HREF rewriter in Python

While I was working at DFAS-Cleveland there was a discussion about moving our web pages from a Microsoft based hosting solution to a UNIX based hosting solution. The biggest problem we faced in this move was a strange one: Case sensitivity. Our existing solution didn't care about case of HREF values, but the UNIX system did. (A quick summary: UNIX sees THISFILE.HTML and thisfile.html as two separate files.)

Why did that matter? Great question. An HREF tells a link where to go when a user clicks on it. If the HREF for the link is THISFILE.HTML, but the file on the server is named thisfile.html, a UNIX based server won't find the correct file, users are unhappy, and eventually the world will explode*.

Now the real fun part is making sure all the links have the correct naming convention. Actually it's not fun at all. It's tedious and boring and would take forever on the DFAS site. So I wrote a Python program to do it for me. I've since cleaned it up a little and removed all the DFAS-specific stuff, so I thought I'd put it here for future reference or for someone to experiment with.

I make no promises that this will work for your system. If the script doesn't work, eats your homework, destroys your computer, kicks your dog, makes your cat explode, or otherwise causes you any kind of harm in any way (including mental stress), you cannot hold me liable, sue me, or otherwise pass the problem on to me in any way.

What you need

What it does

This program will start in a directory you specify (see start variable), open each file in the directory and all subdirectories, scan through the file for HREF values, convert the HREF values as specified in the clean function, and then write any changes back to the file. Read the file specified in the log variable to see what happened or see any errors.

Yes, the program does use regular expressions, but instead of having two problems, using the Kodos - The Python Regex Debugger, the Python re library documentation, and the Python RegEx Howto made it easy. Hooray for Python!

Using it

The program wasn't intended to be interactive. Just save a copy of the script to your computer, start a command window, and type python cleanup.py. That's it. (You might be able to use some of these functions elsewhere in a reusable library or interactive product, I just never had the need to do it.)

License Information

massrename.py by Matt Bear is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

cleanup.py
"""
Python program to search files for HREF values that are uppercase or have junk in them
and convert HREF value to lowercase and clean out junk.

This program will start in a directory you specify (see start variable), 
open each file in the directory and all subdirectories, 
scan through the file for HREF values, 
convert the HREF values as specified in the clean function, 
and then write any changes back to the file. 
Read the file specified in the log variable to see what happened or see any errors.

Copyright 2005 by Matt Bear.
Released under terms of Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License 
http://creativecommons.org/licenses/by-nc-sa/3.0/us/

"""
import re,os,string,sys,getopt

start = "c:\\technology"
log = "c:\\rename.log"

hrefpattern = r'''(?Phref\s*=\s*\"*[^">]*)'''
hrefregex = re.compile(hrefpattern,re.I|re.VERBOSE)
srcpattern = r'''(?Psrc\s*=\s*\"*[^">]*)'''
srcregex = re.compile(srcpattern,re.IGNORECASE|re.VERBOSE)

def cleanContents(file):
    logfile.write("Working on file: "+str(file)+"\n")
    scrubHref(file)
    scrubSrc(file)

def scrubHref(file):
    logfile.write("Searching "+file+" for hrefs.\n")
    html = open(file).readlines()
    hfile = open(file,'w')
    for hline in html:
        hmatch=re.search(hrefregex,hline)
        if hmatch != None:
            oldhref = hmatch.group('CleanThisHref')
            newhref = clean(oldhref)
            hline = string.replace(hline,oldhref,newhref)
            hfile.write(hline)
            print ".",
        else:
            hfile.write(hline)
            print ".",
    hfile.close()
    html = ""

def scrubSrc(file):
    html = open(file).readlines()
    sfile = open(file,'w')
    for sline in html:
        smatch=re.search(srcregex,sline)
        if smatch!= None:
            oldsrc = smatch.group('CleanThisSrc')
            newsrc = clean(oldsrc)
            sline = re.sub(oldsrc,newsrc,sline)
            sfile.write(sline)
            print ".",
        else:
            sfile.write(sline)
            print ".",
    sfile.close()
    html=""

def clean(x):
    cleaned = re.sub("_","-",x).lower()#underscore to dashes, lowercased
    cleaned = re.sub(" ","-",cleaned) #spaces to dashes
    cleaned = re.sub('\[',"-",cleaned) # I don't remember what this is for.
    return cleaned

def cleanFilenames():
    fcount = 0
    for f in i[2]:
        os.rename(os.path.join(i[0],i[2][fcount]),os.path.join(i[0],clean(f)))
        logfile.write("Renamed file to: "+os.path.join(i[0],clean(f))+"\n")
        print ".",
        fcount = fcount+1
        if fcount == len(i[2]):fcount = 0
        if clean(f)[-3:] == "htm":
            cleanContents(os.path.join(i[0],clean(f)))

def cleanDirnames():
    dcount = 0
    for d in i[1]:
        os.rename(os.path.join(i[0],i[1][dcount]),os.path.join(i[0],clean(d)))
        logfile.write("Renamed directory to: "+os.path.join(i[0],clean(d))+"\n")
        print ".",
        dcount = dcount+1
        if dcount == len(i[1]):dcount = 0

def purge():
    for file in i[2]:
        if file[-3:] == "log" or file[-3:] == "lck":
            logfile.write("Deleted "+file)
            os.remove(os.path.join(i[0],clean(file)))

print "Working",
for i in os.walk(start):
    logfile = open(log,"a")
    print ".",
    os.chdir(start)
    cleanFilenames()
    cleanDirnames()
    purge()
    logfile.close()

Notes

*Not really.

If you don't like to cut and paste, or the tabs are weird or something, you can download this copy of the script.

You may also want to look at this CodingHorror post which shows good RegEx coding behavior. It also has links to three RegEx posts (first post, second post, third post) by Mike Malone that cover the basics.