Python Performance Part 2 Redux: Split & Reduce Large Strings for ‘A Href’ Hypertext

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))

Timing Comparison: ~ 300% Performance Improvement

Note: hypertext.html is 48MB.

braydon@bgf:~/python_tests/extract$ time python split.py 

real    0m1.263s
user    0m1.112s
sys     0m0.156s

braydon@bgf:~/python_tests/extract$ time python split_2.py 

real    0m0.392s
user    0m0.268s
sys     0m0.120s

split.py

Previously, I had found the best solution to my problem was to split() the large string up by the “>” character, and then reduce to a list of hyperlinks.

def is_ahref(a,b):
    y = b.find("<a href=")
    if y != -1: a.append(b[y+9:-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])

There is a better solution. Splitting the text up by the “>” character is wasteful; there are many “>”s in html, and most of them that will not have hyperlinks. We don’t need to even check if the item in the list is an href if we split the string into a list that all will have an href, and then reduce it as before.

def is_ahref(a,b):
    z = b.find(">")
    a.append(b[1:z-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

fc = open("hypertext.html","r").read().split("<a href=")
hrefs = preduce(is_ahref,fc,[])

However because the size of the list will be exactly the same as it started, we shouldn’t need to use reduce(), but rather we can just map() a fuction to run through the entire list, ‘reducing’ it to a list of just the hyperlinks.

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))



Previous

Python Performance Part 2: Parsing Large Strings for ‘A Href’ Hypertext

Goal Write a fast Python script that will take a large string and reduce it to a list of all of the hyperlinks in the html string; such as [”http://world.org”,”/tree”]. Attempt 1: Self-Recursion f = open(‘hypertext_sm.html’,'r’) ahrefs = [] count…

Next

Python Performance Part 3: Python 3000 and Transforming Large Lists into Seperate Smaller Lists

Preface This is a redux of Python Performance Part 1, where the fastest method was using the reduce builtin function in Python2.5. December 3rd, Python 3000 final was released so I have downloaded it and gone over some of these…


Leave a Reply


News Calendar

June 2008
M T W T F S S
« Jan   Dec »
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Information