Loading

Writing Own Regular Expressions Engine for Fun

I have been rewriting my own regex in C just for fun, since I found that python's implementation is not that fast, after reading following article.

<Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

I only implemented python's findall function. Actually, its relatively fast, but I am just doing it for fun.

Test string Length : 7200000 (7.2M Characters)
Sample : abbbcdexfbeczexczczkefanncdehbzzdezf * 200000
Computer : Mobile Intel Celeron 1.6MHz, 512 MB

1. a bit complicated regex, mine is 2.5x faster than python's re

regex pattern: [a]?[a-b]?[a]?[bezn][a-z0-9_]{1,3}?[e-k][^dsag]f?
my regex: 0.960999965668 s
1000000 matches [u'abbbcdexf', u'beczex', u'zczkef', u'anncdeh', u'bzzdezf', u'abbbcdexf', u'beczex'] .. [u'zczkef', u'anncde
python re: 2.3029999733 s
1000000 matches [u'abbbcdexf', u'beczex', u'zczkef', u'anncdeh', u'bzzdezf', u'abbbcdexf', u'beczex'] .. [u'zczkef', u'anncde

2. simple regex, 5x faster than python's re

regex pattern: [a-z]
my regex: 1.83299994469 s
7200000 matches [u'a', u'b', u'b', u'b', u'c', u'd', u'e'] .. [u'e', u'z', u'f']
python re: 9.35300016403 s
7200000 matches [u'a', u'b', u'b', u'b', u'c', u'd', u'e'] .. [u'e', u'z', u'f']

3. non-greedy matching, 4x faster than python's re

regex pattern: [a-z]{2,5}?
my regex: 1.40199995041 s
3600000 matches [u'ab', u'bb', u'cd', u'ex', u'fb', u'ec', u'ze'] .. [u'zz', u'de', u'zf']
python re: 5.8789999485 s
3600000 matches [u'ab', u'bb', u'cd', u'ex', u'fb', u'ec', u'ze'] .. [u'zz', u'de', u'zf']

4. fix range matching, 4x faster than python's re

regex pattern: [a-z]{5}
my regex: 0.690999984741 s
1440000 matches [u'abbbc', u'dexfb', u'eczex', u'czczk', u'efann', u'cdehb', u'zzdez'] .. [u'fannc', u'dehbz', u'zdezf']
python re: 2.66400003433 s
1440000 matches [u'abbbc', u'dexfb', u'eczex', u'czczk', u'efann', u'cdehb', u'zzdez'] .. [u'fannc', u'dehbz', u'zdezf']

Cheers,
Mark


2 comments:

Zatlite said...

Appraently, Google Go's implementation http://golang.org/pkg/regexp/ is worse than Python's. http://groups.google.com/group/golang-nuts/browse_thread/thread/f436d16dee9c8875?pli=1

So go and write a good one in Go and submit it for merging. Gain geek credits :P

မာ့ခ္ said...

Yeah bro, I knew that, but I am not going to write full feature regex engine, so may be they dont need that.

က်ေနာ္ဖတ္ေသာ အျခား ဘေလာ့ / ဆိုဒ္မ်ား