Loading

Crawling the blog html content with python (direct HTML read method, not through RSS)

Crawling the blog html content with python (direct HTML read method, not through RSS)

import codecs,re,urllib2

f = urllib2.urlopen('http://www.soemin.net/2009/04/font-encoding-detection-for-zawgyi-and.html')

htm=re.sub("&#(\d+);",lambda x:unichr(int(x.group(1))),f.read().decode("utf8"))

txt=re.findall('<div[^>]+post-body[^>]+>\s*(.*?)\s*<div[^>]+clear:\s*both[^>]+></div>',htm,re.DOTALL)[0]

codecs.open("crawl.txt", 'w+',"utf8").write(txt)

#its also convert #&4096; to က

results will be like this

ေဇာ္ဂ်ီနဲ့ ယူနီကုတ္ ၅.၁ ခြဲျခားျခင္း (Font Encoding Detection for Zawgyi and Unicode 5.1)

.....

အဓိကအားျဖင့္ကေတာ့
၁။ သေဝထိုး၊ ရရစ္၊ ရပင္းစတာေတြ နဲ့

....

.....

Cheers,

Cheers,
Soe Min

No comments:

က်ေနာ္ဖတ္ေသာ အျခား ဘေလာ့ / ဆိုဒ္မ်ား