Entries Tagged as 'python'

Better later than never: the pyparsing-based date parser

As promised in my earlier post I will just release the important part for my date parser using pyparsing. Thanks for Ryan to remind me 😉

Disclaimer: I never really finished (or polished) the code – it is just an excerpt from my actual program that eventually ignored the dates. The grammar probably is faulty, I just publish it as an example how it could be done…

from pyparsing import *
# zero-prefixed and non-prefixed numbers from 1 up to 31 (1, 2, ... 01, 02, ...31),
# followed by ordinal (st, nd, rd, ...)
days_zero_prefix = " ".join([("%02d" % x) for x in xrange(1, 10)])
days_no_prefix = " ".join([("%d" % x) for x in xrange(1, 32)])
day_en_short = oneOf("st nd rd th")
day = ( (oneOf(days_zero_prefix) ^ oneOf(days_no_prefix) ).setResultsName("day")
+ Optional(Suppress(day_en_short)) ).setName("day")
# months, in numbers or names
months_zero_prefix = oneOf(" ".join([("%02d" % x) for x in xrange(1, 10)]))
months_no_prefix = oneOf(" ".join([("%d" % x) for x in xrange(1, 13)]))
months_en_long = oneOf("January February March April May June July August September October November December")
months_en_short = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec")
months_de_long = oneOf("Januar Februar M\cx3rz April Mai Juni Juli August September Oktober November Dezember")
months_de_short = oneOf("Jan Feb M\xc3r Apr Mai Jun Jul Aug Sep Okt Nov Dez")
month = (months_zero_prefix.setName("month_zp").setResultsName("month") |
months_no_prefix.setName("month_np").setResultsName("month") |
months_en_long.setName("month_enl").setResultsName("month") |
months_en_short.setName("month_ens").setResultsName("month") |
months_de_long.setResultsName("month") |
# years, between 1995 and 2009, two- or four-digit format
lowest_year, highest_year = 1995, 2010
years = " ".join([("%d" % x) for x in xrange(lowest_year, highest_year)])
years_short = " ".join([("%02d" % (x % 100)) for x in xrange(lowest_year, highest_year)])
year_long = oneOf(years)
year = (year_long ^ oneOf(years_short)).setResultsName("year")
# choice of separators between date items; if two occur, they should match
sep_literals = oneOf(". - /") ^ White() # ".-/ "
sec_lit = matchPreviousLiteral(sep_literals)
# optional comma
comma = Literal(",") | White()
# punctuation
punctuation = oneOf([x for x in string.punctuation])
# EBNF resulting
date_normal = day + Suppress(sep_literals) + month + Suppress(sec_lit) + year
date_rev = year + Suppress(sep_literals) + month + Suppress(sec_lit) + day
date_usa = month + Suppress(sep_literals) + day + Suppress(sec_lit) + year
date_written = (months_en_long.setResultsName("month") | months_en_short.setResultsName("month")) + day + Suppress(comma) + year
# HTML tag Bounds
anyTag = anyOpenTag | anyCloseTag
date_start = Suppress(WordStart() ^ anyTag ^ punctuation)
# FIXME: there is a problem here
date_end = Suppress(WordEnd() ^ anyTag ^ punctuation)
# final BNF
parser = date_start + (date_normal ^ date_usa ^ date_written ^ date_rev) + date_end
# now do some parsing
for result, a, b in parser.scanString("13.01.2000"):
print result["day"], result["month"], result["year"] == year

Note that the month names are given in English and German (as is the date format); Further, I’m pretty sure it is very inefficient doing the date string parsing like that.

If you find any errors in this code or have questions, feel free to leave a comment 😉

Presenting… ZipDocSrv!

Serves zipped documentation locally

ZipDocSrv serves zipped documentation locally

I program in different languages (python at home, java, javascript etc. for my thesis project), so I need a lot of documentation. However, when I commute I don’t have access to the net, so I have to download all those HTML-documentations and unzip them to my respective docs-folder.

Last time I looked I had 269 MB worth of HTML documentation in my python docs-folder alone – rarely used, lying there, wasting space. Firefox (maybe other browsers as well) allows to browse zip files (with the jar://-protocol), but I don’t feel comfy with that – so I scratched my own itch and wrote a local server for zip files (aptly named ZipDocSrv*). Packing all that stuff into a neat zip-file, I compressed all documentation to 71 MB, saving 74% of space!

ZipDocSrv is a command-line app allowing to mount multiple zip files and open them using your normal browser. Additionally I have added a GUI (as an excercise for me, actually, to get to know wxPython) for people who don’t like editing configuration files themselves.

If you want to try ZipDocSrv, download the GUI app (packed with py2exe).

The code is written in Python. The (horrible! horrible! horrible! You have been warned!) source code can be downloaded here

Be warned: There will be bugs. If you find them or have any feedback, I’d be glad to hear about it in the comments!

  • actually I haven’t found a better name for it – any suggestions?

Parsing Date Strings with pyparsing

Now that the Genre Detection stuff (a lot has happened in the last months… was too busy to write more about it here…) works pretty well, it is time to fine-tune the approach and enhance the results. And one feature I deliberately left out in the beginning (as, then, I didn’t see how to achieve this in a short time) is the occurrence of date strings like “May 25th, 2008” or “21.12.2007” in the text of a web page – which is obviously a good feature for some web genres. There are a lot regular expressions floating around that do exactly that more or less elegantly to a certain extent, but they are a) rarely understandable, b) therefore not easy to alter, c) often lack the different formats of human-readable dates and d) it is so much more fun to do it yourself 🙂 .

As I am fiddling around with Python anyway, I thought to give pyparsing a try. And – after having found a few how-tos and articles (e.g. these slides) – I got started rather easily and I am pretty much impressed of how quickly one can write a decent (in this case not descent 🙂 ) parser that Does What I Want. I am not yet finished, but the most basic use cases work. I still have to put together unit tests and polish the script. The one thing that could still be better is performance (heck, it is s-l-o-w scanning a reasonable big HTML page for occurrences of date strings), but I think the responsibility is mostly on my part, as my code follows a brute force approach and I’m sure there are lots of possibilities to enhance it.

Well, we’ll see… I think as soon as the code looks a bit better I’ll release it here for anyone interested.

Corpus Building

Right now I am in the process of building a web document corpus for web genre detection. That poses the question what a “good” corpus looks like. I have thought (and researched, of course 😉 ) about it for a while and I came up with following requirements:

  • A good corpus needs to consist of randomly acquired web documents. It is important that the page in respective genres (in my case blogs) are pretty much different in themselves by content and by structure – as diverse as documents of a single genre can be.
  • Many corpora found on the web have one severe flaw: being collected for a single purpose they contain the document data but rarely style sheets, JavaScript files or images. In my opinion these can provide features for detecting a genre as well, so I’d need them. Unfortunately, many corpora don’t hint on the original source (URL) of these documents, so it isn’t possible to resample these same documents again and to validly compare my own approach to the one used with these corpora. So I need my corpus to be reconstructible to allow future comparisons.
  • As I focus a structural approach that doesn’t depend on linguistic features or depends on one language, I need to build a multi-lingual corpus, meaning that I will incorporate documents in German, Czech, Russian etc. as well.
  • How many documents are enough? Related works state their corpus size to be around 200 – 300 hand-labelled documents per genre. Manually labelling is no fun, but needed to ensure quality of the corpus.

This list is not exclusive, as there are many more requirements for a corpus, but this is a work-in-progress, and so I will just leave the other stuff out.

For selecting appropriate web resources for the blog genre I first tried to grab the RSS-feeds of blog pinging services (like this one from syndic8), but soon I had the problem of too much ping spam in there. So I tried the open web directory project dmoz, downloaded the RDF-Data, parsed it using my favorite programming language Python and found a lot of URLs – beautifully categorized and hand-labelled (and the best thing is: not by me 😉 ).

My next step is to randomly select documents, checking that they have been categorized correctly by the dmoz folks, and download them. It still is a long way, but I think I made a good progress in short time thanks to dmoz.