Parsing Date Strings with pyparsing

Now that the Genre Detection stuff (a lot has happened in the last months… was too busy to write more about it here…) works pretty well, it is time to fine-tune the approach and enhance the results. And one feature I deliberately left out in the beginning (as, then, I didn’t see how to achieve this in a short time) is the occurrence of date strings like “May 25th, 2008” or “21.12.2007” in the text of a web page – which is obviously a good feature for some web genres. There are a lot regular expressions floating around that do exactly that more or less elegantly to a certain extent, but they are a) rarely understandable, b) therefore not easy to alter, c) often lack the different formats of human-readable dates and d) it is so much more fun to do it yourself 🙂 .

As I am fiddling around with Python anyway, I thought to give pyparsing a try. And – after having found a few how-tos and articles (e.g. these slides) – I got started rather easily and I am pretty much impressed of how quickly one can write a decent (in this case not descent 🙂 ) parser that Does What I Want. I am not yet finished, but the most basic use cases work. I still have to put together unit tests and polish the script. The one thing that could still be better is performance (heck, it is s-l-o-w scanning a reasonable big HTML page for occurrences of date strings), but I think the responsibility is mostly on my part, as my code follows a brute force approach and I’m sure there are lots of possibilities to enhance it.

Well, we’ll see… I think as soon as the code looks a bit better I’ll release it here for anyone interested.

4 Responses to “Parsing Date Strings with pyparsing”

  1. Did you end up getting anywhere with this? I am about to tackle a similar problem and would love to take a look at your code.

  2. Actually, I haven’t pursued it further as the dates didn’t proove to enhance the results of Genre Detection; I only have a (probably pretty faulty) fragment of the code. I will post my pyparsing code building the search string soonish 🙂 Stay tuned!

  3. […] promised in my earlier post I will just release the important part for my date parser using pyparsing. Thanks for Ryan to […]

  4. I have written a new blog post with the actual code I wrote then… See (pingback doesn’t seem to work)…

Discussion Area - Leave a Comment