Better later than never: the pyparsing-based date parser

As promised in my earlier post I will just release the important part for my date parser using pyparsing. Thanks for Ryan to remind me ;)

Disclaimer: I never really finished (or polished) the code – it is just an excerpt from my actual program that eventually ignored the dates. The grammar probably is faulty, I just publish it as an example how it could be done…

from pyparsing import *
 
# zero-prefixed and non-prefixed numbers from 1 up to 31 (1, 2, ... 01, 02, ...31),
# followed by ordinal (st, nd, rd, ...)
days_zero_prefix = " ".join([("%02d" % x) for x in xrange(1, 10)])
days_no_prefix = " ".join([("%d" % x) for x in xrange(1, 32)])
day_en_short = oneOf("st nd rd th")
day = ( (oneOf(days_zero_prefix) ^ oneOf(days_no_prefix) ).setResultsName("day")
+ Optional(Suppress(day_en_short)) ).setName("day")
 
# months, in numbers or names
months_zero_prefix = oneOf(" ".join([("%02d" % x) for x in xrange(1, 10)]))
months_no_prefix = oneOf(" ".join([("%d" % x) for x in xrange(1, 13)]))
months_en_long = oneOf("January February March April May June July August September October November December")
months_en_short = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec")
months_de_long = oneOf("Januar Februar M\cx3rz April Mai Juni Juli August September Oktober November Dezember")
months_de_short = oneOf("Jan Feb M\xc3r Apr Mai Jun Jul Aug Sep Okt Nov Dez")
month = (months_zero_prefix.setName("month_zp").setResultsName("month") |
months_no_prefix.setName("month_np").setResultsName("month") |
months_en_long.setName("month_enl").setResultsName("month") |
months_en_short.setName("month_ens").setResultsName("month") |
months_de_long.setResultsName("month") |
months_de_long.setResultsName("month"))
 
# years, between 1995 and 2009, two- or four-digit format
lowest_year, highest_year = 1995, 2010
years = " ".join([("%d" % x) for x in xrange(lowest_year, highest_year)])
years_short = " ".join([("%02d" % (x % 100)) for x in xrange(lowest_year, highest_year)])
year_long = oneOf(years)
year = (year_long ^ oneOf(years_short)).setResultsName("year")
 
# choice of separators between date items; if two occur, they should match
sep_literals = oneOf(". - /") ^ White() # ".-/ "
sec_lit = matchPreviousLiteral(sep_literals)
 
# optional comma
comma = Literal(",") | White()
# punctuation
punctuation = oneOf([x for x in string.punctuation])
 
# EBNF resulting
date_normal = day + Suppress(sep_literals) + month + Suppress(sec_lit) + year
date_rev = year + Suppress(sep_literals) + month + Suppress(sec_lit) + day
date_usa = month + Suppress(sep_literals) + day + Suppress(sec_lit) + year
date_written = (months_en_long.setResultsName("month") | months_en_short.setResultsName("month")) + day + Suppress(comma) + year
 
# HTML tag Bounds
anyTag = anyOpenTag | anyCloseTag
date_start = Suppress(WordStart() ^ anyTag ^ punctuation)
# FIXME: there is a problem here
date_end = Suppress(WordEnd() ^ anyTag ^ punctuation)
 
# final BNF
parser = date_start + (date_normal ^ date_usa ^ date_written ^ date_rev) + date_end
 
# now do some parsing
for result, a, b in parser.scanString("13.01.2000"):
print result["day"], result["month"], result["year"] == year

Note that the month names are given in English and German (as is the date format); Further, I’m pretty sure it is very inefficient doing the date string parsing like that.

If you find any errors in this code or have questions, feel free to leave a comment ;)

4 Responses to “Better later than never: the pyparsing-based date parser”

  1. Thanks for sharing this code – pulling dates from text has been problem I’ve been trying to crack for ages! Have made a few mods and added some tests and posted on my blog.

  2. Good that it is of some use – I primarily coded this, as I wanted to try pyparsing. The one thing that still bothers me is performance: if you skim through large texts it is unbelievable slow…

    Additionally, I have some test cases that won’t work with this code but never got to fix it. I will possibly look back into it when I have a little spare time…

  3. months_zero_prefix = oneOf(” “.join([("%02d" % x) for x in xrange(1, 10)]))

    =>

    months_zero_prefix = oneOf(["%02d"%x for x in xrange(1, 10)])

  4. [...] code : http://dpaste.com/hold/150360/ Actual dates: Original code from Phil here: http://pcscholl.de/2009/09/24/better-later-than-never-the-pyparsing-based-date-parser/ Code: http://dpaste.com/hold/150357/ This entry was posted in Tech. Bookmark the permalink. [...]

Discussion Area - Leave a Comment