Better later than never: the pyparsing-based date parser
As promised in my earlier post I will just release the important part for my date parser using pyparsing. Thanks for Ryan to remind me 😉
Disclaimer: I never really finished (or polished) the code – it is just an excerpt from my actual program that eventually ignored the dates. The grammar probably is faulty, I just publish it as an example how it could be done…
from pyparsing import * # zero-prefixed and non-prefixed numbers from 1 up to 31 (1, 2, ... 01, 02, ...31), # followed by ordinal (st, nd, rd, ...) days_zero_prefix = " ".join([("%02d" % x) for x in xrange(1, 10)]) days_no_prefix = " ".join([("%d" % x) for x in xrange(1, 32)]) day_en_short = oneOf("st nd rd th") day = ( (oneOf(days_zero_prefix) ^ oneOf(days_no_prefix) ).setResultsName("day") + Optional(Suppress(day_en_short)) ).setName("day") # months, in numbers or names months_zero_prefix = oneOf(" ".join([("%02d" % x) for x in xrange(1, 10)])) months_no_prefix = oneOf(" ".join([("%d" % x) for x in xrange(1, 13)])) months_en_long = oneOf("January February March April May June July August September October November December") months_en_short = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec") months_de_long = oneOf("Januar Februar M\cx3rz April Mai Juni Juli August September Oktober November Dezember") months_de_short = oneOf("Jan Feb M\xc3r Apr Mai Jun Jul Aug Sep Okt Nov Dez") month = (months_zero_prefix.setName("month_zp").setResultsName("month") | months_no_prefix.setName("month_np").setResultsName("month") | months_en_long.setName("month_enl").setResultsName("month") | months_en_short.setName("month_ens").setResultsName("month") | months_de_long.setResultsName("month") | months_de_long.setResultsName("month")) # years, between 1995 and 2009, two- or four-digit format lowest_year, highest_year = 1995, 2010 years = " ".join([("%d" % x) for x in xrange(lowest_year, highest_year)]) years_short = " ".join([("%02d" % (x % 100)) for x in xrange(lowest_year, highest_year)]) year_long = oneOf(years) year = (year_long ^ oneOf(years_short)).setResultsName("year") # choice of separators between date items; if two occur, they should match sep_literals = oneOf(". - /") ^ White() # ".-/ " sec_lit = matchPreviousLiteral(sep_literals) # optional comma comma = Literal(",") | White() # punctuation punctuation = oneOf([x for x in string.punctuation]) # EBNF resulting date_normal = day + Suppress(sep_literals) + month + Suppress(sec_lit) + year date_rev = year + Suppress(sep_literals) + month + Suppress(sec_lit) + day date_usa = month + Suppress(sep_literals) + day + Suppress(sec_lit) + year date_written = (months_en_long.setResultsName("month") | months_en_short.setResultsName("month")) + day + Suppress(comma) + year # HTML tag Bounds anyTag = anyOpenTag | anyCloseTag date_start = Suppress(WordStart() ^ anyTag ^ punctuation) # FIXME: there is a problem here date_end = Suppress(WordEnd() ^ anyTag ^ punctuation) # final BNF parser = date_start + (date_normal ^ date_usa ^ date_written ^ date_rev) + date_end # now do some parsing for result, a, b in parser.scanString("13.01.2000"): print result["day"], result["month"], result["year"] == year |
Note that the month names are given in English and German (as is the date format); Further, I’m pretty sure it is very inefficient doing the date string parsing like that.
If you find any errors in this code or have questions, feel free to leave a comment 😉