Recipe: Building PyStemmer on Windows – fixed!

So I tried to build pyStemmer 1.1.0 for Python 2.7 on Windows and it did not work out of the box. On closer examination I found out that pyStemmer’s setup.py is flawed on line 24:

...
and os.path.split(line.strip())[0] in library_core_dirs]
...

did not find all necessary c files for the language stemmers. This is due to os.path.split not being able to strip the “\” at the end of each entry in mkinc_utf8.mak. This can be fixed easily by changing this line to the following:

...
and os.path.split(line.split()[0].strip())[0] in library_core_dirs]
...

HTH.

Blog Blackout

After having updated to WordPress 2.9.2 this blog was not working anymore… A user-friendly message (“Error establishing a database connection”) instead of the content I am used to see. No further information, no additional error messages. Nice.

After fiddling around for a while (trying to find out the exact error by massaging the PHP code – brrrrr -, checking all settings in wp-config.php several times and following the odd advice with OLD_PASSWORD on WordPress’ FAQ), I found out that the wp_options table in MySQL was fried beyond repair. Literally. Thus, I took a database dump from the remaining tables, deleted all tables and re-installed WP. Curiously, re-importing the dump failed miserably due to two missing columns in wp_links and wp_posts (both about categories). In what version of WP did those disappear? shrugs

Eventually I just added them again, and, alas, after an outage of approximately 1.5 days the blog is here again…

Better later than never: the pyparsing-based date parser

As promised in my earlier post I will just release the important part for my date parser using pyparsing. Thanks for Ryan to remind me ;)

Disclaimer: I never really finished (or polished) the code – it is just an excerpt from my actual program that eventually ignored the dates. The grammar probably is faulty, I just publish it as an example how it could be done…

from pyparsing import *
 
# zero-prefixed and non-prefixed numbers from 1 up to 31 (1, 2, ... 01, 02, ...31),
# followed by ordinal (st, nd, rd, ...)
days_zero_prefix = " ".join([("%02d" % x) for x in xrange(1, 10)])
days_no_prefix = " ".join([("%d" % x) for x in xrange(1, 32)])
day_en_short = oneOf("st nd rd th")
day = ( (oneOf(days_zero_prefix) ^ oneOf(days_no_prefix) ).setResultsName("day")
+ Optional(Suppress(day_en_short)) ).setName("day")
 
# months, in numbers or names
months_zero_prefix = oneOf(" ".join([("%02d" % x) for x in xrange(1, 10)]))
months_no_prefix = oneOf(" ".join([("%d" % x) for x in xrange(1, 13)]))
months_en_long = oneOf("January February March April May June July August September October November December")
months_en_short = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec")
months_de_long = oneOf("Januar Februar M\cx3rz April Mai Juni Juli August September Oktober November Dezember")
months_de_short = oneOf("Jan Feb M\xc3r Apr Mai Jun Jul Aug Sep Okt Nov Dez")
month = (months_zero_prefix.setName("month_zp").setResultsName("month") |
months_no_prefix.setName("month_np").setResultsName("month") |
months_en_long.setName("month_enl").setResultsName("month") |
months_en_short.setName("month_ens").setResultsName("month") |
months_de_long.setResultsName("month") |
months_de_long.setResultsName("month"))
 
# years, between 1995 and 2009, two- or four-digit format
lowest_year, highest_year = 1995, 2010
years = " ".join([("%d" % x) for x in xrange(lowest_year, highest_year)])
years_short = " ".join([("%02d" % (x % 100)) for x in xrange(lowest_year, highest_year)])
year_long = oneOf(years)
year = (year_long ^ oneOf(years_short)).setResultsName("year")
 
# choice of separators between date items; if two occur, they should match
sep_literals = oneOf(". - /") ^ White() # ".-/ "
sec_lit = matchPreviousLiteral(sep_literals)
 
# optional comma
comma = Literal(",") | White()
# punctuation
punctuation = oneOf([x for x in string.punctuation])
 
# EBNF resulting
date_normal = day + Suppress(sep_literals) + month + Suppress(sec_lit) + year
date_rev = year + Suppress(sep_literals) + month + Suppress(sec_lit) + day
date_usa = month + Suppress(sep_literals) + day + Suppress(sec_lit) + year
date_written = (months_en_long.setResultsName("month") | months_en_short.setResultsName("month")) + day + Suppress(comma) + year
 
# HTML tag Bounds
anyTag = anyOpenTag | anyCloseTag
date_start = Suppress(WordStart() ^ anyTag ^ punctuation)
# FIXME: there is a problem here
date_end = Suppress(WordEnd() ^ anyTag ^ punctuation)
 
# final BNF
parser = date_start + (date_normal ^ date_usa ^ date_written ^ date_rev) + date_end
 
# now do some parsing
for result, a, b in parser.scanString("13.01.2000"):
print result["day"], result["month"], result["year"] == year

Note that the month names are given in English and German (as is the date format); Further, I’m pretty sure it is very inefficient doing the date string parsing like that.

If you find any errors in this code or have questions, feel free to leave a comment ;)

Organizing working with the desktop

Working in a academic environment is fun though streneous, as there is an abundance of different tasks and changing contexts that you have to cope with. One of the bigger challenges (besides doing research itself) is organizing and channeling the tasks that lie ahead, organizing a good information-keeping system and archiving. In the last years, some tools and ways of “doing it” (still, I wouldn’t go so far and call them “best practices”) have emerged that match my needs, and in this post I want to share some of them. All tools mentioned here are for WinXP – if you know of worthy equivalents in the Linux world, please comment, I’d like to know about it!

Note: this blog post was sparked by a discussion with a colleague about my filing system, so I thought about how different I treat all my files now in contrast to how I used to some years ago.

Handling Documents

I like plain text. It is easily searchable (using any desktop search engine), versionable (e.g. with Bazaar, as I use it), not dependent on any software (except a good, UTF-8-enabled text editor like my favoured Notepad++) and can be made readable as-is with minimal markup. That’s why I’d like increasingly to embrace ReStructured Text (reSt) – a format used for python documentation but not limited to it. I regularly jot down stuff with it, but when it comes to exchange documents with other people, it is too unhandy (yet?), as only few outside of the python community know and value it. Another point is that the tools to edit reSt-files in a “nice” way are scarce on Windows (I heard that support in Textmate is good – time again to envy Mac users) – I’d have to convert them more often than not to other Office formats in order to really enjoy working with them. Besides that, the MS Office applications are used at work, so I rely on them… (and hey, Word is a pretty good piece of software – the work-over mode is unmatched yet!).

What I still need:

  • a good reSt-to-anything-converter (Pandoc is not good enough yet and it still lacks a lot of features)
  • a good reSt editor (sadly, I’m not seeing a plugin coming for Notepad++ yet)
  • oh yes, and for an editor I’d like an integrated live view for diffs of text, so that a feature like Word’s work-over mode can be at least mimicked – however, how that could work I’ve no idea.

Organizing the Desktop

Desktop clutter – who doesn’t know it? No matter how often I clean it, after a short time it is unorganized and messy again. After having tried different strategies, I finally stumbled upon Fences, a desktop extension that allows grouping files into boxes (called “fences”, similar to MDI windows) that can be arranged on the desktop.

Screenshot - my desktop with fences

I really like it, although it still has minor bugs (sometimes seems to crash explorer.exe). My personal highlight: with a double click on the desktop, everything except specified icons are hidden – the desktop is ready for a presentation without having colleagues bitch about your messy desktop later. Note the “Focused Actions” fence: it is my “hot zone” (see below in “Filing system”).

As a quick way of accessing often-used folders, I use the good old SlickRun – it just works (tried Launchy once, but it didn’t feel more comfortable, so I dumped it again in favour of SlickRun).

What I still need:

  • Right now I’m not too sure if I need anything different for my desktop – I don’t need those full-blown widgets that seem to be the rage. But I’m more than willing to try other stuff – if I need it I will know…

Filing system, Revisions and Synchronizing

As I really need to have all my files on my laptop’s disk, I decided to use a time-based filing system: for each year, I have a folder (e.g. “2009″), in this folder there are sub-folders with the names of the date and the topic (e.g. “2009-08-04 Presentation <Conference Name>”) and in there are all the files I need for this task (e.g. images I need, drafts, …). All different versions of the main file are – again – prefixed with the date. This works very well for re-finding stuff, especially as I have the excellent QTTabBar Explorer extension that allows filtering directory contents based on a file name’s substring.

However, all files I am currently working on are in the “hot zone” (above mentioned “Focused Actions” fence) on my desktop, where I can quickly and easily access all necessary files. When I have finished working on such a document, I eventually move it to a respective folder in my archive.

Some files I have to work on over a longer time are in a special “drafts”-folder. Everything in there is about to change very often (e.g. my notes about ideas I have or todo files), so I have a versioning system installed there (in my case the distributed VCS bazaar – yet I’m hearing great things about git and Mercurial, so I will eventually try them) that synchronizes with a local repository on my Dropbox account. This is necessary as I commute and there I don’t have internet access. So I’m carrying around my VCS which is always synchronized with all my computers as soon as I go online – works like a charm!

What I still need:

  • I’m missing the possibility to tag files, but I haven’t found my perfect solution yet. Does anybody know about an app for WinXP that allows to do such a thing easily? In such a way that I can distribute my files all over the harddisk and the tag app can track them even on copying / moving? Perhaps even integrated into Windows explorer and the standard “Open File”-Dialog? That would be too sweet…
  • Ah… Dropbox. I really love it, especially the feature to share folders with other Dropbox users. Still, one thing that keeps me from having a paid plan for 50GB or 100GB is that my harddisk is small (60GB) and I can’t afford to put too much on Dropbox, as it will synchronize everything to all my computers. Why isn’t there a clever “only download this folder to my harddisk”-option so I could only have the necessary stuff on my disk?
  • Bazaar is nice, although sometimes a bit slow… I’d love to have it integrated as a plug-in in Notepad++, but in the meantime doing it by hand works fine.

Actions and Tasks, Getting Things Done?

I used to have a text file for todos, but lately I have discovered the “Getting Things Done” app MonkeyGTD (based on the mind-boggling good TiddlyWiki, entirely written in one (!) HTML file, using Javascript-foo).

Screenshot of GTD-Wiki monkeyGTD

It offers anything I need, from tasks, “ticklers” (reminders) for tasks, general notes to journals. It is so much more lightweight than for example Chandler, and I don’t need all the bells and whistles that other apps have. Made into a own app using Prism for Firefox, it sits prominently on my desktop and is used regularly. And put into dropbox, I have it synchronized on all my computers.

What I still need:

  • Right now I don’t know. If I find a better solution, I will know it.

Compiled Link list

Conclusions

In this post I presented some ways of working with my computer that work for me. In blogs, this is often sold as “productivity tools”, whereas my impression is that the time saved by using these tools was spent looking for them and playing with them (and very often cleaning up the mess that some tools leave behind and transfering data to another app) – so I don’t really know if there was a productivity gain… Additionally, the tools mentioned above are all pretty light-weight, meaning that they are specialized for one task and do not try to be a full-fledged “suite for something”. I like that.

How do you organize your desktop and filing system? Do you know something that you have successfully used and I could like? Go on, write about it in your own blog or below in my comments!

Presenting… ZipDocSrv!

Serves zipped documentation locally

ZipDocSrv serves zipped documentation locally

I program in different languages (python at home, java, javascript etc. for my thesis project), so I need a lot of documentation. However, when I commute I don’t have access to the net, so I have to download all those HTML-documentations and unzip them to my respective docs-folder.

Last time I looked I had 269 MB worth of HTML documentation in my python docs-folder alone – rarely used, lying there, wasting space. Firefox (maybe other browsers as well) allows to browse zip files (with the jar://-protocol), but I don’t feel comfy with that – so I scratched my own itch and wrote a local server for zip files (aptly named ZipDocSrv*). Packing all that stuff into a neat zip-file, I compressed all documentation to 71 MB, saving 74% of space!

ZipDocSrv is a command-line app allowing to mount multiple zip files and open them using your normal browser. Additionally I have added a GUI (as an excercise for me, actually, to get to know wxPython) for people who don’t like editing configuration files themselves.

If you want to try ZipDocSrv, download the GUI app (packed with py2exe).

The code is written in Python. The (horrible! horrible! horrible! You have been warned!) source code can be downloaded here

Be warned: There will be bugs. If you find them or have any feedback, I’d be glad to hear about it in the comments!


  • actually I haven’t found a better name for it – any suggestions?