Parsing Date Strings with pyparsing

Now that the Genre Detection stuff (a lot has happened in the last months… was too busy to write more about it here…) works pretty well, it is time to fine-tune the approach and enhance the results. And one feature I deliberately left out in the beginning (as, then, I didn’t see how to achieve this in a short time) is the occurrence of date strings like “May 25th, 2008” or “21.12.2007” in the text of a web page – which is obviously a good feature for some web genres. There are a lot regular expressions floating around that do exactly that more or less elegantly to a certain extent, but they are a) rarely understandable, b) therefore not easy to alter, c) often lack the different formats of human-readable dates and d) it is so much more fun to do it yourself 🙂 .

As I am fiddling around with Python anyway, I thought to give pyparsing a try. And – after having found a few how-tos and articles (e.g. these slides) – I got started rather easily and I am pretty much impressed of how quickly one can write a decent (in this case not descent 🙂 ) parser that Does What I Want. I am not yet finished, but the most basic use cases work. I still have to put together unit tests and polish the script. The one thing that could still be better is performance (heck, it is s-l-o-w scanning a reasonable big HTML page for occurrences of date strings), but I think the responsibility is mostly on my part, as my code follows a brute force approach and I’m sure there are lots of possibilities to enhance it.

Well, we’ll see… I think as soon as the code looks a bit better I’ll release it here for anyone interested.

Week-end is nigh

What a week this has been… I had no chance to do anything research-related. But today I got some interesting input from Alex concerning grammar queries for tree structures. I will have a closer look into that when the new week has started.

Further my colleague Doreen has successfully presented our ELWMS.KOM prototype for personal resource-based learning in Berlin. Congrats!

Corpus Building

Right now I am in the process of building a web document corpus for web genre detection. That poses the question what a “good” corpus looks like. I have thought (and researched, of course 😉 ) about it for a while and I came up with following requirements:

  • A good corpus needs to consist of randomly acquired web documents. It is important that the page in respective genres (in my case blogs) are pretty much different in themselves by content and by structure – as diverse as documents of a single genre can be.
  • Many corpora found on the web have one severe flaw: being collected for a single purpose they contain the document data but rarely style sheets, JavaScript files or images. In my opinion these can provide features for detecting a genre as well, so I’d need them. Unfortunately, many corpora don’t hint on the original source (URL) of these documents, so it isn’t possible to resample these same documents again and to validly compare my own approach to the one used with these corpora. So I need my corpus to be reconstructible to allow future comparisons.
  • As I focus a structural approach that doesn’t depend on linguistic features or depends on one language, I need to build a multi-lingual corpus, meaning that I will incorporate documents in German, Czech, Russian etc. as well.
  • How many documents are enough? Related works state their corpus size to be around 200 – 300 hand-labelled documents per genre. Manually labelling is no fun, but needed to ensure quality of the corpus.

This list is not exclusive, as there are many more requirements for a corpus, but this is a work-in-progress, and so I will just leave the other stuff out.

For selecting appropriate web resources for the blog genre I first tried to grab the RSS-feeds of blog pinging services (like this one from syndic8), but soon I had the problem of too much ping spam in there. So I tried the open web directory project dmoz, downloaded the RDF-Data, parsed it using my favorite programming language Python and found a lot of URLs – beautifully categorized and hand-labelled (and the best thing is: not by me 😉 ).

My next step is to randomly select documents, checking that they have been categorized correctly by the dmoz folks, and download them. It still is a long way, but I think I made a good progress in short time thanks to dmoz.

Working retreat in the Kleinwalsertal

Since yesterday my department is on working retreat in a guest house in the idyllic Kleinwalsertal. The weather is still kind of shitty, but we are strongly expecting it to lighten up until start of the weekend 😉

Communication between HTML and XUL

As promised yesterday I post the source code for communicating between an unprivileged HTML-page and an privileged Firefox extension.

The extension code

This code is to be inserted in the extension.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/**
 * Provides a way of transfering arbitrary JSON objects between a HTML-page and
 * the extension; this script is to be inserted in extension code
 *
 * For Client (HTML) - Javascript, see <code>dataclient.js</code>
 *
 * @author Phil
 * @date 2007/08/23
 * @see http://forums.mozillazine.org/viewtopic.php?t=171216
 * @see http://www.json.org/js.html for JSON support
 */
var DataTransferListener = {
	ELWMS_EVENT_NAME: "ELWMSDataTransferEvent",
	ELWMS_EVENT_BACK_NAME: "ELWMSDataBackchannelEvent",
	/**
	 * Listener that subscribes to custom ELWMS Event
	 *
	 * @param {Event} aEvent the event thrown by the HTML Element
	 */
	listenToHTML: function(aEvent) {
		// first we have to check if we allow this... based on URL or what?
		if (aEvent.target.ownerDocument.location.host != "localhost") {
			// TODO: add security here (e.g. from Pref/Setting of applet serving host)!
			// alert("As for security issues only secure HTML pages may pass data to ELWMS extension.");
			// return;
		}
		// data is a escaped JSON String
		var data = unescape(aEvent.target.getAttribute("data")).parseJSON();
		// what to do with the received data?
		var retval = DataTransferListener.handleData(data, aEvent.target);
		// if back data is given:
		if (retval != null) {
			// add escaped and JSONified Object to <code>returnvalue</code>-Attribute
			aEvent.target.setAttribute("returnvalue", escape(retval.toJSONString()));
			// fire event to notify HTML-Page of return value
			var ev = window.document.createEvent("Event");
			ev.initEvent(DataTransferListener.ELWMS_EVENT_BACK_NAME, true, false);
			aEvent.target.dispatchEvent(ev);
		}
	},
	/**
	 * this function should handle all arriving data
	 *
	 * @param {Object} data The JSON Object
	 * @param {HTMLNode} target The node that fired the event
	 */
	handleData : function(data, target) {
		alert("DataTransferListener.handleData: obtained " + data.name);
		if (data.id &gt; 1000) {
			alert("DataTransferListener.handleData: returning changed data")
			return {id:2000, name:"Pong"};
		}
		return null;
	}
}
 
/**
 * Acc. to web page (see above) the 4th parameter denotes if Events are accepted from
 * unsecure sources.
 */
document.addEventListener(DataTransferListener.ELWMS_EVENT_NAME, DataTransferListener.listenToHTML, false, true);

The client-side (Javascript in HTML) code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
/**
 * Provides a way of transfering arbitrary JSON objects from a HTML-page to a
 * extension
 *
 * For Extension (XUL) - Javascript, see <code>datatransfer.js</code>
 *
 * @author Phil
 * @date 2007/08/23
 * @see http://forums.mozillazine.org/viewtopic.php?t=171216
 * @see http://www.json.org/js.html for JSON support
 */
var Communicator = {
	ELWMS_EVENT_NAME: "ELWMSDataTransferEvent",
	ELWMS_EVENT_BACK_NAME: "ELWMSDataBackchannelEvent",
	ELWMS_CALLER_ID : "elwmsdataelement",
	ELWMS_ELEMENT_NAME : "ELWMSDataElement",
 
	/**
	 * initializes the Element and Listeners
	 *
	 */
	init : function() {
		// create data / event firing Elements
		var element = Communicator.createElement();
		// register custom event on callback
		element.addEventListener(Communicator.ELWMS_EVENT_BACK_NAME, Communicator.calledBack, true);
	},
 
	/**
	 * creates the data element
	 *
	 * @return {HTMLElement} the created Element of the type <code>Communicator.ELWMS_ELEMENT_NAME</code>
	 */
	createElement : function() {
		// may I create an Event?
		if ("createEvent" in document) {
		  	// if element is not yet existing
	  		if (!document.getElementById(Communicator.ELWMS_CALLER_ID)) {
		  		var element = document.createElement(Communicator.ELWMS_ELEMENT_NAME);
		  		element.setAttribute("id", Communicator.ELWMS_CALLER_ID);
		  		// attribute containing "data parameter" for extension call
				element.setAttribute("data", "");
				// attribute containing "return value" of extension
				element.setAttribute("returnvalue", "");
		  		document.documentElement.appendChild(element);
		  		return element;
	  		} else {
	  			// element exists - return that
	  			return document.getElementById(Communicator.ELWMS_CALLER_ID);
	  		}
	  	} else {
	  		// some error...
	  		alert("dataclient.js - Communicator.createElement ERROR!");
	  		return null;
	  	}
	},
 
	/**
	 * calls the extension with JSON - data (object)
	 *
	 * @param {Object} data the data to transfer to extension - must be convertible to JSON
	 */
	call : function(data) {
		// create or get our element
		var element = Communicator.createElement();
		element.setAttribute("data", escape(data.toJSONString()));
		// create and fire custom Event to notify extension
		var ev = document.createEvent("Event");
		ev.initEvent(Communicator.ELWMS_EVENT_NAME, true, false);
		element.dispatchEvent(ev);
	},
 
	/**
	 * is called when the extensions fires ELWMS_EVENT_BACK_NAME - Event; data
	 * may be collected from <code>returnvalue</code>-Attribute.
	 *
	 * @param {Event} aEvent the event
	 */
	calledBack : function(aEvent) {
		// TODO: decide what to do here!
		alert("Communicator.calledBack : " + unescape(aEvent.target.getAttribute("returnvalue")));
	}
};
 
function func(aEvent) {
	Communicator.call({id:1100, name:"Ping"});
}
 
/**
 * on page load, the Communicator is initialized
 */
document.addEventListener("DOMContentLoaded", function(aEvent) {
	Communicator.init();
	// add event Listener on button to test...
	document.getElementById("communicator").addEventListener("click", func, true);
}, false);

I think the code is commented well enough to be understandable without any further explanation. For a short overview how this works, see yesterday’s post.