Corpus Building

Right now I am in the process of building a web document corpus for web genre detection. That poses the question what a “good” corpus looks like. I have thought (and researched, of course 😉 ) about it for a while and I came up with following requirements:

  • A good corpus needs to consist of randomly acquired web documents. It is important that the page in respective genres (in my case blogs) are pretty much different in themselves by content and by structure – as diverse as documents of a single genre can be.
  • Many corpora found on the web have one severe flaw: being collected for a single purpose they contain the document data but rarely style sheets, JavaScript files or images. In my opinion these can provide features for detecting a genre as well, so I’d need them. Unfortunately, many corpora don’t hint on the original source (URL) of these documents, so it isn’t possible to resample these same documents again and to validly compare my own approach to the one used with these corpora. So I need my corpus to be reconstructible to allow future comparisons.
  • As I focus a structural approach that doesn’t depend on linguistic features or depends on one language, I need to build a multi-lingual corpus, meaning that I will incorporate documents in German, Czech, Russian etc. as well.
  • How many documents are enough? Related works state their corpus size to be around 200 – 300 hand-labelled documents per genre. Manually labelling is no fun, but needed to ensure quality of the corpus.

This list is not exclusive, as there are many more requirements for a corpus, but this is a work-in-progress, and so I will just leave the other stuff out.

For selecting appropriate web resources for the blog genre I first tried to grab the RSS-feeds of blog pinging services (like this one from syndic8), but soon I had the problem of too much ping spam in there. So I tried the open web directory project dmoz, downloaded the RDF-Data, parsed it using my favorite programming language Python and found a lot of URLs – beautifully categorized and hand-labelled (and the best thing is: not by me 😉 ).

My next step is to randomly select documents, checking that they have been categorized correctly by the dmoz folks, and download them. It still is a long way, but I think I made a good progress in short time thanks to dmoz.

2 Responses to “Corpus Building”

  1. My dissertation from 2005 might provide you with more insight on the web genre problem: Rosso M. (2005), Using Genre to Improve Web Search, PhD dissertation submitted for the degree of Doctor of Philosophy, University of North Carolina, Chapel Hill, USA.

    Also, look for a Journal of the American Society for Information Science and Technology (JASIST) article by me in the next month or two: User-based Identification of Web Genres

  2. […] that the Genre Detection stuff (a lot has happened in the last months… was too busy to write more about it here…) […]

Discussion Area - Leave a Comment