Right now I am in the process of building a web document corpus for web genre detection. That poses the question what a “good” corpus looks like. I have thought (and researched, of course 😉 ) about it for a while and I came up with following requirements:
- A good corpus needs to consist of randomly acquired web documents. It is important that the page in respective genres (in my case blogs) are pretty much different in themselves by content and by structure – as diverse as documents of a single genre can be.
- As I focus a structural approach that doesn’t depend on linguistic features or depends on one language, I need to build a multi-lingual corpus, meaning that I will incorporate documents in German, Czech, Russian etc. as well.
- How many documents are enough? Related works state their corpus size to be around 200 – 300 hand-labelled documents per genre. Manually labelling is no fun, but needed to ensure quality of the corpus.
This list is not exclusive, as there are many more requirements for a corpus, but this is a work-in-progress, and so I will just leave the other stuff out.
For selecting appropriate web resources for the blog genre I first tried to grab the RSS-feeds of blog pinging services (like this one from syndic8), but soon I had the problem of too much ping spam in there. So I tried the open web directory project dmoz, downloaded the RDF-Data, parsed it using my favorite programming language Python and found a lot of URLs – beautifully categorized and hand-labelled (and the best thing is: not by me 😉 ).
My next step is to randomly select documents, checking that they have been categorized correctly by the dmoz folks, and download them. It still is a long way, but I think I made a good progress in short time thanks to dmoz.