This week we managed to agree on 60% of the design document (which includes the crawler part) so I can start coding. By the end of the week I implemented a crawling mechanism which takes an url, returns the raw page, parses the content and returns the plain text.

The crawler uses a cURL mechanism which fools the server my application is a browser (a fake Firefox running on Windows NT aka Windows XP), it even has a cookie jar :) to store the site cookies. The crawler doesn’t have a login mechanism but if we need authentication to read a page, I will need to send the login parameters through POST and enable CURLOPT_POST. For HTTPS pages I can enable CURLOPT_SSL_VERIFYPEER.

Since we might have different parsing algos for different sites, I created an AbstractCrawler which has 2 abstract methods (startCrawling and parseText) that need to be implemented for each derived Crawler class. In startCrawling you can choose to log in or a SSL connection (both to be implemented) and in parseText you can choose the best way to get plain text for that site.

I had to decide between a couple of libraries like PHP’s DOMDocument, Simple HTML DOM, Tidy, Ganon and phpQuery.  Since we could work on broken HTML (unclosed tags) I also found HTMLPurifier, a PHP library which fixes broken HTML.

As I tested the libraries, all of them managed to parse broken HTML (so no need for HTMLPurifier), but Simple HTML DOM caught my attention through its simplicity in use and through its reviews.

  1. What websites do you plan to crawl? Will it be only actual literature or you’re using the term generically? Do you plan on using some kind of learning mechanism to learn about where diacritics usually go?

    • Hi Strainu, we want to crawl actual literature sites like http://ro.wikipedia.org/, dilemaveche.ro/ or romlit.ro/ which have a strong base of diacritics. We might crawl http://www.strainu.ro/ too :) . To bad you have both english and romanian texts. We might need the use of an english set of words to determine whether we need the paragraph or not. This depends on how much time we spend on the others (indexing and building statistics on diacritics ).

