This week we managed to agree on 60% of the design document (which includes the crawler part) so I can start coding. By the end of the week I implemented a crawling mechanism which takes an url, returns the raw page, parses the content and returns the plain text.
The crawler uses a cURL mechanism which fools the server my application is a browser (a fake Firefox running on Windows NT aka Windows XP), it even has a cookie jar to store the site cookies. The crawler doesn’t have a login mechanism but if we need authentication to read a page, I will need to send the login parameters through POST and enable CURLOPT_POST. For HTTPS pages I can enable CURLOPT_SSL_VERIFYPEER.
Since we might have different parsing algos for different sites, I created an AbstractCrawler which has 2 abstract methods (startCrawling and parseText) that need to be implemented for each derived Crawler class. In startCrawling you can choose to log in or a SSL connection (both to be implemented) and in parseText you can choose the best way to get plain text for that site.
I had to decide between a couple of libraries like PHP’s DOMDocument, Simple HTML DOM, Tidy, Ganon and phpQuery. Since we could work on broken HTML (unclosed tags) I also found HTMLPurifier, a PHP library which fixes broken HTML.
As I tested the libraries, all of them managed to parse broken HTML (so no need for HTMLPurifier), but Simple HTML DOM caught my attention through its simplicity in use and through its reviews.
Well, that’s all folks!