This week I build a stable crawling mechanism, which can crawl only a location of the site, having a better Link-following mechanism which follows only under the start URL. The crawled also has a mechanism which transforms relative links to absolute links.
A big problem which I still have is changing the way I am querying the database from idiorm to the paris library. I could not fix this because paris asks me for classes to build the table from.
e.g. If I have the table ‘CrawledPage’, I would need a class called ‘CrawledPage’ which extends ‘Model’.
Another problem when crawling was that my application parsed anything, even files which are not in html format (like png images). To fix this I added a mechanism which tells me what type of page I’m downloading: text, html, png, etc
I left my crawler running for a while and when I came back I found out that it was the 3rd system resources consummer. After some googling I found out that lost variable references are marked for cleanup and freed when the application is finished or when there’s insufficient system memory. After some more search I found out that newer versions of php let you call the garbage collection explicitly, option which was not present in older php versions..
I have to give credit to the logging mechanism build last week because it helped me a lot so far.