#5 DexOnline – Romanian Literature Crawler

Hi,

This week I build a stable crawling mechanism, which can crawl only a location of the site, having a better Link-following mechanism which follows only under the start URL. The crawled also has a mechanism which transforms relative links to absolute links.

A big problem which I still have is changing the way I am querying the database from idiorm to the paris library. I could not fix this because paris asks me for classes to build the table from.
e.g. If I have the table ‘CrawledPage’, I would need a class called ‘CrawledPage’ which extends ‘Model’.

Another problem when crawling was that my application parsed anything, even files which are not in html format (like png images). To fix this I added a mechanism which tells me what type of page I’m downloading: text, html, png, etc

I left my crawler running for a while and when I came back I found out that it was the 3rd system resources consummer. After some googling I found out that lost variable references are marked for cleanup and freed when the application is finished or when there’s insufficient system memory. After some more search I found out that newer versions of php let you call the garbage collection explicitly, option which was not present in older php versions..

I have to give credit to the logging mechanism build last week because it helped me a lot so far.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>