#7 DexOnline – Romanian Literature Crawler

Hi,

Sorry I forgot to provide you with a link to my work:

Crawler

Last week I forgot to post so I’ll state my progress here: I learned how to use the Smarty library, with whom I built a functional crawlerLog page with which you can see the Crawler progress on your computer or smartphone.

This week I used ajax on the crawlerLog web page to refresh its information every 5 seconds and I fixed the www.romlit.ro problem with broken HTML at a general level ( I’m repairing the broken html by using simple_html_dom, removing styles and scripts and adding body tags where there are none)  so I don’t have to use a different HTML parser for romlit. I also improved the Crawler by adding fixtures like crawling a certain area of the site and abstracting the database query layer for faster technology change (e.g. mysql is not very scalable with the amount of data we continue to gather so we may turn to pl/sql)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>