#9 DexOnline – Romanian Literature Crawler

Hi,

This week I did some testing and I decided that we will have better scrapped text if we just make custom HTML parsing for each domain. I saw that romlit.ro is placing valuable text between paragraph tags and wikipedia is using <div id=”mainContent”></div> and also paragraphs.

I also password protected my crawler status page (in browser) in an easy manner with .htaccess and htpasswd, to restrict regular access.

At the end of the week I started implementing the diacritics mechanism. This is a long shot because of mysql poor speed when working with millions of records so stay tuned to find out if we will decide to use mongodb instead.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>