This week I did some testing and I decided that we will have better scrapped text if we just make custom HTML parsing for each domain. I saw that romlit.ro is placing valuable text between paragraph tags and wikipedia is using <div id=”mainContent”></div> and also paragraphs.
I also password protected my crawler status page (in browser) in an easy manner with .htaccess and htpasswd, to restrict regular access.
At the end of the week I started implementing the diacritics mechanism. This is a long shot because of mysql poor speed when working with millions of records so stay tuned to find out if we will decide to use mongodb instead.