#6 DexOnline – Romanian Literature Crawler


This week I had my first code review which went better than I expected.
I tried crawling a local site of mine and it seemed that while building the crawler I accidentally hardcoded the link building mechanism for wiki.dexonline.ro (my directory depth mechanism for composing relative links wasn’t working as I expected: eg: I had localhost/example.com/index.php/aboutus.php instead of localhost/example.com/aboutus.php)

I also fixed the followings regarding URL:
1) http://www.example.com/ and example.com  are the same
2) http://www.example.com/ and http://www.example.com/index.html, or .php, .aspx, .asp, .jsp, .pl, .py, etc are the same (or with a high probability the same, this depends on the directory index definition)
3) Also http://www.example.com/index.php and http://www.example.com/////index.php are the same and this is a server fault when building links dinamically.
4) http://www.example.com/   ,   http://www.example.com  ,   http://www.example.com/index.php/? are the same, since there are no GET parameters defined.

At the end of the week I wrote a first design sketch for the Indexer & diactritics mechanism. I already got some feedback and I expect that this specification will be finished by the end of next week.

