#4 DexOnline – Romanian Literature Crawler


This week I build a logging mechanism mainly for exceptions. I had a problem with initialising static variables with non static functions(return value) and the error php gave me was no help at all (it wasn’t expecting ‘(‘ after the function name). Finally, after looking at the variable declaration, it hit me it shouldn’t be declared as static:P.
I also made my crawler nicer:) because one of the sites has figured out that I’m not a browser so I had to find a way to fool it (I changed the user_agent).
Another problem that I encountered was that when I wanted to print to the terminal a new_line, it continued writing on the same line and ‘\n’ was no help at all. After googling for a while I found out that php had a predefined constant named PHP_EOL which did the job.
I also found out how to extract the HTTP code (eg 200, 301, 404, 500). Until now I was using a function made my someone on stackoverflow which was very limited in details (returned true or false). After looking deeply into curl_getinfo($curl_handler) I found out it returns an associative array which at index ['http_code'] contains the http code. This works only for 200 and 300 series. For HTTP code 400 and above, I use curl_errno($curl_handler).

I hope this weeks work solves the part where the crawler doesn’t know what HTTP code the page returns (which made a fool out of me at the RSoC presentation) and I hope I’ll have a better control over my crawler with all the logging going on.
I also hope in a short time to do my first commit on SVN.

Good Luck!

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>