Last week I finally finished my diacritics learning application. I went through
a lot of bugs and code changes, since I discovered that utf8_general_ci uses
1 byte for characters from [A-Za-z] and 2 bytes for ones from [ăâîșț]. After I
came up with a first version of the application using 1 byte per char string
functions (I was tesing at each char if it’s a 1 byte char or a 2 byte
one), Cătălin showed me that there are multibyte string functions which could
easily simplify the code so I used them.
My next steps are to build the diacritics inserter application and a to do a
lot of testing. I will also have to see if my diacritics learning application will
scale up with mysql, since we will have millions of records in our database.
One idea is to use mongoDB, another one is to store the records in multiple tables, using a refference table as the base pointer(some sort of a hashtable with huge buckets).
See you all at the grand finale.
This week I did some testing and I decided that we will have better scrapped text if we just make custom HTML parsing for each domain. I saw that romlit.ro is placing valuable text between paragraph tags and wikipedia is using <div id=”mainContent”></div> and also paragraphs.
I also password protected my crawler status page (in browser) in an easy manner with .htaccess and htpasswd, to restrict regular access.
At the end of the week I started implementing the diacritics mechanism. This is a long shot because of mysql poor speed when working with millions of records so stay tuned to find out if we will decide to use mongodb instead.
This week I fixed a bug which inserted the same link in the database again, I also rearranged the code for a better reading and I built a TODO list to have a better reading of what should be done next.
I expect that next week Radu, Cătălin and I will agree on the Diacritics Tool design document so I can do some serious coding.
Sorry I forgot to provide you with a link to my work:
Last week I forgot to post so I’ll state my progress here: I learned how to use the Smarty library, with whom I built a functional crawlerLog page with which you can see the Crawler progress on your computer or smartphone.
This week I used ajax on the crawlerLog web page to refresh its information every 5 seconds and I fixed the www.romlit.ro problem with broken HTML at a general level ( I’m repairing the broken html by using simple_html_dom, removing styles and scripts and adding body tags where there are none) so I don’t have to use a different HTML parser for romlit. I also improved the Crawler by adding fixtures like crawling a certain area of the site and abstracting the database query layer for faster technology change (e.g. mysql is not very scalable with the amount of data we continue to gather so we may turn to pl/sql)
This week I had my first code review which went better than I expected.
I tried crawling a local site of mine and it seemed that while building the crawler I accidentally hardcoded the link building mechanism for wiki.dexonline.ro (my directory depth mechanism for composing relative links wasn’t working as I expected: eg: I had localhost/example.com/index.php/aboutus.php instead of localhost/example.com/aboutus.php)
I also fixed the followings regarding URL:
1) http://www.example.com/ and example.com are the same
2) http://www.example.com/ and http://www.example.com/index.html, or .php, .aspx, .asp, .jsp, .pl, .py, etc are the same (or with a high probability the same, this depends on the directory index definition)
3) Also http://www.example.com/index.php and http://www.example.com/////index.php are the same and this is a server fault when building links dinamically.
4) http://www.example.com/ , http://www.example.com , http://www.example.com/index.php/? are the same, since there are no GET parameters defined.
At the end of the week I wrote a first design sketch for the Indexer & diactritics mechanism. I already got some feedback and I expect that this specification will be finished by the end of next week.
This week I build a stable crawling mechanism, which can crawl only a location of the site, having a better Link-following mechanism which follows only under the start URL. The crawled also has a mechanism which transforms relative links to absolute links.
A big problem which I still have is changing the way I am querying the database from idiorm to the paris library. I could not fix this because paris asks me for classes to build the table from.
e.g. If I have the table ‘CrawledPage’, I would need a class called ‘CrawledPage’ which extends ‘Model’.
Another problem when crawling was that my application parsed anything, even files which are not in html format (like png images). To fix this I added a mechanism which tells me what type of page I’m downloading: text, html, png, etc
I left my crawler running for a while and when I came back I found out that it was the 3rd system resources consummer. After some googling I found out that lost variable references are marked for cleanup and freed when the application is finished or when there’s insufficient system memory. After some more search I found out that newer versions of php let you call the garbage collection explicitly, option which was not present in older php versions..
I have to give credit to the logging mechanism build last week because it helped me a lot so far.
This week I build a logging mechanism mainly for exceptions. I had a problem with initialising static variables with non static functions(return value) and the error php gave me was no help at all (it wasn’t expecting ‘(‘ after the function name). Finally, after looking at the variable declaration, it hit me it shouldn’t be declared as static:P.
I also made my crawler nicer:) because one of the sites has figured out that I’m not a browser so I had to find a way to fool it (I changed the user_agent).
Another problem that I encountered was that when I wanted to print to the terminal a new_line, it continued writing on the same line and ‘\n’ was no help at all. After googling for a while I found out that php had a predefined constant named PHP_EOL which did the job.
I also found out how to extract the HTTP code (eg 200, 301, 404, 500). Until now I was using a function made my someone on stackoverflow which was very limited in details (returned true or false). After looking deeply into curl_getinfo($curl_handler) I found out it returns an associative array which at index ['http_code'] contains the http code. This works only for 200 and 300 series. For HTTP code 400 and above, I use curl_errno($curl_handler).
I hope this weeks work solves the part where the crawler doesn’t know what HTTP code the page returns (which made a fool out of me at the RSoC presentation) and I hope I’ll have a better control over my crawler with all the logging going on.
I also hope in a short time to do my first commit on SVN.
This week I learned how to use Idiorm, a php library for mysql databases and Iimplemented the crawler DB side. I found out that the Idiorm INSERT usage was quite obscurely implemented because I didn’t find an example on the web so I started reading the library implementation. Finally I found out that you have to use $obj = ORM::for_table(‘table_name’)->create(); to make an object with the table fields as php variables, then you have to set the coresponding variables values ($obj->field_1 = $val_1;$obj->field_n = $val_n) and finally call $obj->save(); I wrote this code because the other DexOnline intern will need it.
I also wrote a mechanism to manipulate URLs (transform relative to canonical URLs, a mechanism to find if an URL is used (hash + special cases).
I stumbled upon saving the rawPage and parsedText to the filesystem because of directory rights. I didn’t want to change the directory owner so I moved the files to /tmp/DexContent/, but it still doesn’t want to save the files. I’m using file_put_contents($filename, $string) and $filename contains only alfanumeric values and the ‘_’ char.
This week we managed to agree on 60% of the design document (which includes the crawler part) so I can start coding. By the end of the week I implemented a crawling mechanism which takes an url, returns the raw page, parses the content and returns the plain text.
The crawler uses a cURL mechanism which fools the server my application is a browser (a fake Firefox running on Windows NT aka Windows XP), it even has a cookie jar to store the site cookies. The crawler doesn’t have a login mechanism but if we need authentication to read a page, I will need to send the login parameters through POST and enable CURLOPT_POST. For HTTPS pages I can enable CURLOPT_SSL_VERIFYPEER.
Since we might have different parsing algos for different sites, I created an AbstractCrawler which has 2 abstract methods (startCrawling and parseText) that need to be implemented for each derived Crawler class. In startCrawling you can choose to log in or a SSL connection (both to be implemented) and in parseText you can choose the best way to get plain text for that site.
I had to decide between a couple of libraries like PHP’s DOMDocument, Simple HTML DOM, Tidy, Ganon and phpQuery. Since we could work on broken HTML (unclosed tags) I also found HTMLPurifier, a PHP library which fixes broken HTML.
As I tested the libraries, all of them managed to parse broken HTML (so no need for HTMLPurifier), but Simple HTML DOM caught my attention through its simplicity in use and through its reviews.
Well, that’s all folks!
My name is Alin Ungureanu and this summer I’m coding for DexOnline. The project “Romanian Literature Crawler” was assigned to me and it has the following objectives:
- Find words that DEX online doesn’t know, but that occur frequently on the Internet. Write a script to crop usage examples for these words and pass them on to a team of linguists so that they can write definitions for them.
- Show usage examples along with our definitions. Offer an interface where admins can select the most relevant examples.
- Compute statistics on diacritics. For example, compute that, in the context ”abcdSefgh”, S has 90% probability and Ș has 10% probability. This can be used to insert diacritics in a text.
A week has passed by and we haven’t yet agreed on the design document: new ideas are flowing, some of them are too big and can be considered standalone internship projects. It is mandatory to finalize this document until late next week because the clock is ticking and I am eager to write some code:).