#10 DexOnline – Romanian Literature Crawler

Hi,

Last week I finally finished my diacritics learning application. I went through
a lot of bugs and code changes, since I discovered that utf8_general_ci uses
1 byte for characters from [A-Za-z] and 2 bytes for ones from [ăâîșț]. After I
came up with a first version of the application using 1 byte per char string
functions (I was tesing at each char if it’s a 1 byte char or a 2 byte
one), Cătălin showed me that there are multibyte string functions which could
easily simplify the code so I used them.

My next steps are to build the diacritics inserter application and a to do a
lot of testing. I will also have to see if my diacritics learning application will
scale up with mysql, since we will have millions of records in our database.
One idea is to use mongoDB, another one is to store the records in multiple tables, using a refference table as the base pointer(some sort of a hashtable with huge buckets).

See you all at the grand finale.

#9 DexOnline – Romanian Literature Crawler

Hi,

This week I did some testing and I decided that we will have better scrapped text if we just make custom HTML parsing for each domain. I saw that romlit.ro is placing valuable text between paragraph tags and wikipedia is using <div id=”mainContent”></div> and also paragraphs.

I also password protected my crawler status page (in browser) in an easy manner with .htaccess and htpasswd, to restrict regular access.

At the end of the week I started implementing the diacritics mechanism. This is a long shot because of mysql poor speed when working with millions of records so stay tuned to find out if we will decide to use mongodb instead.

Fortnightly Post #4.7: Long time, no post

Hi, there! It has been a while since I last posted. Time has swiftly passed and there were notable events galore. I have enjoyed my spare time that I planned from the start and now it’s time to get back to work.

Last time I talked about the “blueprints” for the tag page. Now, it is almost complete, but unfortunately, we might give it up. Why you may ask, well it’s because we haven’t yet decided which format the images will be. It depends on those that “draw” (better said, create) them. They might be svg or a section of a 3D model, in which case the drawer will also be the one to tag them. I am quite happy with my tag page as I learnt a gamut of technologies like: JavaScript, JSON, AJAX, (better) PHP, using plugins like jCrop and Select2.

Because the updates for the tag page have stalled, I now have to focus on the presentation part. I have to create a gallery for the forthcoming images and I have one plugin in my mind but it first needs approval. Till then…

Happy birthday DEX Online!

#7 DexOnline – Romanian Literature Crawler

Hi,

Sorry I forgot to provide you with a link to my work:

Crawler

Last week I forgot to post so I’ll state my progress here: I learned how to use the Smarty library, with whom I built a functional crawlerLog page with which you can see the Crawler progress on your computer or smartphone.

This week I used ajax on the crawlerLog web page to refresh its information every 5 seconds and I fixed the www.romlit.ro problem with broken HTML at a general level ( I’m repairing the broken html by using simple_html_dom, removing styles and scripts and adding body tags where there are none)  so I don’t have to use a different HTML parser for romlit. I also improved the Crawler by adding fixtures like crawling a certain area of the site and abstracting the database query layer for faster technology change (e.g. mysql is not very scalable with the amount of data we continue to gather so we may turn to pl/sql)

#6 DexOnline – Romanian Literature Crawler

Hi,

This week I had my first code review which went better than I expected.
I tried crawling a local site of mine and it seemed that while building the crawler I accidentally hardcoded the link building mechanism for wiki.dexonline.ro (my directory depth mechanism for composing relative links wasn’t working as I expected: eg: I had localhost/example.com/index.php/aboutus.php instead of localhost/example.com/aboutus.php)

I also fixed the followings regarding URL:
1) http://www.example.com/ and example.com  are the same
2) http://www.example.com/ and http://www.example.com/index.html, or .php, .aspx, .asp, .jsp, .pl, .py, etc are the same (or with a high probability the same, this depends on the directory index definition)
3) Also http://www.example.com/index.php and http://www.example.com/////index.php are the same and this is a server fault when building links dinamically.
4) http://www.example.com/   ,   http://www.example.com  ,   http://www.example.com/index.php/? are the same, since there are no GET parameters defined.

At the end of the week I wrote a first design sketch for the Indexer & diactritics mechanism. I already got some feedback and I expect that this specification will be finished by the end of next week.

Fortnightly Post #2.6: Tag me well!

The code review went well, there were some minor fixes that Cătălin Frâncu, my mentor, had to make. But now it’s stable, there are no more errors or warnings and everything seems right. I still have to add some comments to my code as I reckon that good implementation needs good documentation.

I’ve started writing the tag page and there are ideas galore. Some of them still need discussed because in my mind they’re somehow equivocal. Basically, on this page, there will be one of the images that need tagging. There will be some fields which the volunteer has to complete (e.g. the lexeme that is in the tag, the centre of the tag coordinates, the the centre of the pointed area coordinates) and this information will be stored in the database. Of course the most difficult part is to get the coordinates from the image. But fortunately, all of this work is done by plugin which is called jCrop. It lets users create a selection on an image and then it returns the coordinates (x, y) and the size (width, height) of it. With some JavaScript (which I have just started learning), I calculate the coordinates of the selection centre, which populate the input fields.Based on this coordinates we plan to draw the tags and pointing arrows on the image using HTML 5 <canvas> tag. I really enjoy this part of the project as it seamlessly combines server side and client side scripting.

That’s all for now and for a few days henceforth. I will indulge myself with a short vacation.

#5 DexOnline – Romanian Literature Crawler

Hi,

This week I build a stable crawling mechanism, which can crawl only a location of the site, having a better Link-following mechanism which follows only under the start URL. The crawled also has a mechanism which transforms relative links to absolute links.

A big problem which I still have is changing the way I am querying the database from idiorm to the paris library. I could not fix this because paris asks me for classes to build the table from.
e.g. If I have the table ‘CrawledPage’, I would need a class called ‘CrawledPage’ which extends ‘Model’.

Another problem when crawling was that my application parsed anything, even files which are not in html format (like png images). To fix this I added a mechanism which tells me what type of page I’m downloading: text, html, png, etc

I left my crawler running for a while and when I came back I found out that it was the 3rd system resources consummer. After some googling I found out that lost variable references are marked for cleanup and freed when the application is finished or when there’s insufficient system memory. After some more search I found out that newer versions of php let you call the garbage collection explicitly, option which was not present in older php versions..

I have to give credit to the logging mechanism build last week because it helped me a lot so far.

#4 DexOnline – Romanian Literature Crawler

Hi,

This week I build a logging mechanism mainly for exceptions. I had a problem with initialising static variables with non static functions(return value) and the error php gave me was no help at all (it wasn’t expecting ‘(‘ after the function name). Finally, after looking at the variable declaration, it hit me it shouldn’t be declared as static:P.
I also made my crawler nicer:) because one of the sites has figured out that I’m not a browser so I had to find a way to fool it (I changed the user_agent).
Another problem that I encountered was that when I wanted to print to the terminal a new_line, it continued writing on the same line and ‘\n’ was no help at all. After googling for a while I found out that php had a predefined constant named PHP_EOL which did the job.
I also found out how to extract the HTTP code (eg 200, 301, 404, 500). Until now I was using a function made my someone on stackoverflow which was very limited in details (returned true or false). After looking deeply into curl_getinfo($curl_handler) I found out it returns an associative array which at index ['http_code'] contains the http code. This works only for 200 and 300 series. For HTTP code 400 and above, I use curl_errno($curl_handler).

I hope this weeks work solves the part where the crawler doesn’t know what HTTP code the page returns (which made a fool out of me at the RSoC presentation) and I hope I’ll have a better control over my crawler with all the logging going on.
I also hope in a short time to do my first commit on SVN.

Good Luck!

Fortnightly Post #2: Manage file with style!

Almost done with the elFinder file manager!

I have successfully binded its results of users actions to queries in the database. After completing an action (move, delete, copy, rename) the elFinder (elf, henceforth) adds the name of the command to an array and the results to another array. I created a function that makes queries based on the data stored in those arrays. For example, if a new file is uploaded the script creates a new entry in the table with the path of the added file and the user that completed the action; if a file is moved, it changes its path and so on. This is possible as the elf has a bind option that calls an external function whenever a specific user action is completed.

I have sent the code to be reviewed by my mentors and I’m looking forward to my second commit.

My next task is to create a new page where text in images can be tagged. This is helpful in search engine indexing and will be implemented using jCrop.