Last week I finally finished my diacritics learning application. I went through
a lot of bugs and code changes, since I discovered that utf8_general_ci uses
1 byte for characters from [A-Za-z] and 2 bytes for ones from [ăâîșț]. After I
came up with a first version of the application using 1 byte per char string
functions (I was tesing at each char if it’s a 1 byte char or a 2 byte
one), Cătălin showed me that there are multibyte string functions which could
easily simplify the code so I used them.
My next steps are to build the diacritics inserter application and a to do a
lot of testing. I will also have to see if my diacritics learning application will
scale up with mysql, since we will have millions of records in our database.
One idea is to use mongoDB, another one is to store the records in multiple tables, using a refference table as the base pointer(some sort of a hashtable with huge buckets).
See you all at the grand finale.
This week I did some testing and I decided that we will have better scrapped text if we just make custom HTML parsing for each domain. I saw that romlit.ro is placing valuable text between paragraph tags and wikipedia is using <div id=”mainContent”></div> and also paragraphs.
I also password protected my crawler status page (in browser) in an easy manner with .htaccess and htpasswd, to restrict regular access.
At the end of the week I started implementing the diacritics mechanism. This is a long shot because of mysql poor speed when working with millions of records so stay tuned to find out if we will decide to use mongodb instead.
Hi, there! It has been a while since I last posted. Time has swiftly passed and there were notable events galore. I have enjoyed my spare time that I planned from the start and now it’s time to get back to work.
Because the updates for the tag page have stalled, I now have to focus on the presentation part. I have to create a gallery for the forthcoming images and I have one plugin in my mind but it first needs approval. Till then…
Happy birthday DEX Online!
This week I fixed a bug which inserted the same link in the database again, I also rearranged the code for a better reading and I built a TODO list to have a better reading of what should be done next.
I expect that next week Radu, Cătălin and I will agree on the Diacritics Tool design document so I can do some serious coding.
Sorry I forgot to provide you with a link to my work:
Last week I forgot to post so I’ll state my progress here: I learned how to use the Smarty library, with whom I built a functional crawlerLog page with which you can see the Crawler progress on your computer or smartphone.
This week I used ajax on the crawlerLog web page to refresh its information every 5 seconds and I fixed the www.romlit.ro problem with broken HTML at a general level ( I’m repairing the broken html by using simple_html_dom, removing styles and scripts and adding body tags where there are none) so I don’t have to use a different HTML parser for romlit. I also improved the Crawler by adding fixtures like crawling a certain area of the site and abstracting the database query layer for faster technology change (e.g. mysql is not very scalable with the amount of data we continue to gather so we may turn to pl/sql)
This week I had my first code review which went better than I expected.
I tried crawling a local site of mine and it seemed that while building the crawler I accidentally hardcoded the link building mechanism for wiki.dexonline.ro (my directory depth mechanism for composing relative links wasn’t working as I expected: eg: I had localhost/example.com/index.php/aboutus.php instead of localhost/example.com/aboutus.php)
I also fixed the followings regarding URL:
1) http://www.example.com/ and example.com are the same
2) http://www.example.com/ and http://www.example.com/index.html, or .php, .aspx, .asp, .jsp, .pl, .py, etc are the same (or with a high probability the same, this depends on the directory index definition)
3) Also http://www.example.com/index.php and http://www.example.com/////index.php are the same and this is a server fault when building links dinamically.
4) http://www.example.com/ , http://www.example.com , http://www.example.com/index.php/? are the same, since there are no GET parameters defined.
At the end of the week I wrote a first design sketch for the Indexer & diactritics mechanism. I already got some feedback and I expect that this specification will be finished by the end of next week.
The code review went well, there were some minor fixes that Cătălin Frâncu, my mentor, had to make. But now it’s stable, there are no more errors or warnings and everything seems right. I still have to add some comments to my code as I reckon that good implementation needs good documentation.
That’s all for now and for a few days henceforth. I will indulge myself with a short vacation.
This week I build a stable crawling mechanism, which can crawl only a location of the site, having a better Link-following mechanism which follows only under the start URL. The crawled also has a mechanism which transforms relative links to absolute links.
A big problem which I still have is changing the way I am querying the database from idiorm to the paris library. I could not fix this because paris asks me for classes to build the table from.
e.g. If I have the table ‘CrawledPage’, I would need a class called ‘CrawledPage’ which extends ‘Model’.
Another problem when crawling was that my application parsed anything, even files which are not in html format (like png images). To fix this I added a mechanism which tells me what type of page I’m downloading: text, html, png, etc
I left my crawler running for a while and when I came back I found out that it was the 3rd system resources consummer. After some googling I found out that lost variable references are marked for cleanup and freed when the application is finished or when there’s insufficient system memory. After some more search I found out that newer versions of php let you call the garbage collection explicitly, option which was not present in older php versions..
I have to give credit to the logging mechanism build last week because it helped me a lot so far.
This week I build a logging mechanism mainly for exceptions. I had a problem with initialising static variables with non static functions(return value) and the error php gave me was no help at all (it wasn’t expecting ‘(‘ after the function name). Finally, after looking at the variable declaration, it hit me it shouldn’t be declared as static:P.
I also made my crawler nicer:) because one of the sites has figured out that I’m not a browser so I had to find a way to fool it (I changed the user_agent).
Another problem that I encountered was that when I wanted to print to the terminal a new_line, it continued writing on the same line and ‘\n’ was no help at all. After googling for a while I found out that php had a predefined constant named PHP_EOL which did the job.
I also found out how to extract the HTTP code (eg 200, 301, 404, 500). Until now I was using a function made my someone on stackoverflow which was very limited in details (returned true or false). After looking deeply into curl_getinfo($curl_handler) I found out it returns an associative array which at index ['http_code'] contains the http code. This works only for 200 and 300 series. For HTTP code 400 and above, I use curl_errno($curl_handler).
I hope this weeks work solves the part where the crawler doesn’t know what HTTP code the page returns (which made a fool out of me at the RSoC presentation) and I hope I’ll have a better control over my crawler with all the logging going on.
I also hope in a short time to do my first commit on SVN.
Almost done with the elFinder file manager!
I have successfully binded its results of users actions to queries in the database. After completing an action (move, delete, copy, rename) the elFinder (elf, henceforth) adds the name of the command to an array and the results to another array. I created a function that makes queries based on the data stored in those arrays. For example, if a new file is uploaded the script creates a new entry in the table with the path of the added file and the user that completed the action; if a file is moved, it changes its path and so on. This is possible as the elf has a bind option that calls an external function whenever a specific user action is completed.
I have sent the code to be reviewed by my mentors and I’m looking forward to my second commit.
My next task is to create a new page where text in images can be tagged. This is helpful in search engine indexing and will be implemented using jCrop.