#10 DexOnline – Romanian Literature Crawler

Hi,

Last week I finally finished my diacritics learning application. I went through
a lot of bugs and code changes, since I discovered that utf8_general_ci uses
1 byte for characters from [A-Za-z] and 2 bytes for ones from [ăâîșț]. After I
came up with a first version of the application using 1 byte per char string
functions (I was tesing at each char if it’s a 1 byte char or a 2 byte
one), Cătălin showed me that there are multibyte string functions which could
easily simplify the code so I used them.

My next steps are to build the diacritics inserter application and a to do a
lot of testing. I will also have to see if my diacritics learning application will
scale up with mysql, since we will have millions of records in our database.
One idea is to use mongoDB, another one is to store the records in multiple tables, using a refference table as the base pointer(some sort of a hashtable with huge buckets).

See you all at the grand finale.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>