#7 dexonline – Approximate search using trigrams

And finally, my last post this summer.

The project is now finished. After analysing the data from the log file and comparing the two algorithms, Levenshtein and trigram, my mentor and I decided that the winner is…both of them :) . Trigram’s results are, from my point of view, better than I expected. Even if at first I was a little bit sceptic that the suggestions made would be the ones one would expect, they are actually pretty close to those given by the Levenshtein. And this is thanks to the fact that before a word is split into trigrams, “##” and “%%” are added at the beggining and end of it. So, for example, for the word “trigram”, the vector or trigrams is ['##t', '#tr', 'tri', 'rig', 'igr', 'gra', 'ram', 'am%', 'm%%'].

Of course that the results are not as precise as those given by the Levenshtein, because in that case, the positions of the letters on the keyboard were taken into consideration. Despite this and the fact that both are winners, I can say that trigram’s golden medal is more shiny, due to the fact that the execution time is very low. If before the start of the project the average time/search was of 0.6 seconds, the one for Levenshtein is 0.9 and for trigrams is of only 0.2 seconds.

And if it is to talk some more about statistics, only 25% of the searches don’t return a suggestion, whereas at the beggining, the percentace was 74%. Moreover, 32% of them return a single suggestion, redirecting to that page (before, only 13% were redirected).

The reason why I said that both of the algorithms are winners is that, although at first only the trigram was, we decided today to combine them by using the Levenshtein as a filter for the trigram’s results, in order to sort the suggested words according to the position of the letters, but since this was the last day of the internship and I wouldn’t have had time to finish it, my mentor decided to help me by getting the job done. Because Levenshtein is applied on a maximum of 20 words (at which trigram’s number of suggestions is limited), and not on the whole dictionary as it was in the first place, the execution time for this part is insignificant, and as a whole it still remains very low.

All in all, it was a great experience to work at this project during this summer, and I can say that the results are great and I learnt lots of things, considering that I hadn’t worked in php or mysql before.

#6 dexonline – Games

I guess the grand finally is approaching rather fast.

That’s a good thing considering my part seems to be almost complete.

The game can be played HERE and it now has difficulty and it’s more of a real game than just one question.

There still is the final deployment to be done and graphics need to be improved before that happens. Someone up there (tbr. Catalin) decided that someone else should do the graphics and design. Good choice considering my skills stop at MS Paint! :D

All in all this was a nice and productive experience.

#6 dexonline – Approximate search using trigrams

Not much time left untill the end of rsoc. Since my last post, I’ve been working on the trigram algorithm and I can say it works pretty well.

I have created a new table in the DEX database, NGram, that contains the trigrams and the id of the word in which they occur. When a user searches for a word, it is split into groups of trigrams and, after searching in the NGram table, the words that have the most trigrams in common with the misspelt one are suggested as corrections. The code is not online at this moment because it is a little messy and I have to improve it a little bit.

I have also made some improvements to the old code, the one with Levenshtein algorithm, and the search time about which I was complaining in my last post was considerably improved.

In this last week I will finish the trigram implementation and after a little more tests, my mentor and I will decide on a final form for the project.

#5 dexonline – Approximate search using trigrams

Nothing special in the last 2 weeks, due to the fact that I had 12 days off in which I was abroad. First, I was in Hungary with my highschool teachers and a group of 40 students for a week, and the next day after I arrived home, I went in Vienna for 4 more days with some friends. It was real fun, we have visited lots of interesting places and made hundreds of pictures.

But back to work now :) . In this period I analysed the log of the approximate search queries, and I concluded that, despite the fact that the average search time takes a little bit longer than I expected, 54% of them return at least a suggestion of correction, which I think is pretty good, considering that before, the percentage was just 25%. Anyway, the rest of 46% are searches that have nothing to do with the dictionary, like words in another languages or multiple words. If I’d show you some of them, you’ll probabil laugh all day.

I’ve also been making some research on trigrams and I agreed with my mentor on the steps that have to be made further on.

#5 dexonline – Games

The one thing I liked about RSoC was that they implied I should take one week off at some point to actually enjoy a bit of the summer vacation. So this last week was exactly that. I visited Bacau, Bicaz with Cheile Bicazului and it’s Red Lake then Borsec, Brasov, and the surrounding area like Sinaia, Busteni, Timisul de Sus. It was a lot of fun, with a lot of walking the mountains and visiting adventure parcs, and all this in under 7 days. :)

Now getting back to DEX. I corrected a list of mini-bugs that Catalin was kind enough to point out for me. And I added Box-Muller Transform to increase the difficulty level of the game.

I use this to get from the uniform distribution that rand() offers to a normal distribution. This way, the definitions that are wrong will be somewhat similar in meaning with the ones that are correct.

Next I will make the game learn from the player and modify the difficulty accordingly and also add support for sessions of games and results.

#4 dexonline – Approximate search using trigrams

Another post, another implementation :) . This time, the idea remained almost the same, but, with the help of my mentor, the code looks better.

First, instead of the big and ugly matrix that was first initialized from the beginning with all the values, I used some pretty functions for doing that. Unfortunatelly, that was much slower, so we decided to store the distances in another way: two vectors of 26 elements each, representing the coordinates of the letters on the keyboard.

Then, the code was simplified a little, and, after lots of tests, it seems to have no more bugs. Therefore, it is now in production, but for now only 25% of the approximate searches are using it, percentage to be gradually increased in the next period. I also added separate logging, in order to check the results.

So, I can say that this first idea is done and already in use. I will now start thinking about the second one, with trigrams, and a method to test which one is better.

#4 dexonline – Games

After I played around with the definition dictionary, and after I got another version of the dictionary to play with and after I managed to parse it well enough, well I made the game I wanted to make all alone.

It’s far from finished yet of course, but it’s a nice start, and it works.

Did I mention that it’s fun as well?

One should try it out! Here.

There are things to do like keep logs and, never thinking I would say that but, I need to make this game a lot harder, right now it feels more like a kid game. Now making a game difficulty level may prove to be a lot harder than making the game itself, but I do have some ideas like using a Gaussian with the center in the word when  choosing the wrong definitions, or marking the definition that contain a similar word to describe itself.

In the meantime, do enjoy the new game. Ohh and did I mention that the modifications I added to hangman are now on the main website?

#3 dexonline – Approximate search using trigrams

As I mentioned in my last post, during the last two weeks I have been changing my first implementation, giving up the idea of a filter applied on the suggestions found by the classic levenshtein.

Now, my implementation went directly into the mysql UDF. I made a matrix with the distances between each letter on the keyboard: 5 for neighbour letters on the horizontal, 10 for the ones on the vertical and 15 for the rest. Then, I have the following rules, instead of the classic +1 in the levenshtein algorithm: I add 5 for a missing letter, 5 for switched letters, and for typing the wrong letter or an extra letter 5, 10 or 15, depending on the distance.

The algorithm is not 100% perfect, it has some bugs on some searches, because there are some particular cases for which I haven’t found a rule. But overall, I tested it with 2500 searches from the log and the resulted suggestions are pretty good.

In the following days, the code will be put online, and step by step, an increasing percentage of searches will use it, in order to test its functioning.

#3 dexonline – Games

The last 2 weeks were rather different than what I expected.

College tends to introduce one in a rather large array of tools for use in development. I always had a tendency to believe tools should be kept at a minimum and concepts should stand at the core. I still believe I am right in this matter…

And now the “but”… in a development process one needs to use a lot of tools (and unlike college, all of them at the same time), this also was the case with DEX, which uses a wide array of tools for it’s tasks, from the standard php, apache, mysql to smarty and other interesting toys. What I didn’t expect was that I would be the one to introduce a new tool, unfortunately I have to.

The tool that I introduce has at first glance no connection between it and a web development process or a game, but somehow the knowledge I gathered in college comes in handy. I will add antlr (Another Tool for Language Recognition). I need it to parse the definitions as I mentioned in the previous post.

What it does?

Well a definition has a bit more data than one needs. It has type and the place where the word comes from and a number of example of uses and meanings for each. What I need for the games I’m making is just one simple definition for just one meaning of each word. So I parse the large definition and extract what I need. This is harder than usual because definitions don’t have a very strict syntax like is the case with a programming language and the grammar I am writing needs to accept as many variations as possible. Not to mention that different dictionary’s have a different syntax and a grammar is needed for each of them.

#2 dexonline – Approximate search using trigrams

Hello again!

Since my last post I have been improving the Levenshtein algorithm I was talking about. It now searches for words with distance one, then if there aren’t any, distance two and distance 3. I have decided together with my mentor and after taking o look at the log that a distance greater than 3 is not necessary, because nobody will type with so many mistakes, unless he or she is drunk.

Then, after obtaining a couple of suggestions, these are filtered, showing the user only those words that have 2 swapped letters, a letter typed instead of the correct one  and next to it on the keyboard, those that have a letter in minus or those with an extra letter that is next to one in the searched word. Depending on the word searched, this can lead to a single suggestion which means an automatic redirect to that word.

In order to filter the results, I first considered a kind of matrix with 10 colums and 3 lines containing the letters on the keyboard. I calculate a “sum” of the searched word and the sums of the suggested words and I keep only those that pass the conditions mentioned above. To get the sums, I add up the coordinates of each letter of the word. This idea was not as good as I expected, and now I am worinkg on implementing this directly into the levenshtein algorithm.

Other little things that I’ve done are writing a little program that tests the speed of the algorithm by repeatedly searching words taken from the log and also making the levenshtein not to consider the diacritics as differences.

I hope that until my next post I will finish this part and start the second idea, the one with trigrams. The winner between these two will end up in the site.