#7 dexonline – Approximate search using trigrams

And finally, my last post this summer.

The project is now finished. After analysing the data from the log file and comparing the two algorithms, Levenshtein and trigram, my mentor and I decided that the winner is…both of them :) . Trigram’s results are, from my point of view, better than I expected. Even if at first I was a little bit sceptic that the suggestions made would be the ones one would expect, they are actually pretty close to those given by the Levenshtein. And this is thanks to the fact that before a word is split into trigrams, “##” and “%%” are added at the beggining and end of it. So, for example, for the word “trigram”, the vector or trigrams is ['##t', '#tr', 'tri', 'rig', 'igr', 'gra', 'ram', 'am%', 'm%%'].

Of course that the results are not as precise as those given by the Levenshtein, because in that case, the positions of the letters on the keyboard were taken into consideration. Despite this and the fact that both are winners, I can say that trigram’s golden medal is more shiny, due to the fact that the execution time is very low. If before the start of the project the average time/search was of 0.6 seconds, the one for Levenshtein is 0.9 and for trigrams is of only 0.2 seconds.

And if it is to talk some more about statistics, only 25% of the searches don’t return a suggestion, whereas at the beggining, the percentace was 74%. Moreover, 32% of them return a single suggestion, redirecting to that page (before, only 13% were redirected).

The reason why I said that both of the algorithms are winners is that, although at first only the trigram was, we decided today to combine them by using the Levenshtein as a filter for the trigram’s results, in order to sort the suggested words according to the position of the letters, but since this was the last day of the internship and I wouldn’t have had time to finish it, my mentor decided to help me by getting the job done. Because Levenshtein is applied on a maximum of 20 words (at which trigram’s number of suggestions is limited), and not on the whole dictionary as it was in the first place, the execution time for this part is insignificant, and as a whole it still remains very low.

All in all, it was a great experience to work at this project during this summer, and I can say that the results are great and I learnt lots of things, considering that I hadn’t worked in php or mysql before.

#6 dexonline – Approximate search using trigrams

Not much time left untill the end of rsoc. Since my last post, I’ve been working on the trigram algorithm and I can say it works pretty well.

I have created a new table in the DEX database, NGram, that contains the trigrams and the id of the word in which they occur. When a user searches for a word, it is split into groups of trigrams and, after searching in the NGram table, the words that have the most trigrams in common with the misspelt one are suggested as corrections. The code is not online at this moment because it is a little messy and I have to improve it a little bit.

I have also made some improvements to the old code, the one with Levenshtein algorithm, and the search time about which I was complaining in my last post was considerably improved.

In this last week I will finish the trigram implementation and after a little more tests, my mentor and I will decide on a final form for the project.

#5 dexonline – Approximate search using trigrams

Nothing special in the last 2 weeks, due to the fact that I had 12 days off in which I was abroad. First, I was in Hungary with my highschool teachers and a group of 40 students for a week, and the next day after I arrived home, I went in Vienna for 4 more days with some friends. It was real fun, we have visited lots of interesting places and made hundreds of pictures.

But back to work now :) . In this period I analysed the log of the approximate search queries, and I concluded that, despite the fact that the average search time takes a little bit longer than I expected, 54% of them return at least a suggestion of correction, which I think is pretty good, considering that before, the percentage was just 25%. Anyway, the rest of 46% are searches that have nothing to do with the dictionary, like words in another languages or multiple words. If I’d show you some of them, you’ll probabil laugh all day.

I’ve also been making some research on trigrams and I agreed with my mentor on the steps that have to be made further on.

#4 dexonline – Approximate search using trigrams

Another post, another implementation :) . This time, the idea remained almost the same, but, with the help of my mentor, the code looks better.

First, instead of the big and ugly matrix that was first initialized from the beginning with all the values, I used some pretty functions for doing that. Unfortunatelly, that was much slower, so we decided to store the distances in another way: two vectors of 26 elements each, representing the coordinates of the letters on the keyboard.

Then, the code was simplified a little, and, after lots of tests, it seems to have no more bugs. Therefore, it is now in production, but for now only 25% of the approximate searches are using it, percentage to be gradually increased in the next period. I also added separate logging, in order to check the results.

So, I can say that this first idea is done and already in use. I will now start thinking about the second one, with trigrams, and a method to test which one is better.

#3 dexonline – Approximate search using trigrams

As I mentioned in my last post, during the last two weeks I have been changing my first implementation, giving up the idea of a filter applied on the suggestions found by the classic levenshtein.

Now, my implementation went directly into the mysql UDF. I made a matrix with the distances between each letter on the keyboard: 5 for neighbour letters on the horizontal, 10 for the ones on the vertical and 15 for the rest. Then, I have the following rules, instead of the classic +1 in the levenshtein algorithm: I add 5 for a missing letter, 5 for switched letters, and for typing the wrong letter or an extra letter 5, 10 or 15, depending on the distance.

The algorithm is not 100% perfect, it has some bugs on some searches, because there are some particular cases for which I haven’t found a rule. But overall, I tested it with 2500 searches from the log and the resulted suggestions are pretty good.

In the following days, the code will be put online, and step by step, an increasing percentage of searches will use it, in order to test its functioning.

#2 dexonline – Approximate search using trigrams

Hello again!

Since my last post I have been improving the Levenshtein algorithm I was talking about. It now searches for words with distance one, then if there aren’t any, distance two and distance 3. I have decided together with my mentor and after taking o look at the log that a distance greater than 3 is not necessary, because nobody will type with so many mistakes, unless he or she is drunk.

Then, after obtaining a couple of suggestions, these are filtered, showing the user only those words that have 2 swapped letters, a letter typed instead of the correct one  and next to it on the keyboard, those that have a letter in minus or those with an extra letter that is next to one in the searched word. Depending on the word searched, this can lead to a single suggestion which means an automatic redirect to that word.

In order to filter the results, I first considered a kind of matrix with 10 colums and 3 lines containing the letters on the keyboard. I calculate a “sum” of the searched word and the sums of the suggested words and I keep only those that pass the conditions mentioned above. To get the sums, I add up the coordinates of each letter of the word. This idea was not as good as I expected, and now I am worinkg on implementing this directly into the levenshtein algorithm.

Other little things that I’ve done are writing a little program that tests the speed of the algorithm by repeatedly searching words taken from the log and also making the levenshtein not to consider the diacritics as differences.

I hope that until my next post I will finish this part and start the second idea, the one with trigrams. The winner between these two will end up in the site.

#1 dexonline – Approximate search using trigrams


My name is Mihai Trifu and this summer I’m working at dexonline. My goal is to improve approximate search, which for the moment is not working very well.

The first two weeks were used mostly for installing the software needed, downloading and getting used to the existing code, taking a look at php and mysql (which I wasn’t familiar with)  and research about approximate search.

I also managed to write some code. The existing one only used to detect one letter errors, but now I’ve implemented a Levenshtein algorithm that also detects errors of distance 2 (distance one means inserting, deleting or changing a letter and also swapping two neighbour  ones).

At the moment I am working on improving this algorithm, making it faster and also adding some  kind of filters which will try to gues the searcher’s intention and reduce the number of suggested words.