And finally, my last post this summer.
The project is now finished. After analysing the data from the log file and comparing the two algorithms, Levenshtein and trigram, my mentor and I decided that the winner is…both of them . Trigram’s results are, from my point of view, better than I expected. Even if at first I was a little bit sceptic that the suggestions made would be the ones one would expect, they are actually pretty close to those given by the Levenshtein. And this is thanks to the fact that before a word is split into trigrams, “##” and “%%” are added at the beggining and end of it. So, for example, for the word “trigram”, the vector or trigrams is ['##t', '#tr', 'tri', 'rig', 'igr', 'gra', 'ram', 'am%', 'm%%'].
Of course that the results are not as precise as those given by the Levenshtein, because in that case, the positions of the letters on the keyboard were taken into consideration. Despite this and the fact that both are winners, I can say that trigram’s golden medal is more shiny, due to the fact that the execution time is very low. If before the start of the project the average time/search was of 0.6 seconds, the one for Levenshtein is 0.9 and for trigrams is of only 0.2 seconds.
And if it is to talk some more about statistics, only 25% of the searches don’t return a suggestion, whereas at the beggining, the percentace was 74%. Moreover, 32% of them return a single suggestion, redirecting to that page (before, only 13% were redirected).
The reason why I said that both of the algorithms are winners is that, although at first only the trigram was, we decided today to combine them by using the Levenshtein as a filter for the trigram’s results, in order to sort the suggested words according to the position of the letters, but since this was the last day of the internship and I wouldn’t have had time to finish it, my mentor decided to help me by getting the job done. Because Levenshtein is applied on a maximum of 20 words (at which trigram’s number of suggestions is limited), and not on the whole dictionary as it was in the first place, the execution time for this part is insignificant, and as a whole it still remains very low.
All in all, it was a great experience to work at this project during this summer, and I can say that the results are great and I learnt lots of things, considering that I hadn’t worked in php or mysql before.