DEX Online is a Romanian language dictionary. We have 1.9M unique monthly visitors, 13M monthly page views and 220K Facebook fans. Help us write new features and our users will love you! Also, win medals!

Romanian Literature Crawler

Project summary

We want to crawl and index web pages of Romanian literature, news articles, blogs and other texts. This is the groundwork for several exciting features beyond the scope of this internship:

  • Find words that DEX online doesn't know, but that occur frequently on the Internet. Write a script to crop usage examples for these words and pass them on to a team of linguists so that they can write definitions for them.
  • Show usage examples along with our definitions. Offer an interface where admins can select the most relevant examples.
  • Compute statistics on diacritics. For example, compute that, in the context ''abcdSefgh'', S has 90% probability and Ș has 10% probability. This can be used to insert diacritics in a text.

Project details

The crawler should:

  • read a whitelist of hosts and crawl only those hosts. We need trustworthy Romanian texts (good grammar, using diacritics etc.)
  • have a per-site rate limiting mechanism so as not to annoy the sites we crawl
  • extract raw text from pages, for example using regular expressions tailored to each site's page structure
  • ignore everything except the actual article text
  • be able to checkpoint and restart (it is almost guaranteed to run for days at a time)
  • store tuples of (URL, raw page, extracted text, timestamp) in a database
  • build an index of word -> page, taking into account inflected forms

First steps: Installation instructions, newbie ticket list

Required knowledge

  • basic PHP; our code is home-brewed, clean, well-maintained and does not use advanced language features
  • basic MySQL; our database structure is hairy, but you will need to work with a very small part of it
  • basic Idiorm, a simple ORM library
  • you will work at arm's length from the core DEX online code, since the crawler is independent of our website; however, you may need to interface with our word database


