Apertium: Open source machine translation
Apertium is an open source shallow-transfer machine translation (MT) system. In addition to the translation engine, it also provides tools for manipulating linguistic data, and translators designed to run using the engine. At the time of writing, there are stable bilingual translators available for English-Catalan, English-Spanish, Catalan-Spanish, Catalan-French, Spanish-Portuguese, Spanish-Galician, and French-Spanish; as well as monolingual translators that translate from Esperanto to Catalan and to Spanish, and from Romanian to Spanish. There are also a number of unstable translators in various stages of development. (A list of language pairs, updated daily, is available on the Apertium wiki).
In other words, Apertium is the open-source Systran (the engine that powers Babelfish).
Apertium version 1 was based on existing translators that had been designed by the Transducens group at the Universitat d'Alacant, and funded by the OpenTrad consortium. Subsequent development has been funded by the university, as well as by Prompsit Language Engineering. While Apertium 1 was designed with the Romance languages of Spain in mind, Apertium 2 added support for less-related languages (Catalan-English); Apertium 3 added Unicode support.
Apertium is designed according to the Unix philosophy: translation is performed in stages by a set of tools that operate on a simple text stream. Other tools can be added to the pipeline as required, and the text stream can be modified using standard tools. There is also a wrapper script (called simply apertium) that takes care of most of the details.
$ echo 'Esta es Gloria, mi amiga argentina'|apertium es-en This is Gloria, my Argentinian friend
(That example was picked at random from 'Teach Yourself Spanish Grammar' - translation quality is not always that high, though).
Apertium packages are available for Debian and Ubuntu (apt-get install apertium); packages are not yet available for other distributions, though it has been used successfully on several distributions. The mildly curious may prefer to try the Surf and Translate demo on the Apertium Web site.
I intend to follow this article with articles of a more tutorial nature; the rest of this article is intended to give an explanation of the most common terms in machine translation.
Types of Machine Translation
Machine translation systems differ in sophistication, and there are several basic approaches to translation. At the basic level, any translation system has to include dictionary lookup; however, this can also use a stemmer to find the basic form of a word (instead of looking up 'beers' in the dictionary, it looks up 'beer'), or a morphological analyser (which operates much like a stemmer, but also includes grammatical information - Apertium's analyser would return beer<n><pl> from the word 'beers', to tell the rest of the system that the word is a noun, and plural).
Rule-based systems were the first 'real' kind of machine translation system. Rather than simply translating word to word, rules are developed that allow for words to be placed in different places, to have different meaning depending on context, etc. The Georgetown-IBM experiment in 1954 was one of the first rule-based machine translation systems; Systran and Apertium are RBMT systems.
Example Based Machine Translation (EBMT) systems translate using the results of previous translations. Translation Memory systems are the most basic example of EBMT; more complicated TM systems (such as OmegaT), which use techniques such as fuzzy matching to suggest similar translations, are closer to the original idea behind EBMT.
Statistical Machine Translation (SMT) is, at its most basic, a more complicated form of word translation, where statistical weights are used to decide the most likely translation of a word. Modern SMT systems are phrase-based rather than word-based, and assemble translations using the overlap in phrases. Google Translate is based on SMT; there is also an open-source system for SMT called Moses.
Interlingua systems are an extension of rule-based systems that use an intermediate language instead of direct translation. Systems based on Interlingua can then more readily translate between various combinations of languages. OpenLogos is an open-source Interlingua-based machine translator, based on the Logos system; a competitor of Systran.
Transfer-based systems are another approach to rule-based machine transfer, influenced by the Interlingua idea. Instead of using a whole language, an intermediate representation of equivalent pieces is used. This still uses language-pair-specific translation, but the amount of language-specific rules are reduced to a minimum. There are two kinds of transfer-based translation: shallow transfer (syntactic), where words are translated based on combinations of word types; and deep transfer (semantic), which uses a representation of the meaning of each word as a basis for how it should be translated.
Most current machine translation systems are hybrid systems: Moses is primarily SMT, but can use morphological analysers to add extra confidence in translation options; Apertium uses statistical methods for word sense disambiguation.
SMT is the current focus of most serious research in MT, but rule-based systems still have a number of advantages. First and foremost, SMT systems require the availability of a large amount of text in both languages1, which for most language pairs is not available. Secondly, the generated dictionaries contain all likely word combinations for both languages, which both consume a lot of memory and take much more processing time than do the kind of dictionaries used in rule-based systems (which also have the advantage of being useful as human-readable dictionaries - TinyLex is a Java ME program for bilingual dictionaries that uses Apertium data).
Another aspect of SMT that may or may not be a drawback, depending on your perspective, is that they use monolingual models as a way of determining how to combine the phrases they translate. The upside is that, unless they encounter words that don't exist in their dictionaries, the output will be of better quality than with rule-based translation. The downside is that this translation may bear very little relation to the source sentence. With a rule-based system, a bad translation will look like garbage.
Why Open Source Translation?
The best translations depend on the closeness of the languages involved: all other things being equal, a Spanish - Portuguese translator will give a better translation than a Spanish - English translator. Another factor is the domain: words that could be ambiguous in general use may only have one meaning in a specific context. This is well known, and for this reason, most commercial translation systems provide the ability to choose specific domains, and to specify meanings in a user-defined dictionary that can override the system dictionary.
What they don't provide, however, is a way to specify custom rules.
In the majority of machine translation uses, documents are translated in bulk, and later edited. Human translators are expensive, and machine translation is used to reduce this cost, or even to remove it entirely. In the majority of cases, the human editor will be expected to follow an in-house style guide; even if the translation is accurate and clear, it would most likely still require editing to conform to this style guide. Even if the translator can't give better accuracy, it can still reduce expense by reducing the amount of editing a document requires.
The usual solution to this is to combine translation memory with automatic translation. A better solution would be to combine translation memory with a fully customisable machine translator - an open source machine translator.
SMT is starting to be used by companies who seek to provide 'bespoke' machine translators with example-based features, which can adapt as corrections are made to the translation. However, as the selection of a phrase is based on the amount of occurrences, the same correction has to be made a number of times - potentially hundreds or even thousands of times.
Consider this example:
Wolę piekło chaosu od piekła porządku.2
In Polish, the preposition 'od' means 'from', with a few exceptions. The above sentence is one example of such an exception:
I prefer the hell of chaos to the hell of order.
Writing a rule in Apertium to say that the preposition 'od' is 'to' following the verb 'woleć' is quite simple, and takes a lot less time than does writing enough examples for an SMT-based translator to infer the same, and doesn't carry the risk of harming cases that were previously handled correctly.
A Glimpse of the Future
I hope I've made some of you more interested in Apertium: my next article will be a tutorial covering the creation of morphological analysers in Apertium. Anyone too impatient for that can find more information on the Apertium wiki, and there are usually a number of people available on #apertium on irc.freenode.org available to answer questions.
1 In a paper from Google Research, they describe a method of overcoming this problem by using the statistical translation probabilities of multiple languages (which they have used recently, in their newest language additions); essentially, cross-referencing multiple bilingual dictionaries to create new ones. The tool that does this in Apertium is called apertium-crossdics. (The paper also suggests that "One solution is to create such parallel data by automatic translation and then retaining reliable translations by using confidence metrics", which is a bit like saying that infinite monkeys can at least translate the works of Shakespeare).
Jimmy is a single father of one, who enjoys long walks... Oh, right.
Jimmy has been using computers from the tender age of seven, when his father
inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST
and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998
and hasn't looked back.
In his spare time, Jimmy likes to play guitar and read: not at the same time,
but the picks make handy bookmarks.
Jimmy has been using computers from the tender age of seven, when his father inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998 and hasn't looked back.
In his spare time, Jimmy likes to play guitar and read: not at the same time, but the picks make handy bookmarks.