Why Russian to English is difficult for Machine Translation
When we consider the history of machine translation, the science by which computers automatically translate from one human language to another, we see that much of the science starts with Russian. One of the earliest mentions of automated translation involves Russian Peter Troyanskii who submitted a proposal that included both the bilingual dictionary and a method for dealing with grammatical roles between languages, based on the grammatical system of Esperanto, even before computers were available.
The first set of proposals for computer-based machine translation was presented in 1949 by Warren Weaver, a researcher at the Rockefeller Foundation in his now famous "Translation memorandum". In the famous memorandum referenced here, he said: “it is very tempting to say that a book written in Russian is simply a book written in English which was coded into the Russian code.” These proposals were based on information theory, successes in code-breaking during the Second World War, and theories about the universal principles underlying natural language. But Weaver’s memo was not the only driver for this emerging field. What really kick-started research was Cold War fear and the US analysts desire to easily read and translate Russian technical papers. Warren Weaver inspired the founders of Language Weaver to name themselves after him in the early 2000s, and the company was the first to commercialize and productize Statistical Machine Translation (SMT) and was also the source for much of the subsequent innovation in SMT. Its alumni went on to start Google Translate, Moses, influence Amazon MT/AI initiatives, and the company and its intellectual property are now owned by SDL Plc.
The original Georgetown experiment, which involved successful fully automatic translation of more than sixty Russian sentences into English in 1954, was one of the earliest recorded MT projects. Researchers of the Georgetown experiment asserted their belief that machine translation would be a solved problem within three to five years. This claim to be able to solve the MT problem in five years has been a frequent refrain of the MT community, and almost seventy years later we see that MT remains a challenging problem. Recent advances with Neural MT are welcome and indeed significant advances, but MT remains one of the most challenging research areas in AI.
As the results of 70 years of ongoing MT research efforts show, the machine translation problem is indeed one of the most difficult problems to solve in the Natural Language Processing (NLP) field. It is worth some consideration why this is so, as it explains why it has taken 70 years to get here, and why it may still take much more time to get to “always perfect” MT, even in these heady NMT breakthrough days.
It is perhaps useful to contrast MT to the automated speech recognition (ASR) challenge, to illustrate the difficulty. If we take a simple sentence like, “Today, we are pleased to announce a significant breakthrough with our ongoing MT research, especially as it pertains to Russian to English translations.” In the case of ASR, there is really only one correct answer, the computer either identified the correct word or it did not, and even when it does not properly identify the word, one can often understand from the context and other correctly predicted words.
Computers perform well when problems have binary outcomes, where things are either right or wrong, and computers tend to solve these kinds of problems much more effectively than problems where the “answers” are much less clear. If we consider the sentence in question as a translation, it is a very different computing challenge. Language is complex and varied, and the exact same thing can be said and translated in many different ways. All of which can be considered correct. If you were to add the possibilities of slightly wrong or grossly wrong translations you can see there are a large range of permutational possibilities. The sentence in question has many possible correct translations and herein lies the problem. Computers do not really have a way to assess these variations other than through probability calculations and measuring statistical data density which is almost always completely defined by the data you train on. If you train on a data set that does not contain every possible translation then you will have missed some possibilities. The truth is that we NEVER train an engine on every possible acceptable translation.
Michael Housman, is chief data science officer at RapportBoost.AI and faculty member of Singularity University. He explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted that machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited or definable set of moves.
Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then, of course, you need labeled data. You need to tell the machine to do it right or wrong.”
Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”
Erik Cambria is an academic AI researcher and assistant professor at Nanyang Technological University in Singapore said, “The biggest issue with machine translation today is that we tend to go from the syntactic form of a sentence in the input language to the syntactic form of that sentence in the target language. That’s not what we humans do. We first decode the meaning of the sentence in the input language and then we encode that meaning into the target language.”
All these hindering factors remain in effect for the forseeable future, so we should not expect another big leap forward until we find huges masses of new, high-quality data, or develop a new breakthrough in the pattern detection methodology.
In essence (grossly oversimplified), MT is a pattern detection and pattern matching technique where a computer is shown large volumes of clean equivalent sentences in two languages and it “learns” how to “translate” from analyzing these examples. NMT does this differently than SMT, but essentially they are both detecting patterns in the data they are shown, with NMT having a much deeper sense of what a pattern might be. This is why the quality and volume of the “training data” matters, as it defines the patterns that can be learned.
What we have seen over the last 70 years is that languages that are more similar tend to be easier to model (MT is a translation model). Thus we see, it is much easier to build an MT system for Spanish <> Portuguese because both languages are very similar and have many equivalent linguistic structures. In contrast English <> Japanese will be much more challenging because there are big differences in linguistic characteristics, orthographic format (JP can be written in 3 scripts), morphology, grammar, word order, honorific structure and so on. Also, while English <> Japanese is difficult, it is much easier to build a model for Japanese <> Korean since they have much more structural and linguistic similarity and equivalencies.
Thus the basic cause of difficulty is because of the fundamental linguistic differences between the two languages. Linguistic concepts that exist in one language do NOT exist in the other language, so equivalencies are very difficult to formulate and model. A recent research paper describes language modeling difficulties using the Europarl corpus whose existence allows many comparative research experiments. A key finding of this study was that inflectional morphology is a big factor in difficulty. In the chart below we see that DE, HU, and FI are more difficult because of this and SV and DA are easier also because they are more similar.
A prior study based on the Europarl data has been a key reference for what combinations are easy or difficult to model with roughly equivalent datasets used to build these comparative models.
What this chart shows is that the easiest translation direction is Spanish to French (BLEU score of 40.2), the hardest Dutch to Finnish (10.3).
This shows that having much more Finnish data does not help raise output quality because the linguistic differences are much more significant. The Romance languages outperform many other language combinations with significantly less data.
Russian has always been considered to be one of the most difficult languages in MT, mostly because it is very different linguistically from English. Early NMT attempts were unable to outperform old RBMT models, and SMT models, in general, were rarely able to consistently beat the best RBMT models.
Russian differs from English significantly in inflection, morphology, word order and gender associations with nouns.
(from Wiktionary https://en.wiktionary.org/wiki/%D0%BA%D0%BD%D0%B8%D0%B3%D0%B0#Russian)
That's 12 forms of the same word, which are used depending on what role the word is playing in the sentence. But they're not all distinct; you can have the same form for different roles, like the singular genitive & the plural nominative.
Additionally, like Spanish or French every noun has a gender. The word for "book" is feminine, but this is an arbitrary categorization; there's no reason why a book (книга kníga) is feminine and why a table (стол stól) is masculine. But it matters because the case suffixes are different for each gender (masculine, feminine, or neuter). So while there are 12 different forms of the word "book" and 12 different forms of the word "table", they don't share the same set of suffixes. When adjectives modify nouns, they need to agree with the noun, taking the same (or similar) suffix.
Also, like Spanish or French, verbs conjugate depending on tense (past vs. non-past), person (I vs. you vs. he/she/it), number (singular vs. plural), etc. So one verb may have several different forms, as well.
In English, we use word order to accomplish the same thing as the suffixes on nouns in Russian. Because Russian has these case markings, their word order is much more free. For example, these are all acceptable ways of saying "I went to the shop":
Essentially, all orderings are possible, except that the preposition "to" (в v) must precede the word for "shop" (магазин magazin). You can imagine that as sentences get longer, the number of possible sentence order structures increase. There are some limits on this: some orders in this example are dispreferred and sound strange or archaic, and others are only used to emphasize where you're going or who is going. But there are certainly more ways of saying the same thing than English, which is stricter in its word order.
Difficult languages, in general, are more demanding of the skill required of the MT system developer. They are not advised for the Moses and OpenNMT hacker who wants to see how his data might perform with open source magic, and generally, most of these naive practitioners will stay away from these languages.
There are special challenges for an MT system developer who builds Russian <> English MT systems, e.g.
The first set of proposals for computer-based machine translation was presented in 1949 by Warren Weaver, a researcher at the Rockefeller Foundation in his now famous "Translation memorandum". In the famous memorandum referenced here, he said: “it is very tempting to say that a book written in Russian is simply a book written in English which was coded into the Russian code.” These proposals were based on information theory, successes in code-breaking during the Second World War, and theories about the universal principles underlying natural language. But Weaver’s memo was not the only driver for this emerging field. What really kick-started research was Cold War fear and the US analysts desire to easily read and translate Russian technical papers. Warren Weaver inspired the founders of Language Weaver to name themselves after him in the early 2000s, and the company was the first to commercialize and productize Statistical Machine Translation (SMT) and was also the source for much of the subsequent innovation in SMT. Its alumni went on to start Google Translate, Moses, influence Amazon MT/AI initiatives, and the company and its intellectual property are now owned by SDL Plc.
The original Georgetown experiment, which involved successful fully automatic translation of more than sixty Russian sentences into English in 1954, was one of the earliest recorded MT projects. Researchers of the Georgetown experiment asserted their belief that machine translation would be a solved problem within three to five years. This claim to be able to solve the MT problem in five years has been a frequent refrain of the MT community, and almost seventy years later we see that MT remains a challenging problem. Recent advances with Neural MT are welcome and indeed significant advances, but MT remains one of the most challenging research areas in AI.
Why is MT such a difficult NLP problem?
As the results of 70 years of ongoing MT research efforts show, the machine translation problem is indeed one of the most difficult problems to solve in the Natural Language Processing (NLP) field. It is worth some consideration why this is so, as it explains why it has taken 70 years to get here, and why it may still take much more time to get to “always perfect” MT, even in these heady NMT breakthrough days.
It is perhaps useful to contrast MT to the automated speech recognition (ASR) challenge, to illustrate the difficulty. If we take a simple sentence like, “Today, we are pleased to announce a significant breakthrough with our ongoing MT research, especially as it pertains to Russian to English translations.” In the case of ASR, there is really only one correct answer, the computer either identified the correct word or it did not, and even when it does not properly identify the word, one can often understand from the context and other correctly predicted words.
Computers perform well when problems have binary outcomes, where things are either right or wrong, and computers tend to solve these kinds of problems much more effectively than problems where the “answers” are much less clear. If we consider the sentence in question as a translation, it is a very different computing challenge. Language is complex and varied, and the exact same thing can be said and translated in many different ways. All of which can be considered correct. If you were to add the possibilities of slightly wrong or grossly wrong translations you can see there are a large range of permutational possibilities. The sentence in question has many possible correct translations and herein lies the problem. Computers do not really have a way to assess these variations other than through probability calculations and measuring statistical data density which is almost always completely defined by the data you train on. If you train on a data set that does not contain every possible translation then you will have missed some possibilities. The truth is that we NEVER train an engine on every possible acceptable translation.
Michael Housman, is chief data science officer at RapportBoost.AI and faculty member of Singularity University. He explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted that machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited or definable set of moves.
Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then, of course, you need labeled data. You need to tell the machine to do it right or wrong.”
Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”
Erik Cambria is an academic AI researcher and assistant professor at Nanyang Technological University in Singapore said, “The biggest issue with machine translation today is that we tend to go from the syntactic form of a sentence in the input language to the syntactic form of that sentence in the target language. That’s not what we humans do. We first decode the meaning of the sentence in the input language and then we encode that meaning into the target language.”
All these hindering factors remain in effect for the forseeable future, so we should not expect another big leap forward until we find huges masses of new, high-quality data, or develop a new breakthrough in the pattern detection methodology.
Why are some language combinations more difficult in MT?
In essence (grossly oversimplified), MT is a pattern detection and pattern matching technique where a computer is shown large volumes of clean equivalent sentences in two languages and it “learns” how to “translate” from analyzing these examples. NMT does this differently than SMT, but essentially they are both detecting patterns in the data they are shown, with NMT having a much deeper sense of what a pattern might be. This is why the quality and volume of the “training data” matters, as it defines the patterns that can be learned.
What we have seen over the last 70 years is that languages that are more similar tend to be easier to model (MT is a translation model). Thus we see, it is much easier to build an MT system for Spanish <> Portuguese because both languages are very similar and have many equivalent linguistic structures. In contrast English <> Japanese will be much more challenging because there are big differences in linguistic characteristics, orthographic format (JP can be written in 3 scripts), morphology, grammar, word order, honorific structure and so on. Also, while English <> Japanese is difficult, it is much easier to build a model for Japanese <> Korean since they have much more structural and linguistic similarity and equivalencies.
Thus the basic cause of difficulty is because of the fundamental linguistic differences between the two languages. Linguistic concepts that exist in one language do NOT exist in the other language, so equivalencies are very difficult to formulate and model. A recent research paper describes language modeling difficulties using the Europarl corpus whose existence allows many comparative research experiments. A key finding of this study was that inflectional morphology is a big factor in difficulty. In the chart below we see that DE, HU, and FI are more difficult because of this and SV and DA are easier also because they are more similar.
A prior study based on the Europarl data has been a key reference for what combinations are easy or difficult to model with roughly equivalent datasets used to build these comparative models.
What this chart shows is that the easiest translation direction is Spanish to French (BLEU score of 40.2), the hardest Dutch to Finnish (10.3).
This shows that having much more Finnish data does not help raise output quality because the linguistic differences are much more significant. The Romance languages outperform many other language combinations with significantly less data.
Why Russian is especially difficult
Russian has always been considered to be one of the most difficult languages in MT, mostly because it is very different linguistically from English. Early NMT attempts were unable to outperform old RBMT models, and SMT models, in general, were rarely able to consistently beat the best RBMT models.
Russian differs from English significantly in inflection, morphology, word order and gender associations with nouns.
Inflection
Unlike English, Russian is a highly inflected language. Suffixes on nouns mark 6 distinct cases, which determine the role of the noun in the sentence (whether it's the subject, the direct object, the indirect object, something being possessed, something used as an instrument, or the object of a preposition). For example, all of these are different forms of the word "book":singular | plural | |
nominative | книга (kniga) | книги (knigi) |
genitive | книги (knigi) | книг (knig) |
dative | книге (knige) | книгам (knigam) |
accusative | книгу (knigu) | книги (knigi) |
instrumental | книгой, книгою (knigoj, knigoju) | книгами (knigami) |
prepositional | книге (knige) | книгах (knigax) |
(from Wiktionary https://en.wiktionary.org/wiki/%D0%BA%D0%BD%D0%B8%D0%B3%D0%B0#Russian)
That's 12 forms of the same word, which are used depending on what role the word is playing in the sentence. But they're not all distinct; you can have the same form for different roles, like the singular genitive & the plural nominative.
Additionally, like Spanish or French every noun has a gender. The word for "book" is feminine, but this is an arbitrary categorization; there's no reason why a book (книга kníga) is feminine and why a table (стол stól) is masculine. But it matters because the case suffixes are different for each gender (masculine, feminine, or neuter). So while there are 12 different forms of the word "book" and 12 different forms of the word "table", they don't share the same set of suffixes. When adjectives modify nouns, they need to agree with the noun, taking the same (or similar) suffix.
Also, like Spanish or French, verbs conjugate depending on tense (past vs. non-past), person (I vs. you vs. he/she/it), number (singular vs. plural), etc. So one verb may have several different forms, as well.
Word order
In English, we use word order to accomplish the same thing as the suffixes on nouns in Russian. Because Russian has these case markings, their word order is much more free. For example, these are all acceptable ways of saying "I went to the shop":
Я пошёл в магазин. (ya poshol v magazin)
Я в магазин пошёл. (ya v magazin poshol)
Пошёл я в магазин. (poshol ya v magazin)
Пошёл в магазин я. (poshol v magazin ya)
В магазин я пошёл. (v magazin ya poshol)
В магазин пошёл я. (v magazin poshol ya)
я ya = I
пошёл poshol = went
в v = to
магазин magazin = shop
Essentially, all orderings are possible, except that the preposition "to" (в v) must precede the word for "shop" (магазин magazin). You can imagine that as sentences get longer, the number of possible sentence order structures increase. There are some limits on this: some orders in this example are dispreferred and sound strange or archaic, and others are only used to emphasize where you're going or who is going. But there are certainly more ways of saying the same thing than English, which is stricter in its word order.
Difficult languages, in general, are more demanding of the skill required of the MT system developer. They are not advised for the Moses and OpenNMT hacker who wants to see how his data might perform with open source magic, and generally, most of these naive practitioners will stay away from these languages.
There are special challenges for an MT system developer who builds Russian <> English MT systems, e.g.
- MT needs to pay more attention to Russian word inflections than to the order of the words, to know where to put the word in the English translation
- MT needs to be flexible enough to translate a familiar Russian source sentence that appears in an unfamiliar word order
Comments
Post a Comment