On-Line Translation Tools Revisited

On-Line Translation Tools Revisited

Notes on a talk given on Thursday April 21st 2011

by Chris Betterton-Jones

A PDF version of these notes can be viewed , downloaded and printed from my Google Docs

Language - that very human attribute which separates us from animals ! With language, we can share our experiences and learn from them. I can inform you how to cook eggs just by telling you; I can give you this knowledge in exchange for some eggs; I can negotiate how many eggs the cooking recipe is worth; and I can gossip about the deal with my friends to make sure I am not being cheated. Other animals do not have this skill.

There are between 3000 and 8000 different languages, which can be divided into language groups based largely on how closely together they have developed. English belongs to the "IndoEuropean" group of languages.

Note: English is not descended from Latin. It is a Germanic language with a lot of Latin vocabulary, borrowed from French in the Middle Ages.

Spanish is however descended from Latin together with Spanish, Portuguese, Catalan, Rumanian, and French . Thus it is easier for humans (and machines) to translate from Spanish to French than from Spanish to English.

It is difficult for everyone to translate from an IndoEuropean Language to one which has a completely different structure (such as Chinese)

Translation process
The translation process may be stated as:

  • Decoding the meaning of the source text; and
  • Re-encoding this meaning in the target language.

This is not so easy as it sounds! To decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the grammar, semantics, syntax, idioms, etc., of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language.

Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the target language that "sounds" as if it has been written by a person ??

Let us look at how human beings learn languages. This can provide insight into the different techniques used by computers to do translations, and why some translators are more accurate than others.

How humans learn languages:

How do we learn our Mother tongue?: by example, mimicry and usage.
This is a red apple. This is a green apple. This is a red ball. That is a blue ball.
Cor blimey !

How did you learn a foreign language at school?
Je suis, tu es, ….
Hablo, hablas habla, hablamos, hablais hablan…
i.e. by learning the language rules
This is a completely different approach. Note: Languages evolve first. Then we make up the rules.

How did I try to learn Latin at school?
I was good at "set books" (66%)- bad at grammar (11%)
Set books - I learned the English text by heart and could recognise which Latin text corresponded with which English text - and make a good guess at what would come next. Statistical approach. Mindless…but effective!

Translation software uses all these techniques:
1. Rule-Based translation
The most famous is Systran - was used by EU. It gave rise to Alta Vista, Babel fish and the early (until 2007) Google translation system.
Two main types of Rule based translation:
a) Transfer based: - Language pairs
You look at both languages and define a set of rules which describe their similarity in structure etc.
Have to analyse:

  • Parts of speech: nouns, verbs, prepositions, number, gender
  • Ambiguous words/phrases: Eats,shoots and leaves (** See note below)
  • Dictionary translation
  • Re-order phrases and chunks
  • Make up the new sentence

Can be 90% accurate where the two languages are closely related.
La plume de ma tante / La pluma de mi tía / My aunt's pen ….or feather
(Note: Google / Bing / : The pen of my aunt. Yahoo Babelfish: The feather of my aunt )

(**A panda walks into a bar, sits down and orders a sandwich. He eats the sandwich, pulls out a gun and shoots the waiter dead. As the panda stands up to go, the bartender shouts, "Hey! Where are you going? You just shot my waiter and you didn't pay for your sandwich!"

The panda yells back at the bartender, "Hey, I'm a PANDA! Look it up!" The bartender opens his dictionary and sees the following definition for panda: "A tree dwelling marsupial of Asian origin, characterized by distinct black and white colouring. Eats shoots and leaves".)

b) Interlingual (oldest approach): No language pairs
Don't use language pairs, but use rules to translate into an abstract "super language" (the interlingua) and then translate from this into the target language.
Problem: It's extremely difficult to make a universal interlingua. Languages differ so much in the way they are put together,and so do cultures. Eskimos who have lots of words for types of snow, are not so good at words for types of sand. Interlingua also takes a lot of computer brain power.

2. Example Based translation

Use logic a bit like "The Egg Heads" quiz programme - you might not know the answer but can deduce the meaning by comparing pairs of sentences:

How's your Japanese?

How much is that red umbrella? Ano akai kasa wa ikura desu ka.
How much is that small camera? Ano chiisai kamera wa ikura desu ka.

From this `pair of sentences you can figure out:

  • How much is that X ? corresponds to Ano X wa ikura desu ka.
  • red umbrella corresponds to akai kasa
  • small camera corresponds to chiisai kamera

This is not used on its own, but useful is sorting out ambiguous phrases e.g. "put on"
e.g. I put on the lights. I put on a hat.

3. Statistical translation
This is based on the method I used to translate my Latin set books. Nowadays it is by far the most widely-studied machine translation method. Basically they use huge numbers of bilingual texts and compare one language version with another. They use statistical tests to figure out what is the most likely meaning and translation. It's somewhat similar to the way word-processors try to guess the next word you are going to type, and how Speech Recognition programmes figure out the meaning of the words you speak so the programme can produce meaningful text. As in the case of Speech Recognition this technique is prone to fantastic bloopers!

This method is now used by Google, since the strength of Google is loads and loads of documents. (200 billion words of parallel translated docs from UN ). Before, Google used Systran. Statistical translation is not tailored to any specific pair of languages, it simply needs lots of parallel text to work with. Google therefore has 60 languages and Babelfish only 14

4. Hybrid
Combines Rules based and Statistical techniques.
Two ways of doing this:

Rules tidied up by statistics:
Translations are performed using a rules based system. Statistics are then used in an attempt to adjust/correct the output.

Statistics guided by rules:
Rules are used at both ends. Firstly to pre-process the text in an attempt to guide the statistical system. Then again afterwards to tidy up.
This approach has a lot more power, flexibility and control when translating and is used by Microsoft's "Bing Translator"

Which is the Best??
So we have three major on-line Translation tools which work in different ways. Which is the "best"?
Try these tests with all three tools (Google, Babelfish and Bing - links at end of these notes)

1. Text nicked from a test for commercial Translation tools. Copy and paste the Spanish into a translation tool and see what comes out.

Spanish version of Little Red Riding Hood:

"Abuela, ¿por qué tienes los ojos tan grandes?" Caperucita Roja preguntó. "Para que yo pueda ver mejor," Dijo la abuela. "¡Oh, abuelita, ¿por qué tienes la boca tan grande?" "Para poder comerte mejor!” Entonces, la abuela salta de la cama.

Correct Translation (for reference)

“Grandma, why do you have such big eyes?" Little Red Riding Hood asked. "So that I can see better." the grandma said. "Oh, Grandma, why do you have such a big mouth?" "So I can eat better!" Then, the grandma jumps out of the bed.

2. Translate the letter below with all three tools. Note Google's ability to offer alternate translations if you click on a translated word.

Dear Sir,
I am writing to complain about your company's poor service. Last month I ordered a new television and paid in advance. The television has not yet been delivered. Neither have I received any communication from you.
Yours sincerely,
John Smith

Each gives a slightly different result. How can we test which is "best" ?

3. Look at the findings of :
"Comparison of online machine translation tools" by Ethan Shen - TC World June 2010


Summary of the comparison:

  • Google is strong in Spanish/English this could be because many Latin American countries offer English Translations of official documents.
  • Across almost every language Bing Translator and Yahoo Babelfish gain ground or surpass Google Translate as the text length gets shorter.
  • Translation quality is not a two way street. The engine that is best for translating in one direction is not necessarily the best tool to translate back the other way.
  • Brand bias. People tend to perceive Google as being "the best"

Translating entire web pages:
Google is tops, but is prone to getting things completely back to front. You still need to know something of the language to make a meaningful translation. Want to have a multilingual web site? Design the site so it which translates well using Google!

Useful links and further reading:

Machine translation

Example-based machine translation

Translation Tools:

Google Language Tools:

See also Google Toolbar - On-lineTranslation of web pages for Firefox and IE (Chrome has the translation facility built in)

Bing: http://www.microsofttranslator.com/

Yahoo Babelfish: http://babelfish.yahoo.com/

Word Reference : http://www.wordreference.com - for dictionary, idioms and ambiguous usage.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License