Monday, November 23, 2009  
Google
Web pcquest.com

CIOL Network sites

Search by Issue | Sitemap | Advanced Search

• For most updated version of DQ TOP 20 issue, visit dqindia.com • Ad : Play and Plug ERP by IBM
 Home > Technology

Statistical Machine Translation using Moses

This nifty tool called Moses can be used to convert text from one language to another. In this article, we show how you can do it in a few simple steps

Wednesday, April 01, 2009

Print Comment Email DiggDigg DeliciousDel.icio.us RedittReddit TwitterTwitter

Machine translation is a growing research area due to its application in providing fast and meaningful translation of text and speech from one language to another language. It can be done based either on rules where rules are applied to convert one language into another language or through statistical machine translation (SMT). Statistical machine translation uses statistical methods to translate with the help of parallel corpus. It uses word based translation method or phrase based translation. The success of machine translation system depends on how well one language's words are aligned with another language's words. Statistical machine translation system allows training of translation model for any language. The requirement for doing statistical machine translation is a bi-lingual parallel corpus. There are various types of translation methods which are used like factored, beam-search and phrase based.

Direct Hit!

Applies To: Language Translators, Computational Linguists, NLP Researchers
Price: Free (GNU GPL)
USP: Create your language translator
Primary Link: www.statmt.org/moses
Google Keywords:
Statistical Machine Translation, Moses

Various machine translation tools are available like Apertium (GNU license), OpenLogos which is the open source version of Logos Machine Translation System, SYSTRAN which is one of the oldest Machine Translation company and Moses (GNU General Public License).

Moses is a phrase based machine translation tool for converting one language to another language. Technical details regarding this are available on Moses website, www.statmt.org/moses. In this article, we will give a short step-by-step process for converting text from one language to another, for example Hindi to English using Moses machine translation tool.

Parallel Corpus
Prepare Parallel Corpus for source (English) and target (Hindi) language which will be used for training the language model. This corpus can be prepared from your existing translated data or can be obtained from Internet free of cost or with a price for e.g. EMILLE corpus, a free version of which is available for research purposes. Similarly, a parallel corpus with a smaller size is required for tuning as well as for testing the model.

Data preparation
The parallel corpus is converted to a format that is suitable for Giza++. Giza is an open source tool based on IBM model and is used for word alignment. Before training Moses the following software should be downloaded:

  • SRILM [ http:// tinyurl. com/ dx8m5m ] - This is the tool developed by Stanford research institute for building statistical language model.
  • GIZA++ [ http:// tinyurl. com/ cdem45 ] or [ http:// giza-pp. googlecode. com/ ] – This tool is developed by Franz Josef Och. This tool implements different models like HMM and also performs word alignment.
  • MKCLS [http://tinyurl.com/ c83mpx ] or [ http://www.fjoch. com/mkcls. html ] - This tool is also developed by Franz Josef Och and used for training word classes which is used in SMT model. For MKCLS and GIZA++ latest GNU compiler is required.
  • Moses [ http://sourceforge.net/ projects/mosesdecoder ]
  • Additional scripts [ http://tinyurl. com/ cp8xz7 ] - These are the additional scripts for Moses training and tuning.

For this article we are keeping /usr/home/PCQ/demo as the root directory for installation. The steps given below are relative to this root directory.

Getting started
Create a directory 'srilm' in root directory, move downloaded srilm tar file to this directory, extract and run the 'make file'.
Then move GIZA++ tar file to the root directory and extract. This process will create a directory GIZA++-v2. Run the make file inside GIZA++-v2 and thereafter again run make with argument snt2cooc.out from the same directory. This will produce GIZA++ and snt2cooc.out. Create a directory 'bin' in root directory and copy these two files to 'bin'.

Now move the mkcls tar file to root directory and extract. This will make mkcls-v2 directory in which you should run the make file. This will produce mkcls file which is copied to 'bin'.

Create a directory named Moses under root and copy Moses tar file to this directory and extract it. Now change to Moses directory and execute regenerate–makefiles.sh. and thereafter execute configure script with option ./configure --with-srilm=/usr/home/PCQ/demo/srilm and then run 'make – J 4'.

Now move to 'bin' directory under root and create 'moses-scripts' directory in it. Thereafter move to moses/scripts directory under root and run make release. Move scripts tar file to root directory and extract it. This process will complete the set up of Moses.

Training the Translator
Now we can process our parallel corpus. Create directories working-dir/corpus in root directory and copy the English and Hindi corpus in it. Filter out long sentences and lowercase the data. We do not require to lowercase Hindi data because it is in wx format. Now we create a directory 'lm' inside 'working-dir' directory to build the language model. Again we need to lowercase the English language data. These two steps creates English language model data. Now we can build the model using SRILM. After this process, our language model is ready and now we can train our model. After training, for better performance we can tune it also. However, tuning is not mandatory.

For translating, English language sentence can be given as input echo 'This is a small House.' | /usr/home/PCQ/ demo/moses/moses-cmd/src/moses -f moses. ini > out.txt.

We can find its translated Hindi language sentence by 'cat out.txt' which will contain this: yaha eka CotA AvAsagqha Hai.

Conclusion
If the corpus is large in size then it requires huge memory, at least 2 GB for building the translation model. Besides, a few steps in the process can take few minutes to few hours depending on the processing power, memory and size of training corpus. The model given here is base line model and research is going on to improve the results of translation. If the corpus is large enough, then the trained model will have a high translation accuracy.

Nirav Shah & Sumit Goswami, IIT Kharagpur

Page(s)   1  

Print Comment Email DiggDigg DeliciousDel.icio.us RedittReddit TwitterTwitter


Untitled Document



ZTE:Leading CDMA Technology


Extraordinary Networks:Freedom of Choice


   
 

 
 

Magazine Subscription | RQS | Contact Us | Team PCQuest | Advertising - Print | jobs@cybermedia