Translate

Friday, 25 May 2012

The Translator Machine



 > > Ranjan Kumar Goswami


This article describes a new approach to machine translator that translates English text into Bangla text with disambiguation. The translated Bengali text in English scripts is also useful for learning Bengali or Bangla language as a foreign language. At the same time the Bengali rural people who do not know English language well can understand the English matter with the translated output. The proposed approach is a new one that uses both the rule-based and transformation-based machine translation schemes along with three level parsing approaches. This is a significant contribution towards creation of a low-cost Human Language Technology (HLT). About two hundred million people in the West Bengal, Tripura (two states in India) and in Bangladesh (a country), speak and write Bangla as their first language. This English to Bangla (E2B)-ANUBAD or translator system or E2B takes a paragraph of English sentences as input sentences and produces equivalent Bangla sentences. The E2B-ANUBAD system compries of a preprocessor, morphological parser, semantic parser using English word ontology for context disambiguation, an electronic lexicon associated with grammatical information and a discourse processor. It also employs a lexical disambiguation analyzer. This system does not rely on a stochastic approach. Rather, it is based on a special kind of hybrid architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components of both English and Bangla for creation of a low-cost HLT.
This article describes a new approach to machine translator that translates English text into Bangla text with disambiguation. The translated Bengali text in English scripts is also useful for learning Bengali or Bangla language as a foreign language. At the same time the Bengali rural people who do not know English language well can understand the English matter with the translated output. The proposed approach is a new one that uses both the rule-based and transformation-based machine translation schemes along with three level parsing approaches. This is a significant contribution towards creation of a low-cost Human Language Technology (HLT). About two hundred million people in the West Bengal, Tripura (two states in India) and in Bangladesh (a country), speak and write Bangla as their first language. This English to Bangla (E2B)-ANUBAD or translator system or E2B takes a paragraph of English sentences as input sentences and produces equivalent Bangla sentences. The E2B-ANUBAD system comprises of a preprocessor, morphological parser, semantic parser using English word ontology for context disambiguation, an electronic lexicon associated with grammatical information and a discourse processor. It also employs a lexical disambiguation analyzer. This system does not rely on a stochastic approach. Rather, it is based on a special kind of hybrid architecture of transformer and rule-based NLE architectures along with various linguistic knowledge components of both English and Bangla for creation of a low-cost HLT.
Introduction:
Bangla language is characterized by a rich system of inflections (VIBHAKTI), derivation, and compound formation [5,6,7] and, that is why the NLE using Bangla (output generation) is a very challenging task.
Natural Language Engineering (NLE) is the process of computer analysis of input provided in a human language (natural language) and conversion of this input into a useful form of representation. The input of a NLP system can be: written text or speech. This paper is concerned with the written text only. In order to process written text, we need: (a) lexical, (b) syntactic, (c) semantic knowledge about the language and (d) discourse information along with real world knowledge.
The purpose of lexical processing is to determine meanings of individual words. Syntactic analysis deals with syntactic structure. Semantic analysis deals with the context- independent meaning representation whereas the discourse processing deals with final meaning representation.
The term ontology simply denotes a group of "concepts" organized to reflect the relationships between the concepts. A lexicographer has the primitive task of building of ontology. Each word forms a class in which more than one entity can be included. Suppose there are words like biscuits, pizza, cake etc. All these words can be put under a single category i.e., Food (edible one). This type of categorization can be performed through the is - a - kind - of relation. Such information is useful for the purpose of context disambiguation. The E2B-BANGAUBAD system employs such ontological analysis also.
The proposed translator (E2B-ANUBAD) uses (i) the grammar for the input or source language, (ii) a source -to - target language dictionary, (iii) a set of source -to - target language rules, and (iv) an exception handler.
The E2B Translator System:
This English to Bangla (E2B) translator is based on a special architecture of rule-based and transformer architecture. It is based on 300 rules. More rules are being developed. It is upgraded with linguistic knowledge architecture also. The system is enriched with morphological parser, semantic parser along with ontological analyzer, disambiguation processing, and discourse analyzer [1,2,3,4,8,9,10]. The system has been developed using VB 6.0 and MS Access 2000. To begin with, the lexicon comprises of 2500 English words only. The E2B translator system's user interface for input and output is shown below. This interface shows a paragraph of English (input) sentences in the upper text box and the translated paragraph of Bangla (output) sentences in the lower text box.
The system is capable handling a word that is not present in the lexicon. It is capable of handling lexicon disambiguation (a word with multiple part - of - speech tags or with multiple meanings) also. For example, the word "Light" (in English) has multiple POS tags namely, verb, adjective and noun. Light (v) means Jwalao (in Bangla). Light (Adjective) means Halka (in Bangla). Light (noun) means Baati (in Bangla). Again, the E2B-BANGANUBAD system is capable of context disambiguation also.
For example, for the input sentence in English like, "I had a Pizza," the E2B-BANGANUBAD's output is "Aami Ekta Pizza Kheyechhilam," (in Bangla). Or, for the input sentence "I had a dog", the system's output is "Aamar Ekti Kukur Chhilo" (in Bangla). The word "had" has two different context meanings. Again, for an example of POS disambiguation, the system's output is "Aamra Jol Khai'' for the input English sentence - "We drink water"(water as noun). Or, for the input sentence - "Water the tree" (water as a verb), the system's output is "Gaachh Tite Jol Dao". Or, for the input sentence - "This is a water tank" (water as an adjective), the system's output is "Eti Joler Tank".
This system does not use any pre-tagged English corpus because it is not a stochastic approach. We have not used Hidden Markov Model (HMM) also. E2B- BANGANUBAD uses its in-built POS tagger only.
Conclusion:
This system is capable of handling the most challenging "disambiguation" aspects of NLE through semantic net analysis. The E2B-BANGANUBAD Translator system is not exactly based on any conventional rule-based, stochastic or transformation based NLE. This is based on a special kind of hybrid architecture of rule-based and transformer system along with an integrated parser for both morphological and semantic analysis. This system gives only a unique translated output sentence against an English sentence. Much attention is given in developing such a complex NLE translator system to generate a deterministic output sentence for an input or a source sentence. Both the lexicon and context disambiguation processing have been incorporated to work satisfactorily. This system incorporates also various linguistic components like Bangla inflections (Vibhakti), derivation, Karaka (endings) and compound formation also. The system is easily upgradable with new grammatical rules and lexicons. Study is going on towards enhancing this translator. This is a low cost domain independent translator system aiming to produce reliable output with high performance and higher accuracy and dedicated to rural Bengalee people for understanding English text. This is a significant step forward toward creation of an affordable HLT.
Acknowledgement: Author is thankful to Dr. A.B. Saha, Executive Director, CDAC, Kolkata, for his encouragement.
References:
  1. Akshar Bharati, et al., "Natural Language Processing," PHI, 2000.
  2. Wingrad, Terry, "Understanding Natural Language," Academic Press, New York, 1972.
  3. D. Jurafsky and J.H. Martin, "Speech and Language Processing," Pearson Education, 2000.
  4. Ma Quing, "Natural Language Processing with Neural Networks," Language Engineering Conference 2002, Hyderabad.
  5. Goutam Kumar Saha, et al, "Computer Assisted Bangla POS Tagging," Proceedings of the International Symposium ISTRANS 2004, Tata McGraw-Hill, New Delhi, 2004.
  6. K.C. Dash (ed.), " Indian Semantics," Agamakala Publications, Delhi, 1994.
  7. Bamondeb Chakroborty, "Uchchotoro Bangla Byakaron," Akshay Malancha, 2003.
  8. Akshar Bharati, et al, "A Computational Grammar for Indian Languages Processing, " Indian Linguistics Journal, 52, 91-103, 1991a.
  9. Goutam Kumar Saha, "BANGANUBAD - An English to Bangla Translator," in press, International Journal CPOL, 2005, USA.
  10. Goutam Kumar Saha, "Bangla Text Parsing with Intelligence," Proceedings of the International Conference MS'05, 2005, Morocco.

No comments:

Post a Comment