Neural machine translation of chemical nomenclature between English and Chinese

  • PDF / 1,082,172 Bytes
  • 6 Pages / 595.276 x 790.866 pts Page_size
  • 57 Downloads / 214 Views

DOWNLOAD

REPORT


Journal of Cheminformatics Open Access

RESEARCH ARTICLE

Neural machine translation of chemical nomenclature between English and Chinese Tingjun Xu*  , Weiming Chen, Junhong Zhou, Jingfang Dai, Yingyong Li and Yingli Zhao

Abstract  Machine translation of chemical nomenclature has considerable application prospect in chemical text data processing between languages. However, rule based machine translation tools have to face significant complication in rule sets building, especially in translation of chemical names between English and Chinese, which are the two most used languages of chemical nomenclature in the world. We applied two types of neural networks in the task of chemical nomenclature translation between English and Chinese, and made a comparison with an existing rule based machine translation tool. The result shows that deep learning based approaches have a great chance to precede rule based translation tools in machine translation of chemical nomenclature between English and Chinese. Introduction Chemical names are primitive representations of chemicals, widely used by chemists in research articles, patents, data materials to describe chemical substances. Names accorded with chemical nomenclatures of IUPAC and CAS are exact expressions of molecular structures [1, 2], therefore those names can be used as identifiers of substances in chemical databases, and can be recognized by machine easily for converting names to conventional chemical structure representations [3, 4]. English and Chinese are the two most used languages  of chemical nomenclature in  the  world, according to number of results found by Google for searching a chemical name in different languages [5]. However, the linguistic differences of English and Chinese chemical names have limited  exchanges between users on both sides [6]. Therefore, machine translation of chemical nomenclature would  be  more  applicable  than  manual translation in chemical data processing. For example, data sets of compound names would be more valuable when derived from chemical named entity recognition systems of Chinese text-mining materials, because bulk translation of *Correspondence: [email protected] Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 LingLing Road, Shanghai 200032, China

Chinese chemical names into English by machine make it possible for the names to be converted into connection tables of chemical structures, owing to the fact that the vast majority of “name to structure” tools are only for English nomenclature [7–9]. Unfortunately,  it  still  has  a  lot  of  work  to  be done  in machine translation of chemical nomenclature beyond existing researches [10–12], especially in translation of chemical names between English and Chinese [13, 14]. There has significant complication in the task of analyzing Chinese chemical names, and make it difficult to set up a sophisticated machine translation rule set for conversion of various Chinese chemical names to or from English [13, 14]. For example, the Chinese chemical name of