RNN based machine translation and transliteration for Twitter data
- PDF / 1,054,656 Bytes
- 6 Pages / 595.276 x 790.866 pts Page_size
- 30 Downloads / 254 Views
RNN based machine translation and transliteration for Twitter data M. K. Vathsala1 · Holi Ganga2 Received: 13 February 2020 / Accepted: 2 June 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract The present work aims at analyzing the social media data for code-switching and transliterated to English language using the special kind of recurrent neural network (RNN) called Long Short-Term Memory (LSTM) Network. During the course of work, TensorFlow is used to express LSTM suitably. Twitter data is stored in MongoDB to enable easy handling and processing of data. The data is parsed through different fields with the aid of Python script and cleaned using regular expressions. The LSTM model is trained for 1 M data which is further used for transliteration and translation of the Twitter data. Translation and transliteration of social media data enables publicizing the content in the language understood by majority of the population. With this, any content which is anti-social or threat to law and order can be easily verified and blocked at the source. Keywords Long short-term memory (LSTM) · Recurrent neural network (RNN) · Sequence-to-sequence · Python · Translation · Transliteration · Twitter · Machine translation (MT) · BLEU · Tensorflow
1 Introduction Machine Translation (MT) has evolved over five decades, which the developers have religiously followed since then. The most primitive approach in MT is the Statistical Machine Translation (SMT), which uses algorithms that are predictive in nature while teaching a system to translate the text. The existing translated text is used for translating the input text to the required language. The major drawback of this approach is that it requires a bi-lingual material for the model to predict the input text. This also hampers its ability to predict obscure languages. Whereas, the evolution of Neural Machine Translation (NMT) approach, has addressed the major drawbacks associated with SMT with its more accurate translation. Though this approach is also based on Deep learning techniques, with the aid of existing statistical models, the input data is distributed among the layers enabling a faster response.
* M. K. Vathsala [email protected] 1
Dept of ISE, MSRIT, Bengaluru, VTU, Belagavi, Karnataka 560054, India
Dept of ISE, Global Academy of Technology, Bengaluru, VTU, Belagavi 560098, India
2
So based on the definitions thus stated related to two different approaches, it is quite clear that NMT can handle intricate computations when compared to the conventional statistical model. Despite having sufficient information about NMT, the extension of this towards handling the social media content has been looked at with meagre intent. The data posted on social networking sites have been the source for major cyber-attacks. Twitter statistics indicate a whopping 313 million tweets posted monthly. The data is very huge, which makes to difficult to analyse the information posted in different regional languages. A tweet posted in a regiona
Data Loading...