Hadoop Framework for Entity Recognition Within High Velocity Streams Using Deep Learning

Social media such as twitter, Facebook are the sources for Stream data. They generate unstructured formal text on various topics containing, emotions expressed on persons, organizations, locations, movies etc. Characteristics of such stream data are veloc

  • PDF / 418,435 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 93 Downloads / 216 Views

DOWNLOAD

REPORT


Abstract Social media such as twitter, Facebook are the sources for Stream data. They generate unstructured formal text on various topics containing, emotions expressed on persons, organizations, locations, movies etc. Characteristics of such stream data are velocity, volume, incomplete, often incorrect, cryptic and noisy. Hadoop framework is proposed in our earlier work for recognising and resolving entities within semi structured data such as e-catalogs. This paper extends the framework for recognising and resolving entities from unstructured data such as tweets. Such a system can be used in data integration, de-duplication, detecting events, sentiment analysis. The proposed framework will recognize pre-defined entities from streams using Natural Language Processing (NLP) for extracting local context features and uses Map Reduce for entity resolution. Test results proved that the proposed entity recognition system could identify predefined entities such as location, organization and person entities with an accuracy of 72%.





Keywords Entity recognition Natural language processing Stream data Entity resolution Hadoop framework Tweets Supervised learning









1 Introduction Entity recognition involves identifying named entities in the given formal text and classifying them to one of the predefined entities. This process helps in data integration process of Extract, Transform and Load (ETL). Applications such as automated question answering system, entity duplication also require entity

S. Vasavi (✉) Department of CSE, VR Siddhartha Engineering College, Vijayawada, India e-mail: [email protected] S. Prabhakar Benny University College of Engineering for Women, Kakatiya University, Warangal, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 S.C. Satapathy et al. (eds.), Data Engineering and Intelligent Computing, Advances in Intelligent Systems and Computing 542, DOI 10.1007/978-981-10-3223-3_23

247

248

S. Vasavi and S. Prabhakar Benny

recognition. Entity represents real world objects such as persons, locations, organizations, objects etc. For instance, in the sentence. “Jawaharlal Nehru is the first prime minister of India”, “Jawaharlal Nehru” and “INDIA” are person, location entities respectively. Literature provides approaches, methods, techniques and tools to carry out entity recognition. Natural language processing plays vital role in recognizing and extracting entities. It performs various syntactic and semantic analysis on the text to recognize atomic elements (nouns, verbs, prepositions, quantifiers) of information such as person, organization, location, numeric value, date and time, unit of measurement. Following are the 5 categories of Named Entity Recognition (NER) methods that are used for automatic recognition and classification of named entities: 1. 2. 3. 4. 5.

Rule-based NER Machine learning-based NER Hybrid NER Statistical based Deep learning

In rule based NER, names are extracted using human-made rules set. In machine learning based NER, supervised and uns