Textual Data Analysis with NLTK

In this book, you have seen various analysis techniques and numerous examples that worked on data in numerical or tabular form, which is easily processed through mathematical expressions and statistical techniques. But most of the data is composed of text

  • PDF / 204,800 Bytes
  • 20 Pages / 504 x 720 pts Page_size
  • 16 Downloads / 265 Views

DOWNLOAD

REPORT


Textual Data Analysis with NLTK In this book, you have seen various analysis techniques and numerous examples that worked on data in numerical or tabular form, which is easily processed through mathematical expressions and statistical techniques. But most of the data is composed of text, which responds to grammatical rules (or sometimes not even that :)) that differ from language to language. In text, the words and the meanings attributable to the words (as well as the emotions they transmit) can be a very useful source of information. In this chapter, you will learn about some text analysis techniques using the NLTK (Natural Language Toolkit) library, which will allow you to perform otherwise complex operations. Furthermore, the topics covered will help you understand this important part of data analysis.

Text Analysis Techniques In recent years, with the advent of Big Data and the immense amount of textual data coming from the Internet, a lot of text analysis techniques have been developed by necessity. In fact, this form of data can be very difficult to analyze, but at the same time represents a source of a lot of useful information, given also the enormous availability of data. Just think of all the literature produced, the numerous posts published on the Internet, for example. Comments on social networks and chats can also be a great source of data, especially to understand the degree of approval or disapproval of a particular topic.

© Fabio Nelli 2018 F. Nelli, Python Data Analytics, https://doi.org/10.1007/978-1-4842-3913-1_13

487

Chapter 13

Textual Data Analysis with NLTK

Analyzing these texts has therefore become a source of enormous interest, and there are many techniques that have been introduced for this purpose, creating a real discipline in itself. Some of the more important techniques are the following: •

Analysis of the frequency distribution of words



Pattern recognition



Tagging



Analysis of links and associations



Sentiment analysis

The Natural Language Toolkit (NLTK) If you program in Python and want to analyze data in text form, one of the most commonly used tools at the moment is the Python Natural Language Toolkit (NLTK). NLTK is nothing more than a Python library (https://www.nltk.org/) in which there are many tools specialized in processing and text data analysis. NLTK was created in 2001 for educational purposes, then over time it developed to such an extent that it became a real analysis tool. Within the NLTK library, there is also a large collection of sample texts, called corpora. This collection of texts is taken largely from literature and is very useful as a basis for the application of the techniques developed with the NLTK library. In particular, it’s used to perform tests (a role similar to the MNIST dataset present in TensorFlow, which is discussed in Chapter 9). Installing NLTK on your computer is a very simple operation. Being a very popular Python library, you simply need to install it using pip or conda. On Linux systems, use this: pip install nltk