Natural Language Processing with Python

Natural Language Processing with Python

You might often hear Natural Language Processing with Python. Why? A sub-division of Artificial Intelligence (AI), Natural Language Processing or NLP, concerns itself with facilitating interactions between humans and machines (computer systems and applications) using natural language. 

NLP combines elements of linguistics, computer science, and data science. In practice, it involves developing systems capable of extracting language information from unstructured data sets, analyzing and interpreting language, and using the results to perform specific tasks or solve particular problems. This typically requires the development of software and related infrastructure, and in this context, you will often hear mention of natural language processing with Python.

What is Python?

Python is a high-level, general-purpose programming language that doesn’t have to transform code to be computer legible. For this reason, Python is known as an interpreted language, which aims to produce more apparent and more logical code for small and large-scale projects alike. The National Security Agency (NSA) employs Python for intelligence analysis and cryptography.

The Python programming ecosystem provides many resources and tutorials that support the deployment of Artificial Intelligence (AI) and Machine Learning (ML) systems. Though the language allows for the processing and handling of complex computational systems, the syntax of Python reads much like standard English. This makes Python relatively easy to learn. It also makes natural language processing using Python a logical match.

NLP Libraries and Tools

As we’ve observed, natural language processing is quite a complex field. So the last thing you want to be doing as an NLP programmer is to wrack your brains searching for the proper syntax and tool-kits to perform various functions. Fortunately, much of the grunt work can be eliminated by taking advantage of the various NLP libraries and resources available online, including predefined data sets, routines, and natural language processing with Python source code.

There are many open-source Natural Language Processing libraries, including Apache OpenNLP, the Gate NLP library, TensorFlow (also known as Seq2seq), and the Stanford NLP suite.

NLP with Python – Basic Procedures

What is natural language processing with Python? The best way to illustrate is by running through some of the basic procedures, using the Natural Language ToolKit (NLTK), written in Python and is the most popular library for natural language processing. It’s available for Python 2.7, 3.4, and 3.5 and has a large community that supports the project.

On Windows, macOS, or Linux systems, you can install NLTK using the “pip” command:

$ pip install nltk

To verify a successful installation, open the Python terminal and type:

Import nltk

Moving further, you’ll need to download and install the various NLTK packages which are relevant to your work. Typing:

import nltk

nltk.download()

This will display the NLTK downloader and give you the option to choose which packages you wish to install.

In a typical Python and natural language processing scenario, you’ll need to take certain steps in order to clean up and categorize the text data that you’ll be working with. A standard set of procedures might include the following:

Extracting Data from HTML or XML Files

The Python library Beautiful Soup enables you to extract information from XML and HTML files and allows you to clean HTML tags out of web page text.

You first need to invoke the urllib module to identify a specific web page:

import urllib.request

response =  urllib.request.urlopen(‘https://WEBPAGE ADDRESS’)

html = response.read()

print(html)

You can then call up Beautiful Soup to clean up the text:

soup = BeautifulSoup(HTML,’html5lib’)

text = soup.get_text(strip = True)

print(text)

NLP with Python – Tokenization

Using text vectorization, NLP tools transform the text into a form that machines can understand. Training data is then fed to machine learning (ML) algorithms, along with logical output tags that enable the machines to make the correct associations between a particular input and its corresponding output.

Tokenization is used to break a string of words into semantically useful units called tokens. Sentence tokenization splits the sentences within a body of text, while word tokenization splits the words within a sentence. 

A natural language processing with Python example using the Beautiful Soup library to tokenize the text on a web page might look like this:

from bs4 import BeautifulSoup

import urllib.request

response = urllib.request.urlopen(‘http://php.net/’)

html = response.read()

soup = BeautifulSoup(html,”html5lib”)

text = soup.get_text(strip=True)

tokens = [t for t in text.split()]

print (tokens)

Word Frequency Distribution

To establish how many times a particular word or token appears in a given piece of text, it’s necessary to calculate a frequency distribution. In the NLTK Python library, this function is performed by a class known as FreqDist. Its syntax looks like this:

import urllib.request

import nltk

response = urllib.request.urlopen(‘http://PAGE_ADDRESS’)

html = response.read()

soup = BeautifulSoup(html,”html5lib”)

text = soup.get_text(strip=True)

tokens = [t for t in text.split()]

freq = nltk.FreqDist(tokens)

for key,val in freq.items():

print (str(key) + ‘:’ + str(val))

The function’s output displays the frequency distribution for the document, including the most frequent token.

Removing Stop Words

In any given body of text, there’s usually a proliferation of common connective words like “a,” “of,” and “the.” These are known as stop words. Because they occur so often, they can hamper natural language processing analysis by cluttering up the data set without significantly adding to the underlying meaning of the text. 

The NTLK library provides a small set of common stop words, which you can load into a list. The set consists of stop words in various languages, so when loading it, you’ll need to specify the particular language that you’re working in. For example:

stopwords = nltk.corpus.stopwords.words(“English”)

Note that all the words in the NLTK stop words list are lowercase, so you may need to use the str.lower() qualifier to account for any stop words that are capitalized in your own text. The syntax for stop word removal might look something like this:

clean_tokens = tokens[:]

sr = stopwords.words(‘english’)

NLP with Python – Performing Sentiment Analysis

Sentiment analysis in NLP is a technique for studying unstructured data streams to extract text or audio that expresses certain opinions. It’s extensively used in market research and social media monitoring for assessing the state of public opinion, reactions to the performance of various products or services, and so on.

For natural language processing sentiment analysis, Python provides a pre-built sentiment analyzer for the NLTK library, in the form of VADER — the Valence Aware Dictionary and sEntiment Reasoner. Designed with social media monitoring in mind, VADER works best with short sentences containing some slang and abbreviations.

To use VADER, you first create an instance of the tool nltk.sentiment.SentimentIntensityAnalyzer, then use .polarity_scores() on a raw string of data:

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

sia.polarity_scores(“Dude,this NLTK really works!”)

{‘neg’: 0.0, ‘neu’: 0.295, ‘pos’: 0.705, ‘compound’: 0.8012}

The output will display a range of scores in which the negative, neutral, and positive scores all add up to 1 and cannot be negative.

Some NLP Projects using Python

There are many NLP projects out there, covering a variety of use cases. Some of the natural language processing with Python projects worthy of note include the following.

The DeepMoji deep learning model has been trained from 1.2 billion tweets with emojis and looks to draw inferences on how language is used to express emotions. It can be used for sentiment analysis, and the repository includes code snippets, training data, and tests for evaluating the code.

The NLP GitHub Project focuses on creating a bot that can learn from Whatsapp conversations exported from your phone and then start a conversation similar to your own speaking or writing patterns. 

Automatic Summarization of Scientific Papers is another GitHub NLP project that uses natural language processing to create a supervised learning-based system that can summarize scientific papers. It’s designed for people who regularly have to read extensive research documents and is a suitable project to learn for Python beginners or intermediate learners.

Also available on GitHub, the Speech Emotion Analyzer project aims to create a neural network model for detecting emotions from the conversations we have in daily life. Its neural networking model can detect up to five different emotions from male or female speakers. 

Share
Facebook
Twitter
LinkedIn
Email
Terry Brown

Terry Brown

Terry is an experienced product management and marketing professional having worked for technology based companies for over 30 years, in different industries including; Telecoms, IT Service Management (ITSM), Managed Service Providers (MSP), Enterprise Security, Business Intelligence (BI) and Healthcare. He has extensive experience defining and driving marketing strategy to align and support the sales process. He is also a fan of craft beer and Lotus cars.