Artificial Intelligence

What is Natural Language Processing?

William Goddard
October 17, 2021
Reading Time: 15 minutes

What is Natural Language Processing? It’s a term that you may have heard in connection with computing and Artificial Intelligence (AI).

Table of Contents

Natural Language Processing is the area of computer science that’s dedicated to creating machines, computer systems, and applications that can interpret and understand text or speech input in natural human language and provide output or responses using the same medium.

The Natural Language Processing definition you’ll find at dictionary.com puts it like this:

“The application of machine learning algorithms to the analysis, understanding, and manipulation of written or spoken examples of human language. Abbreviation: NLP.”

So NLP (Natural Language Processing) is the sub-branch of Artificial Intelligence that uses a combination of linguistics, computer science, statistical analysis, and Machine Learning (ML) to give systems the ability to understand text and spoken words in natural language, in much the same way as human beings can.

A more detailed way to define Natural Language Processing is to describe it as a discipline that combines computational linguistics (the rule-based modeling of human language) with statistical modeling, machine learning, and deep learning models.

NLP Basics

If you’re beginning Natural Language Processing, it’s easier to start with the written word.

This avoids the added complexity of transcribing speech into text or generating natural human voices.

The introduction of Natural Language Processing initially had two different objectives: understanding human language input and generating human language output.

Natural Language Understanding or NLU may be considered the passive mode or abstract of Natural Language Processing.

It has its structural basis in text or speech analysis and manifests through text and speech classification.

Natural Language Generation or NLG is the more active mode of NLP.

In practice, conversational systems capable of providing human language responses to human input will alternate the functions of Natural Language Understanding and Natural Language Generation, as NLP algorithms analyze and comprehend a natural-language statement, then formulate an appropriate response.

First Steps in Developing NLP Projects

For an easy introduction to Natural Language Processing at a practical level, some knowledge of machine learning basics is essential.

However, by adopting a project-based approach, it’s possible to develop and train NLP models even without the technical credentials of an intensive background in mathematics or theoretical computer science.

By gaining an introduction to Natural Language Processing at the project level, you can revisit your machine learning basics, gain a greater understanding of NLP applications (especially if the projects are based on actual use cases), and acquire new skills during the project implementation stage.

Some of the Natural Language Processing steps during project implementation would include:

Mining unstructured data sources to extract text-based information.
Separating the text into units that a machine can understand.
Further refining these units through Natural Language Processing procedures like tokenization, stemming, word count frequency, etc.
Implementing a pre-trained learning model.
Deploying the training model as an Application Programming Interface (API).
Connecting the API to your main application.

Practical Applications of NLP

What is NLP (Natural Language Processing) from a practical or user perspective?

There are numerous applications of NLP at the consumer and corporate levels -some of them so commonplace and familiar that we now take them for granted.

Some of the most common uses of Natural Language Processing occur at the level of our interactions with everyday computer software, mobile devices, and the internet.

For example, NLP is the basis for spam and email filtering and provides the mechanism that Gmail uses to classify incoming messages as Important, Promotion, or suitable for your Primary inbox.

Other examples of Natural Language Processing in action include autocorrect, autocomplete, and the grammar and spell-checking of text or speech input into word-processing applications, text boxes, and internet search engines.

What can Natural Language Processing do at the corporate or organizational level? Well, one of the significant applications of NLP for businesses is the use of chatbots.

Conversational interfaces are now routinely deployed to provide answers to Frequently Asked Questions, field customer service or technical support queries, and as a direct line of communication between individual consumers and brands large or small.

History of Natural Language Processing

The history of Natural Language Processing spans from the early twentieth century to the present day, charting an evolutionary path from the earliest concepts of linguistic structure and computational science to today’s advanced applications and systems.

Among the key events in a brief history of Natural Language Processing are:

1916 – The Path of General Linguistics
Based on the work and writings of Swiss linguistics professor Ferdinand de Saussure, collaborators Albert Sechehaye and Charles Bally publish “Le Cours de Linguistique Générale” (“The Path of General Linguistics”), a book laying the foundations for the structuralist approach to linguistics and computational science.

1950 – Alan Turing
Alan Turing documents his concept of the “thinking machine,” proposing that if a machine could take part in a conversation via the use of a teleprinter -and that if it imitated a human being so wholly there were no discernible differences -then the machine could be considered intelligent, or capable of thinking.

1952 – The Hodgkin-Huxley model
The Hodgkin-Huxley model demonstrates how the human brain uses neurons to form an electrical network.

1957 – Noam Chomsky
Noam Chomsky publishes “Syntactic Structures,” a book laying out the conceptual basis for Phase-Structure Grammar, which could methodically translate natural language sentences into a format usable by computers.

1958 – John McCarthy
John McCarthy introduces the programming language LISP (Locator/Identifier Separation Protocol), which is still in use today.

1964 – US National Research Council
ELIZA, a “typewritten” comment and response system, is introduced. By rearranging sentences and following relatively simple grammar rules, the process could emulate the behavior of a psychiatrist. In the same year, the US National Research Council (NRC) created the Automatic Language Processing Advisory Committee (ALPAC), a body charged with assessing Natural Language Processing research progress.

1966 – NRC and ALPAC
The NRC and ALPAC halt funding for NLP research, apparently signaling the death of Natural Language Processing.

The 1970s – Fresh Ideas
Fresh ideas are brought into NLP, such as building conceptual ontologies that structure real-world information into a form that computers can understand.

The 1980s – Problems in NLP are Addressed
Many significant problems in Natural Language Processing are addressed using symbolic approaches such as breaking text into tokens and assigning meanings to these tokens and their mutual relationships.

The 1990s – Paving the Way
The first advanced statistical models replace most Natural Language Processing systems based on complex sets of hand-written rules. This paves the way for systems built on automatic learning.

1997 – RNN Models
A review of the neural history of Natural Language Processing starts this year when LSTM recurrent neural net (RNN) models are introduced.

2001 – Yoshio Bengio
Yoshio Bengio and his team propose the first neural “language” model, using a feed-forward neural network.

2008 – Collobert and Weston
Collobert and Weston apply multi-task learning, a sub-field of machine learning where multiple learning tasks are solved simultaneously, to neural networks for NLP.

2011 – Siri
Apple’s Siri hits the market as one of the world’s first successful NLP / AI assistants for general consumers.

2013 – Word2Vec
Mikolov et al. introduce Word2Vec, one of the most popular word embedding models. The year also sees widespread adoption of recurrent neural networks (RNNs), convolutional neural networks (CNNs), and recursive neural networks in NLP.

2014 – Sutskever et al.
Sutskever et al. propose sequence-to-sequence learning, a general end-to-end approach for mapping one sequence to another via a neural network.

2015 – Enabling Neural Machine Translation
Bahdanau et al. introduce the principle of attention, the key concept enabling neural machine translation (NMT) models to outperform classic sentence-based Machine Translation systems.

2018 to Present – Pre-trained Language Models
Large pre-trained language models can learn word representations from large unannotated bodies of text, enabling efficient learning with significantly less data.

Trends

Much of the innovation currently taking place in the Artificial Intelligence arena is in the field of natural language technologies and processes.

The prevailing trend in this area over the past decade has been a shift from rules-based models to training models based on machine learning.

Some of the current trends in Natural Language Processing or NLP include the following:

Hybrid Deployment Models

One of the prevailing trends in NLP is the deployment of neural networks which use smaller quantities of training data alongside conventional rules-based models.

This enables more accurate text analysis and facilitates conversational AI, sentiment analysis, and various other applications.

This hybrid approach is advantageous in situations where a large body of reliable training data for Natural Language Processing is not available.

Model makers can start with a rules-based dynamic, then later switch to using learned models.

Using Natural Language Generation (NLG) to Demystify Data Science

Natural Language Generation or NLG uses text analytics and Natural Language Processing techniques to first understand written or spoken text input and then produce natural language responses to what’s been said.

Much of what’s going on “under the hood” may be incomprehensible to the average business user in the often complex fields of machine learning and data science.

NLG makes it possible to design systems that can explain what’s happening in simple terms, making the concepts and mechanics more accessible to anyone who isn’t a data scientist or specialist in the system or application concerned.

New Approaches to Training Data

Extensive and accurate bodies of training data are essential for extracting the maximum benefit from Natural Language Processing that relies on machine learning and deep neural networks.

There’s currently a lack of sufficient training data, and several methods are being developed to overcome this problem.

Rather than relying on a considerable volume of data, most depend on refining the available resources using domain-specific information.

For example, BERT or Bidirectional Encoder Representations from Transformers relies on mechanisms (transformers) that can pre-train learning models on the text while looking at it in multiple directions, rather than simply from left to right.

This results in a greater understanding of the context and meaning of the text and minimizes the quantity of data required.

Limitations

Natural Language Processing (NLP) uses a combination of linguistics, computer science, and statistical analysis to transform everyday spoken or written language into something that can be processed, understood, and acted on by machines.

Human language is notoriously complex, dynamic, and quirky, posing challenges to NLP that have yet to be fully overcome.

Some of the limitations of Natural Language Processing include:

Difficulties With Context And Shades Of Meaning

Often, the exact words or phrases can have different meanings depending on the context of a sentence.

In addition, many words (“hear,” “here”) have the same pronunciation but different meanings.

And a language may contain several different words (synonyms) that all have the same meaning.

To construct an NLP system capable of handling all these kinds of permutations, modelers must include all of a word’s possible meanings and all possible synonyms.

Problems With Sarcasm And Irony

Sarcastic or ironic statements typically include words or phrases that say one thing, but in the context of the statement, actually mean the exact opposite.

Although NLP models can be trained with common triggers that indicate sarcasm, it’s a complex process.

Interpreting Ambiguous Phrases and References

A single word can serve as a verb, noun, or adjective in different contexts.

Whole sentences may also have different meanings when viewed in a different context.

Part of Speech or PoS tagging is one NLP technique that can assist designers in overcoming this problem -but it’s far from perfect.

Confusion Due to Misuse or Misspelling

Misused, mispronounced, or misspelled words can wreak havoc on accurate text analysis and disrupt the capabilities of autocorrect mechanisms.

These issues may be reduced to some extent as Natural Language Processing databases grow and as individual users train their AI and voice assistant systems over time.

Slang and Colloquialisms

As we’ve observed, language is dynamic, with new slang terms, abbreviations, and buzzwords being added all the time.

Interpreting these phrases can present problems for NLP at the text analytics level while keeping up with the changes occurring in a language or dialect can produce issues at the data and training levels.

Industry or Domain-Specific Languages

Certain disciplines such as medicine and the legal profession have their unique vocabulary and sub-languages, which Natural Language Processing systems designed for generic text find difficult to handle.

This often makes it necessary to design and train analysis tools for a specific domain or industry language.

Uncommon or Low-Resource Languages

There are many languages in the world that have relatively few native speakers or don’t have extensive resources on the web to provide training data.

For these tongues, Natural Language Processing may not be practical.

However, new NLP techniques such as multilingual transformers and multilingual sentence embedding are beginning to address this issue by identifying and exploiting the universal similarities between languages.

Tools

By analyzing and converting spoken or written text into a form that machines can understand and act upon, Natural Language Processing tools help process unstructured information from numerous sources.

They have applications in text and sentiment analysis, subject classification, and user-level applications like spell-checkers, autocorrect mechanisms, search engines, virtual assistants, and chatbots with conversational interfaces.

Much of the Natural Language Processing software for commercial use deploys like SaaS, or Software as a Service – cloud-based solutions which users can implement with little or no code.

SaaS platforms often offer pre-trained Natural Language Processing models for “plug and play” operation, or Application Programming Interfaces (APIs), for those who wish to simplify their NLP deployment in a flexible manner that requires little coding.

For example, Aylien is a SaaS API, which uses deep learning and NLP to analyze large volumes of text-based data, such as social media commentary, academic publications, and real-time content from news outlets.

And the Google Cloud Natural Language API provides several pre-trained models for sentiment analysis, content classification, and other functions.

For individuals and developers seeking Natural Language Processing software open source is often the easiest way to go.

For example, SpaCy is an open-source Natural Language Processing with Python library that supports large data volumes and includes many pre-trained NLP models.

It focuses on ease of use and typically displays a short menu offering the best available option for each particular task.

The Python programming language is used extensively in Natural Language Processing.

Natural Language Processing with PyTorch harnesses the power of a deep learning library for NLP with rich capabilities.

As with other NLP libraries, PyTorch Natural Language Processing begins with loading the required libraries and data sets, setting up a model architecture, defining a training function, building an NLP model, then testing its accuracy.

Techniques

To make unstructured text data comprehensible to machines, there are several Natural Language Processing techniques that NLP designers must routinely perform. They include:

Tokenization

This involves splitting content into “tokens”, which are individual terms or sentences that make it easier for an NLP system to work with the data.

Lemmatization and Stemming

These are preprocessing techniques used in cleaning up NLP text data and preparing a data set. In lemmatization, a given word is converted into its “root” dictionary form or “lemma.”

Stemming reduces a word to its immediate root -so, for example, “baking” becomes “bake.”

Word Clouds

A word cloud or tag cloud is a data visualization technique that represents words from a body of text in a chart.

The more important words display in a larger font, with the font size decreasing for less critical words.

Keywords Extraction

This technique uses keyword extraction algorithms to extract the most important words or phrases from a body of text or a collection of text passages.

The TextRank algorithm, for example, works on the same principle as the PageRank algorithms that Google uses to assign importance to different web pages.

Named Entity Recognition

Named Entity Recognition is an integral part of Natural Language Processing methodology, which is used to identify entities in unstructured text data and assign them to a list of pre-defined categories such as “persons,” “organizations,” or “dates.”

Topic Modeling

Given a collection of disparate documents, topic modeling is an NLP technique that uses algorithms to identify patterns of words and phrases that can assist in clustering the documents and grouping them by topics.

Sentiment Analysis

In Natural Language Processing, sentiment analysis is used to establish whether a piece of text or commentary is positive, negative, or neutral in tone.

It has applications in social media monitoring, Customer Relationship Management, and the analysis of reviews.

Sentiment analysis and Natural Language Processing are, therefore, useful tools for commercial enterprises and data analysts.

For Natural Language Processing sentiment analysis, Python is commonly used, as the programming language provides several resources and libraries with pre-built data sets and algorithms.

Sentiment analysis capturing favorability using Natural Language Processing is an approach that extracts sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the entire document into positive or negative.

Applying this kind of sentiment analysis in Natural Language Processing makes it possible to identify sentiments in web pages and news articles with great precision.

News

Online, there are several sites dedicated to Natural Language Processing news. Some recent headlines include:

“Toward a machine learning model that can reason about everyday actions.”

“Hey, Alexa! Sorry, I fooled you …”

Applications

Among the many Natural Language Processing applications in everyday use are autocorrect, grammar and spell-checking, machine translation, and speech recognition.

Natural Language Processing applications in the consumer realm include spam and email filtering, voice and AI assistants like Siri, and conversational chatbots.

Chatbots are also among the major business applications of Natural Language Processing.

Together with sentiment analysis, chatbots are helping brands and organizations to interact with consumers and deliver better customer service.

There’s an evolving breed of Natural Language Processing security applications, such as software that can perform Malicious Language Processing to identify malware code and phishing text.

Characteristics

Among the several characteristics of Natural Language Processing are:

Syntax: the arrangement of words in a sentence to make grammatical sense. NLP uses syntax to assess meaning based on grammatical rules.

Semantics: applying algorithms to understand the meaning and structure of sentences.

Parsing: the grammatical analysis of a sentence. In Natural Language Processing, this involves breaking the sentence into parts of speech such as nouns, verbs, and adverbs.

Word segmentation: taking a string of text and deriving word forms from it.

Stemming: reducing words to their root form, which is useful when analyzing a piece of text for all instances of a particular word.

Morphological segmentation: dividing words into smaller parts called morphemes. In NLP, this has particular applications in machine translation and speech recognition.

Sentence breaking: creating boundaries between the sentences of large bodies of text.

Difficult

Learning a new language -or even learning how to communicate effectively on your own -can be a tough challenge for humans.

This in itself can explain why Natural Language Processing is difficult.

Natural Language Processing or NLP is the science of teaching and developing machines capable of extracting language information from unstructured data sources, analyzing, interpreting, and understanding that language, then using this understanding to help solve particular problems or perform specific tasks.

One challenge to performing NLP is the sheer size and complexity of the lexicon or word base of a language.

The vocabulary of an average English speaker typically consists of around 20,000 words -which is roughly one-tenth of the over 200,000 entries in the Oxford English Dictionary.

So designing for NLP requires massive databases of words. Then there’s the complexity of grammar.

Sentence construction, context, ambiguity, colloquialism, synonyms, antonyms, and irony all contribute to the challenge of designing NLP systems capable of taking all these nuances into account.

Deep Learning

Deep learning is a form of machine learning based on neural networks.

Deep learning for Natural Language Processing opens up a number of possibilities, including recognizing patterns in text data, inferring meaning from the context or words and phrases, and determining the emotional tone of text passages.

Applications of deep learning in the NLP space are helping to facilitate and improve the performance of web searches, social media feeds, and interactions with voice assistants.

Text Mining

In Natural Language Processing text mining is the process of examining large collections of written data to discover relevant information and convert that information into data that can be used for further analysis.

Natural Language Processing and text mining are a logical fit, as NLP (Natural Language Processing) is a text mining component that performs linguistic analysis to help machines interpret and understand the text.