Before looking into NLP, lets understand what natural language is. In simple terms, it’s the language we use to express ourselves. It’s a basic means of communication. To define more specifically, language is a mutually agreed set of protocols involving words/sounds we use to communicate with each other.
For example, You must have some emails in your mailbox that have been automatically labeled as spam. This is done with the help of NLP. Here, the email service – analyzes the content of the emails, understands the context, and then decides whether these emails need to be marked as spam or not.
Text Analytics and NLP
Text analytics is the method of extracting meaningful insights and answering questions from text data. This text data need not be a human language. Let’s understand this with an example. Suppose you have a text file that contains your outgoing phone calls and SMS log data in the following format:
In the above figure, the first two fields represent the date and time at which the call was made or the SMS was sent. The third field represents the type of data. If the data is of the call type, then the value for this field will be set as voice_call. If the type of data is sms, the value of this field will be set to sms. The fourth field is for the phone number and name of the contact. If the number of the person is not in the contact list, then the name value will be left blank. The last field is for the duration of the call or text message. If the type of the data is voice_call, then the value in this field will be the duration of that call. If the type of data is sms, then the value in this field will be the text message.
The following figure shows records of call data stored in a text file:
Now, the data shown in the preceding figure is not exactly a human language. But it contains various information that can be extracted by analyzing it. A couple of questions that can be answered by looking at this data are as follows:
- How many New Year greetings were sent by SMS on 1st January?
- How many people were contacted whose name is not in the contact list?
The art of extracting useful insights from any given text data can be referred to as text analytics. NLP, on the other hand, is not just restricted to text data. Voice (speech) recognition and analysis also come under the domain of NLP. NLP can be broadly categorized into two types: Natural Language Understanding (NLU) and Natural Language Generation (NLG).
- NLU: NLU refers to specific type of NLP that covers the reading aspect of NLP, where in it will extract the intent and the entity(the action).
- NLG: NLG is what happens when computers write the language.
For example, when a human speaks to a machine, the machine interprets the human language with the help of the NLU process. Also, by using the NLG process, the machine generates an appropriate response and shares that with the human, thus making it easier for humans to understand.
Kick Starting an NLP Project
Suppose you are working on a project in which you need to collect tweets and analyze their sentiments. We will explain how this is carried out by discussing each phase.
We can divide an NLP project into several sub-projects. These phases are followed sequentially. An NLP project has to go through six major phases.
1. Data Collection
This is the initial phase of any NLP project. Our purpose is to collect data as per our requirements. For this, we may either use existing data, collect data from various online repositories, or create our own dataset by crawling the web. In our case, we will collect tweets.
2. Data Preprocessing
Once the data is collected, we need to clean it. It is necessary to clean the collected data, as dirty data tends to reduce effectiveness and accuracy. In our case, we will remove the unnecessary URLs, words, and more from the collected tweets.
3. Feature Extraction
Computers understand only binary digits: 0 and 1. Thus, every instruction we feed into a computer gets transformed into binary digits. Similarly, machine learning models tend to understand only numeric data. As such, it becomes necessary to convert the text data into its equivalent numerical form.
4. Model Development
Once the feature set is ready, we need to develop a suitable model that can be trained to gain knowledge from the data. These models are generally statistical, machine learning-based, deep learning-based, or reinforcement learning-based. In our case, we will build a model that is capable of extracting sentiments from numeric forms.
5. Model Assessment
After developing a model, it is essential to benchmark it. This process of bench marking is known as model assessment. In this step, we will evaluate the performance of our model by comparing it to others. This can be done by using different parameters or metrics. These parameters include precision, recall, and accuracy. In our case, we will evaluate the newly created model by checking how well it performs when extracting the sentiments of the tweets.
6. Model Deployment
This is the final stage for most NLP projects. In this stage, the models are put into production. They are either integrated into an existing system or new products are created by keeping this model as a base. In our case, we will deploy our model to production, such that it can extract sentiments from tweets in real time.
In summary, we learned what NLP is and how it can be used to extract the useful information. We looked at the different phases an NLP project has to pass through.