Introduction to NLTK and Its Most Common Algorithms Used for Natural Language Processing (I)
Print Friendly, PDF & Email

Introduction

In today’s blog, we will discuss a very important python package, which is widely used for Natural Language Processing (NLP) known as NLTK. NLTK stands for Natural Language Tool Kit and is one of the most powerful NLP libraries. Natural Language Tool Kit is the mother of all NLP libraries. Machine Learning projects and data pre-processing use NLTK. It is used in data pre-processing. NLTK mainly consists of algorithms given below:

  1. Tokenization
  2. Stopwords
  3. Stemming
  4. Lemmatization
  5. Parts of speech tagging

NLTK Installation

pip install nltk

NLP

Natural Language Processing is defined as understanding text or speech with the aid of any software program or machine. An analogy is that humans can interact and understand each other views and reply with the perfect answer. In NLP, a computer made interactions, not a human.

Text Analysis Operations Using NLTK

1. Tokenization

Tokenization is the first step in text analytics. It is the process of breaking down a textual content paragraph into smaller chunks. A token is a single entity which is building block for sentence or paragraph. For example, each word is a token if a sentence  “tokenized” in words. Every sentence is a token if a paragraph is tokenized into a sentence.

1.1 Word Tokenization

The word tokenizer breaks the text paragraph into words. For example, if you want to separate each and every word from a paragraph, then you can use this tokenizer. This is the most used tokenizer in text analysis.

# Importing word tokenizer from NLTK
from nltk.tokenize import word_tokenize 

# Create an input String
paragraph = "Hello, how are you doing today? The weather is cool. The sky is dark."

# breaking the paragraph into word tokens and assigning it to a new variable. 
tokenized_text = word_tokenize(paragraph) 

# Printing the tokenized paragraph. It will return a list of tokens.
print(tokenized_text) 

OUTPUT:
['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is',
 'cool', '.', 'The', 'sky', 'is', 'dark', '.']
1.2 Sentence Tokenization

Sentence tokenizer breaks items into sentences. For example, if you have a text paragraph and you want to separate each sentence from this paragraph, then you can use Sentence Tokenizer. This is the second most widely used tokenizer in text analysis.

# Importing sent tokenizer from NLTK
from nltk.tokenize import sent_tokenize

# Create an input String
paragraph = "Hello, how are you doing today? The weather is cool. The sky is dark."

# Use sent_tokenize Method to break the string in sentences.
tokenized_text = sent_tokenize(paragraph)
print(tokenized_text)

OUTPUT:
['Hello, how are you doing today?', 'The weather is cool.', 'The sky is dark.']
1.3 Space Tokenization:

Space Tokenizer extracts the tokens form a sentence or paragraph based on space. Suppose you have a paragraph and you want to separate each word based on space, you can use Space Tokenizer.

# Importing Space Tokenizer from NLTK
from nltk.tokenize import SpaceTokenizer

# Create an input String
paragraph = "Hello, how are you doing today? \nThe weather is cool.\t The sky is dark."

# Use SpaceTokenizer & Tokenise Method to remove Space only.
tokenized_text = SpaceTokenizer().tokenize(paragraph)
print(tokenized_text)

OUTPUT:
['Hello,', 'how', 'are', 'you', 'doing', 'today?', '\nThe', 'weather', 'is', 
'cool.\t', 'The', 'sky', 'is', 'dark.']

Note: In the above output, you can see that all the words tokenized based on space. The output is different from the word tokenization.

1.4 Tab Tokenization

Tab Tokenizer extracts the tokens form a sentence or paragraph based on tabs. Suppose you have a paragraph and you want to separate each word based on the tab, you can use Tab Tokenizer.

# Importing Tab Tokenizer from NLTK
from nltk.tokenize import TabTokenizer

# Create an input String
paragraph = "Hello, how are you doing today? \nThe weather is cool.\t The sky is dark."

# Use TabTokenizer & Tokenise Method to remove Tabs.
tokenized_text = TabTokenizer().tokenize(paragraph)
print(tokenized_text)

OUTPUT:
['Hello, how are you doing today? \nThe weather is cool.', ' The sky is dark.']

Note: In the above output, you can see that all the words tokenized based on the tab. The above output is different from space tokenization. You can see we have only one tab after cool in our paragraph, and Tab Tokenizer divide the paragraph in two parts, one is before the \t, and another is after the \t.

1.5 Line Tokenization

Line Tokenizer extracts the tokens form a sentence or paragraph on the basis of lines. Suppose you have a paragraph, and you want to separate each word on the basis of line, you can use Line Tokenizer.

# Importing Line Tokenizer from NLTK
from nltk.tokenize import LineTokenizer 

# Create an input String
paragraph = "Hello, how are you doing today? \nThe weather is cool.\t The sky is dark."

# Use LineTokenizer & Tokenise Method to remove Newlines.
tokenized_text = LineTokenizer().tokenize(paragraph)
print(tokenized_text)

OUTPUT:
['Hello, how are you doing today? ', 'The weather is cool.\t The sky is dark.']

Note: In the above output, you can see that all the words tokenized on the basis of line. The above output is different from tab tokenization. You can see we have only one new line after today in our paragraph, and Line Tokenizer divide the paragraph in two parts, one is before the \n, and another is after the \n.

1.6 White Space Tokenization

White space Tokenizer extracts the tokens form a sentence or paragraph without whitespace, newline, and tabs. Suppose you have a paragraph and you want to separate each word without whitespace, newline and tabs, you can use Line Tokenizer.

# Import Whitwspace Tokenizer method from Natural Language Tool Kit(NLTK)
from nltk.tokenize import WhitespaceTokenizer

# Create an input String
paragraph = "Hello, how are you doing today? \nThe weather is cool.\t The sky is dark."

# Use WhitespaceTokenizer & Tokenise Method to remove Whitespaces, Newlines & tabs
tokenized_text = WhitespaceTokenizer().tokenize(paragraph)

# Printing the new text which will be a list.
print(tokenized_text)

OUTPUT:
['Hello,', 'how', 'are', 'you', 'doing', 'today?', 'The', 'weather', 'is', 'cool.',
 'The', 'sky', 'is', 'dark.']
Note: In the above output, you can see that all the whitespaces, newlines, and tabs are removed. The above output is different from space tokenization.

Conclusion

In this blog, we have learned about Tokenization in detail. We also learned how to do tokenization and its different methods in detail. We will cover other algorithms of NLTK in our upcoming blog. Please feel free to leave comments and feedback in the section below. To know more about our services, please visit Loginworks Softwares Inc.

1 COMMENT

  1. This is my first time pay a quick visit at here and I am actually pleased to read all at single place.

LEAVE A REPLY

Please enter your comment!
Please enter your name here