Intro to NLTK and Its Common Algorithms Used for NLP – P2

Introduction

In our previous blog, we had only covered tokenization and its methods. In today’s blog, we will discuss further algorithms of NLTK. Here, we will start with Stopwords.

1. Stopwords

Stopwords are the most common words in any natural language. For the reason of analyzing text and building Natural Language Processing models, these stopwords won’t add much value to the document, so we need to remove these words. The most words used in a text are  ‘is,’ ‘s,’ ‘am,’ ‘or,’ ‘who,’ ‘as,’ ‘from,’ ‘him,’ ‘each,’ ‘the,’ ‘themselves,’ ‘when,’ ‘to,’ ‘at,’ etc.

Few Key Benefits of Removing Stopwords

  • It decreases the dataset size, and the time to train the model also decreases.
  • Removing stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. It can increase classification accuracy.

To check the list of stopwords, you can use the below command.

# Import NLTK
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Printing set of English Stopwords
print(set(stopwords.words('english')))

OUTPUT:
{'in', 'be', 'an', 'aren', 'between', 'most', 'these', 'both', "weren't", 'his',
'while', 'so', 've', "that'll", 'we', 'such', 'doesn', 'from', 'y', 'what',
'shan', 'wouldn', 'him', 'no', "didn't", 'on', 'it', 's', "you'd", 'm',
"shouldn't", 'her', 'ain', 'itself', 'been', 'as', "she's", 'into', 'doing',
'there', 'where', "it's", 'more', 'each', 'needn', 'here', 'does', 'its', 'those',
"aren't", "hasn't", "shan't", 'won', 'ours', 'yours', "wouldn't", 'again',
'yourselves', 'off', 'only', 'am', 'will', 'who', 'the', 'very', 'has', 'you',
'nor', 'mightn', 'at', 'himself', "you're", 'because', 'herself', 'if', 'is',
"mustn't", 'just', 'until', 'once', 'before', 'when', 'up', 'don', 'weren', 'too',
'should', 'didn', 'was', 'were', 'not', 'she', 'of', 'are', 'few', 'this',
"should've", "hadn't", 'had', 'he', 'with', 'whom', 'ma', "couldn't", 'myself',
'or', 't', 'my', 'having', 'haven', 'i', 'couldn', 'during', 'for', 'have',
'under', "haven't", 'did', 'your', "doesn't", 'can', 'wasn', 'our', 'through',
'a', 'isn', "you've", 'now', "wasn't", 'out', 'yourself', 'than', 'some', 'hadn',
're', "mightn't", 'they', 'mustn', 'being', 'theirs', 'further', 'hasn',
'shouldn', 'but', 'which', 'how', 'other', 'hers', 'and', 'against', 'by',
"don't", 'to', 'themselves', 'that', 'o', 'over', 'own', 'do', 'about', 'any',
"needn't", 'ourselves', "you'll", 'below', "won't", 'them', 'why', 'me', 'above',
'their', 'down', 'all', 'then', 'after', 'd', 'll', 'same', "isn't"}

We have printed the Stopwords of the English language. You can also print the stopwords of another language like Germany etc.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Create an input string
paragraph = "Hello, how are you doing today? The weather is cool. The sky is dark."

# Create a set of English Stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(paragraph)

# Removing Stopwords from word_tokens and inserting rest of the words in 
filtered_sentence list using list comprehension.
filtered_sentence = [w for w in word_tokens if not w in stop_words]
print("Word Tokens : ", word_tokens)
print("Filtered Sentence : ", filtered_sentence)

OUTPUT:
Word Tokens :  ['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The',
'weather', 'is', 'cool', '.', 'The', 'sky', 'is', 'dark', '.']
Filtered Sentence :  ['Hello', ',', 'today', '?', 'The', 'weather', 'cool', '.',
'The', 'sky', 'dark', '.']

2. Stemming

Stemming is the process of reducing a word to a root word by cutting off the end or the beginning of the word. For example, the word connection, connected, connecting reduced to a root word connect. It is a rule-based process of stripping the prefix and suffix from a word.

Applications

  • It reduces the number of unique words in the text.
  • It is used in information retrieval systems like search engines.

Some Stemming Algorithms are:

  • Potter’s Stemmer algorithm
  • Krovetz Stemmer
  • Lovins Stemmer
  • N-Gram Stemmer
  • Dawson Stemme

3. Lemmatization

It is the method of changing the words of a sentence to its dictionary form. Lemmatization in linguistics is the procedure of grouping together the inflected varieties of a word so that they can be analyzed as a single item, identified by way of the word’s lemma, or dictionary form.

It is usually more sophisticated than stemming. Stemmer works on an individual phrase without knowledge of the context. For example, the phrase “better” has “good” as its lemma. This component will miss by using stemming because it requires a dictionary look-up.

Why Lemmatization Is Better Then Stemming?

Lemmatization is an intelligent operation, while Stemming is a general operation. In Lemmatization, the proper form will be looked into in the dictionary. Hence, lemmatization helps in forming better machine learning features then stemming.

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
porter_stemmer = PorterStemmer()
text = "studies studying rocks corpora cry cries"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
print()

OUTPUT:
Stemming for studies is studi
Lemma for studies is study

Stemming for studying is studi
Lemma for studying is studying

Stemming for rocks is rock
Lemma for rocks is rock

Stemming for corpora is corpora
Lemma for corpora is corpus

Stemming for cry is cri
Lemma for cry is cry

Stemming for cries is cri
Lemma for cries is cry

Note

If you look at the above output, the stemming for studies and studying is the same (studi), but lemmatizer provides different outputs for both. You can also compare the other outputs. So, it would be great if lemmatization is better.

4. Parts of Speech (POS) Tagging

Parts of speech tagging is liable for reading the text in a language and assigning some specific token (Parts of Speech) to every word. The tag is a part-of-speech tag and signifies whether the word is a noun, verb, adjective, and so on.

For example: within the sentence “Answer the question,” the answer is a verb; however, within the sentence “Give me your answer,” the answer is a Noun.

To understand the meaning of any sentence or to extract relationships, POS Tagging is a very important step.

Different POS Tagging Methods

  • Lexical Based Methods
  • Rule-Based Methods
  • Probabilistic Methods
  • Deep Learning Methods

Part of Speech is a very long topic, so we will study this topic in a new blog.

Conclusion

In this blog, we have covered stopwords, stemming, lemmatization, and parts of speech tagging. Please feel free to leave feedback and comments in the section below. To know more about our services, please visit Loginworks Softwares Inc.

Leave a Comment