Jump to Section
Introduction
In our previous blog, we had only covered tokenization and its methods. In today’s blog, we will discuss further algorithms of NLTK. Here, we will start with Stopwords.
1. Stopwords
Stopwords are the most common words in any natural language. For the reason of analyzing text and building Natural Language Processing models, these stopwords won’t add much value to the document, so we need to remove these words. The most words used in a text are ‘is,’ ‘s,’ ‘am,’ ‘or,’ ‘who,’ ‘as,’ ‘from,’ ‘him,’ ‘each,’ ‘the,’ ‘themselves,’ ‘when,’ ‘to,’ ‘at,’ etc.
Few Key Benefits of Removing Stopwords
- It decreases the dataset size, and the time to train the model also decreases.
- Removing stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. It can increase classification accuracy.
To check the list of stopwords, you can use the below command.
# Import NLTK import nltk nltk.download('stopwords') from nltk.corpus import stopwords # Printing set of English Stopwords print(set(stopwords.words('english'))) OUTPUT: {'in', 'be', 'an', 'aren', 'between', 'most', 'these', 'both', "weren't", 'his', 'while', 'so', 've', "that'll", 'we', 'such', 'doesn', 'from', 'y', 'what', 'shan', 'wouldn', 'him', 'no', "didn't", 'on', 'it', 's', "you'd", 'm', "shouldn't", 'her', 'ain', 'itself', 'been', 'as', "she's", 'into', 'doing', 'there', 'where', "it's", 'more', 'each', 'needn', 'here', 'does', 'its', 'those', "aren't", "hasn't", "shan't", 'won', 'ours', 'yours', "wouldn't", 'again', 'yourselves', 'off', 'only', 'am', 'will', 'who', 'the', 'very', 'has', 'you', 'nor', 'mightn', 'at', 'himself', "you're", 'because', 'herself', 'if', 'is', "mustn't", 'just', 'until', 'once', 'before', 'when', 'up', 'don', 'weren', 'too', 'should', 'didn', 'was', 'were', 'not', 'she', 'of', 'are', 'few', 'this', "should've", "hadn't", 'had', 'he', 'with', 'whom', 'ma', "couldn't", 'myself', 'or', 't', 'my', 'having', 'haven', 'i', 'couldn', 'during', 'for', 'have', 'under', "haven't", 'did', 'your', "doesn't", 'can', 'wasn', 'our', 'through', 'a', 'isn', "you've", 'now', "wasn't", 'out', 'yourself', 'than', 'some', 'hadn', 're', "mightn't", 'they', 'mustn', 'being', 'theirs', 'further', 'hasn', 'shouldn', 'but', 'which', 'how', 'other', 'hers', 'and', 'against', 'by', "don't", 'to', 'themselves', 'that', 'o', 'over', 'own', 'do', 'about', 'any', "needn't", 'ourselves', "you'll", 'below', "won't", 'them', 'why', 'me', 'above', 'their', 'down', 'all', 'then', 'after', 'd', 'll', 'same', "isn't"}
We have printed the Stopwords of the English language. You can also print the stopwords of another language like Germany etc.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # Create an input string paragraph = "Hello, how are you doing today? The weather is cool. The sky is dark." # Create a set of English Stopwords stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(paragraph) # Removing Stopwords from word_tokens and inserting rest of the words in filtered_sentence list using list comprehension. filtered_sentence = [w for w in word_tokens if not w in stop_words] print("Word Tokens : ", word_tokens) print("Filtered Sentence : ", filtered_sentence) OUTPUT: Word Tokens : ['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'cool', '.', 'The', 'sky', 'is', 'dark', '.'] Filtered Sentence : ['Hello', ',', 'today', '?', 'The', 'weather', 'cool', '.', 'The', 'sky', 'dark', '.']
2. Stemming
Stemming is the process of reducing a word to a root word by cutting off the end or the beginning of the word. For example, the word connection, connected, connecting reduced to a root word connect. It is a rule-based process of stripping the prefix and suffix from a word.
Applications
- It reduces the number of unique words in the text.
- It is used in information retrieval systems like search engines.
Some Stemming Algorithms are:
- Potter’s Stemmer algorithm
- Krovetz Stemmer
- Lovins Stemmer
- N-Gram Stemmer
- Dawson Stemme
3. Lemmatization
It is the method of changing the words of a sentence to its dictionary form. Lemmatization in linguistics is the procedure of grouping together the inflected varieties of a word so that they can be analyzed as a single item, identified by way of the word’s lemma, or dictionary form.
It is usually more sophisticated than stemming. Stemmer works on an individual phrase without knowledge of the context. For example, the phrase “better” has “good” as its lemma. This component will miss by using stemming because it requires a dictionary look-up.
Why Lemmatization Is Better Then Stemming?
Lemmatization is an intelligent operation, while Stemming is a general operation. In Lemmatization, the proper form will be looked into in the dictionary. Hence, lemmatization helps in forming better machine learning features then stemming.
import nltk from nltk.stem.porter import PorterStemmer from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() porter_stemmer = PorterStemmer() text = "studies studying rocks corpora cry cries" tokenization = nltk.word_tokenize(text) for w in tokenization: print("Stemming for {} is {}".format(w,porter_stemmer.stem(w))) print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w))) print() OUTPUT: Stemming for studies is studi Lemma for studies is study Stemming for studying is studi Lemma for studying is studying Stemming for rocks is rock Lemma for rocks is rock Stemming for corpora is corpora Lemma for corpora is corpus Stemming for cry is cri Lemma for cry is cry Stemming for cries is cri Lemma for cries is cry
Note
If you look at the above output, the stemming for studies and studying is the same (studi), but lemmatizer provides different outputs for both. You can also compare the other outputs. So, it would be great if lemmatization is better.
4. Parts of Speech (POS) Tagging
Parts of speech tagging is liable for reading the text in a language and assigning some specific token (Parts of Speech) to every word. The tag is a part-of-speech tag and signifies whether the word is a noun, verb, adjective, and so on.
For example: within the sentence “Answer the question,” the answer is a verb; however, within the sentence “Give me your answer,” the answer is a Noun.
To understand the meaning of any sentence or to extract relationships, POS Tagging is a very important step.
Different POS Tagging Methods
- Lexical Based Methods
- Rule-Based Methods
- Probabilistic Methods
- Deep Learning Methods
Part of Speech is a very long topic, so we will study this topic in a new blog.
Conclusion
In this blog, we have covered stopwords, stemming, lemmatization, and parts of speech tagging. Please feel free to leave feedback and comments in the section below. To know more about our services, please visit Loginworks Softwares Inc.
- Business Intelligence Vs Data Analytics: What’s the Difference? - December 10, 2020
- Effective Ways Data Analytics Helps Improve Business Growth - July 28, 2020
- How the Automotive Industry is Benefitting From Web Scraping - July 23, 2020