This post is meant as a summary of many of the concepts that I learned in Marti Hearst'sNatural Language Processing class at the UC Berkeley School of Information. I wanted to record the concepts and approaches that I had learned with quick overviews of the code you need to get it working. I figured that it could help some other people get a handle on the goals and code to get things done.
Other cheat sheets: Beth Anderson, Wisconsin. Brings in the entire nltk package. Appendix: Python and NLTK Cheat Sheet (Draft) 1.1 Python 1.1.1 Strings. NLTK, th Natural Languag Toolkit, i a suit of program module, data set and tutorial support research and teach in computational linguistic and natural languag process. stemmer = nltk.PorterStemmer. Notes from Python's NLTK book. Contribute to kqdtran/nltk-cheatsheet development by creating an account on GitHub.
I would encourage anyone else to take a look at the Natural Language Processing with Python and read more about scikit-learn.
Tokenization
The goal of tokenization is to break up a sentence or paragraph into specific tokens or words. We basically want to convert human language into a more abstract representation that computers can work with.
Sometimes you want to split sentence by sentence and other times you just want to split words.
Sentence Tokenizers
Here's a popular word regular expression tokenizer from the NLTK book that works quite well.
Word Tokenizers
Part of Speech Tagging
Once you've tokenized the sentences you need to tag them. Tagging is not necessary for all purposes but it does help the computer better understand the objects and references in your sentences. Remember our goal is to encode semantics, not words, and tagging can help us do that.
Unfortunately, this is an imperfect science, it's just never going to work out perfectly because in so many sentences there are so many different representations of text. Let me show you what I mean, I'll be using a comical example of a garden path sentence.
Nltk Cheat Sheet Pdf
This sentence is comical because death can either happen more slowly than thought (as in we had an expectation of death happening at a certain rate of speed).
But semantically, the speed of death can compared to the speed of thought which is obviously strange. Once you learn about these kinds of comical sentence structures, you start to seem them more often.
This one is also comical. In this sentence we've got two meanings as well. McDonald's fries are the holy grail for potato farmers or more comically McDonald's fries the actual holy grail for potato farmers. A comical mental image.
images from Sentence first
Thus part of speech tagging is never perfect, because there are so many interpretations.
Built in tagger
This is the built in tagger, the one that NLTK recommends. It's pretty slow when working on sort of large corpus.
Unigram, Bigram, and Backoff Tagging
These are backoff taggers, basically it's just a dictionary look up to tag parts of speech. You train it on a tagged corpus(or corpora) and then use it to tag sentences in the future.
Here's how you train the tagger on brown, this is a unigram tagger, so it's not going to perform really well because it will tag everything as a NN (noun) or whatever part of speech we give it.
This is a true backoff tagger that defaults to a certain part of speech. So it will look for trigram occurrences and see if it finds any with a certain word formation, if it does not then it will backoff to the bigram tagger, etc.
What's nice is to speed things up, you can actually just pickle the backoff tagger so that it's easier to deploy a tagger if need be.
Removing Punctuation
At times you'll need to remove certain punctuation marks - this is an easy way to do so.
Stopwords
Here's an easy way to remove stop words.
Extend it with:
Stemming
Stemming the process by which endings are removed from words in order to remove things like tense or plurality. It's not appropriate for all cases but can make it easier to connect together tenses to see if you're covering the same subject matter.
Frequency Distributions
A common go to to see what's going on with certain text data sets, frequency distributions allow you to see the frequency at which certain words occur and plot it if need be.
Collocations, Bigrams, Trigrams
Bigrams and trigrams are just words that are commonly found together and measures their relevance by a certain measurement.
Chunking
Chunking basically just grabs chunks of text that might be more meaningful to your research or program. You create a list of parts of speech and run that over your corpus. It will extract the phrasing that you need.
Remember you've got to customize it to the part of speech tagger that you're using, like Brown or the Stanford Tagger.
Splitting Training Sets + Test Sets
This is a simple way that Marti showed us that allows for simple splitting of test sets.
This splits it into thirds.
Train, Dev, Test Sets
This splits it into halves.
Simpler Test Sets
Classifiers & Scikit-learn
Now there are plenty of different ways of classifying text, this isn't an exhaustive list but it's a pretty good starting point.
Nltk Cheat Sheets
TF-IDF
See my other two posts on TF-IDF here:
Naive Bayes Classifiers
This is a simple Naive Bayes classifier.
SVC Classifier
SVMs need numerican inputs, it can take text-based features so you have to convert these features into numbers before passing them to this classifier.
Decision Tree Classification
This is a simple decision tree classifier.
Maximum Entropy Classifier
A maximum entropy classifier and some helpful explainers here.
Cross Validating Classifiers
One thing you'll need to avoid over-fitting is you'll want to cross validate with k-folds. This can help you see where you might be over-fitting in your corpus.
Creating Pipelines for Classifiers
Finally creating pipelines can help speed things up immensely, especially when you're moving to more production level code.
If you are a data scientist or aspire to be one investing your time in learning natural language processing (NLP) will be an investment in your future. 2020 saw a surge in the field of natural language processing. In this blog post you will discover 5 popular NLP libraries, and it’s applications.
Preprocessing Libraries
Preprocessing a crucial step in any machine learning pipeline. If you are building a language model you would have to create a word vector which involves removing stop words, and converting words to its root form.
#1 Spacy
Spacy is a popular Python library for sentence tokenization, lemmatization, and stemming. It is an industry grade library which can be used for text preprocessing and training deep learning based text classifiers.
Getting started with Spacy: Named Entity Recognition is an important task in natural language processing. NER helps in extracting important entities like location, organization names, etc.
The above code processes the two sentences and extracts the location in both sentences.
Let us now see the output
As seen from the output the code was able to extract Stockholm and Mumbai and associated them with the GPE label which indicates countries, cities, or states.
#2 NLTK
NLTK is another popular Python library for text preprocessing. It was started as an academic project and soon became very popular amongst researchers and academicians.
Let us see how we can do Part of Speech Tagging using NLTK. Part of speech tagging is used to extract the important part of speech like nouns, pronouns, adverbs, adjectives, etc.
The parts of speech that were extract from the above sentence are
Applications
A popular application of NLP is to categorize a document into a given set of labels. There are a number of Python libraries which can help you to train deep learning based models for topic modeling, text summarization, sentiment analysis etc. Let us have a look at some of these popular libraries
Most deep learning based NLP models rely on pretrained language models using a process called transfer learning. A huge corpus of document is trained and then this model can be fine-tuned for a specific domain. Some popular libraries which help in using pretrained models and building industry grade NLP applications are
#3 FARM
Farm is a popular open source package developed by a Berlin based company. It is used to make the life of developers easier by providing some nice functionalities like experiment tracking, multitask-learning and parallelized processing of documents.
#4 Flair
Flair is a popular PyTorch based framework which helps developers to build state of the NLP applications like named entity recognition, part-of-speech tagging, sense disambiguation and classification.
#5 Transformers
Transformers is a popular Python library to easily access pretrained models and has support for both PyTorch and TensorFlow. If you want to build an entire NLP pipeline by using pretrained models for Natural language understanding and generation tasks transformers will make your life easier.
#6 Gensim
Gensim is another popular Python library widely used for topic modelling and provides an easy-to-use interface for popular algorithms like word2vec to find synonymous words.