A field combining Computer Science, AI, and Linguistics to enable computers to understand, interpret, and generate human language.
NLP bridges the gap between human language and computer understanding. It allows systems like Alexa, Siri, and Google Home to process spoken and written words.
One of the most common applications of NLP is classifying emails as
Ham (Not Spam) or Spam.
Artificial Neural Network
Best for Tabular Data. Order doesn't matter.
Convolutional Neural Network
Best for Images/Video. Spatial features.
Recurrent Neural Network
Best for Sequential Data (NLP).
In language, the order of words defines meaning. "The food is good" vs "The food is not good".
Evolution of NLP Architectures
Standard Neural Networks (ANNs) process inputs simultaneously (in parallel). They don't have a concept of "time" or "order".
Grammarly, Mobile Keyboards, Spelling correction.
LinkedIn/Gmail auto-suggested responses ("Great work!", "Thanks for sharing").
Google Translate, Social Media "See Translation" features.
Siri, Alexa, Google Assistant (Speech Recognition).
Process of breaking down text into smaller units called tokens (words, characters, or subwords).
Gathering raw text data from various sources (web, documents, user input).
Cleaning and preparing raw data for analysis.
Converting text into numerical vectors that machines can understand.
Choosing the right architecture associated with the task.
Crude heuristic process that chops off the ends of words. Often results in non-words.
PorterStemmer, LancasterStemmerUses vocabulary and morphological analysis to return the base/dictionary form (Lemma).
WordNetLemmatizerFiltering out high-frequency words that add little semantic meaning.
stopwords.words("english")Since machines cannot understand raw text, we must convert it into numerical vectors.(Note: This is the core mathematical transformation after the practical preprocessing phase).
| Word | The | food | is | great | bad | Pizza | Amazing |
|---|---|---|---|---|---|---|---|
| The | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| food | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| is | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| ... | 0 | 0 | 0 | ... | ... | ... | ... |
If we have 50k unique words, our input vector size is 50,000. This means the model has to learn weights for 50,000 features.
With limited training data, a model with too many parameters will just memorize the data (including noise) rather than learning general patterns. This is known as the Curse of Dimensionality.
Ideally, Food, Pizza, and Burger should be close. But in OHE, a sentence "Food, burger pizza" they are all distinct and equidistant (orthogonal).
In this technique, we count the occurrence of words in a document but ignore the order. The frequency of each word is used as a feature.
| Sentence | good | boy | girl | others... |
|---|---|---|---|---|
| S1 | 1 | 1 | 0 | ... |
| S2 | 1 | 0 | 1 | ... |
| S3 | 1 | 1 | 1 | ... |
N-grams help capture some context by grouping N consecutive words together. This can be used with BoW or TF-IDF.
["not", "good"]["not good"]TF-IDF reflects how important a word is to a document in a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
| Sentence | good | boy | girl |
|---|---|---|---|
| S1 (good boy) | 0 (1/2 * 0) | 0.2025 (1/2 * 0.405) | 0 |
| S2 (good girl) | 0 | 0 | 0.2025 (1/2 * 0.405) |
| S3 (boy girl good) | 0 | 0.135 (1/3 * 0.405) | 0.135 (1/3 * 0.405) |