>_
EngineeringNotes
Back to NLP

Natural Language Processing (NLP)

A field combining Computer Science, AI, and Linguistics to enable computers to understand, interpret, and generate human language.

What is it?

NLP bridges the gap between human language and computer understanding. It allows systems like Alexa, Siri, and Google Home to process spoken and written words.

Example: Spam Classification

One of the most common applications of NLP is classifying emails as
Ham (Not Spam) or Spam.

  • Input: Email Body + Subject
  • Process: NLP converts text to Vectors
  • Output: Classification Label
Email InputNLP ProcessingVectorsClassifierSpamHam

Deep Learning Approaches

ANN

Artificial Neural Network

Best for Tabular Data. Order doesn't matter.

CNN

Convolutional Neural Network

Best for Images/Video. Spatial features.

RNN

Recurrent Neural Network

Best for Sequential Data (NLP).

Why Sequence Matters?

In language, the order of words defines meaning. "The food is good" vs "The food is not good".

Simple RNNLSTM / GRUBi-Directional(RNN)Enc / DecSelf AttentionTransformersGen AI / LLMs

Evolution of NLP Architectures

Why not just use ANN?

Standard Neural Networks (ANNs) process inputs simultaneously (in parallel). They don't have a concept of "time" or "order".

"food good bad not" as a Bag of Words loses semantic meaning. The network sees the same words, just a different count, and misses the negation.
Parallel Processing (No Sequence)

Practical Use Cases

Auto-Correction & Suggestion

Grammarly, Mobile Keyboards, Spelling correction.

Smart Replies

LinkedIn/Gmail auto-suggested responses ("Great work!", "Thanks for sharing").

Machine Translation

Google Translate, Social Media "See Translation" features.

Voice Assistants

Siri, Alexa, Google Assistant (Speech Recognition).

Common Terminologies

  • CorpusThe whole body of text/data (e.g., A whole paragraph or book).
  • DocumentsIndividual instances in the corpus (e.g., Sentences).
  • VocabularySet of unique words present in the text.
  • WordsAll tokens present in the corpus.

Tokenization

Process of breaking down text into smaller units called tokens (words, characters, or subwords).

// Input Corpus
"My name is Shivam and I have interest in Machine Learning, NLP and DL. I am also a Full Stack developer."
Sentence Tokenization
1. "My name is Shivam..."
2. "I am also a Full Stack..."
Word Tokenization
MynameisShivamandIhaveinterest...

👣 General NLP Pipeline

  1. 1

    Text Input & Data Collection

    Gathering raw text data from various sources (web, documents, user input).

  2. 2

    Text Preprocessing

    Cleaning and preparing raw data for analysis.

    • Tokenization: Splitting text into smaller units (words/sentences).
    • Normalization: Lowercasing, removing punctuation.
    • Stopword Removal: Removing common words like "and", "is", "the".
  3. 3

    Text Representation (Feature Extraction)

    Converting text into numerical vectors that machines can understand.

    Method A:Bag of Words (BoW), TF-IDF
    Method B:Word Embeddings (Word2Vec, GloVe)
    Input Text → Vector
  4. 4

    Model Selection

    Choosing the right architecture associated with the task.

    SupervisedUnsupervisedTransformers / BERT / GPT

Text Preprocessing II

Stemming vs LemmatizationNormalization Techniques

1. Stemming

Crude heuristic process that chops off the ends of words. Often results in non-words.

Input: "Running", "Runs", "Ran"
Output: "Run"
Input: "Better"
Output: "Bet" (Incorrect meaning)
NLTK: PorterStemmer, LancasterStemmer

2. Lemmatization

Uses vocabulary and morphological analysis to return the base/dictionary form (Lemma).

Input: "Running", "Runs"
Output: "Run"
Input: "Better"
Output: "Good" (Correct context)
NLTK: WordNetLemmatizer

Stopwords Removal

Filtering out high-frequency words that add little semantic meaning.

istheandaaninon
Input: "This is a sample sentence, showing off the stop words filtration."
Output: "sample sentence, showing stop words filtration."
NLTK: stopwords.words("english")

Python (NLTK) Implementation

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# 1. Tokenization
tokens = nltk.word_tokenize(text)

# 2. Stopwords
clean_tokens = [word for word in tokens if word not in stopwords.words("english")]

# 3. Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("eating")) # output: eat
print(lemmatizer.lemmatize("better", pos="a")) # output: good

Text Representation (Text → Vectors)

Since machines cannot understand raw text, we must convert it into numerical vectors.(Note: This is the core mathematical transformation after the practical preprocessing phase).

1. One Hot Encoding (OHE)

Corpus (Documents)
D1: The food is great
D2: The food is bad
D3: Pizza is Amazing
Vocabulary (Unique Words)
ThefoodisgreatbadPizzaAmazing
WordThefoodisgreatbadPizzaAmazing
The1000000
food0100000
is0010000
...000............

Advantages

  • Easy to implement and understand intuition.
  • Works well for simple categorical data.

Disadvantages

  • Sparsity & Overfitting: Matrix is mostly zeros.
  • Fixed Size: ML Algorithms require fixed shape & size input matrix.
  • No Semantic Meaning: Words are orthogonal.
  • OOV (Out of Vocabulary) Issue: Out of Vocabulary handling.

1. Sparse Matrix & Overfitting

If we have 50k unique words, our input vector size is 50,000. This means the model has to learn weights for 50,000 features.

With limited training data, a model with too many parameters will just memorize the data (including noise) rather than learning general patterns. This is known as the Curse of Dimensionality.

Vocabulary
50,000 Words
↓
Input Features
50,000 Dim
↓
Model Weights
Massive!
↓
Result
Overfitting
(Memorizing Noise)

2. No Semantic Meaning captured

Ideally, Food, Pizza, and Burger should be close. But in OHE, a sentence "Food, burger pizza" they are all distinct and equidistant (orthogonal).

Food (0,1,0)
Pizza (1,0,0)
Burger (0,0,1)
e.g. Food, Pizza, Burger → Vocabulary
Finding similarity:
Dist(Food, Pizza) = √2
Dist(Food, Burger) = √2
Dist(Pizza, Burger) = √2
All distances are equal.
Conclusion: No semantic meaning is getting captured.

2. Bag of Words (BoW)

In this technique, we count the occurrence of words in a document but ignore the order. The frequency of each word is used as a feature.

Corpus

  • S1: He is a good boy
  • S2: She is a good girl
  • S3: Boy and girl are good
Raw Text
↓
Lowercase
↓
Stopwords Removal
↓
BoW Vector

Vocabulary (Frequency)

good 3
boy 2
girl 2
(Sorted by Frequency)

BoW Output Matrix

Sentencegoodboygirlothers...
S1110...
S2101...
S3111...

Advantages

  • Simple and intuitive to understand.
  • Fixed sized input for ML algorithms.

Disadvantages

  • Sparsity & Overfitting: Matrix contains mostly zeros.
  • Ordering Lost: Semantic meaning changes if order is ignored.
  • OOV: Doesn't handle new words well.
  • No Semantic Meaning: Words are still treated independently.

3. N-grams

N-grams help capture some context by grouping N consecutive words together. This can be used with BoW or TF-IDF.

Unigrams (N=1)
["this", "is", "food"]
Bigrams (N=2)
["this is", "is food"]
Trigrams (N=3)
["this is food"]

Why use N-grams? (Context Example)

Unigrams (BoW)
"The food is not good"
Result: ["not", "good"]
Problem: "not" might be removed as stopword, or treated separately. Model sees "good" and predicts Positive.
Bigrams
"The food is not good"
Result: ["not good"]
Success: "not good" is a single feature. Model learns this means Negative.

4. TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF reflects how important a word is to a document in a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

TF (Term Frequency) =
(No. of repetition of words in sentence) / (No. of words in sentence)
IDF (Inverse Document Frequency) =
loge(No. of Sentences / No. of sentences containing the word)

Example Calculation

Total Sentences (N)3
IDF Values:
IDF(good)log(3/3) = 0
IDF(boy)log(3/2) ≈ 0.405
IDF(girl)log(3/2) ≈ 0.405

Final TF-IDF Matrix (Calculated)

Sentencegoodboygirl
S1 (good boy)0 (1/2 * 0)0.2025 (1/2 * 0.405)0
S2 (good girl)000.2025 (1/2 * 0.405)
S3 (boy girl good)00.135 (1/3 * 0.405)0.135 (1/3 * 0.405)
Note: 'good' has 0 weight because it appears in all documents (Stopword-like behavior).

Advantages

  • Intuitive logic.
  • Fixed Size Input.
  • Word importance is getting captured (rare words get high weight).

Disadvantages

  • Sparsity: Still exists.
  • OOV: Out of Vocabulary issue remains.