Natural Language Processing (NLP)

A field combining Computer Science, AI, and Linguistics to enable computers to understand, interpret, and generate human language.

What is it?

NLP bridges the gap between human language and computer understanding. It allows systems like Alexa, Siri, and Google Home to process spoken and written words.

Example: Spam Classification

One of the most common applications of NLP is classifying emails as
Ham (Not Spam) or Spam.

Input: Email Body + Subject
Process: NLP converts text to Vectors
Output: Classification Label

Deep Learning Approaches

ANN

Artificial Neural Network

Best for Tabular Data. Order doesn't matter.

CNN

Convolutional Neural Network

Best for Images/Video. Spatial features.

RNN

Recurrent Neural Network

Best for Sequential Data (NLP).

Why Sequence Matters?

In language, the order of words defines meaning. "The food is good" vs "The food is not good".

Evolution of NLP Architectures

Why not just use ANN?

Standard Neural Networks (ANNs) process inputs simultaneously(in parallel). They don't have a concept of "time" or "order".

"food good bad not" as a Bag of Words loses semantic meaning. The network sees the same words, just a different count, and misses the negation.

Practical Use Cases

Auto-Correction & Suggestion

Grammarly, Mobile Keyboards, Spelling correction.

Smart Replies

LinkedIn/Gmail auto-suggested responses ("Great work!", "Thanks for sharing").

Machine Translation

Google Translate, Social Media "See Translation" features.

Voice Assistants

Siri, Alexa, Google Assistant (Speech Recognition).

Common Terminologies

CorpusThe whole body of text/data (e.g., A whole paragraph or book).
DocumentsIndividual instances in the corpus (e.g., Sentences).
VocabularySet of unique words present in the text.
WordsAll tokens present in the corpus.

Tokenization

Process of breaking down text into smaller units called tokens (words, characters, or subwords).

// Input Corpus

"My name is Shivam and I have interest in Machine Learning, NLP and DL. I am also a Full Stack developer."

Sentence Tokenization

1. "My name is Shivam..."

2. "I am also a Full Stack..."

Word Tokenization

MynameisShivamandIhaveinterest...

👣 General NLP Pipeline

1
Text Input & Data Collection
Gathering raw text data from various sources (web, documents, user input).
2
Text Preprocessing
Cleaning and preparing raw data for analysis.
- Tokenization: Splitting text into smaller units (words/sentences).
- Normalization: Lowercasing, removing punctuation.
- Stopword Removal: Removing common words like "and", "is", "the".
3
Text Representation (Feature Extraction)
Converting text into numerical vectors that machines can understand.
Method A:Bag of Words (BoW), TF-IDF
Method B:Word Embeddings (Word2Vec, GloVe)
Input Text → Vector
4
Model Selection
Choosing the right architecture associated with the task.
SupervisedUnsupervisedTransformers / BERT / GPT

Text Preprocessing II

Stemming vs LemmatizationNormalization Techniques

1. Stemming

Crude heuristic process that chops off the ends of words. Often results in non-words.

Input: "Running", "Runs", "Ran"

Output: "Run"

Input: "Better"

Output: "Bet" (Incorrect meaning)

NLTK: PorterStemmer, LancasterStemmer

2. Lemmatization

Uses vocabulary and morphological analysis to return the base/dictionary form (Lemma).

Input: "Running", "Runs"

Output: "Run"

Input: "Better"

Output: "Good" (Correct context)

NLTK: WordNetLemmatizer

Stopwords Removal

Filtering out high-frequency words that add little semantic meaning.

istheandaaninon

Input:"This is a sample sentence, showing off the stop words filtration."

Output:"sample sentence, showing stop words filtration."

NLTK: stopwords.words("english")

Python (NLTK) Implementation

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# 1. Tokenization
tokens = nltk.word_tokenize(text)

# 2. Stopwords
clean_tokens = [word for word in tokens if word not in stopwords.words("english")]

# 3. Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("eating")) # output: eat
print(lemmatizer.lemmatize("better", pos="a")) # output: good

Text Representation (Text → Vectors)

Since machines cannot understand raw text, we must convert it into numerical vectors.(Note: This is the core mathematical transformation after the practical preprocessing phase).

1. One Hot Encoding (OHE)

Corpus (Documents)

D1: The food is great

D2: The food is bad

D3: Pizza is Amazing

Vocabulary (Unique Words)

ThefoodisgreatbadPizzaAmazing

Word	The	food	is	great	bad	Pizza	Amazing
The	1	0	0	0	0	0	0
food	0	1	0	0	0	0	0
is	0	0	1	0	0	0	0
...	0	0	0	...	...	...	...

Advantages

Easy to implement and understand intuition.
Works well for simple categorical data.

Disadvantages

Sparsity & Overfitting: Matrix is mostly zeros.
Fixed Size: ML Algorithms require fixed shape & size input matrix.
No Semantic Meaning: Words are orthogonal.
OOV (Out of Vocabulary) Issue: Out of Vocabulary handling.

1. Sparse Matrix & Overfitting

If we have 50k unique words, our input vector size is 50,000. This means the model has to learn weights for 50,000 features.

With limited training data, a model with too many parameters will just memorize the data (including noise) rather than learning general patterns. This is known as the Curse of Dimensionality.

Vocabulary

50,000 Words

→

↓

Input Features

50,000 Dim

→

↓

Model Weights

Massive!

→

↓

Result

Overfitting

(Memorizing Noise)

2. No Semantic Meaning captured

Ideally, Food, Pizza, and Burgershould be close. But in OHE, a sentence "Food, burger pizza" they are all distinct and equidistant (orthogonal).

Food (0,1,0)

Pizza (1,0,0)

Burger (0,0,1)

e.g.Food, Pizza, Burger → Vocabulary

Finding similarity:

Dist(Food, Pizza) = √2
Dist(Food, Burger) = √2
Dist(Pizza, Burger) = √2

All distances are equal.
Conclusion: No semantic meaning is getting captured.

2. Bag of Words (BoW)

In this technique, we count the occurrence of words in a document but ignore the order. The frequency of each word is used as a feature.

Corpus

S1: He is a good boy
S2: She is a good girl
S3: Boy and girl are good

Raw Text

↓

Lowercase

↓

Stopwords Removal

↓

BoW Vector

Vocabulary (Frequency)

good 3

boy 2

girl 2

(Sorted by Frequency)

BoW Output Matrix

Sentence	good	boy	girl	others...
S1	1	1	0	...
S2	1	0	1	...
S3	1	1	1	...

Advantages

Simple and intuitive to understand.
Fixed sized input for ML algorithms.

Disadvantages

Sparsity & Overfitting: Matrix contains mostly zeros.
Ordering Lost: Semantic meaning changes if order is ignored.
OOV: Doesn't handle new words well.
No Semantic Meaning: Words are still treated independently.

3. N-grams

N-grams help capture some context by grouping N consecutive words together. This can be used with BoW or TF-IDF.

Unigrams (N=1)

["this", "is", "food"]

Bigrams (N=2)

["this is", "is food"]

Trigrams (N=3)

["this is food"]

Why use N-grams? (Context Example)

Unigrams (BoW)

"The food is not good"

Result: ["not", "good"]

Problem: "not" might be removed as stopword, or treated separately. Model sees "good" and predicts Positive.

Bigrams

"The food is not good"

Result: ["not good"]

Success: "not good" is a single feature. Model learns this means Negative.

4. TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF reflects how important a word is to a document in a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

TF (Term Frequency) =

(No. of repetition of words in sentence) / (No. of words in sentence)

IDF (Inverse Document Frequency) =

log_e(No. of Sentences / No. of sentences containing the word)

Example Calculation

Total Sentences (N)3

IDF Values:

IDF(good)log(3/3) = 0

IDF(boy)log(3/2) ≈ 0.405

IDF(girl)log(3/2) ≈ 0.405

Final TF-IDF Matrix (Calculated)

Sentence	good	boy	girl
S1 (good boy)	0 (1/2 * 0)	0.2025 (1/2 * 0.405)	0
S2 (good girl)	0	0	0.2025 (1/2 * 0.405)
S3 (boy girl good)	0	0.135 (1/3 * 0.405)	0.135 (1/3 * 0.405)

Note: 'good' has 0 weight because it appears in all documents (Stopword-like behavior).

Advantages

Intuitive logic.
Fixed Size Input.
Word importance is getting captured (rare words get high weight).

Disadvantages

Sparsity: Still exists.
OOV: Out of Vocabulary issue remains.

Natural Language Processing (NLP)

What is it?

Example: Spam Classification

Deep Learning Approaches

ANN

CNN

RNN

Why Sequence Matters?

Why not just use ANN?

Practical Use Cases

Auto-Correction & Suggestion

Smart Replies

Machine Translation

Voice Assistants

Common Terminologies

Tokenization

👣 General NLP Pipeline

Text Input & Data Collection

Text Preprocessing

Text Representation (Feature Extraction)

Model Selection

Text Preprocessing II

Stemming vs LemmatizationNormalization Techniques

1. Stemming

2. Lemmatization

Stopwords Removal

Python (NLTK) Implementation

Text Representation (Text → Vectors)

1. One Hot Encoding (OHE)

Advantages

Disadvantages

1. Sparse Matrix & Overfitting

2. No Semantic Meaning captured

2. Bag of Words (BoW)

Corpus

Vocabulary (Frequency)

BoW Output Matrix

Advantages

Disadvantages

3. N-grams

Why use N-grams? (Context Example)

4. TF-IDF (Term Frequency - Inverse Document Frequency)

Example Calculation

Final TF-IDF Matrix (Calculated)

Advantages

Disadvantages