Unlocking Language - An Introduction to Natural Language Processing (NLP)

July 5, 2023 — #AI/ML #CS Basics

Hello, language lovers and tech explorers! Ever marvel at how your phone understands your voice commands, or how search engines seem to grasp the meaning behind your queries? That's the magic of Natural Language Processing (NLP) – a fascinating field of Artificial Intelligence that empowers computers to understand, interpret, and generate human language.

Today, we're embarking on an introductory journey into NLP. We'll look at what it is, the core challenges it tackles (hello, ambiguity!), the fundamental theories from probability and information science that underpin it, essential techniques like language modeling, and a peek into the linguistic structures that make it all work.

What is Natural Language Processing (NLP)? ️

Natural Language Processing (NLP) is a vibrant interdisciplinary field that combines computer science, AI, and linguistics. Its primary goal is to enable computers to process, analyze, understand, and generate human language in a way that is both meaningful and useful.

From the spam filter in your email to sophisticated machine translation services, NLP is already deeply embedded in our daily lives. Some exciting applications include:

Question Answering: Systems like IBM's Watson can understand complex questions and provide answers based on vast amounts of text.
Information Extraction: Automatically pulling structured information from unstructured text, like creating calendar entries from emails.
Machine Translation: Translating text from one language to another, either fully automatically or by assisting human translators.
Sentiment Analysis: Determining the emotional tone (positive, negative, neutral) expressed in a piece of text, like product reviews.
Spam Detection: Identifying and filtering unwanted emails.
Part-of-Speech (POS) Tagging: Assigning grammatical categories (noun, verb, adjective, etc.) to words in a sentence.
Named Entity Recognition (NER): Identifying and classifying named entities like persons, organizations, and locations in text.

The Hurdles: Why is NLP So Hard?

If understanding language seems intuitive to us humans, why is it such a monumental task for computers? The primary culprit is ambiguity. Human language is inherently ambiguous at many levels:

Lexical Ambiguity: Words can have multiple meanings. For example, "bank" can refer to a financial institution or the side of a river. "Fed raises interest rates" – is "raises" a verb or noun? Is "interest" a noun or adjective?
Syntactic Ambiguity (Structural Ambiguity): A sentence can have multiple grammatical structures. The classic example: "I saw a man on a hill with a telescope." Who has the telescope?
Semantic Ambiguity: The meaning of a sentence can be unclear even if its syntax is well-formed.
Discourse Ambiguity (e.g., Anaphora): Pronouns like "he," "she," or "it" can refer to different entities depending on the preceding context. "Carter told Mubarak he shouldn't run again." Who is 'he'?

Headlines often playfully (or accidentally) exploit ambiguity, leading to "crash blossoms" like "Violinist Linked to JAL Crash Blossoms" or "Teacher Strikes Idle Kids".

Beyond ambiguity, other challenges include:

Non-standard English: Tweets, text messages, and colloquial language often deviate from formal grammar and spelling (e.g., "Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever").
Segmentation Issues: Languages like Chinese, Japanese, and Thai don't use spaces to separate words, making word tokenization a complex first step. German has long compound words (e.g., Lebensversicherungsgesellschaftsangestellter for "life insurance company employee") that may need splitting for tasks like information retrieval.
World Knowledge: Understanding language often requires vast amounts of common sense and real-world knowledge that computers lack (e.g., "Mary and Sue are sisters" vs. "Mary and Sue are mothers" implies different age ranges and relationships).
Idioms and Figurative Language: Phrases like "dark horse" or "get cold feet" don't mean what their individual words suggest.
Tricky Entity Names: Names of movies, songs, or genes can look like regular phrases (e.g., "Where is A Bug's Life playing?").

To make progress, NLP systems rely on knowledge about language (grammatical rules, word meanings), knowledge about the world, and ways to combine these, often through probabilistic models built from language data.

Foundations: Basic Probability & Information Theory for NLP

Probability theory and information theory provide the mathematical bedrock for many modern NLP techniques, especially for handling ambiguity and making predictions.

Basics of Probability Theory

Probability helps us quantify uncertainty. In NLP, we often want to assign probabilities to linguistic events, such as the likelihood of a particular word sequence. A fundamental tool is the Chain Rule of Probability, which allows us to calculate the joint probability of a sequence of events (like words in a sentence): $P(w_1, w_2, \dots, w_n) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1, w_2) \times \dots \times P(w_n|w_1, \dots, w_{n-1})$ This can be written more compactly as: $P(w_1w_2 \dots w_n) = \prod_{i=1}^{n} P(w_i | w_1 w_2 \dots w_{i-1})$ .

Elements of Information Theory

Information theory, pioneered by Claude Shannon, deals with the quantification, storage, and communication of information.

Entropy: A measure of the uncertainty or randomness associated with a random variable. In NLP, it can quantify the uncertainty of predicting the next word.
Perplexity: A common intrinsic evaluation metric for language models. It measures how well a probability model predicts a sample. A lower perplexity indicates that the language model is better at predicting the sample. Perplexity $PP(W)$ for a word sequence $W = w_1w_2...w_N$ is the inverse probability of the test set, normalized by the number of words: $PP(W) = P(w_1w_2...w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1w_2...w_N)}}$ . The intuition, often called the Shannon Game, is how well we can predict the next word in a sequence. A better model assigns higher probability to the words that actually occur. Perplexity can be seen as the weighted equivalent branching factor – the average number of choices a random variable has when trying to predict the next word.

Language Modeling: Predicting What's Next

A Language Model (LM) is a statistical model that assigns a probability to a sequence of words, $P(W)$ , or predicts the probability of an upcoming word given previous words, $P(w_n|w_1, \dots, w_{n-1})$ . LMs are fundamental to many NLP tasks, including speech recognition, machine translation, and spelling correction.

N-gram Models: The Power of Context

Calculating the probability of a long sequence of words using the full chain rule is often infeasible due to data sparsity – we'd rarely see most long sequences in our training data. To simplify this, N-gram models make a Markov Assumption: the probability of a word depends only on a fixed number of $N-1$ preceding words (its context).

Unigram Model ( $N=1$ ): Assumes each word is independent of the previous ones. $P(w_1w_2 \dots w_n) \approx \prod_{i=1}^{n} P(w_i)$ . Sentences generated from unigram models are typically just random collections of frequent words.
Bigram Model ( $N=2$ ): The probability of a word depends only on the immediately preceding word. $P(w_i | w_1 \dots w_{i-1}) \approx P(w_i | w_{i-1})$ . Generated sentences start to show some local coherence.
Trigram Model ( $N=3$ ): Probability depends on the previous two words: $P(w_i | w_{i-2}w_{i-1})$ .

And so on for 4-grams, 5-grams, etc. While N-grams are a simplification (language has long-distance dependencies), they are often surprisingly effective.

Estimating N-gram Probabilities: The probabilities are typically estimated using Maximum Likelihood Estimation (MLE) from a large text corpus. For a bigram: $P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}$ .

The Noisy Channel Model: Decoding Corrupted Information

The Noisy Channel Model is a powerful framework applicable to many NLP tasks, including spelling correction, machine translation, and speech recognition.

Intuition: Imagine an intended word or sentence $w$ (the source) is transmitted through a "noisy channel" (e.g., a typist, a speaker in a noisy room, a translation process) and is observed as $x$ (the output), which might be a corrupted version of $w$ .
Goal: Given the observed, possibly corrupted output $x$ , we want to find the original, intended word/sentence $\hat{w}$ that is most probable. $\hat{w} = \arg\max_w P(w|x)$
Applying Bayes' Rule: $P(w|x) = \frac{P(x|w)P(w)}{P(x)}$ Since $P(x)$ is constant for all candidate $w$ 's, we maximize: $\hat{w} = \arg\max_w P(x|w)P(w)$
Components:
- $P(w)$ : The Language Model (or source model). This tells us the prior probability of the word/sentence $w$ being the intended one. More fluent or common sentences get higher probability.
- $P(x|w)$ : The Channel Model (or error model). This tells us the probability that $x$ was observed if $w$ was the intended word/sentence. For spelling correction, this would be the probability of a certain typo occurring.

Smoothing and the EM Algorithm: Handling Data Sparsity & Hidden Variables

Smoothing Techniques: Tackling the Zero-Probability Problem

A major issue with MLE for N-grams is data sparsity. Many N-grams that are perfectly valid will not appear in the training corpus, leading to a zero probability estimate ( $P=0$ ). If such an N-gram then appears in a test sentence, the entire sentence probability becomes zero, which is disastrous. Smoothing techniques re-distribute some probability mass from seen N-grams to unseen ones.

Add-one (Laplace) Smoothing: The simplest approach. Add 1 to all N-gram counts before normalizing. For bigrams: $P_{Add-1}(w_i|w_{i-1}) = \frac{C(w_{i-1},w_i)+1}{C(w_{i-1})+V}$ , where $V$ is the vocabulary size. While simple, it's often too blunt for language modeling but can be useful for tasks like text categorization.
Good-Turing Smoothing: A more sophisticated technique that uses the frequency of N-grams seen once ( $N_1$ ) to estimate the total probability mass for unseen N-grams. For an N-gram with count $c$ , its adjusted count $c^*$ is roughly $(c+1)N_{c+1}/N_c$ .
Interpolation and Backoff: These methods combine probabilities from different order N-grams.
- Interpolation: Linearly combines probabilities, e.g., $P(w_i|w_{i-1}) = \lambda_1 P(w_i) + \lambda_2 P(w_i|w_{i-1}) + \lambda_3 P(w_i|w_{i-2}w_{i-1})$ . The $\lambda$ weights sum to 1 and can be learned from a held-out corpus.
- Backoff: Uses the highest-order N-gram if it has good evidence (count > 0); otherwise, it "backs off" to a lower-order N-gram (e.g., if trigram count is zero, use bigram probability, scaled by a backoff weight). "Stupid backoff" is a simple version used for web-scale N-grams.

EM Algorithm (Expectation-Maximization)

The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, particularly when the model depends on unobserved latent variables.

Concept: It alternates between two steps:
1. E-step (Expectation): Estimates the values of the latent variables, given the current parameter estimates and observed data.
2. M-step (Maximization): Re-estimates the model parameters to maximize the expected log-likelihood found in the E-step.
NLP Applications: EM is used in various NLP tasks, including training Hidden Markov Models (HMMs) for Part-of-Speech tagging (where the true POS tags are latent variables), unsupervised learning of word alignments in machine translation, and learning parameters of probabilistic context-free grammars with latent annotations.

(While the NLP.pdf mentions EM in the context of learning latent annotations for PCFGs, a general introduction to EM is typically more extensive and often covered in broader ML courses.)

Linguistics for NLP: The Structure Underpinning Language ️

Effective NLP requires an understanding of the structure of language itself.

Text Normalization: Preparing Text for Processing

Before sophisticated analysis, text often needs to be normalized:

Tokenization: Segmenting a stream of characters into meaningful units called tokens (usually words). This involves handling punctuation, hyphens, contractions (like "what're" $\rightarrow$ "what", "are"), and special cases like "Finland's". For languages without explicit word separators like Chinese or Japanese, tokenization (or word segmentation) is a more complex task, often using algorithms like Maximum Matching.
Word Normalization:
- Case Folding: Converting all text to a single case (usually lowercase) to treat "The" and "the" as the same word. This is common in information retrieval but might discard useful information for other tasks.

Phonology and Morphology: Sounds and Word Forms

Phonology: The study of sound systems in language (less directly covered in these introductory materials).
Morphology: The study of the internal structure of words and how they are formed. Words are made up of morphemes, the smallest meaning-bearing units:
- Stems: The core meaning-bearing part of a word (e.g., "connect" in "connection").
- Affixes: Prefixes, suffixes, infixes that attach to stems, often carrying grammatical information (e.g., "-ion", "-s", "un-").

Two common morphological processing tasks in NLP are: * Lemmatization: Reducing inflected word forms to their lemma – their base dictionary form (e.g., "cars" $\rightarrow$ "car"; "am," "are," "is" $\rightarrow$ "be"). This usually requires a dictionary and part-of-speech information. * Stemming: A cruder process of chopping off affixes to get a common stem. It doesn't always result in a valid word but groups related word forms. (e.g., "automate," "automatic," "automation" might all stem to "automat"). The Porter's algorithm is a widely used rule-based stemmer for English.

Syntax: The Grammar of Language

Syntax deals with the rules governing how words combine to form grammatical phrases and sentences.

Phrase Structure Grammar (Constituency Grammar): This view organizes a sentence into nested constituents or phrases (e.g., Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase (PP)). The relationships are often represented by Context-Free Grammars (CFGs) or their probabilistic counterparts, Probabilistic Context-Free Grammars (PCFGs). Headed phrase structure identifies the most important word (the head) within each phrase (e.g., the head of an NP is usually a noun).
Dependency Grammar: This approach represents syntactic structure as a set of binary, asymmetric relationships called dependencies between lexical items (words). Each dependency links a head (governor) to a dependent (modifier or argument). For example, in "The boy put the tortoise," "put" is the head, "boy" is its subject dependent, and "tortoise" is its object dependent. Dependency structures are usually trees. They can sometimes represent non-projective dependencies (crossing arcs when words are laid out linearly), which are challenging for some phrase-structure formalisms.

Conclusion: The Intricate Dance of Language and Computation

Natural Language Processing is a rich and challenging field that sits at the intersection of computer science, AI, and linguistics. From deciphering the meaning hidden within ambiguous sentences to predicting the flow of language with statistical models and understanding its grammatical structure, NLP constantly pushes the boundaries of what machines can do with human language. The foundational concepts of probability, information theory, language modeling, and linguistic analysis are essential tools in this exciting endeavor.

As we continue to generate and consume vast amounts of textual and spoken data, the importance and sophistication of NLP techniques will only continue to grow, shaping how we interact with technology and access information.