Exploring Context-Free Languages and Pushdown Automata

April 18, 2023 — #AFL #CS Basics

Welcome back, language theory explorers! In our last discussion, we thoroughly examined regular languages and their corresponding finite automata. We saw their power but also their limitations – they struggle with structures requiring unbounded memory, like matching nested parentheses or ensuring equal counts of different symbols ( $a^n b^n$ ).

Today, we move up the Chomsky hierarchy to Context-Free Languages (CFLs). These languages are generated by Context-Free Grammars (CFGs) and recognized by Pushdown Automata (PDA). They form the theoretical backbone for describing the syntax of most programming languages. Let's unravel the details!

1. Context-Free Grammars (CFGs)

We increase the generative power of grammars by relaxing the restrictions found in regular grammars.

Definition: A grammar $G = (V, T, S, P)$ is context-free if all productions in $P$ have the form: $A \rightarrow x$ where $A \in V$ (a single variable) and $x \in (V \cup T)^*$ (any string of variables and terminals).

A language $L$ is context-free if there exists a CFG $G$ such that $L = L(G)$ .

Key Difference: Unlike regular grammars (right/left-linear), the right-hand side ( $x$ ) can be any mix of variables and terminals. The "context-free" name comes from the fact that the variable $A$ can be replaced by $x$ regardless of the context (surrounding symbols) in which $A$ appears in a sentential form.

Examples:

$G_1: S \rightarrow aSb | \lambda$ . Generates $L(G_1) = \{a^n b^n : n \ge 0\}$ . This language is context-free but not regular.
$G_2: S \rightarrow aSa | bSb | \lambda$ . Generates $L(G_2) = \{ww^R : w \in \{a, b\}^*\}$ . This language (palindromes) is also context-free but not regular.
$G_3: S \rightarrow SS | aSb | bSa | \lambda$ . Generates $L(G_3) = \{w \in \{a, b\}^* : n_a(w) = n_b(w)\}$ . (Balanced numbers of a's and b's). Context-free, not regular. This grammar is not linear (due to $S \rightarrow SS$ ), unlike the first two examples.

Since every regular grammar is a context-free grammar (it satisfies the CFG production form), the family of regular languages is a proper subset of the context-free languages.

2. Derivations, Sentential Forms, and Trees

Understanding how CFGs generate strings involves derivations and their visual representation.

Leftmost and Rightmost Derivations

When a sentential form contains multiple variables, we have a choice of which one to replace next. To standardize this:

Leftmost Derivation: In each step, the leftmost variable in the sentential form is replaced.
Rightmost Derivation: In each step, the rightmost variable is replaced.

Example: Grammar: $S \rightarrow AB, A \rightarrow aaA | \lambda, B \rightarrow Bb | \lambda$ . String: aab.

Leftmost: $S \Rightarrow AB \Rightarrow aaAB \Rightarrow aaB \Rightarrow aaBb \Rightarrow aab$
Rightmost: $S \Rightarrow AB \Rightarrow ABb \Rightarrow AAb \Rightarrow aaAAb \Rightarrow aab$

(Note: This grammar isn't quite right for $a^{2n}b^m$ ; see original Example 5.1 for $ww^R$ ).

For any string derivable in a CFG, there exists at least one leftmost and at least one rightmost derivation.

Sentential Forms and Derivation (Parse) Trees

A derivation tree (or parse tree) graphically represents a derivation, independent of the order of rule application.

Definition: An ordered tree is a derivation tree for CFG $G=(V, T, S, P)$ if:

The root is labeled $S$ .
Every leaf is labeled with a symbol from $T \cup \{\lambda\}$ .
Every interior node is labeled with a symbol from $V$ .
If an interior node is labeled $A$ and its children (left-to-right) are labeled $a_1, a_2, \dots, a_n$ , then $A \rightarrow a_1 a_2 \dots a_n$ must be a production in $P$ .
A leaf labeled λ has no siblings.

Partial Derivation Tree: Satisfies 3, 4, 5, but the root might not be $S$ , and leaves can be variables.
Yield: The string formed by reading the labels of the leaves from left to right (omitting any λ leaves).

Theorem:

Every $w \in L(G)$ is the yield of some derivation tree for $G$ .
The yield of any derivation tree for $G$ is in $L(G)$ .
The yield of any partial derivation tree rooted at $S$ is a sentential form of $G$ .

Derivation trees capture the structure and application of rules, while leftmost/rightmost derivations impose an order.

3. Parsing and Ambiguity

Parsing: The process of finding a derivation (or derivation tree) for a given string $w$ according to a grammar $G$ . It essentially determines if $w \in L(G)$ and reveals its syntactic structure.
Membership Problem: Deciding if $w \in L(G)$ .

Parsing Algorithms

Exhaustive Search (Brute-Force): Systematically generate all possible (say, leftmost) derivations of length 1, length 2, etc., and see if $w$ is produced. This works but can be extremely inefficient (potentially exponential time $O(|P|^{c|w|})$ ) and might not terminate if the grammar has cycles or λ-productions allowing derivations to loop or shrink.
Termination: If the grammar has no λ-productions and no unit productions ( $A \rightarrow B$ ), exhaustive search is guaranteed to terminate in at most $O(|P|^{2|w|})$ time, determining membership.
CYK Algorithm: A more efficient dynamic programming algorithm for membership (and parsing). Requires the grammar to be in Chomsky Normal Form (see below). Runs in $O(|w|^3)$ time.
LL and LR Parsers: For restricted classes of CFGs (LL and LR grammars), linear time ( $O(|w|))$ parsing is possible. These are crucial for compiler construction.

Ambiguity

Definition: A CFG $G$ is ambiguous if there exists some string $w \in L(G)$ that has two or more distinct derivation trees (or equivalently, two or more distinct leftmost or rightmost derivations).

Example: Grammar $E \rightarrow E+E | E*E | (E) | a|b|c$ is ambiguous for expressions like a+b*c. One tree groups a+b first, the other groups b*c first.

Resolving Ambiguity:

External Rules: Impose precedence rules (e.g., * before +).
Rewrite Grammar: Modify the grammar to enforce precedence and associativity, making it unambiguous. Example: Introduce Term (T) and Factor (F) non-terminals: $E \rightarrow E + T | T$ $T \rightarrow T * F | F$ $F \rightarrow (E) | a|b|c$ This grammar only allows the parse corresponding to standard operator precedence for a+b*c.

Definition:

A CFL $L$ is unambiguous if there exists at least one unambiguous CFG that generates it.
A CFL $L$ is inherently ambiguous if every CFG that generates it is ambiguous. Example: $L = \{a^n b^n c^m\} \cup \{a^n b^m c^m\}$ , $n, m \ge 0$ . The string $a^k b^k c^k$ can be derived in two ways, reflecting the union, and no unambiguous grammar is known.

Ambiguity is undesirable in programming languages where unique interpretation is required. Deciding if an arbitrary CFG is ambiguous, or if two CFGs are equivalent, are generally undecidable problems.

4. Simplifying Context-Free Grammars

For theoretical proofs and practical algorithms (like CYK), it's useful to simplify CFGs by removing certain types of productions without changing the language generated (except possibly for λ).

Assumption: For simplification, we often focus on λ-free languages ( $L - \{\lambda\}$ ). If needed, λ can be added back later via a new start symbol $S_0 \rightarrow S | \lambda$ .

Key Simplifications:

Eliminating λ-Productions:
- Nullable Variable: A variable $A$ is nullable if $A \Rightarrow^* \lambda$ .
- Procedure: a. Find all nullable variables $V_N$ . (Start with variables in $A \rightarrow \lambda$ , then add $B$ if $B \rightarrow A_1 \dots A_k$ and all $A_i$ are nullable, repeat). b. Create a new set of productions $P'$ . For each production $A \rightarrow x_1 x_2 \dots x_m$ in the original $P$ , add to $P'$ this production and all productions formed by deleting one or more $x_i$ where $x_i \in V_N$ . If all $x_i$ are nullable, do not add $A \rightarrow \lambda$ to $P'$ .
- Result: A grammar $G'$ without λ-productions such that $L(G') = L(G) - \{\lambda\}$ .
Eliminating Unit Productions:
- Unit Production: A production of the form $A \rightarrow B$ , where $A, B \in V$ .
- Procedure: a. Find all pairs $(A, B)$ such that $A \Rightarrow^* B$ using only unit productions (can use a dependency graph). b. Create a new set of productions $P'$ . Initialize $P'$ with all non-unit productions from the original $P$ . c. For every pair $(A, B)$ found in step (a), add productions $A \rightarrow y$ to $P'$ for every non-unit production $B \rightarrow y$ in the original $P$ .
- Result: An equivalent grammar $G'$ without unit productions.
Eliminating Useless Productions:
- Useless Variable/Production: A variable $A$ is useless if it cannot participate in the derivation of any terminal string (i.e., either $S \not\Rightarrow^* xAy$ for any $x, y$ , or $A \not\Rightarrow^* w$ for any $w \in T^*$ ). A production involving a useless variable is useless.
- Procedur): a. (Reach Terminals): Find all variables $V_T$ that can derive a terminal string. (Start with $A$ if $A \rightarrow w \in T^*$ ; add $B$ if $B \rightarrow x_1 \dots x_k$ and all $x_i$ are terminals or already in $V_T$ ; repeat). Keep only productions involving variables in $V_T$ and terminals. b. (Reachable from S): Find all variables $V_S$ reachable from the start symbol $S$ using the remaining productions (e.g., using a dependency graph). Keep only productions involving variables in $V_S$ and terminals from step (a).
- Result: An equivalent grammar $G'$ with no useless variables or productions.

Theorem: Any CFL $L$ not containing λ has an equivalent CFG with no λ-productions, no unit productions, and no useless productions. (Apply the eliminations in the order: λ, unit, useless).

5. Normal Forms for CFGs

Normal forms are restricted grammar structures that are still powerful enough to generate all CFLs (except possibly λ).

Chomsky Normal Form (CNF)

Definition: A CFG is in Chomsky Normal Form (CNF) if all productions are of the form:

$A \rightarrow BC$ (where $A, B, C \in V$ )
$A \rightarrow a$ (where $A \in V, a \in T$ )

Theorem: Any CFL $L$ with $\lambda \notin L$ has an equivalent grammar in CNF.

Conversion Algorithm:

Start with a CFG $G$ having no λ, unit, or useless productions.
Step 1 (Isolate Terminals): Create a grammar $G_1$ .
- Keep productions $A \rightarrow a$ .
- For each production $A \rightarrow x_1 \dots x_n$ ( $n \ge 2$ ), replace each terminal $a$ occurring at position $x_i$ with a new variable $B_a$ . Add the production $B_a \rightarrow a$ . The original production becomes $A \rightarrow C_1 \dots C_n$ where $C_i = x_i$ if $x_i \in V$ , and $C_i = B_a$ if $x_i = a$ .
- Result: All productions are $A \rightarrow a$ or $A \rightarrow C_1 \dots C_n$ ( $C_i \in V_1$ ).
Step 2 (Reduce RHS Length): Create the final grammar $G'$ .
- Keep productions $A \rightarrow a$ and $A \rightarrow C_1 C_2$ .
- For productions $A \rightarrow C_1 C_2 \dots C_n$ ( $n > 2$ ), introduce new variables $D_1, \dots, D_{n-2}$ and replace the production with the set: $A \rightarrow C_1 D_1$ $D_1 \rightarrow C_2 D_2$ ... $D_{n-2} \rightarrow C_{n-1} C_n$
- Result: All productions are $A \rightarrow BC$ or $A \rightarrow a$ .

Usefulness: CNF is required for algorithms like CYK parsing.

Greibach Normal Form (GNF)

Definition: A CFG is in Greibach Normal Form (GNF) if all productions are of the form:

$A \rightarrow ax$ (where $A \in V, a \in T, x \in V^*$ )

Theorem: Any CFL $L$ with $\lambda \notin L$ has an equivalent grammar in GNF.

Conversion: The conversion process is more complex than for CNF, often involving steps to eliminate left recursion and then systematically substituting to get the required form.

Usefulness: GNF ensures that each step in a leftmost derivation produces exactly one terminal symbol, which is useful in relating CFGs to Pushdown Automata.

6. Pushdown Automata (PDA)

PDAs are automata equipped with a stack as auxiliary storage, allowing them to recognize context-free languages.

Conceptual Model: A finite control unit, an input tape (read-only, moves right), and a stack (LIFO) of infinite capacity.

Formal Definition: A (nondeterministic) pushdown automaton (NPDA) is a 7-tuple $M = (Q, \Sigma, \Gamma, \delta, q_0, z, F)$ :

$Q$ : Finite set of states.
$\Sigma$ : Finite input alphabet.
$\Gamma$ : Finite stack alphabet.
$\delta: Q \times (\Sigma \cup \{\lambda\}) \times \Gamma \rightarrow \text{finite subsets of } Q \times \Gamma^*$ : The transition function.
$q_0 \in Q$ : The initial state.
$z \in \Gamma$ : The initial stack symbol (stack bottom marker).
$F \subseteq Q$ : The set of final (accepting) states.

Transition Function $\delta(q, a, b)$ : Contains pairs $(p, \gamma)$ meaning: if in state $q$ , reading input $a$ (or λ if $a=\lambda$ ), with symbol $b$ on top of the stack, the automaton can transition to state $p$ , pop $b$ from the stack, and push the string $\gamma$ onto the stack (rightmost symbol of $\gamma$ first).

Instantaneous Description (ID): A triplet $(q, w, u)$ representing the current state $q$ , the remaining unread input $w$ , and the stack contents $u$ (top is leftmost).

Move: $(q_1, aw, bx) \vdash (q_2, w, \gamma x)$ if $(q_2, \gamma) \in \delta(q_1, a, b)$ . $\vdash^*$ denotes zero or more moves.

Acceptance (by final state): An NPDA $M$ accepts a string $w \in \Sigma^*$ if $(q_0, w, z) \vdash^* (p, \lambda, u)$ for some state $p \in F$ and any stack content $u \in \Gamma^*$ . Language Accepted ( $L(M)$ ): The set of all strings accepted by $M$ . (Note: Acceptance by empty stack is an alternative, equivalent definition).

Example PDA for $L=\{a^n b^n : n \ge 0\}$ :

Push a symbol (e.g., '1') for each 'a'.
Pop a '1' for each 'b'.
Accept if the stack is empty (back to initial symbol 'z') when input is finished. Requires transitions like $\delta(q_{read\_a}, a, \_) \rightarrow \text{push } 1$ , $\delta(q_{read\_b}, b, 1) \rightarrow \text{pop } 1$ , plus state changes and start/end transitions.

Deterministic PDA (DPDA): An NPDA is deterministic if:

$|\delta(q, a, b)| \le 1$ for all $q, a, b$ . (At most one move possible).
If $\delta(q, \lambda, b) \neq \emptyset$ , then $\delta(q, c, b) = \emptyset$ for all $c \in \Sigma$ . (No choice between a λ-move and consuming input).

Deterministic CFL (DCFL): A language accepted by some DPDA. The DCFLs are a proper subset of the CFLs (e.g., $\{ww^R\}$ is context-free but not deterministic).

7. Equivalence: PDAs and CFGs

The fundamental connection mirroring FA/Regular Expressions.

Theorem: For any context-free language $L$ , there exists an NPDA $M$ such that $L = L(M)$ .

Proof Idea (CFG in GNF → NPDA):

Construct a 3-state NPDA ( $q_0, q_1, q_f$ ).
Initial move: Push the grammar's start symbol $S$ onto the stack: $\delta(q_0, \lambda, z) = \{(q_1, Sz)\}$ .
Simulation move: For each grammar production $A \rightarrow a x_1 x_2 \dots x_k$ (where $a \in T, x_i \in V$ ), add a PDA transition $\delta(q_1, a, A) = \{(q_1, x_1 x_2 \dots x_k)\}$ . This reads the terminal $a$ , pops variable $A$ , and pushes the variable part of the RHS.
Final move: Accept when stack is empty (down to $z$ ): $\delta(q_1, \lambda, z) = \{(q_f, z)\}$ .
The PDA effectively simulates a leftmost derivation using its stack to hold the pending variables.

Theorem: If $L = L(M)$ for some NPDA $M$ , then $L$ is a context-free language.

Proof Idea (NPDA → CFG):

More complex. Assume (wlog) the NPDA $M$ satisfies certain conditions (single final state entered only on empty stack, moves push/pop at most one symbol).
Create grammar variables of the form $(q_i A q_j)$ , intended to represent the set of input strings $w$ that cause $M$ to go from state $q_i$ to $q_j$ , consuming $w$ , and resulting in the net popping of stack symbol $A$ .
Grammar Productions:
- If $\delta(q_i, a, A)$ contains $(q_j, \lambda)$ (pop A), add production $(q_i A q_j) \rightarrow a$ .
- If $\delta(q_i, a, A)$ contains $(q_j, BC)$ (replace A with BC), add productions $(q_i A q_k) \rightarrow a (q_j B q_l) (q_l C q_k)$ for all possible states $q_l, q_k$ . This simulates popping B (going $q_j \rightarrow q_l$ ) then popping C (going $q_l \rightarrow q_k$ ).
Start Symbol: $(q_0 z q_f)$ , where $q_f$ is the unique final state.
The construction ensures $(q_0 z q_f) \Rightarrow^* w$ if and only if $M$ accepts $w$ .

These theorems establish that NPDAs and CFGs define the same class of languages: the context-free languages.

8. Properties of Context-Free Languages

How do CFLs behave under operations? Can we decide things about them?

Closure Properties

Closed Under: Union, Concatenation, Kleene Star, Homomorphism, Reversal.
NOT Closed Under: Intersection ( $L_1 = \{a^n b^n c^m\}, L_2 = \{a^n b^m c^m\}$ are CFLs, but $L_1 \cap L_2 = \{a^n b^n c^n\}$ is not), Complementation (follows from non-closure under intersection via DeMorgan's laws).
Closed Under Regular Intersection: If $L_1$ is context-free and $L_2$ is regular, then $L_1 \cap L_2$ is context-free. Proof involves constructing a PDA that simulates the original PDA and a DFA for $L_2$ simultaneously using state pairs.

Pumping Lemma for Context-Free Languages

Similar to the regular version, but reflects the structure of parse trees. Used to prove languages are not context-free.

Pumping Lemma for CFLs: Let L be a context-free language. Then there exists a constant $m \ge 1$ (the pumping length) such that any string $w \in L$ with $|w| \ge m$ can be divided into five substrings, $w = uvxyz$ , satisfying:

$|vxy| \le m$ (the "pumpable" region is bounded).
$|vy| \ge 1$ (at least one of the pumped parts is non-empty).
For all integers $i \ge 0$ , the string $w_i = uv^i xy^i z$ is also in $L$ .

Proof Idea: For a sufficiently long $w$ , its derivation tree (using a simplified grammar, e.g., CNF) must be tall. A tall tree must have a long path from root to a leaf. Since there are finitely many variables, some variable $A$ must repeat on a long enough path ( $S \Rightarrow^* uAz \Rightarrow^* uvAyz \Rightarrow^* uvxyz$ ). The substring $vxy$ is the yield of the subtree rooted at the upper $A$ . $x$ is the yield of the subtree rooted at the lower $A$ . The constraint $|vxy| \le m$ comes from bounding the height of subtrees without repeated variables. $|vy| \ge 1$ because we assume no useless/λ/unit productions. Pumping corresponds to repeating ( $i>1$ ) or removing ( $i=0$ ) the derivation segment $A \Rightarrow^* vAy$ .

Using the CFL Pumping Lemma: Similar "game" as the regular version, but the adversary has more freedom in choosing the decomposition $uvxyz$ (only $|vxy| \le m$ is guaranteed). You must show that for all valid decompositions, pumping ( $uv^ixy^iz$ ) leads to a contradiction for some $i$ .

Example: $L = \{a^n b^n c^n : n \ge 0\}$ is not context-free. Assume it is. Let $m$ be pumping length. Choose $w = a^m b^m c^m$ . Adversary decomposes $w=uvxyz$ with $|vxy| \le m$ and $|vy| \ge 1$ . Case 1: $vxy$ contains only one type of symbol (e.g., only $a$ 's). Then $w_0 = uxz$ or $w_2 = uv^2xy^2z$ will have unequal numbers of $a, b, c$ . $\rightarrow$ Contradiction. Case 2: $vxy$ contains symbols of two types (e.g., $a$ 's and $b$ 's, or $b$ 's and $c$ 's). It cannot contain all three due to $|vxy| \le m$ . If $vxy$ has $a$ 's and $b$ 's, pump $i=2$ . $w_2$ has more $a$ 's or $b$ 's (or both) than $c$ 's. $\rightarrow$ Contradiction. Similarly if $vxy$ has $b$ 's and $c$ 's. Since all possible valid decompositions lead to a contradiction, $L$ is not context-free.

Other Non-CFL Examples: $\{ww : w \in \{a, b\}^*\}$ , $\{a^{n!} : n \ge 0\}$ , $\{a^n b^j : n=j^2\}$ .

Decision Problems

Compared to regular languages, fewer properties are decidable for CFLs:

Decidable:
- Membership ( $w \in L(G)$ ?): Yes (e.g., CYK algorithm).
- Emptiness ( $L(G) = \emptyset$ ?): Yes (check if start symbol $S$ can derive any terminal string - related to finding useful variables).
- Finiteness/Infiniteness ( $|L(G)| = \infty$ ?): Yes (check if the grammar's dependency graph has a cycle involving a variable reachable from S and capable of producing terminals).
Undecidable:
- Ambiguity: Is CFG $G$ ambiguous?
- Equivalence: Is $L(G_1) = L(G_2)$ ?
- Inclusion: Is $L(G_1) \subseteq L(G_2)$ ?
- Regularity: Is $L(G)$ regular?
- Intersection Emptiness: Is $L(G_1) \cap L(G_2) = \emptyset$ ? (Proven via reduction from Post Correspondence Problem).
- Is $L(G) = \Sigma^*$ ?

Conclusion

Context-free languages, generated by CFGs and recognized by PDAs, significantly expand our descriptive power beyond regular languages, capturing essential syntactic structures like nesting found in programming languages and arithmetic expressions. We've seen how to manipulate CFGs (simplification, normal forms), the power and limitations of PDAs (especially the difference between deterministic and nondeterministic variants), and how CFL properties (closure, pumping lemma, decidability) differ from those of regular languages. While more powerful, the increased complexity leads to fundamental limitations, with many important questions about arbitrary CFLs being algorithmically undecidable.