Skip to Content
All memories

Exploring Context-Free Languages and Pushdown Automata

 — #AFL#CS Basics

Welcome back, language theory explorers! In our last discussion, we thoroughly examined regular languages and their corresponding finite automata. We saw their power but also their limitations – they struggle with structures requiring unbounded memory, like matching nested parentheses or ensuring equal counts of different symbols (anbna^n b^n).

Today, we move up the Chomsky hierarchy to Context-Free Languages (CFLs). These languages are generated by Context-Free Grammars (CFGs) and recognized by Pushdown Automata (PDA). They form the theoretical backbone for describing the syntax of most programming languages. Let's unravel the details!

1. Context-Free Grammars (CFGs)

We increase the generative power of grammars by relaxing the restrictions found in regular grammars.

Definition: A grammar G=(V,T,S,P)G = (V, T, S, P) is context-free if all productions in PP have the form: AxA \rightarrow x where AVA \in V (a single variable) and x(VT)x \in (V \cup T)^* (any string of variables and terminals).

A language LL is context-free if there exists a CFG GG such that L=L(G)L = L(G).

Key Difference: Unlike regular grammars (right/left-linear), the right-hand side (xx) can be any mix of variables and terminals. The "context-free" name comes from the fact that the variable AA can be replaced by xx regardless of the context (surrounding symbols) in which AA appears in a sentential form.

Examples:

  • G1:SaSbλG_1: S \rightarrow aSb | \lambda. Generates L(G1)={anbn:n0}L(G_1) = \{a^n b^n : n \ge 0\}. This language is context-free but not regular.
  • G2:SaSabSbλG_2: S \rightarrow aSa | bSb | \lambda. Generates L(G2)={wwR:w{a,b}}L(G_2) = \{ww^R : w \in \{a, b\}^*\}. This language (palindromes) is also context-free but not regular.
  • G3:SSSaSbbSaλG_3: S \rightarrow SS | aSb | bSa | \lambda. Generates L(G3)={w{a,b}:na(w)=nb(w)}L(G_3) = \{w \in \{a, b\}^* : n_a(w) = n_b(w)\}. (Balanced numbers of a's and b's). Context-free, not regular. This grammar is not linear (due to SSSS \rightarrow SS), unlike the first two examples.

Since every regular grammar is a context-free grammar (it satisfies the CFG production form), the family of regular languages is a proper subset of the context-free languages.

2. Derivations, Sentential Forms, and Trees

Understanding how CFGs generate strings involves derivations and their visual representation.

Leftmost and Rightmost Derivations

When a sentential form contains multiple variables, we have a choice of which one to replace next. To standardize this:

  • Leftmost Derivation: In each step, the leftmost variable in the sentential form is replaced.
  • Rightmost Derivation: In each step, the rightmost variable is replaced.

Example: Grammar: SAB,AaaAλ,BBbλS \rightarrow AB, A \rightarrow aaA | \lambda, B \rightarrow Bb | \lambda. String: aab.

  • Leftmost: SABaaABaaBaaBbaabS \Rightarrow AB \Rightarrow aaAB \Rightarrow aaB \Rightarrow aaBb \Rightarrow aab
  • Rightmost: SABABbAAbaaAAbaabS \Rightarrow AB \Rightarrow ABb \Rightarrow AAb \Rightarrow aaAAb \Rightarrow aab

(Note: This grammar isn't quite right for a2nbma^{2n}b^m; see original Example 5.1 for wwRww^R).

For any string derivable in a CFG, there exists at least one leftmost and at least one rightmost derivation.

Sentential Forms and Derivation (Parse) Trees

A derivation tree (or parse tree) graphically represents a derivation, independent of the order of rule application.

Definition: An ordered tree is a derivation tree for CFG G=(V,T,S,P)G=(V, T, S, P) if:

  1. The root is labeled SS.
  2. Every leaf is labeled with a symbol from T{λ}T \cup \{\lambda\}.
  3. Every interior node is labeled with a symbol from VV.
  4. If an interior node is labeled AA and its children (left-to-right) are labeled a1,a2,,ana_1, a_2, \dots, a_n, then Aa1a2anA \rightarrow a_1 a_2 \dots a_n must be a production in PP.
  5. A leaf labeled λ has no siblings.
  • Partial Derivation Tree: Satisfies 3, 4, 5, but the root might not be SS, and leaves can be variables.
  • Yield: The string formed by reading the labels of the leaves from left to right (omitting any λ leaves).

Theorem:

  • Every wL(G)w \in L(G) is the yield of some derivation tree for GG.
  • The yield of any derivation tree for GG is in L(G)L(G).
  • The yield of any partial derivation tree rooted at SS is a sentential form of GG.

Derivation trees capture the structure and application of rules, while leftmost/rightmost derivations impose an order.

3. Parsing and Ambiguity

  • Parsing: The process of finding a derivation (or derivation tree) for a given string ww according to a grammar GG. It essentially determines if wL(G)w \in L(G) and reveals its syntactic structure.
  • Membership Problem: Deciding if wL(G)w \in L(G).

Parsing Algorithms

  • Exhaustive Search (Brute-Force): Systematically generate all possible (say, leftmost) derivations of length 1, length 2, etc., and see if ww is produced. This works but can be extremely inefficient (potentially exponential time O(Pcw)O(|P|^{c|w|})) and might not terminate if the grammar has cycles or λ-productions allowing derivations to loop or shrink.
  • Termination: If the grammar has no λ-productions and no unit productions (ABA \rightarrow B), exhaustive search is guaranteed to terminate in at most O(P2w)O(|P|^{2|w|}) time, determining membership.
  • CYK Algorithm: A more efficient dynamic programming algorithm for membership (and parsing). Requires the grammar to be in Chomsky Normal Form (see below). Runs in O(w3)O(|w|^3) time.
  • LL and LR Parsers: For restricted classes of CFGs (LL and LR grammars), linear time (O(w))O(|w|)) parsing is possible. These are crucial for compiler construction.

Ambiguity

Definition: A CFG GG is ambiguous if there exists some string wL(G)w \in L(G) that has two or more distinct derivation trees (or equivalently, two or more distinct leftmost or rightmost derivations).

Example: Grammar EE+EEE(E)abcE \rightarrow E+E | E*E | (E) | a|b|c is ambiguous for expressions like a+b*c. One tree groups a+b first, the other groups b*c first.

Resolving Ambiguity:

  1. External Rules: Impose precedence rules (e.g., * before +).
  2. Rewrite Grammar: Modify the grammar to enforce precedence and associativity, making it unambiguous. Example: Introduce Term (T) and Factor (F) non-terminals: EE+TTE \rightarrow E + T | T TTFFT \rightarrow T * F | F F(E)abcF \rightarrow (E) | a|b|c This grammar only allows the parse corresponding to standard operator precedence for a+b*c.

Definition:

  • A CFL LL is unambiguous if there exists at least one unambiguous CFG that generates it.
  • A CFL LL is inherently ambiguous if every CFG that generates it is ambiguous. Example: L={anbncm}{anbmcm}L = \{a^n b^n c^m\} \cup \{a^n b^m c^m\}, n,m0n, m \ge 0. The string akbkcka^k b^k c^k can be derived in two ways, reflecting the union, and no unambiguous grammar is known.

Ambiguity is undesirable in programming languages where unique interpretation is required. Deciding if an arbitrary CFG is ambiguous, or if two CFGs are equivalent, are generally undecidable problems.

4. Simplifying Context-Free Grammars

For theoretical proofs and practical algorithms (like CYK), it's useful to simplify CFGs by removing certain types of productions without changing the language generated (except possibly for λ).

Assumption: For simplification, we often focus on λ-free languages (L{λ}L - \{\lambda\}). If needed, λ can be added back later via a new start symbol S0SλS_0 \rightarrow S | \lambda.

Key Simplifications:

  1. Eliminating λ-Productions:

    • Nullable Variable: A variable AA is nullable if AλA \Rightarrow^* \lambda.
    • Procedure: a. Find all nullable variables VNV_N. (Start with variables in AλA \rightarrow \lambda, then add BB if BA1AkB \rightarrow A_1 \dots A_k and all AiA_i are nullable, repeat). b. Create a new set of productions PP'. For each production Ax1x2xmA \rightarrow x_1 x_2 \dots x_m in the original PP, add to PP' this production and all productions formed by deleting one or more xix_i where xiVNx_i \in V_N. If all xix_i are nullable, do not add AλA \rightarrow \lambda to PP'.
    • Result: A grammar GG' without λ-productions such that L(G)=L(G){λ}L(G') = L(G) - \{\lambda\}.
  2. Eliminating Unit Productions:

    • Unit Production: A production of the form ABA \rightarrow B, where A,BVA, B \in V.
    • Procedure: a. Find all pairs (A,B)(A, B) such that ABA \Rightarrow^* B using only unit productions (can use a dependency graph). b. Create a new set of productions PP'. Initialize PP' with all non-unit productions from the original PP. c. For every pair (A,B)(A, B) found in step (a), add productions AyA \rightarrow y to PP' for every non-unit production ByB \rightarrow y in the original PP.
    • Result: An equivalent grammar GG' without unit productions.
  3. Eliminating Useless Productions:

    • Useless Variable/Production: A variable AA is useless if it cannot participate in the derivation of any terminal string (i.e., either S̸xAyS \not\Rightarrow^* xAy for any x,yx, y, or A̸wA \not\Rightarrow^* w for any wTw \in T^*). A production involving a useless variable is useless.
    • Procedur): a. (Reach Terminals): Find all variables VTV_T that can derive a terminal string. (Start with AA if AwTA \rightarrow w \in T^*; add BB if Bx1xkB \rightarrow x_1 \dots x_k and all xix_i are terminals or already in VTV_T; repeat). Keep only productions involving variables in VTV_T and terminals. b. (Reachable from S): Find all variables VSV_S reachable from the start symbol SS using the remaining productions (e.g., using a dependency graph). Keep only productions involving variables in VSV_S and terminals from step (a).
    • Result: An equivalent grammar GG' with no useless variables or productions.

Theorem: Any CFL LL not containing λ has an equivalent CFG with no λ-productions, no unit productions, and no useless productions. (Apply the eliminations in the order: λ, unit, useless).

5. Normal Forms for CFGs

Normal forms are restricted grammar structures that are still powerful enough to generate all CFLs (except possibly λ).

Chomsky Normal Form (CNF)

Definition: A CFG is in Chomsky Normal Form (CNF) if all productions are of the form:

  • ABCA \rightarrow BC (where A,B,CVA, B, C \in V)
  • AaA \rightarrow a (where AV,aTA \in V, a \in T)

Theorem: Any CFL LL with λL\lambda \notin L has an equivalent grammar in CNF.

Conversion Algorithm:

  1. Start with a CFG GG having no λ, unit, or useless productions.
  2. Step 1 (Isolate Terminals): Create a grammar G1G_1.
    • Keep productions AaA \rightarrow a.
    • For each production Ax1xnA \rightarrow x_1 \dots x_n (n2n \ge 2), replace each terminal aa occurring at position xix_i with a new variable BaB_a. Add the production BaaB_a \rightarrow a. The original production becomes AC1CnA \rightarrow C_1 \dots C_n where Ci=xiC_i = x_i if xiVx_i \in V, and Ci=BaC_i = B_a if xi=ax_i = a.
    • Result: All productions are AaA \rightarrow a or AC1CnA \rightarrow C_1 \dots C_n (CiV1C_i \in V_1).
  3. Step 2 (Reduce RHS Length): Create the final grammar GG'.
    • Keep productions AaA \rightarrow a and AC1C2A \rightarrow C_1 C_2.
    • For productions AC1C2CnA \rightarrow C_1 C_2 \dots C_n (n>2n > 2), introduce new variables D1,,Dn2D_1, \dots, D_{n-2} and replace the production with the set: AC1D1A \rightarrow C_1 D_1 D1C2D2D_1 \rightarrow C_2 D_2 ... Dn2Cn1CnD_{n-2} \rightarrow C_{n-1} C_n
    • Result: All productions are ABCA \rightarrow BC or AaA \rightarrow a.

Usefulness: CNF is required for algorithms like CYK parsing.

Greibach Normal Form (GNF)

Definition: A CFG is in Greibach Normal Form (GNF) if all productions are of the form:

  • AaxA \rightarrow ax (where AV,aT,xVA \in V, a \in T, x \in V^*)

Theorem: Any CFL LL with λL\lambda \notin L has an equivalent grammar in GNF.

Conversion: The conversion process is more complex than for CNF, often involving steps to eliminate left recursion and then systematically substituting to get the required form.

Usefulness: GNF ensures that each step in a leftmost derivation produces exactly one terminal symbol, which is useful in relating CFGs to Pushdown Automata.

6. Pushdown Automata (PDA)

PDAs are automata equipped with a stack as auxiliary storage, allowing them to recognize context-free languages.

Conceptual Model: A finite control unit, an input tape (read-only, moves right), and a stack (LIFO) of infinite capacity.

Formal Definition: A (nondeterministic) pushdown automaton (NPDA) is a 7-tuple M=(Q,Σ,Γ,δ,q0,z,F)M = (Q, \Sigma, \Gamma, \delta, q_0, z, F):

  • QQ: Finite set of states.
  • Σ\Sigma: Finite input alphabet.
  • Γ\Gamma: Finite stack alphabet.
  • δ:Q×(Σ{λ})×Γfinite subsets of Q×Γ\delta: Q \times (\Sigma \cup \{\lambda\}) \times \Gamma \rightarrow \text{finite subsets of } Q \times \Gamma^*: The transition function.
  • q0Qq_0 \in Q: The initial state.
  • zΓz \in \Gamma: The initial stack symbol (stack bottom marker).
  • FQF \subseteq Q: The set of final (accepting) states.

Transition Function δ(q,a,b)\delta(q, a, b): Contains pairs (p,γ)(p, \gamma) meaning: if in state qq, reading input aa (or λ if a=λa=\lambda), with symbol bb on top of the stack, the automaton can transition to state pp, pop bb from the stack, and push the string γ\gamma onto the stack (rightmost symbol of γ\gamma first).

Instantaneous Description (ID): A triplet (q,w,u)(q, w, u) representing the current state qq, the remaining unread input ww, and the stack contents uu (top is leftmost).

Move: (q1,aw,bx)(q2,w,γx)(q_1, aw, bx) \vdash (q_2, w, \gamma x) if (q2,γ)δ(q1,a,b)(q_2, \gamma) \in \delta(q_1, a, b). \vdash^* denotes zero or more moves.

Acceptance (by final state): An NPDA MM accepts a string wΣw \in \Sigma^* if (q0,w,z)(p,λ,u)(q_0, w, z) \vdash^* (p, \lambda, u) for some state pFp \in F and any stack content uΓu \in \Gamma^*. Language Accepted (L(M)L(M)): The set of all strings accepted by MM. (Note: Acceptance by empty stack is an alternative, equivalent definition).

Example PDA for L={anbn:n0}L=\{a^n b^n : n \ge 0\}:

  • Push a symbol (e.g., '1') for each 'a'.
  • Pop a '1' for each 'b'.
  • Accept if the stack is empty (back to initial symbol 'z') when input is finished. Requires transitions like δ(qread_a,a,_)push 1\delta(q_{read\_a}, a, \_) \rightarrow \text{push } 1, δ(qread_b,b,1)pop 1\delta(q_{read\_b}, b, 1) \rightarrow \text{pop } 1, plus state changes and start/end transitions.

Deterministic PDA (DPDA): An NPDA is deterministic if:

  1. δ(q,a,b)1|\delta(q, a, b)| \le 1 for all q,a,bq, a, b. (At most one move possible).
  2. If δ(q,λ,b)\delta(q, \lambda, b) \neq \emptyset, then δ(q,c,b)=\delta(q, c, b) = \emptyset for all cΣc \in \Sigma. (No choice between a λ-move and consuming input).

Deterministic CFL (DCFL): A language accepted by some DPDA. The DCFLs are a proper subset of the CFLs (e.g., {wwR}\{ww^R\} is context-free but not deterministic).

7. Equivalence: PDAs and CFGs

The fundamental connection mirroring FA/Regular Expressions.

Theorem: For any context-free language LL, there exists an NPDA MM such that L=L(M)L = L(M).

Proof Idea (CFG in GNF → NPDA):

  • Construct a 3-state NPDA (q0,q1,qfq_0, q_1, q_f).
  • Initial move: Push the grammar's start symbol SS onto the stack: δ(q0,λ,z)={(q1,Sz)}\delta(q_0, \lambda, z) = \{(q_1, Sz)\}.
  • Simulation move: For each grammar production Aax1x2xkA \rightarrow a x_1 x_2 \dots x_k (where aT,xiVa \in T, x_i \in V), add a PDA transition δ(q1,a,A)={(q1,x1x2xk)}\delta(q_1, a, A) = \{(q_1, x_1 x_2 \dots x_k)\}. This reads the terminal aa, pops variable AA, and pushes the variable part of the RHS.
  • Final move: Accept when stack is empty (down to zz): δ(q1,λ,z)={(qf,z)}\delta(q_1, \lambda, z) = \{(q_f, z)\}.
  • The PDA effectively simulates a leftmost derivation using its stack to hold the pending variables.

Theorem: If L=L(M)L = L(M) for some NPDA MM, then LL is a context-free language.

Proof Idea (NPDA → CFG):

  • More complex. Assume (wlog) the NPDA MM satisfies certain conditions (single final state entered only on empty stack, moves push/pop at most one symbol).
  • Create grammar variables of the form (qiAqj)(q_i A q_j), intended to represent the set of input strings ww that cause MM to go from state qiq_i to qjq_j, consuming ww, and resulting in the net popping of stack symbol AA.
  • Grammar Productions:
    • If δ(qi,a,A)\delta(q_i, a, A) contains (qj,λ)(q_j, \lambda) (pop A), add production (qiAqj)a(q_i A q_j) \rightarrow a.
    • If δ(qi,a,A)\delta(q_i, a, A) contains (qj,BC)(q_j, BC) (replace A with BC), add productions (qiAqk)a(qjBql)(qlCqk)(q_i A q_k) \rightarrow a (q_j B q_l) (q_l C q_k) for all possible states ql,qkq_l, q_k. This simulates popping B (going qjqlq_j \rightarrow q_l) then popping C (going qlqkq_l \rightarrow q_k).
  • Start Symbol: (q0zqf)(q_0 z q_f), where qfq_f is the unique final state.
  • The construction ensures (q0zqf)w(q_0 z q_f) \Rightarrow^* w if and only if MM accepts ww.

These theorems establish that NPDAs and CFGs define the same class of languages: the context-free languages.

8. Properties of Context-Free Languages

How do CFLs behave under operations? Can we decide things about them?

Closure Properties

  • Closed Under: Union, Concatenation, Kleene Star, Homomorphism, Reversal.
  • NOT Closed Under: Intersection (L1={anbncm},L2={anbmcm}L_1 = \{a^n b^n c^m\}, L_2 = \{a^n b^m c^m\} are CFLs, but L1L2={anbncn}L_1 \cap L_2 = \{a^n b^n c^n\} is not), Complementation (follows from non-closure under intersection via DeMorgan's laws).
  • Closed Under Regular Intersection: If L1L_1 is context-free and L2L_2 is regular, then L1L2L_1 \cap L_2 is context-free. Proof involves constructing a PDA that simulates the original PDA and a DFA for L2L_2 simultaneously using state pairs.

Pumping Lemma for Context-Free Languages

Similar to the regular version, but reflects the structure of parse trees. Used to prove languages are not context-free.

Pumping Lemma for CFLs: Let L be a context-free language. Then there exists a constant m1m \ge 1 (the pumping length) such that any string wLw \in L with wm|w| \ge m can be divided into five substrings, w=uvxyzw = uvxyz, satisfying:

  1. vxym|vxy| \le m (the "pumpable" region is bounded).
  2. vy1|vy| \ge 1 (at least one of the pumped parts is non-empty).
  3. For all integers i0i \ge 0, the string wi=uvixyizw_i = uv^i xy^i z is also in LL.

Proof Idea: For a sufficiently long ww, its derivation tree (using a simplified grammar, e.g., CNF) must be tall. A tall tree must have a long path from root to a leaf. Since there are finitely many variables, some variable AA must repeat on a long enough path (SuAzuvAyzuvxyzS \Rightarrow^* uAz \Rightarrow^* uvAyz \Rightarrow^* uvxyz). The substring vxyvxy is the yield of the subtree rooted at the upper AA. xx is the yield of the subtree rooted at the lower AA. The constraint vxym|vxy| \le m comes from bounding the height of subtrees without repeated variables. vy1|vy| \ge 1 because we assume no useless/λ/unit productions. Pumping corresponds to repeating (i>1i>1) or removing (i=0i=0) the derivation segment AvAyA \Rightarrow^* vAy.

Using the CFL Pumping Lemma: Similar "game" as the regular version, but the adversary has more freedom in choosing the decomposition uvxyzuvxyz (only vxym|vxy| \le m is guaranteed). You must show that for all valid decompositions, pumping (uvixyizuv^ixy^iz) leads to a contradiction for some ii.

Example: L={anbncn:n0}L = \{a^n b^n c^n : n \ge 0\} is not context-free. Assume it is. Let mm be pumping length. Choose w=ambmcmw = a^m b^m c^m. Adversary decomposes w=uvxyzw=uvxyz with vxym|vxy| \le m and vy1|vy| \ge 1. Case 1: vxyvxy contains only one type of symbol (e.g., only aa's). Then w0=uxzw_0 = uxz or w2=uv2xy2zw_2 = uv^2xy^2z will have unequal numbers of a,b,ca, b, c. \rightarrow Contradiction. Case 2: vxyvxy contains symbols of two types (e.g., aa's and bb's, or bb's and cc's). It cannot contain all three due to vxym|vxy| \le m. If vxyvxy has aa's and bb's, pump i=2i=2. w2w_2 has more aa's or bb's (or both) than cc's. \rightarrow Contradiction. Similarly if vxyvxy has bb's and cc's. Since all possible valid decompositions lead to a contradiction, LL is not context-free.

Other Non-CFL Examples: {ww:w{a,b}}\{ww : w \in \{a, b\}^*\}, {an!:n0}\{a^{n!} : n \ge 0\}, {anbj:n=j2}\{a^n b^j : n=j^2\}.

Decision Problems

Compared to regular languages, fewer properties are decidable for CFLs:

  • Decidable:
    • Membership (wL(G)w \in L(G)?): Yes (e.g., CYK algorithm).
    • Emptiness (L(G)=L(G) = \emptyset?): Yes (check if start symbol SS can derive any terminal string - related to finding useful variables).
    • Finiteness/Infiniteness (L(G)=|L(G)| = \infty?): Yes (check if the grammar's dependency graph has a cycle involving a variable reachable from S and capable of producing terminals).
  • Undecidable:
    • Ambiguity: Is CFG GG ambiguous?
    • Equivalence: Is L(G1)=L(G2)L(G_1) = L(G_2)?
    • Inclusion: Is L(G1)L(G2)L(G_1) \subseteq L(G_2)?
    • Regularity: Is L(G)L(G) regular?
    • Intersection Emptiness: Is L(G1)L(G2)=L(G_1) \cap L(G_2) = \emptyset? (Proven via reduction from Post Correspondence Problem).
    • Is L(G)=ΣL(G) = \Sigma^*?

Conclusion

Context-free languages, generated by CFGs and recognized by PDAs, significantly expand our descriptive power beyond regular languages, capturing essential syntactic structures like nesting found in programming languages and arithmetic expressions. We've seen how to manipulate CFGs (simplification, normal forms), the power and limitations of PDAs (especially the difference between deterministic and nondeterministic variants), and how CFL properties (closure, pumping lemma, decidability) differ from those of regular languages. While more powerful, the increased complexity leads to fundamental limitations, with many important questions about arbitrary CFLs being algorithmically undecidable.