Exploring Context-Free Languages and Pushdown Automata
Welcome back, language theory explorers! In our last discussion, we thoroughly examined regular languages and their corresponding finite automata. We saw their power but also their limitations – they struggle with structures requiring unbounded memory, like matching nested parentheses or ensuring equal counts of different symbols ().
Today, we move up the Chomsky hierarchy to Context-Free Languages (CFLs). These languages are generated by Context-Free Grammars (CFGs) and recognized by Pushdown Automata (PDA). They form the theoretical backbone for describing the syntax of most programming languages. Let's unravel the details!
1. Context-Free Grammars (CFGs)
We increase the generative power of grammars by relaxing the restrictions found in regular grammars.
Definition: A grammar is context-free if all productions in have the form: where (a single variable) and (any string of variables and terminals).
A language is context-free if there exists a CFG such that .
Key Difference: Unlike regular grammars (right/left-linear), the right-hand side () can be any mix of variables and terminals. The "context-free" name comes from the fact that the variable can be replaced by regardless of the context (surrounding symbols) in which appears in a sentential form.
Examples:
- . Generates . This language is context-free but not regular.
- . Generates . This language (palindromes) is also context-free but not regular.
- . Generates . (Balanced numbers of a's and b's). Context-free, not regular. This grammar is not linear (due to ), unlike the first two examples.
Since every regular grammar is a context-free grammar (it satisfies the CFG production form), the family of regular languages is a proper subset of the context-free languages.
2. Derivations, Sentential Forms, and Trees
Understanding how CFGs generate strings involves derivations and their visual representation.
Leftmost and Rightmost Derivations
When a sentential form contains multiple variables, we have a choice of which one to replace next. To standardize this:
- Leftmost Derivation: In each step, the leftmost variable in the sentential form is replaced.
- Rightmost Derivation: In each step, the rightmost variable is replaced.
Example: Grammar: . String: aab.
- Leftmost:
- Rightmost:
(Note: This grammar isn't quite right for ; see original Example 5.1 for ).
For any string derivable in a CFG, there exists at least one leftmost and at least one rightmost derivation.
Sentential Forms and Derivation (Parse) Trees
A derivation tree (or parse tree) graphically represents a derivation, independent of the order of rule application.
Definition: An ordered tree is a derivation tree for CFG if:
- The root is labeled .
- Every leaf is labeled with a symbol from .
- Every interior node is labeled with a symbol from .
- If an interior node is labeled and its children (left-to-right) are labeled , then must be a production in .
- A leaf labeled λ has no siblings.
- Partial Derivation Tree: Satisfies 3, 4, 5, but the root might not be , and leaves can be variables.
- Yield: The string formed by reading the labels of the leaves from left to right (omitting any λ leaves).
Theorem:
- Every is the yield of some derivation tree for .
- The yield of any derivation tree for is in .
- The yield of any partial derivation tree rooted at is a sentential form of .
Derivation trees capture the structure and application of rules, while leftmost/rightmost derivations impose an order.
3. Parsing and Ambiguity
- Parsing: The process of finding a derivation (or derivation tree) for a given string according to a grammar . It essentially determines if and reveals its syntactic structure.
- Membership Problem: Deciding if .
Parsing Algorithms
- Exhaustive Search (Brute-Force): Systematically generate all possible (say, leftmost) derivations of length 1, length 2, etc., and see if is produced. This works but can be extremely inefficient (potentially exponential time ) and might not terminate if the grammar has cycles or λ-productions allowing derivations to loop or shrink.
- Termination: If the grammar has no λ-productions and no unit productions (), exhaustive search is guaranteed to terminate in at most time, determining membership.
- CYK Algorithm: A more efficient dynamic programming algorithm for membership (and parsing). Requires the grammar to be in Chomsky Normal Form (see below). Runs in time.
- LL and LR Parsers: For restricted classes of CFGs (LL and LR grammars), linear time ( parsing is possible. These are crucial for compiler construction.
Ambiguity
Definition: A CFG is ambiguous if there exists some string that has two or more distinct derivation trees (or equivalently, two or more distinct leftmost or rightmost derivations).
Example: Grammar is ambiguous for expressions like a+b*c. One tree groups a+b first, the other groups b*c first.
Resolving Ambiguity:
- External Rules: Impose precedence rules (e.g.,
*before+). - Rewrite Grammar: Modify the grammar to enforce precedence and associativity, making it unambiguous. Example: Introduce Term (T) and Factor (F) non-terminals:
This grammar only allows the parse corresponding to standard operator precedence for
a+b*c.
Definition:
- A CFL is unambiguous if there exists at least one unambiguous CFG that generates it.
- A CFL is inherently ambiguous if every CFG that generates it is ambiguous. Example: , . The string can be derived in two ways, reflecting the union, and no unambiguous grammar is known.
Ambiguity is undesirable in programming languages where unique interpretation is required. Deciding if an arbitrary CFG is ambiguous, or if two CFGs are equivalent, are generally undecidable problems.
4. Simplifying Context-Free Grammars
For theoretical proofs and practical algorithms (like CYK), it's useful to simplify CFGs by removing certain types of productions without changing the language generated (except possibly for λ).
Assumption: For simplification, we often focus on λ-free languages (). If needed, λ can be added back later via a new start symbol .
Key Simplifications:
-
Eliminating λ-Productions:
- Nullable Variable: A variable is nullable if .
- Procedure: a. Find all nullable variables . (Start with variables in , then add if and all are nullable, repeat). b. Create a new set of productions . For each production in the original , add to this production and all productions formed by deleting one or more where . If all are nullable, do not add to .
- Result: A grammar without λ-productions such that .
-
Eliminating Unit Productions:
- Unit Production: A production of the form , where .
- Procedure: a. Find all pairs such that using only unit productions (can use a dependency graph). b. Create a new set of productions . Initialize with all non-unit productions from the original . c. For every pair found in step (a), add productions to for every non-unit production in the original .
- Result: An equivalent grammar without unit productions.
-
Eliminating Useless Productions:
- Useless Variable/Production: A variable is useless if it cannot participate in the derivation of any terminal string (i.e., either for any , or for any ). A production involving a useless variable is useless.
- Procedur): a. (Reach Terminals): Find all variables that can derive a terminal string. (Start with if ; add if and all are terminals or already in ; repeat). Keep only productions involving variables in and terminals. b. (Reachable from S): Find all variables reachable from the start symbol using the remaining productions (e.g., using a dependency graph). Keep only productions involving variables in and terminals from step (a).
- Result: An equivalent grammar with no useless variables or productions.
Theorem: Any CFL not containing λ has an equivalent CFG with no λ-productions, no unit productions, and no useless productions. (Apply the eliminations in the order: λ, unit, useless).
5. Normal Forms for CFGs
Normal forms are restricted grammar structures that are still powerful enough to generate all CFLs (except possibly λ).
Chomsky Normal Form (CNF)
Definition: A CFG is in Chomsky Normal Form (CNF) if all productions are of the form:
- (where )
- (where )
Theorem: Any CFL with has an equivalent grammar in CNF.
Conversion Algorithm:
- Start with a CFG having no λ, unit, or useless productions.
- Step 1 (Isolate Terminals): Create a grammar .
- Keep productions .
- For each production (), replace each terminal occurring at position with a new variable . Add the production . The original production becomes where if , and if .
- Result: All productions are or ().
- Step 2 (Reduce RHS Length): Create the final grammar .
- Keep productions and .
- For productions (), introduce new variables and replace the production with the set: ...
- Result: All productions are or .
Usefulness: CNF is required for algorithms like CYK parsing.
Greibach Normal Form (GNF)
Definition: A CFG is in Greibach Normal Form (GNF) if all productions are of the form:
- (where )
Theorem: Any CFL with has an equivalent grammar in GNF.
Conversion: The conversion process is more complex than for CNF, often involving steps to eliminate left recursion and then systematically substituting to get the required form.
Usefulness: GNF ensures that each step in a leftmost derivation produces exactly one terminal symbol, which is useful in relating CFGs to Pushdown Automata.
6. Pushdown Automata (PDA)
PDAs are automata equipped with a stack as auxiliary storage, allowing them to recognize context-free languages.
Conceptual Model: A finite control unit, an input tape (read-only, moves right), and a stack (LIFO) of infinite capacity.
Formal Definition: A (nondeterministic) pushdown automaton (NPDA) is a 7-tuple :
- : Finite set of states.
- : Finite input alphabet.
- : Finite stack alphabet.
- : The transition function.
- : The initial state.
- : The initial stack symbol (stack bottom marker).
- : The set of final (accepting) states.
Transition Function : Contains pairs meaning: if in state , reading input (or λ if ), with symbol on top of the stack, the automaton can transition to state , pop from the stack, and push the string onto the stack (rightmost symbol of first).
Instantaneous Description (ID): A triplet representing the current state , the remaining unread input , and the stack contents (top is leftmost).
Move: if . denotes zero or more moves.
Acceptance (by final state): An NPDA accepts a string if for some state and any stack content . Language Accepted (): The set of all strings accepted by . (Note: Acceptance by empty stack is an alternative, equivalent definition).
Example PDA for :
- Push a symbol (e.g., '1') for each 'a'.
- Pop a '1' for each 'b'.
- Accept if the stack is empty (back to initial symbol 'z') when input is finished. Requires transitions like , , plus state changes and start/end transitions.
Deterministic PDA (DPDA): An NPDA is deterministic if:
- for all . (At most one move possible).
- If , then for all . (No choice between a λ-move and consuming input).
Deterministic CFL (DCFL): A language accepted by some DPDA. The DCFLs are a proper subset of the CFLs (e.g., is context-free but not deterministic).
7. Equivalence: PDAs and CFGs
The fundamental connection mirroring FA/Regular Expressions.
Theorem: For any context-free language , there exists an NPDA such that .
Proof Idea (CFG in GNF → NPDA):
- Construct a 3-state NPDA ().
- Initial move: Push the grammar's start symbol onto the stack: .
- Simulation move: For each grammar production (where ), add a PDA transition . This reads the terminal , pops variable , and pushes the variable part of the RHS.
- Final move: Accept when stack is empty (down to ): .
- The PDA effectively simulates a leftmost derivation using its stack to hold the pending variables.
Theorem: If for some NPDA , then is a context-free language.
Proof Idea (NPDA → CFG):
- More complex. Assume (wlog) the NPDA satisfies certain conditions (single final state entered only on empty stack, moves push/pop at most one symbol).
- Create grammar variables of the form , intended to represent the set of input strings that cause to go from state to , consuming , and resulting in the net popping of stack symbol .
- Grammar Productions:
- If contains (pop A), add production .
- If contains (replace A with BC), add productions for all possible states . This simulates popping B (going ) then popping C (going ).
- Start Symbol: , where is the unique final state.
- The construction ensures if and only if accepts .
These theorems establish that NPDAs and CFGs define the same class of languages: the context-free languages.
8. Properties of Context-Free Languages
How do CFLs behave under operations? Can we decide things about them?
Closure Properties
- Closed Under: Union, Concatenation, Kleene Star, Homomorphism, Reversal.
- NOT Closed Under: Intersection ( are CFLs, but is not), Complementation (follows from non-closure under intersection via DeMorgan's laws).
- Closed Under Regular Intersection: If is context-free and is regular, then is context-free. Proof involves constructing a PDA that simulates the original PDA and a DFA for simultaneously using state pairs.
Pumping Lemma for Context-Free Languages
Similar to the regular version, but reflects the structure of parse trees. Used to prove languages are not context-free.
Pumping Lemma for CFLs: Let L be a context-free language. Then there exists a constant (the pumping length) such that any string with can be divided into five substrings, , satisfying:
- (the "pumpable" region is bounded).
- (at least one of the pumped parts is non-empty).
- For all integers , the string is also in .
Proof Idea: For a sufficiently long , its derivation tree (using a simplified grammar, e.g., CNF) must be tall. A tall tree must have a long path from root to a leaf. Since there are finitely many variables, some variable must repeat on a long enough path (). The substring is the yield of the subtree rooted at the upper . is the yield of the subtree rooted at the lower . The constraint comes from bounding the height of subtrees without repeated variables. because we assume no useless/λ/unit productions. Pumping corresponds to repeating () or removing () the derivation segment .
Using the CFL Pumping Lemma: Similar "game" as the regular version, but the adversary has more freedom in choosing the decomposition (only is guaranteed). You must show that for all valid decompositions, pumping () leads to a contradiction for some .
Example: is not context-free. Assume it is. Let be pumping length. Choose . Adversary decomposes with and . Case 1: contains only one type of symbol (e.g., only 's). Then or will have unequal numbers of . Contradiction. Case 2: contains symbols of two types (e.g., 's and 's, or 's and 's). It cannot contain all three due to . If has 's and 's, pump . has more 's or 's (or both) than 's. Contradiction. Similarly if has 's and 's. Since all possible valid decompositions lead to a contradiction, is not context-free.
Other Non-CFL Examples: , , .
Decision Problems
Compared to regular languages, fewer properties are decidable for CFLs:
- Decidable:
- Membership (?): Yes (e.g., CYK algorithm).
- Emptiness (?): Yes (check if start symbol can derive any terminal string - related to finding useful variables).
- Finiteness/Infiniteness (?): Yes (check if the grammar's dependency graph has a cycle involving a variable reachable from S and capable of producing terminals).
- Undecidable:
- Ambiguity: Is CFG ambiguous?
- Equivalence: Is ?
- Inclusion: Is ?
- Regularity: Is regular?
- Intersection Emptiness: Is ? (Proven via reduction from Post Correspondence Problem).
- Is ?
Conclusion
Context-free languages, generated by CFGs and recognized by PDAs, significantly expand our descriptive power beyond regular languages, capturing essential syntactic structures like nesting found in programming languages and arithmetic expressions. We've seen how to manipulate CFGs (simplification, normal forms), the power and limitations of PDAs (especially the difference between deterministic and nondeterministic variants), and how CFL properties (closure, pumping lemma, decidability) differ from those of regular languages. While more powerful, the increased complexity leads to fundamental limitations, with many important questions about arbitrary CFLs being algorithmically undecidable.