Parsing Deep Dive - How Compilers Make Sense of Your Code's Structure!

May 13, 2023 — #Compiler Design #CS Basics

Hey Code Architects! In our previous post on Lexical Analysis, we saw how the Lexical Analyzer chops up your raw source code into a stream of tokens – the basic "words" of your programming language. But just like having words isn't enough to make a meaningful sentence, having tokens isn't enough for the compiler to understand your program.

Enter the Syntax Analyzer, also known as the Parser! This is the compiler's grammar expert. Its job is to take that stream of tokens and determine if they form a grammatically valid structure according to the rules of the programming language. Think of it as checking if your code's "sentences" are correctly constructed. If lexical analysis is about words, syntax analysis is about sentence structure.

Ready to dive into how compilers understand the hierarchy and structure of your code? Let's get parsing!

The Blueprint: Context-Free Grammars (CFGs)

Before a parser can analyze anything, it needs a rulebook that defines the language's syntax. This rulebook is formally known as a Context-Free Grammar (CFG).

A CFG provides a precise and straightforward way to describe how phrases or constructs in a programming language are built from smaller parts. It consists of:

Terminals: These are the basic symbols or tokens of the language, like keywords (if, while), identifiers (x, myVar), operators (+, *), and literals (10, "hello"). They are the "words" supplied by the lexical analyzer.
Non-terminals (Variables): These represent sets of strings of terminals. They define the syntactic structures, like an expression, a statement, or a declaration.
Start Symbol: One special non-terminal that represents the entire program or the highest-level structure (e.g., program or compilation_unit). The parsing process aims to derive a sequence of tokens that matches this start symbol.
Productions (Rules): These are rules that specify how non-terminals can be replaced by sequences of terminals and/or other non-terminals. They look like A -> α, where A is a non-terminal and α is a string of terminals and/or non-terminals. The -> can be read as "can be replaced by" or "derives".

Example CFG for simple arithmetic expressions:

E -> E + T // An expression can be an expression plus a term E -> T // Or an expression can simply be a term T -> T * F // A term can be a term times a factor T -> F // Or a term can simply be a factor F -> (E) // A factor can be an expression in parentheses F -> id // Or a factor can be an identifier (token representing a variable/number)

Here, E (Expression), T (Term), and F (Factor) are non-terminals. +, *, (, ), and id are terminals. E is likely our start symbol for this mini-grammar.

Using these rules, the parser tries to build a Parse Tree (or Abstract Syntax Tree - AST) for a given sequence of tokens. This tree graphically shows how the start symbol of a grammar derives the input string. For instance, for the input id + id * id, a parse tree would show how E derives this structure, respecting the rules.

Top-Down Parsing: Starting from the Goal

Top-down parsing attempts to find the leftmost derivation for an input string by starting with the grammar's start symbol and repeatedly applying productions to expand non-terminals until the input token stream is matched. Imagine trying to build a parse tree from the root (start symbol) downwards to the leaves (terminals).

Recursive Descent Parsing

This is a straightforward and popular top-down technique where a set of mutually recursive procedures is written, typically one for each non-terminal in the grammar.

When a procedure for a non-terminal A is called, it decides which production A -> α to use based on the current input token.
It then processes the symbols in α. If a symbol is a non-terminal, it calls the procedure for that non-terminal. If it's a terminal, it matches it with the current input token and advances.

Backtracking

What if a recursive descent parser makes a wrong choice of production? For example, if A has two productions A -> α and A -> β. If it picks A -> α and fails to match the rest of the input, it might need to backtrack: reset the input pointer to where it was before trying α, and then try A -> β. While simple to understand, backtracking can be very inefficient, potentially leading to exponential time complexity for some grammars.

Predictive Parsing (LL(k) Parsers)

To avoid the inefficiency of backtracking, we use predictive parsers. These parsers try to "predict" the correct production to use by looking ahead at the next few input symbols.

LL(k) means the parser scans the input from Left to right, constructs a Leftmost derivation, and uses k symbols of lookahead.
LL(1) parsers (using one lookahead symbol) are very common. They require grammars to be LL(1) grammars, which means they must not have:
- Left recursion: Productions like E -> E + T directly cause infinite loops in simple recursive descent. This must be eliminated.
- Common prefixes that cannot be distinguished by one lookahead symbol. For example, if S -> if E then S else S and S -> if E then S. If the input is if E then S, the parser can't decide which rule to pick by looking at just if.
LL(1) parsers are often implemented using a parsing table built using concepts like FIRST and FOLLOW sets.
- FIRST(α): The set of terminals that can begin any string derived from α.
- FOLLOW(A): The set of terminals that can appear immediately after the non-terminal A in some sentential form.

Predictive parsing is efficient (linear time) but works only for a restricted class of grammars.

Bottom-Up Parsing: Building from the Leaves

Bottom-up parsing works by starting with the input token stream and attempting to "reduce" it to the grammar's start symbol. It's like building a parse tree from the leaves upwards to the root. The core idea is to find substrings of the current working string that match the right-hand side (RHS) of a production and replace them with the non-terminal on the left-hand side (LHS).

Shift-Reduce Parsing

This is the most common bottom-up parsing technique. It uses a stack to hold grammar symbols and an input buffer. There are four possible actions:

Shift: Move the next input token onto the top of the stack.
Reduce: If the top of the stack contains a sequence of symbols α that matches the RHS of a production A -> α, replace α on the stack with A. This corresponds to building a node in the parse tree. The sequence α is called a handle.
Accept: If the stack contains the start symbol and the input buffer is empty, parsing is successful.
Error: If none of the above actions are possible, a syntax error is detected.

The challenge is deciding when to shift and when to reduce (and which production to use for reduction).

LR Parsers (LR(k))

LR parsers are powerful shift-reduce parsers that can handle a large class of context-free grammars.

L means they scan the input from Left to right.
R means they construct a Rightmost derivation in reverse.
k is the number of lookahead symbols (usually 0 or 1).

LR parsers are table-driven, using two tables:

ACTION table: Tells the parser what to do (shift, reduce, accept, or error) given the current state (on top of the stack) and the next input token.
GOTO table: Tells the parser which state to transition to after a reduction.

There are several types of LR parsers, varying in power and table size:

LR(0): The simplest, but too weak for most practical grammars. It uses no lookahead.
SLR(1) (Simple LR): More powerful than LR(0). It uses the FOLLOW sets of non-terminals to help decide on reductions, reducing conflicts.
LR(1): The most powerful LR parser using one lookahead symbol. It can parse a very wide range of grammars. However, its parsing tables can be very large.
LALR(1) (Look-Ahead LR): A compromise between SLR(1) and LR(1). It has significantly smaller tables than LR(1) but is nearly as powerful. This is the technique used by popular parser generators like YACC (Yet Another Compiler-Compiler) and Bison.

LR parsers are efficient (linear time) and can handle a broader range of grammars than LL parsers, including left-recursive ones.

Oops! Handling Errors in Parsing ️

Syntax errors are common! A good parser shouldn't just give up at the first error. It should:

Report errors clearly and accurately.
Recover from the error quickly to continue parsing and find more potential errors in the same compilation run.

Common error recovery strategies include:

Panic Mode: This is the simplest and most common method. When an error is detected, the parser discards input tokens one by one until a "synchronizing token" is found (e.g., a semicolon ;, a closing brace }). Then it attempts to continue parsing.
Phrase-Level Recovery: The parser might try to replace a prefix of the remaining input (or a portion of the stack) with a small correction that allows the parse to continue. For example, inserting a missing semicolon or deleting an extraneous comma.
Error Productions: The grammar can be augmented with special productions that generate common errors. If an error production is used, the parser can generate a more specific error message.
Global Correction: Theoretically, a parser could find the minimal sequence of changes to make an erroneous input string syntactically correct. However, this is generally too complex and costly to implement in practice.

The Trouble with Ambiguity: Ambiguous Grammars

An ambiguous grammar is a CFG that can produce more than one parse tree for the same input sentence. This means the same sequence of tokens can be interpreted in multiple ways.

Example of ambiguity: Consider the grammar: E -> E + E | E * E | id For the input id + id * id, we can get two parse trees:

One that groups (id + id) * id (addition first)
Another that groups id + (id * id) (multiplication first)

Ambiguity is problematic because the compiler needs a single, unambiguous interpretation to generate correct code.

Resolving Ambiguity:

Rewrite the Grammar: The most common way is to rewrite the grammar to eliminate ambiguity. For expressions, this usually involves introducing new non-terminals and structuring productions to enforce precedence (e.g., * binds tighter than +) and associativity (e.g., - is left-associative: a-b-c is (a-b)-c). Our earlier example:
```
E -> E + T | T
T -> T * F | F
F -> (E) | id
```
This grammar unambiguously enforces that * has higher precedence than +, and both are left-associative.
Parser-Specific Rules: Some parser generators (like YACC/Bison) allow you to specify precedence and associativity rules for operators directly, without rewriting the grammar extensively. This can keep the grammar simpler while still resolving ambiguities for common cases like expressions.
The "Dangling Else" Problem: Another classic ambiguity arises with if-then-else statements: stmt -> if expr then stmt stmt -> if expr then stmt else stmt In if E1 then if E2 then S1 else S2, which if does the else attach to? The common convention is to associate the else with the closest preceding unmatched then. LR parsers usually resolve this correctly by shifting the else rather than reducing the if-then part.

Wrapping Up the Structure!

Syntax analysis is a cornerstone of compilation, transforming a linear sequence of tokens into a hierarchical structure (the parse tree) that reflects the program's grammar. We've seen how Context-Free Grammars define this structure, and explored two main strategies for parsing:

Top-down parsing (like Recursive Descent and LL(1)) starts from the grammar's main rule.
Bottom-up parsing (like Shift-Reduce and the powerful LR family - SLR, LALR, LR(1)) builds up from the tokens.

We also touched upon crucial aspects like handling errors gracefully and resolving ambiguities in grammars to ensure our code has a single, clear meaning. Understanding these concepts not only demystifies compilers but can also help you appreciate the design of programming languages themselves!