A shorthand for expressing patterns of text. An expression which generates a language from the operations union (alternation), concatenation, and concatenation closure.
*
(RE, EBNF) Concatenation closure; occurrence of the preceding item zero or more times.
|
(BNF, RE, EBNF) or, or alternatively.
( )
Grouping.
The empty string
regular language
A language which can be generated by regular expressions.
self-embedding
A characteristic of languages in which an N-phrase of the language may contain an N-phrase.
self-embedding language
A language which is self-embedding.
Self-embedding languages are not regular languages.
Regular languages are not self-embedding.
Programming languages are typically self-embedding.
The tokens of programming languages are typically regular.
Extended BNF
Extended Backus-Naur Form (EBNF)
BNF augmented by the use of regular expressions on the right hand side of productions.
::=
(BNF, EBNF) is defined as, may consist of.
*
(RE, EBNF) Concatenation closure; occurrence of the preceding item zero or more times.
|
(BNF, RE, EBNF) or, or alternatively.
( )
Grouping.
Grammar Transformations
left factorization
The regrouping of expressions so that the repetion of symbols on the left side is eliminated.
left recursion
The condition in which the definition of a symbol begins with the same symbol.
Starter Sets
starter set
The set of all terminal symbols which can begin a non-terminal symbol.
Parsing
The Bottom-up Parsing Strategy
bottom-up parsing
Parsing in which the syntax tree is built from the terminal symbols toward the root.
The top-down Parsing Strategy
top-down parsing
Parsing in which the syntax tree is built from the root toward the terminal symbols.
Recursive-descent Parsing
Systematic Development of a Recursive-descent Parsing
Abstract Syntax Trees
Representation
Construction
Scanning
Scanning is similar to parsing.
Scanning functions at a lower level than parsing.
In parsing, the goal is usually a program, and the terminal symbols are tokens.
In scanning, the goal is tokens, and the terminal symbols are individual characters.
separator
A source program element which separates tokens.
Blanks spaces, tabs, new lines, and comments are separators.
Separators enhance human readablility of programs.
Separators are not part of the programs phrase structure.
Lexical Grammar
lexical grammar
A grammar which describes tokens.
The start symbol for a lexical grammar is Token.
The lexical grammar should include the non-terminal symbol Comment.
The terminal symbols for a lexical grammar are individual characters.
The lexical grammar must not be self-embedding.
Systematic scanner generation
Step 1 - Express the lexical grammar in EBNF.
Step 1A - left factor the lexical grammar.
Step 1B - remove any left recursion from the grammar.
Step 2 - For each production N ::= X, write a (private) Scanner method ScanN with a body determined by X.
Step 3 - Assemble the Scanner.
Step 3A - Include a (private) class variable currentChar which will be the character currently being examined.
Step 3B - Write a (private) method take, analagous to the parser method accept, and a (private) method takeIt, analogous to the Parser method acceptIt.
Step 3C - Add code to the ScanN methods to capture the token's spelling and type.
Step 3D - Write a public scan method which scans on the regular expression 'Separator* Token', discarding any separators and returning a Token.
Case Study: Syntactic Analysis in the Triangle Compiler