Lexical analysis
lexical analysis is the process of translation from a raw Unicode character stream to a sequence of tokens. The tokens are the terminal symbols of the syntactic grammar. A program that perform lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. In detail, there are three steps in turn :
- translate all Unicode escapes to the corresponding Unicode character, for example, translate \n to 0A
- recognize line terminators to separate the stream resulting from step 1 to the input characters and terminators, this step will save line numbers of source code so that you can debug your program by some error message with corresponding line number
- split result from step 2 to white space (including line terminator), comments and tokens , and then tokens are reserved
Tokens
Token is a very important concept in compiler. Java tokens contain :
- Identifier
- Keyword
- Literal
- Separator
- Operator
The Tokens are non-terminal symbols of the lexical grammar with characters as terminal symbols, like this :
BooleanLiteral:
true
false
but the terminal symbols of the syntactic grammar. A parser which analyze the syntax of programming language uses token stream as input, and abstract syntax tree (AST) as output.