CS 1104 Introduction to Computer Science

COMPILERS AND TRANSLATORS

Lexemes and Tokens

A Lexeme is a string of characters that is a lowest-level syntatic unit in the programming language. These are the "words" and punctuation of the programming language.

A Token is a syntactic category that forms a class of lexemes. These are the "nouns", "verbs", and other parts of speech for the programming language.

In a practical programming language, there are a very large number of lexemes, perhaps even an infinite number. In a practical programming language, there are only a small number of tokens.

One of the major tasks of the Lexical Analyzer is to create pairs of lexemes and tokens; that is to collect all the characters of a lexeme and to attach a potential token to it. This is so as to:

Convert the text into a fixed format that will be easier to analyze at the next stage, and
Recognize lexemes (as far as possible) so as to save time in the next stage.

 

 


EXAMPLE

while (y  >=  t) y  =  y - 3 ;

will be represented by the set of pairs:

Lexeme
Token**
while
WHILE
(
LPAREN
y
IDENTIFIER
<=
COMPARISON
t
IDENTIFIER
)
RPAREN
y
IDENTIFIER
=
ASSIGNMENT
y
IDENTIFIER
-
ARITHMETIC
3
INTEGER
;
SEMICOLON

** In practice these tokens are normally fixed length names or numbers that are the entry points to the compiler routines that will provide additional information and processing for the particular token-type.


[TOC]


© J.A.N. LEE
Last Updated 2000/10/24