what is lexer ?

Jun 2026 • Compiler Series • Part 2

A lexer is a component that converts raw source code characters into tokens for the parser.

What is a lexer?

When the compiler gets code like:

    .
    .
    total = price + 42;
    .
    .

It doesn't know what it is. For the compiler, it's just a stream of characters without any classification:

't' 'o' 't' 'a' 'l' ' ' '=' ' ' 'p' ...

Now we have two options: either we make meaning out of this now, or during later steps, we add an extra hurdle to handle each character.

If you are the smart one (unlike me), you will think that handling each character later separately is tedious, so we should handle it now and set some rules about what is allowed and what is not.

A lexer is this small set of instructions which can classify a stream of characters into tokens. Not to confuse it with grammar, the lexer doesn't know the rights and wrongs of a language.

If we take an example of standard English:
jump3d D0g m004. ov3r

The lexer knows that according to the rules we defined, words can't have numbers, so it can't classify this as a valid word.

And if we write:
jumped Dog moon. over

The lexer knows this matches the patterns of valid words, so it will make Tokens out of this. It still doesn't know if the sentence is right grammatically—it doesn't know the rules of grammar; it just tries to classify the stream of characters for each word.

Numbers in identifiers are valid in many languages. The above analogy is just an example to understand the logistics of a language.

What is a token?

The smallest individual element of a program is called a Token. Most things you see inside a program are tokens.

Usually compilers follow the format TokenType(Token) for representation.

So total = price + 42 becomes:

[IDENT(total), ASSIGN(=), IDENT(price), PLUS(+), NUMBER(42)]

Spaces have no value here. There are languages like Python which work on indentations, and they follow slightly different rules.

Lexer Output Example

If we take the reference pseudocode:

fn add(a,b){
    price = 10;
    total = price + 42;
    print(total);
}

The lexer will parse and output the following token array:

[
    FUNCTION(fn),
    IDENT(add),
    LPAREN((),
    IDENT(a),
    COMMA(,),
    IDENT(b),
    RPAREN()),
    LBRACE({),

    IDENT(price),
    ASSIGN(=),
    NUMBER(10),
    SEMICOLON(;),

    IDENT(total),
    ASSIGN(=),
    IDENT(price),
    PLUS(+),
    NUMBER(42),
    SEMICOLON(;),

    IDENT(print),
    LPAREN((),
    IDENT(total),
    RPAREN()),
    SEMICOLON(;),

    RBRACE(})
]

Toy Lexer Implementation

Below is an example of a Toy Lexer written in Python:

source = """
total = price + 42;
"""

tokens = []

i = 0
while i < len(source):
    ch = source[i]

    # Skip whitespace
    if ch.isspace():
        i += 1
        continue

    # Identifier
    if ch.isalpha() or ch == "_":
        start = i

        while i < len(source) and (source[i].isalnum() or source[i] == "_"):
            i += 1

        value = source[start:i]
        tokens.append(("IDENT", value))
        continue

    # Number
    if ch.isdigit():
        start = i

        while i < len(source) and source[i].isdigit():
            i += 1

        value = source[start:i]
        tokens.append(("NUMBER", value))
        continue

    # Operators
    if ch == "=":
        tokens.append(("ASSIGN", ch))
    elif ch == "+":
        tokens.append(("PLUS", ch))
    elif ch == ";":
        tokens.append(("SEMICOLON", ch))
    else:
        tokens.append(("UNKNOWN", ch))

    i += 1

for token in tokens:
    print(token)

Aadarsh Chandra (Pi)

I am an 18-year-old self-employed developer building full-stack web applications, Python/Rust backends, and low-level systems. I learn by building things from first principles.