Skip to main content

Overview

The Mini-Compilador Educativo transforms source code through six sequential phases, each building on the output of the previous one. This page provides an in-depth look at how a sample program flows through the entire compilation pipeline.

Example Program

We’ll trace this simple program through all phases:
let a = 5;
let b = 10;
let c = a + b * 2;
print c;
Expected output: 30 (because b * 2 = 20, then a + 20 = 25… wait, let me recalculate: 10 * 2 = 20, 5 + 20 = 25. Actually the output should be 25.)

Phase 1: Lexical Analysis (Scanner)

1

Input

Raw character stream:
let a = 5;\nlet b = 10;\nlet c = a + b * 2;\nprint c;
2

Processing

The Scanner class reads character-by-character:
  1. Recognizes keywords by dictionary lookup
  2. Identifies operators as single characters
  3. Accumulates digit sequences into numbers
  4. Collects alphanumeric sequences into identifiers
  5. Tracks line and column positions
3

Output

List of tokens with metadata:
[
  Token(LET, 'let', línea=1, col=1),
  Token(IDENTIFICADOR, 'a', línea=1, col=5),
  Token(IGUAL, '=', línea=1, col=7),
  Token(NUMERO, '5', línea=1, col=9, valor=5),
  Token(PUNTO_COMA, ';', línea=1, col=10),
  
  Token(LET, 'let', línea=2, col=1),
  Token(IDENTIFICADOR, 'b', línea=2, col=5),
  Token(IGUAL, '=', línea=2, col=7),
  Token(NUMERO, '10', línea=2, col=9, valor=10),
  Token(PUNTO_COMA, ';', línea=2, col=11),
  
  Token(LET, 'let', línea=3, col=1),
  Token(IDENTIFICADOR, 'c', línea=3, col=5),
  Token(IGUAL, '=', línea=3, col=7),
  Token(IDENTIFICADOR, 'a', línea=3, col=9),
  Token(SUMA, '+', línea=3, col=11),
  Token(IDENTIFICADOR, 'b', línea=3, col=13),
  Token(MULTIPLICACION, '*', línea=3, col=15),
  Token(NUMERO, '2', línea=3, col=17, valor=2),
  Token(PUNTO_COMA, ';', línea=3, col=18),
  
  Token(PRINT, 'print', línea=4, col=1),
  Token(IDENTIFICADOR, 'c', línea=4, col=7),
  Token(PUNTO_COMA, ';', línea=4, col=8),
  
  Token(FIN_ARCHIVO, '', línea=4, col=9)
]

Key Features

Position Tracking

Every token stores its line and column for precise error reporting.

Value Extraction

Numeric tokens carry both the original string (lexema) and parsed integer (valor).

Comment Handling

// comments are consumed without generating tokens (see compfinal.py:319-327).

Error Collection

Invalid characters add to Scanner.errores list but don’t stop scanning.

Phase 2: Syntactic Analysis (Parser)

1

Input

Token stream from Scanner
2

Processing

The Parser uses recursive descent to match tokens against grammar rules:
  1. parsear() → processes statements until EOF
  2. _sentencia() → dispatches to _declaracion_variable() or _sentencia_print()
  3. _expresion() → delegates to _suma_resta()_multiplicacion_division()_primario()
  4. Operator precedence handled by nesting depth
3

Output

Abstract Syntax Tree (AST)

AST Representation

Programa
├── DeclaracionVariable
│   ├── nombre: 'a'
│   └── valor:
│       └── NumeroLiteral(5)
├── DeclaracionVariable
│   ├── nombre: 'b'
│   └── valor:
│       └── NumeroLiteral(10)
├── DeclaracionVariable
│   ├── nombre: 'c'
│   └── valor:
│       └── ExpresionBinaria
│           ├── operador: '+'
│           ├── izquierda: Identificador('a')
│           └── derecha:
│               └── ExpresionBinaria
│                   ├── operador: '*'
│                   ├── izquierda: Identificador('b')
│                   └── derecha: NumeroLiteral(2)
└── SentenciaPrint
    └── expresion:
        └── Identificador('c')

Precedence in Action

Expression: a + b * 2Why it parses as a + (b * 2) and not (a + b) * 2:
  1. _expresion() calls _suma_resta()
  2. _suma_resta() calls _multiplicacion_division() for left side → gets a
  3. Sees +, so calls _multiplicacion_division() for right side
  4. _multiplicacion_division() processes b * 2 completely (consuming both tokens) before returning
  5. Result: ExpresionBinaria(+, a, ExpresionBinaria(*, b, 2))
Source: compfinal.py:802-843

Phase 3: Semantic Analysis

1

Input

AST from Parser
2

Processing

The AnalizadorSemantico traverses the AST with these checks:Variable Tracking:
  • Maintains variables_declaradas set
  • On DeclaracionVariable: analyze expression first, then add variable
  • On Identificador: verify variable exists in set
Division by Zero:
  • On ExpresionBinaria with / operator
  • If right side is NumeroLiteral with valor == 0, report error
3

Output

Boolean success flag + error list

Semantic Trace

# Statement 1: let a = 5;
1. Analyze expression: NumeroLiteral(5) ✓
2. Add 'a' to variables_declaradas
3. variables_declaradas = {'a'}

# Statement 2: let b = 10;
1. Analyze expression: NumeroLiteral(10) ✓
2. Add 'b' to variables_declaradas
3. variables_declaradas = {'a', 'b'}

# Statement 3: let c = a + b * 2;
1. Analyze expression:
   - Left side: Identificador('a')
     → Check: 'a' in variables_declaradas ✓
   - Right side: ExpresionBinaria(*, ...)
     - Left: Identificador('b')
       → Check: 'b' in variables_declaradas ✓
     - Right: NumeroLiteral(2) ✓
     - Operator: * (not division) ✓
2. Add 'c' to variables_declaradas
3. variables_declaradas = {'a', 'b', 'c'}

# Statement 4: print c;
1. Analyze expression: Identificador('c')
   → Check: 'c' in variables_declaradas ✓

# Result: No errors ✓
Semantic analysis only catches literal division by zero. Runtime divisions like 10 / x where x becomes 0 are not detected.
Source: compfinal.py:968-1085

Phase 4: Intermediate Code Generation

1

Input

Validated AST
2

Processing

The GeneradorIR creates Three Address Code (TAC):
  • Each operation becomes one TAC instruction
  • Temporary variables (t0, t1, …) store intermediate results
  • Variables use their original names
3

Output

List of TAC instructions as strings

Generated IR Code

a = 5
b = 10
t0 = b * 2
t1 = a + t0
c = t1
print c

TAC Properties

Explicit Ordering

Operations occur in sequential order, making control flow obvious.

Single Operator

Each instruction has at most one operator, simplifying optimization and code generation.

Platform Independent

TAC is abstract; not tied to any specific CPU architecture.

Optimization Ready

Easy to apply optimizations like constant folding, dead code elimination, etc.
Source: compfinal.py:1090-1149

Phase 5: Interpretation

1

Input

AST (same as used for IR generation)
2

Processing

The Interprete directly executes the AST:
  • Maintains variables dictionary for runtime values
  • Recursively evaluates expressions
  • Outputs print statement results to console
3

Output

Console output from print statements

Execution Trace

# Statement 1: let a = 5;
evaluar(NumeroLiteral(5)) → 5
variables['a'] = 5
# variables = {'a': 5}

# Statement 2: let b = 10;
evaluar(NumeroLiteral(10)) → 10
variables['b'] = 10
# variables = {'a': 5, 'b': 10}

# Statement 3: let c = a + b * 2;
evaluar(ExpresionBinaria(+, ...)):
  izquierda = evaluar(Identificador('a'))
    → variables['a'] → 5
  derecha = evaluar(ExpresionBinaria(*, ...)):
    izquierda = evaluar(Identificador('b'))
      → variables['b'] → 10
    derecha = evaluar(NumeroLiteral(2))
2
    operador.lexema == '*'
10 * 220
  operador.lexema == '+'
5 + 2025
variables['c'] = 25
# variables = {'a': 5, 'b': 10, 'c': 25}

# Statement 4: print c;
evaluar(Identificador('c')) → variables['c'] → 25
print("     Resultado:")
print("                → 25")
The interpreter uses floor division for / operator: 7 / 2 evaluates to 3, not 3.5.Source: compfinal.py:1209
Source: compfinal.py:1156-1215

Phase 6: Assembly Code Generation

1

Input

AST
2

Processing

The GeneradorASM produces x86 assembly for EMU8086:
  1. Data section: Declare all variables as words (dw)
  2. Code section: Translate each statement:
    • Use AX register as accumulator
    • Use stack for nested expressions
    • Call print_num procedure for output
  3. Utility procedures: Include print_num for integer output
3

Output

Complete .asm file as string

Generated Assembly

.model small
.stack 100h

.data
a dw 0
b dw 0
c dw 0
msg db 13,10,'$'

.code
main proc
mov ax, @data
mov ds, ax

; let a = 5;
mov ax, 5
mov a, ax

; let b = 10;
mov ax, 10
mov b, ax

; let c = a + b * 2;
mov ax, a
push ax
mov ax, b
push ax
mov ax, 2
mov bx, ax
pop ax
imul bx
mov bx, ax
pop ax
add ax, bx
mov c, ax

; print c;
mov ax, c
call print_num
mov ah,09h
lea dx,msg
int 21h

mov ah,4ch
int 21h
main endp

; imprimir numero en AX
print_num proc
push ax
push bx
push cx
push dx
mov cx,0
mov bx,10
cmp ax,0
jne convert
mov dl,'0'
mov ah,02h
int 21h
jmp done
convert:
next_digit:
xor dx,dx
div bx
push dx
inc cx
cmp ax,0
jne next_digit
print_loop:
pop dx
add dl,'0'
mov ah,02h
int 21h
loop print_loop
done:
pop dx
pop cx
pop bx
pop ax
ret
print_num endp

end main

Assembly Generation Strategy

Binary expressions use stack to preserve left operand:
1. Evaluate left → result in AX
2. Push AX
3. Evaluate right → result in AX
4. Move AX to BX
5. Pop AX (restore left)
6. Perform operation (AX op BX → AX)
This handles nested expressions naturally through recursion.
Source: compfinal.py:1221-1390

Phase Comparison

Here’s how the same expression a + b * 2 is represented at each phase:
PhaseRepresentation
1. Lexer[ID(a), SUMA, ID(b), MULT, NUM(2)]
2. ParserBinExpr(+, ID(a), BinExpr(*, ID(b), Lit(2)))
3. Semantic✓ Variables exist, no errors
4. IRt0 = b * 2
t1 = a + t0
5. InterpreterEvaluates to: 25
6. Assemblymov ax, b
imul 2
add ax, a (simplified)

Error Handling Across Phases

Detected by: Scanner
Examples:
  • Invalid characters: let x = 5@;
  • Unexpected symbols: let x = $100;
Handling:
  • Error added to Scanner.errores
  • Token type set to ERROR
  • Scanning continues to find more errors
Output:
Error léxico en línea 1, columna 10: carácter inesperado '@'
Detected by: Parser
Examples:
  • Missing semicolon: let x = 5
  • Wrong token order: = x 5 let;
  • Unmatched parentheses: let x = (5 + 3;
Handling:
  • ErrorSintaxis exception raised
  • Error caught and added to Parser.errores
  • Synchronization to next statement boundary
  • Parsing continues
Output:
Error de sintaxis en línea 1, columna 10: Se esperaba ';' al final de la declaración. Se encontró ''
Detected by: AnalizadorSemantico
Examples:
  • Undeclared variable: print x; (without let x = ...;)
  • Division by zero: let x = 10 / 0;
Handling:
  • Error added to AnalizadorSemantico.errores
  • Analysis continues to find more errors
Output:
Error semántico en línea 2, columna 7: la variable 'x' no ha sido declarada
Once semantic analysis passes, subsequent phases assume correctness:
  • IR Generation: Always succeeds with valid AST
  • Interpretation: May encounter runtime division by zero (not caught)
  • Assembly Generation: Produces syntactically correct code
The compiler does not detect:
  • Runtime division by zero (variable divisor)
  • Integer overflow
  • Stack overflow from deep recursion

Complete Compilation Output

═══════════════════════════════════════════════════════════════════════════
  MINI-COMPILADOR EDUCATIVO
═══════════════════════════════════════════════════════════════════════════

[FASE 1] Análisis Léxico...
         (Convirtiendo código en tokens)
  ✓ Completado: 24 tokens generados

[FASE 2] Análisis Sintáctico...
         (Verificando gramática y construyendo AST)
  ✓ Completado: 4 sentencias parseadas

[FASE 3] Análisis Semántico...
         (Verificando que el código tenga sentido)
  ✓ Completado: 3 variables verificadas

[FASE 4] Generación de Código Intermedio (IR)
         (Three Address Code)
    a = 5
    b = 10
    t0 = b * 2
    t1 = a + t0
    c = t1
    print c

[FASE 5] Ejecución del programa...
              (Interpretando el AST)
     Resultado:
                → 25
  ✓ Ejecución finalizada

───────────────────────────────────────────────────────────────────────────
  ✓ RESULTADO: ¡Compilación exitosa!

Performance Characteristics

Time Complexity

  • Lexer: O(n) where n = character count
  • Parser: O(n) where n = token count
  • Semantic: O(n) where n = AST node count
  • IR/ASM: O(n) where n = AST node count
  • Overall: O(n) linear in input size

Space Complexity

  • Token list: O(n) tokens
  • AST: O(n) nodes
  • Symbol table: O(v) variables
  • IR code: O(n) instructions
  • Overall: O(n) linear in input size
The compiler completes in milliseconds for typical programs (< 100 lines). GUI shows compilation time in status bar.

Next Steps

Try It Yourself

Experiment with the compiler by:
  1. Modifying the example program
  2. Introducing intentional errors to see error messages
  3. Viewing the AST and IR output (enable in GUI checkboxes)
  4. Exporting and running the assembly code in EMU8086
See Installation and Quick Start for setup and usage instructions.