Skip to main content

Overview

The Mini-Compilador Educativo follows a classic multi-phase compiler architecture, transforming high-level source code through six distinct stages into executable x86 assembly code. Each phase is encapsulated in its own class, creating a modular and maintainable design. Compiler Architecture Diagram

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│                      Source Code                              │
│                  "let x = 5 + 3;"                            │
└────────────────────┬─────────────────────────────────────────┘


         ┌───────────────────────┐
         │   1. Scanner (Lexer)  │  Character stream → Tokens
         │   TipoToken, Token    │  [LET, ID(x), =, NUM(5), +, NUM(3), ;]
         └───────────┬───────────┘


         ┌───────────────────────┐
         │   2. Parser           │  Tokens → AST
         │   Recursive Descent   │  DeclaracionVariable(x, BinExpr(+, 5, 3))
         └───────────┬───────────┘


         ┌───────────────────────┐
         │ 3. SemanticAnalyzer   │  Validates AST semantics
         │   Variable tracking   │  Checks: undeclared vars, div by zero
         └───────────┬───────────┘


         ┌───────────────────────┐
         │ 4. GeneradorIR        │  AST → Three Address Code
         │   TAC generation      │  t0 = 5 + 3
         └───────────┬───────────┘  x = t0

                     ├─────────────────────┐
                     │                     │
                     ▼                     ▼
         ┌───────────────────┐  ┌─────────────────────┐
         │ 5. Interprete     │  │ 6. GeneradorASM     │
         │   Executes AST    │  │   TAC → x86 ASM     │
         │   Runtime eval    │  │   EMU8086 format    │
         └───────────────────┘  └─────────────────────┘

Core Components

Phase 1: Scanner (Lexical Analysis)

Class: Scanner
Input: Raw source code string
Output: List of Token objects
@dataclass
class Token:
    tipo: TipoToken        # Token type (LET, NUMERO, etc.)
    lexema: str            # Original text ("let", "42", etc.)
    linea: int             # Line number
    columna: int           # Column position
    valor: Any = None      # Numeric value for NUMERO tokens
Key Features:
  • Character-by-character scanning with position tracking
  • Comment handling (// single-line comments)
  • Reserved word dictionary lookup
  • Error collection with line/column information
Source: compfinal.py:207-491

Phase 2: Parser (Syntactic Analysis)

Class: Parser
Input: List of tokens from Scanner
Output: Abstract Syntax Tree (AST) as Programa object
Expressions (produce values):
  • NumeroLiteral - Integer constants
  • Identificador - Variable references
  • ExpresionBinaria - Binary operations (+, -, *, /)
  • ExpresionAgrupada - Parenthesized expressions
Statements (perform actions):
  • DeclaracionVariable - Variable assignments
  • SentenciaPrint - Print statements
Root Node:
  • Programa - Container for all statements
Parsing Strategy:
Recursive descent with operator precedence handling:
expresion → suma_resta
suma_resta → mult_div (('+' | '-') mult_div)*
mult_div → primario (('*' | '/') primario)*
primario → NUMERO | IDENTIFICADOR | '(' expresion ')'
The nesting ensures multiplication/division bind tighter than addition/subtraction. Example AST:
Programa
└── DeclaracionVariable
    ├── nombre: 'x'
    └── valor:
        └── ExpresionBinaria
            ├── operador: '+'
            ├── izquierda: NumeroLiteral(5)
            └── derecha: NumeroLiteral(3)
Source: compfinal.py:653-950

Phase 3: Semantic Analyzer

Class: AnalizadorSemantico
Input: AST from Parser
Output: Boolean success + error list

Variable Tracking

Maintains a set of declared variables. Detects use of undeclared identifiers.

Division by Zero

Catches literal division by zero (e.g., x / 0) at compile time.
Validation Rules:
  1. Variables must be declared before use
  2. No division by zero with numeric literals
  3. Warns on variable redeclaration
Implementation: Visitor pattern recursively traversing the AST. Source: compfinal.py:968-1085

Phase 4: Intermediate Code Generator

Class: GeneradorIR
Input: Validated AST
Output: Three Address Code (TAC) as list of strings
TAC Format:
t0 = 5 + 3
x = t0
print x
Characteristics:
  • Each instruction has at most one operator
  • Temporary variables (t0, t1, …) hold intermediate results
  • Platform-independent representation
  • Suitable for optimization passes
Source: compfinal.py:1090-1149

Phase 5: Interpreter

Class: Interprete
Input: AST
Output: Runtime execution results
The interpreter executes the program by evaluating the AST directly, providing immediate feedback without requiring assembly or machine code generation.
Features:
  • Variable storage in dictionary
  • Direct expression evaluation
  • Integer arithmetic (using floor division for /)
  • Print output to console
Source: compfinal.py:1156-1215

Phase 6: Assembly Code Generator

Class: GeneradorASM
Input: AST
Output: x86 assembly code string (EMU8086 format)
Generated Structure:
.model small
.stack 100h

.data
    x dw 0          ; Variable declarations
    msg db 13,10,'$'

.code
main proc
    mov ax, @data
    mov ds, ax
    
    ; Generated instructions
    mov ax, 5
    push ax
    mov ax, 3
    mov bx, ax
    pop ax
    add ax, bx      ; Result in AX
    mov x, ax       ; Store to variable
    
    mov ah, 4ch
    int 21h
main endp

; print_num procedure...
end main
Key Techniques:
  • Stack-based evaluation of expressions
  • AX register for accumulator
  • BX register for right operand
  • Included print_num routine for output
Source: compfinal.py:1221-1390

Data Flow

Here’s how a simple program flows through the compiler:
1

Input Code

let result = 10 + 5 * 2;
print result;
2

After Lexer

[LET, ID(result), =, NUM(10), +, NUM(5), *, NUM(2), ;, PRINT, ID(result), ;, EOF]
3

After Parser

Programa[
  DeclaracionVariable(result, BinExpr(+, 10, BinExpr(*, 5, 2))),
  SentenciaPrint(ID(result))
]
4

After Semantic Analysis

✓ All variables declared
✓ No division by zero
5

After IR Generation

t0 = 5 * 2
t1 = 10 + t0
result = t1
print result
6

After Interpreter

Console output: 20
7

After ASM Generation

Produces .asm file ready for EMU8086

Error Handling Strategy

Collected in Scanner.errores list:
  • Invalid characters
  • Reported with line/column
  • Compilation halts if errors present

Class Relationships

Key Design Principles

Separation of Concerns

Each phase handles one responsibility, making the codebase easy to understand and extend.

Visitor Pattern

AST traversal uses implicit visitor pattern through type checking with isinstance().

Error Recovery

Parser synchronization allows reporting multiple errors in one pass.

Immutable AST

AST nodes use @dataclass for clean, immutable structures.
The compiler currently has no optimization phase. The IR is generated but not optimized before assembly generation. This is a potential area for future enhancement.