Architecture Overview

Package Structure

The parser package is organized into focused modules:

tulit/parser/
├── parser.py                 # Abstract base Parser class
├── models.py                 # Domain models (Article, Citation, Recital, etc.)
├── registry.py               # Parser registry pattern for dynamic parser selection
├── normalization.py          # Text normalization strategies
├── exceptions.py             # Custom exception hierarchy
├── strategies/
│   └── article_extraction.py # Article extraction strategies
├── xml/
│   ├── xml.py                # Abstract XMLParser base class
│   ├── helpers.py            # XML utilities (XMLNodeExtractor, XMLValidator)
│   ├── formex.py             # Formex4Parser
│   ├── boe.py                # BOEXMLParser (Spanish Official Gazette)
│   └── akomantoso/           # Akoma Ntoso parser package
│       ├── base.py           # AkomaNtosoParser base class
│       ├── akn4eu.py         # AKN4EUParser variant
│       ├── german.py         # GermanLegalDocMLParser variant
│       ├── luxembourg.py     # LuxembourgAKNParser variant
│       ├── extractors.py     # Article extraction utilities
│       └── utils.py          # Format detection and factory functions
└── html/
    ├── html_parser.py        # Abstract HTMLParser base class
    ├── veneto.py             # VenetoHTMLParser (regional documents)
    └── cellar/               # EU Cellar parser package
        ├── cellar.py         # CellarHTMLParser (semantic XHTML)
        ├── cellar_standard.py # CellarStandardHTMLParser (simple structure)
        └── proposal.py       # ProposalHTMLParser (legislative proposals)

Design Patterns

Registry Pattern

The ParserRegistry class implements the Registry pattern to enable dynamic parser selection:

from tulit.parser.registry import ParserRegistry, get_parser_for_format

# Register a new parser
registry = ParserRegistry()
registry.register('custom_format', CustomParser)

# Get parser by format
parser = get_parser_for_format('akn')

Benefits:

Loose coupling between parser selection and usage
Easy to add new parsers without modifying existing code
Centralized parser management

Strategy Pattern

Multiple strategy patterns are used for algorithmic flexibility:

Text Normalization Strategies:

from tulit.parser.normalization import (
    WhitespaceNormalizer,
    UnicodeNormalizer,
    CompositeNormalizer
)

# Compose multiple normalization strategies
normalizer = CompositeNormalizer([
    WhitespaceNormalizer(),
    UnicodeNormalizer()
])

Article Extraction Strategies:

from tulit.parser.strategies.article_extraction import (
    FormexArticleStrategy,
    CellarStandardArticleStrategy
)

# Each parser uses an appropriate strategy
strategy = FormexArticleStrategy()
articles = strategy.extract_articles(root_element)

Benefits:

Algorithms can be selected at runtime
Easy to add new strategies
Promotes code reuse through composition

Factory Pattern

Factory functions create appropriate parser instances:

from tulit.parser.xml.akomantoso import create_akn_parser

# Automatically detect format and create appropriate parser
parser = create_akn_parser('document.akn')

Benefits:

Encapsulates complex object creation logic
Client code doesn’t need to know about concrete classes
Enables automatic format detection

Template Method Pattern

The base Parser class defines the parsing workflow as a template method:

class Parser(ABC):
    def parse(self, file: str, **options) -> 'Parser':
        """Template method defining the parsing workflow."""
        self.get_root(file)
        self.get_preface()
        self.get_preamble()
        self.get_formula()
        self.get_citations()
        self.get_recitals()
        self.get_preamble_final()
        self.get_body()
        self.get_chapters()
        self.get_articles()
        self.get_conclusions()
        return self

Benefits:

Consistent parsing workflow across all parsers
Subclasses override only specific steps
Reduces code duplication

Domain Models

Structured domain objects provide type-safe access to document components:

from tulit.parser.models import Article, Citation, Recital, Chapter

@dataclass
class Article:
    """Represents a legal article with metadata and content."""
    number: str
    title: Optional[str] = None
    content: Optional[str] = None
    children: List[ArticleChild] = field(default_factory=list)

Benefits:

Type safety and IDE autocompletion
Clear data contracts
Easier testing and validation
Self-documenting code

Exception Hierarchy

Custom exception hierarchy provides granular error handling:

ParserError (base exception)
├── ParseError (parsing failures)
├── ValidationError (schema validation failures)
├── ExtractionError (data extraction failures)
└── FileLoadError (file loading failures)

Benefits:

Specific error handling at appropriate levels
Clear error semantics
Better debugging information

XML Utilities

Centralized XML utilities reduce code duplication:

XMLNodeExtractor:

XPath-based node extraction
Namespace-aware queries
Text content extraction with normalization

XMLValidator:

Schema validation with error handling
Support for both local and remote schemas
Detailed validation error reporting

Benefits:

Consistent XML processing across parsers
Robust error handling
Reduced code duplication

Module Organization

Modules are kept focused and maintainable:

Modules average 200-300 lines
Each module has a single, clear responsibility
Better separation of concerns
Easy to navigate and maintain

Key organizational achievements:

parser.py: 315 lines - Core abstract base class
xml.py: 663 lines - XML parser base with utilities
akomantoso/: 7 focused modules for different variants
cellar/: 3 specialized HTML parsers for EU documents

Testing Strategy

The codebase maintains comprehensive test coverage:

126 tests passing across all parsers
16 tests skipped (external API dependencies)
Unit tests for all public methods
Integration tests for complete parsing workflows
Edge case coverage for error handling

Extension Points

The architecture provides multiple extension points:

Adding a New Parser:

Inherit from XMLParser or HTMLParser
Implement required abstract methods
Register with ParserRegistry
Add tests

Adding a New Normalization Strategy:

Inherit from TextNormalizationStrategy
Implement normalize() method
Use in parser through composition

Adding a New Article Extraction Strategy:

Inherit from ArticleExtractionStrategy
Implement extract_articles() method
Use in parser initialization