Architecture Overview

Package Structure

The parser package is organized into focused modules:

tulit/parser/
├── parser.py                 # Abstract base Parser class
├── models.py                 # Domain models (Article, Citation, Recital, etc.)
├── registry.py               # Parser registry pattern for dynamic parser selection
├── normalization.py          # Text normalization strategies
├── exceptions.py             # Custom exception hierarchy
├── strategies/
│   └── article_extraction.py # Article extraction strategies
├── xml/
│   ├── xml.py                # Abstract XMLParser base class
│   ├── helpers.py            # XML utilities (XMLNodeExtractor, XMLValidator)
│   ├── formex.py             # Formex4Parser
│   ├── boe.py                # BOEXMLParser (Spanish Official Gazette)
│   └── akomantoso/           # Akoma Ntoso parser package
│       ├── base.py           # AkomaNtosoParser base class
│       ├── akn4eu.py         # AKN4EUParser variant
│       ├── german.py         # GermanLegalDocMLParser variant
│       ├── luxembourg.py     # LuxembourgAKNParser variant
│       ├── extractors.py     # Article extraction utilities
│       └── utils.py          # Format detection and factory functions
└── html/
    ├── html_parser.py        # Abstract HTMLParser base class
    ├── veneto.py             # VenetoHTMLParser (regional documents)
    └── cellar/               # EU Cellar parser package
        ├── cellar.py         # CellarHTMLParser (semantic XHTML)
        ├── cellar_standard.py # CellarStandardHTMLParser (simple structure)
        └── proposal.py       # ProposalHTMLParser (legislative proposals)

Design Patterns

Registry Pattern

The ParserRegistry class implements the Registry pattern to enable dynamic parser selection:

from tulit.parser.registry import ParserRegistry, get_parser_for_format

# Register a new parser
registry = ParserRegistry()
registry.register('custom_format', CustomParser)

# Get parser by format
parser = get_parser_for_format('akn')

Benefits:

  • Loose coupling between parser selection and usage

  • Easy to add new parsers without modifying existing code

  • Centralized parser management

Strategy Pattern

Multiple strategy patterns are used for algorithmic flexibility:

Text Normalization Strategies:

from tulit.parser.normalization import (
    WhitespaceNormalizer,
    UnicodeNormalizer,
    CompositeNormalizer
)

# Compose multiple normalization strategies
normalizer = CompositeNormalizer([
    WhitespaceNormalizer(),
    UnicodeNormalizer()
])

Article Extraction Strategies:

from tulit.parser.strategies.article_extraction import (
    FormexArticleStrategy,
    CellarStandardArticleStrategy
)

# Each parser uses an appropriate strategy
strategy = FormexArticleStrategy()
articles = strategy.extract_articles(root_element)

Benefits:

  • Algorithms can be selected at runtime

  • Easy to add new strategies

  • Promotes code reuse through composition

Factory Pattern

Factory functions create appropriate parser instances:

from tulit.parser.xml.akomantoso import create_akn_parser

# Automatically detect format and create appropriate parser
parser = create_akn_parser('document.akn')

Benefits:

  • Encapsulates complex object creation logic

  • Client code doesn’t need to know about concrete classes

  • Enables automatic format detection

Template Method Pattern

The base Parser class defines the parsing workflow as a template method:

class Parser(ABC):
    def parse(self, file: str, **options) -> 'Parser':
        """Template method defining the parsing workflow."""
        self.get_root(file)
        self.get_preface()
        self.get_preamble()
        self.get_formula()
        self.get_citations()
        self.get_recitals()
        self.get_preamble_final()
        self.get_body()
        self.get_chapters()
        self.get_articles()
        self.get_conclusions()
        return self

Benefits:

  • Consistent parsing workflow across all parsers

  • Subclasses override only specific steps

  • Reduces code duplication

Domain Models

Structured domain objects provide type-safe access to document components:

from tulit.parser.models import Article, Citation, Recital, Chapter

@dataclass
class Article:
    """Represents a legal article with metadata and content."""
    number: str
    title: Optional[str] = None
    content: Optional[str] = None
    children: List[ArticleChild] = field(default_factory=list)

Benefits:

  • Type safety and IDE autocompletion

  • Clear data contracts

  • Easier testing and validation

  • Self-documenting code

Exception Hierarchy

Custom exception hierarchy provides granular error handling:

ParserError (base exception)
├── ParseError (parsing failures)
├── ValidationError (schema validation failures)
├── ExtractionError (data extraction failures)
└── FileLoadError (file loading failures)

Benefits:

  • Specific error handling at appropriate levels

  • Clear error semantics

  • Better debugging information

XML Utilities

Centralized XML utilities reduce code duplication:

XMLNodeExtractor:

  • XPath-based node extraction

  • Namespace-aware queries

  • Text content extraction with normalization

XMLValidator:

  • Schema validation with error handling

  • Support for both local and remote schemas

  • Detailed validation error reporting

Benefits:

  • Consistent XML processing across parsers

  • Robust error handling

  • Reduced code duplication

Module Organization

Modules are kept focused and maintainable:

  • Modules average 200-300 lines

  • Each module has a single, clear responsibility

  • Better separation of concerns

  • Easy to navigate and maintain

Key organizational achievements:

  • parser.py: 315 lines - Core abstract base class

  • xml.py: 663 lines - XML parser base with utilities

  • akomantoso/: 7 focused modules for different variants

  • cellar/: 3 specialized HTML parsers for EU documents

Testing Strategy

The codebase maintains comprehensive test coverage:

  • 126 tests passing across all parsers

  • 16 tests skipped (external API dependencies)

  • Unit tests for all public methods

  • Integration tests for complete parsing workflows

  • Edge case coverage for error handling

Extension Points

The architecture provides multiple extension points:

Adding a New Parser:

  1. Inherit from XMLParser or HTMLParser

  2. Implement required abstract methods

  3. Register with ParserRegistry

  4. Add tests

Adding a New Normalization Strategy:

  1. Inherit from TextNormalizationStrategy

  2. Implement normalize() method

  3. Use in parser through composition

Adding a New Article Extraction Strategy:

  1. Inherit from ArticleExtractionStrategy

  2. Implement extract_articles() method

  3. Use in parser initialization