Architecture Overview ===================== Package Structure ----------------- The parser package is organized into focused modules:: tulit/parser/ ├── parser.py # Abstract base Parser class ├── models.py # Domain models (Article, Citation, Recital, etc.) ├── registry.py # Parser registry pattern for dynamic parser selection ├── normalization.py # Text normalization strategies ├── exceptions.py # Custom exception hierarchy ├── strategies/ │ └── article_extraction.py # Article extraction strategies ├── xml/ │ ├── xml.py # Abstract XMLParser base class │ ├── helpers.py # XML utilities (XMLNodeExtractor, XMLValidator) │ ├── formex.py # Formex4Parser │ ├── boe.py # BOEXMLParser (Spanish Official Gazette) │ └── akomantoso/ # Akoma Ntoso parser package │ ├── base.py # AkomaNtosoParser base class │ ├── akn4eu.py # AKN4EUParser variant │ ├── german.py # GermanLegalDocMLParser variant │ ├── luxembourg.py # LuxembourgAKNParser variant │ ├── extractors.py # Article extraction utilities │ └── utils.py # Format detection and factory functions └── html/ ├── html_parser.py # Abstract HTMLParser base class ├── veneto.py # VenetoHTMLParser (regional documents) └── cellar/ # EU Cellar parser package ├── cellar.py # CellarHTMLParser (semantic XHTML) ├── cellar_standard.py # CellarStandardHTMLParser (simple structure) └── proposal.py # ProposalHTMLParser (legislative proposals) Design Patterns --------------- Registry Pattern ~~~~~~~~~~~~~~~~ The ``ParserRegistry`` class implements the Registry pattern to enable dynamic parser selection: .. code-block:: python from tulit.parser.registry import ParserRegistry, get_parser_for_format # Register a new parser registry = ParserRegistry() registry.register('custom_format', CustomParser) # Get parser by format parser = get_parser_for_format('akn') **Benefits:** * Loose coupling between parser selection and usage * Easy to add new parsers without modifying existing code * Centralized parser management Strategy Pattern ~~~~~~~~~~~~~~~~ Multiple strategy patterns are used for algorithmic flexibility: **Text Normalization Strategies:** .. code-block:: python from tulit.parser.normalization import ( WhitespaceNormalizer, UnicodeNormalizer, CompositeNormalizer ) # Compose multiple normalization strategies normalizer = CompositeNormalizer([ WhitespaceNormalizer(), UnicodeNormalizer() ]) **Article Extraction Strategies:** .. code-block:: python from tulit.parser.strategies.article_extraction import ( FormexArticleStrategy, CellarStandardArticleStrategy ) # Each parser uses an appropriate strategy strategy = FormexArticleStrategy() articles = strategy.extract_articles(root_element) **Benefits:** * Algorithms can be selected at runtime * Easy to add new strategies * Promotes code reuse through composition Factory Pattern ~~~~~~~~~~~~~~~ Factory functions create appropriate parser instances: .. code-block:: python from tulit.parser.xml.akomantoso import create_akn_parser # Automatically detect format and create appropriate parser parser = create_akn_parser('document.akn') **Benefits:** * Encapsulates complex object creation logic * Client code doesn't need to know about concrete classes * Enables automatic format detection Template Method Pattern ~~~~~~~~~~~~~~~~~~~~~~~ The base ``Parser`` class defines the parsing workflow as a template method: .. code-block:: python class Parser(ABC): def parse(self, file: str, **options) -> 'Parser': """Template method defining the parsing workflow.""" self.get_root(file) self.get_preface() self.get_preamble() self.get_formula() self.get_citations() self.get_recitals() self.get_preamble_final() self.get_body() self.get_chapters() self.get_articles() self.get_conclusions() return self **Benefits:** * Consistent parsing workflow across all parsers * Subclasses override only specific steps * Reduces code duplication Domain Models ------------- Structured domain objects provide type-safe access to document components: .. code-block:: python from tulit.parser.models import Article, Citation, Recital, Chapter @dataclass class Article: """Represents a legal article with metadata and content.""" number: str title: Optional[str] = None content: Optional[str] = None children: List[ArticleChild] = field(default_factory=list) **Benefits:** * Type safety and IDE autocompletion * Clear data contracts * Easier testing and validation * Self-documenting code Exception Hierarchy ------------------- Custom exception hierarchy provides granular error handling: .. code-block:: python ParserError (base exception) ├── ParseError (parsing failures) ├── ValidationError (schema validation failures) ├── ExtractionError (data extraction failures) └── FileLoadError (file loading failures) **Benefits:** * Specific error handling at appropriate levels * Clear error semantics * Better debugging information XML Utilities ------------- Centralized XML utilities reduce code duplication: **XMLNodeExtractor:** * XPath-based node extraction * Namespace-aware queries * Text content extraction with normalization **XMLValidator:** * Schema validation with error handling * Support for both local and remote schemas * Detailed validation error reporting **Benefits:** * Consistent XML processing across parsers * Robust error handling * Reduced code duplication Module Organization ------------------- Modules are kept focused and maintainable: * Modules average 200-300 lines * Each module has a single, clear responsibility * Better separation of concerns * Easy to navigate and maintain Key organizational achievements: * ``parser.py``: 315 lines - Core abstract base class * ``xml.py``: 663 lines - XML parser base with utilities * ``akomantoso/``: 7 focused modules for different variants * ``cellar/``: 3 specialized HTML parsers for EU documents Testing Strategy ---------------- The codebase maintains comprehensive test coverage: * **126 tests passing** across all parsers * **16 tests skipped** (external API dependencies) * Unit tests for all public methods * Integration tests for complete parsing workflows * Edge case coverage for error handling Extension Points ---------------- The architecture provides multiple extension points: **Adding a New Parser:** 1. Inherit from ``XMLParser`` or ``HTMLParser`` 2. Implement required abstract methods 3. Register with ``ParserRegistry`` 4. Add tests **Adding a New Normalization Strategy:** 1. Inherit from ``TextNormalizationStrategy`` 2. Implement ``normalize()`` method 3. Use in parser through composition **Adding a New Article Extraction Strategy:** 1. Inherit from ``ArticleExtractionStrategy`` 2. Implement ``extract_articles()`` method 3. Use in parser initialization