Parsers
This subpackage contains modules for parsing legal documents in various formats into a JSON representation.
The formats currently supported are:
XML Formats:
Formex 4 (EU legislative documents)
Akoma Ntoso 3.0 (multiple variants: EU, German LegalDocML, Luxembourg CSD13)
BOE XML (Spanish Official Gazette)
HTML Formats:
Cellar XHTML (semantic structure)
Cellar Standard HTML (simple structure)
EU Legislative Proposals
Core Parser Architecture
Parser Base Module
This module provides the abstract Parser base class and JSON validation utilities. All concrete parsers should inherit from the Parser class and implement the required abstract methods.
The module now imports domain models, exceptions, registry, and normalization strategies from their respective focused modules for better organization.
- class tulit.parser.parser.Parser
Bases:
ABCAbstract base class for legal document parsers.
All subclasses must implement: - get_preface() - get_articles() - parse()
Optional methods with default implementations: - get_preamble() - get_formula() - get_citations() - get_recitals() - get_preamble_final() - get_body() - get_chapters() - get_conclusions()
- root
Root element of the XML or HTML document.
- Type:
lxml.etree._Element or bs4.BeautifulSoup
- preamble
The preamble section of the document.
- Type:
lxml.etree.Element or bs4.Tag or None
- body
The body section of the document.
- Type:
lxml.etree.Element or bs4.Tag or None
- articles
List of extracted articles from the body. Each article is a dictionary with keys: - ‘eId’: Article identifier - ‘text’: Article text - ‘children’: List of child elements of the article
- Type:
- abstract get_preface() str | None
Extract document preface/title.
MUST be implemented by all subclasses.
- Returns:
Document title/preface text
- Return type:
str or None
- abstract get_articles() None
Extract articles from document body.
MUST be implemented by all subclasses. Extracts articles and stores them in self.articles as a list of dictionaries.
- Returns:
Articles are stored in self.articles attribute
- Return type:
None
- abstract parse(file: str, **options) Parser
Parse document and extract all components.
MUST be implemented by all subclasses.
- get_preamble() Any | None
Extract preamble section.
Override in subclass if format has preamble. Default returns None.
- Returns:
Preamble element or None if not present
- Return type:
Any or None
- get_formula() str | None
Extract formula (enacting clause).
Override in subclass if format has formula. Default returns None.
- Returns:
Formula text or None if not present
- Return type:
str or None
- get_citations() list[dict[str, str]]
Extract citations/references.
Override in subclass if format has citations. Default returns empty list.
- get_recitals() list[dict[str, str]]
Extract recitals (whereas clauses).
Override in subclass if format has recitals. Default returns empty list.
- get_preamble_final() str | None
Extract final preamble text.
Override in subclass if format has final preamble. Default returns None.
- Returns:
Final preamble text or None if not present
- Return type:
str or None
- get_body() Any | None
Extract body section.
Override in subclass if needed. Default returns None.
- Returns:
Body element or None
- Return type:
Any or None
- get_chapters() list[dict[str, Any]]
Extract chapters.
Override in subclass if format has chapters. Default returns empty list.
- get_conclusions() dict[str, Any] | None
Extract conclusions section.
Override in subclass if format has conclusions. Default returns None.
- to_dict() dict[str, Any]
Convert the parser’s extracted data to a dictionary.
This version ensures that common non-JSON-native objects are converted to JSON-serializable forms. It will: - Call .to_dict() on domain model objects (Citation, Article, etc.) if
available.
Recursively convert lists and dicts.
Convert BeautifulSoup Tag objects to their text content.
Convert lxml elements to their concatenated text content.
- Returns:
A dictionary containing all extracted elements from the document with JSON-serializable values.
- Return type:
Domain Models
Domain Models Module
This module contains domain model classes representing legal document structures. These models provide a clear, type-safe representation of legal documents, independent of the parsing implementation.
- class tulit.parser.models.Citation(eId: str, text: str)
Bases:
objectRepresents a citation in a legal document.
- class tulit.parser.models.Recital(eId: str, text: str)
Bases:
objectRepresents a recital (whereas clause) in a legal document.
- class tulit.parser.models.ArticleChild(eId: str, text: str, amendment: bool | None = None)
Bases:
objectRepresents a child element of an article (paragraph, point, etc.).
- class tulit.parser.models.Article(eId: str, num: str, heading: str | None = None, children: List[ArticleChild] = None)
Bases:
objectRepresents an article in a legal document.
- children
Child elements (paragraphs, points)
- Type:
List[ArticleChild]
- children: List[ArticleChild] = None
Parser Registry
Parser Registry Module
This module provides a registry pattern for managing parser implementations. It allows for dynamic parser discovery and instantiation based on format types.
- class tulit.parser.registry.ParserRegistry
Bases:
objectRegistry for managing parser implementations.
This class implements the Registry pattern to allow dynamic parser discovery and instantiation. Parsers can be registered with format identifiers and aliases, and then retrieved by format name.
Example
>>> registry = ParserRegistry() >>> registry.register('xml', XMLParser) >>> parser = registry.create('xml')
- __init__()
Initialize an empty parser registry.
- register(format_id: str, parser_class: Type, aliases: List[str] | None = None) None
Register a parser class for a given format.
- Parameters:
- Raises:
ParserError – If format_id or any alias is already registered
- register_factory(format_id: str, factory_func: Callable, aliases: List[str] | None = None) None
Register a factory function for creating parser instances.
This is useful when parser instantiation requires special logic or when dealing with parser variants.
- create(format_id: str, *args, **kwargs)
Create a parser instance for the given format.
- Parameters:
format_id (str) – Format identifier or alias
*args – Arguments to pass to parser constructor
**kwargs – Arguments to pass to parser constructor
- Returns:
An instance of the requested parser
- Return type:
- Raises:
ParserError – If format_id is not registered
- list_formats() List[str]
List all registered format identifiers.
- Returns:
List of format identifiers (not including aliases)
- Return type:
List[str]
- tulit.parser.registry.get_parser_registry() ParserRegistry
Get the global parser registry instance.
- Returns:
The global parser registry
- Return type:
- tulit.parser.registry.register_parser(format_id: str, parser_class: Type = None, factory: Callable = None, aliases: List[str] | None = None) None
Convenience function to register a parser in the global registry.
- Parameters:
Example
>>> register_parser('xml', XMLParser, aliases=['xmldoc'])
Text Normalization
Text Normalization Strategies Module
This module provides text normalization strategies following the Strategy pattern. Different normalization algorithms can be selected at runtime, making parsers more flexible and testable.
- class tulit.parser.normalization.TextNormalizationStrategy
Bases:
ABCAbstract base class for text normalization strategies.
The Strategy pattern allows different text cleaning/normalization algorithms to be selected at runtime, making parsers more flexible and testable.
Example
>>> normalizer = WhitespaceNormalizer() >>> clean_text = normalizer.normalize(" multiple spaces ") "multiple spaces"
- class tulit.parser.normalization.WhitespaceNormalizer(fix_punctuation: bool = True)
Bases:
TextNormalizationStrategyNormalizes whitespace in text.
Removes newlines, tabs, carriage returns
Collapses multiple spaces to single space
Strips leading/trailing whitespace
Optionally fixes spacing before punctuation
- class tulit.parser.normalization.UnicodeNormalizer(unicode_form: str | None = None, replace_nbsp: bool = True)
Bases:
TextNormalizationStrategyNormalizes unicode characters in text.
Replaces non-breaking spaces with regular spaces
Optionally normalizes unicode to a specific form (NFC, NFD, NFKC, NFKD)
- class tulit.parser.normalization.PatternReplacementNormalizer(patterns: List[tuple[str, str]])
Bases:
TextNormalizationStrategyNormalizes text using regex pattern replacements.
Useful for removing specific markers, formatting codes, or document-specific artifacts.
- __init__(patterns: List[tuple[str, str]])
Initialize pattern replacement normalizer.
- Parameters:
patterns (List[tuple[str, str]]) – List of (pattern, replacement) tuples for regex substitution
Example
>>> normalizer = PatternReplacementNormalizer([ ... (r'▼[A-Z]\d*', ''), # Remove consolidation markers ... (r'^\(\d+\)', '') # Remove leading numbers in parentheses ... ])
- class tulit.parser.normalization.CompositeNormalizer(strategies: List[TextNormalizationStrategy])
Bases:
TextNormalizationStrategyComposite strategy that applies multiple normalizers in sequence.
This allows combining different normalization strategies in a specific order to achieve complex text cleaning operations.
Example
>>> normalizer = CompositeNormalizer([ ... UnicodeNormalizer(), ... WhitespaceNormalizer(), ... PatternReplacementNormalizer([(r'▼[A-Z]\d*', '')]) ... ]) >>> clean_text = normalizer.normalize(raw_text)
- __init__(strategies: List[TextNormalizationStrategy])
Initialize composite normalizer.
- Parameters:
strategies (List[TextNormalizationStrategy]) – List of normalizers to apply in order
- tulit.parser.normalization.create_standard_normalizer() CompositeNormalizer
Create a standard text normalizer suitable for most legal documents.
Applies: 1. Unicode normalization (non-breaking spaces) 2. Whitespace normalization (newlines, tabs, multiple spaces) 3. Punctuation spacing fixes
- Returns:
Composite normalizer with standard strategies
- Return type:
- tulit.parser.normalization.create_html_normalizer() CompositeNormalizer
Create a normalizer for HTML-based legal documents.
Applies: 1. Pattern removal (consolidation markers) 2. Unicode normalization 3. Whitespace normalization
- Returns:
Composite normalizer for HTML documents
- Return type:
- tulit.parser.normalization.create_formex_normalizer() CompositeNormalizer
Create a normalizer for Formex XML documents.
Applies: 1. Pattern removal (leading parentheses numbers) 2. Unicode normalization 3. Whitespace normalization
- Returns:
Composite normalizer for Formex documents
- Return type:
Parser Exceptions
Parser Exceptions Module
This module contains all custom exception classes for the parser package. Organizing exceptions in a dedicated module improves maintainability and allows for better exception handling patterns.
- exception tulit.parser.exceptions.ParserError
Bases:
ExceptionBase exception for all parser-related errors.
- exception tulit.parser.exceptions.ParseError
Bases:
ParserErrorRaised when parsing fails due to malformed input.
- exception tulit.parser.exceptions.ValidationError
Bases:
ParserErrorRaised when validation against a schema fails.
- exception tulit.parser.exceptions.ExtractionError
Bases:
ParserErrorRaised when extraction of specific content fails.
- exception tulit.parser.exceptions.FileLoadError
Bases:
ParserErrorRaised when loading a file fails.
XML Parsers
Base XML Parser
XML Parser Base Module
This module provides the abstract XMLParser base class for XML-based document parsers. All XML parsers should inherit from XMLParser and implement the required abstract methods.
The XMLParser class integrates XML validation, node extraction utilities, and text normalization from the organized helper modules.
- class tulit.parser.xml.xml.XMLParser(normalizer: TextNormalizationStrategy | None = None)
Bases:
ParserAbstract base class for XML parsers.
Provides common XML parsing utilities and helper methods. Uses XMLValidator for schema validation and TextNormalizationStrategy for text processing.
Subclasses must implement get_preface(), get_articles(), and parse() or use the provided parse() template method by overriding component methods.
- validation_errors
Validation errors if the XML file is invalid.
- Type:
lxml.etree._LogEntry or None
- normalizer
Strategy for text normalization operations.
- __init__(normalizer: TextNormalizationStrategy | None = None) None
Initializes the Parser object with default attributes.
- Parameters:
normalizer (TextNormalizationStrategy, optional) – Text normalization strategy to use. Defaults to standard normalizer.
- load_schema(schema: str) None
Load an XSD schema for XML validation.
Delegates to XMLValidator for actual schema loading.
- Parameters:
schema (str) – Filename of the XSD schema file
- Return type:
None
- validate(file: str, format: str) bool
Validate an XML file against the loaded schema.
Delegates to XMLValidator for actual validation.
- remove_node(tree, node)
Removes specified nodes from the XML tree while preserving their tail text.
Delegates to XMLNodeExtractor for node removal.
- Parameters:
tree (lxml.etree._Element) – The XML tree or subtree to process.
node (str) – XPath expression identifying the nodes to remove.
- Returns:
The modified XML tree with specified nodes removed.
- Return type:
lxml.etree._Element
- get_root(file: str | None = None)
Parses an XML file and returns its root element using secure parser settings.
- Parameters:
file (str, optional) – Path to the XML file. If not provided, uses the file path from parse()
- Return type:
None
- Raises:
FileLoadError – If file cannot be loaded or parsed
- get_preface(preface_xpath, paragraph_xpath) None
Extracts paragraphs from the preface section of the document.
- get_formula(formula_xpath: str, paragraph_xpath: str) str
Extracts formula text from the preamble.
- Parameters:
- Returns:
Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.
- Return type:
str or None
- get_citations(citations_xpath, citation_xpath, extract_eId=None)
Extracts citations from the preamble.
- Parameters:
- Returns:
Updates the instance’s citations attribute with the found citations.
- Return type:
None
- get_recitals(recitals_xpath, recital_xpath, text_xpath, extract_intro=None, extract_eId=None)
Extracts recitals from the preamble.
- Parameters:
recitals_xpath (str) – XPath expression to locate the recitals section.
recital_xpath (str) – XPath expression to locate individual recitals.
text_xpath (str) – XPath expression to locate the text within each recital.
extract_intro (function, optional) – Function to handle the extraction of the introductory recital.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.
- Returns:
Updates the instance’s recitals attribute with the found recitals.
- Return type:
None
- get_preamble_final(preamble_final_xpath) str
Extracts the final preamble text from the document.
- Parameters:
preamble_final_xpath (str) – XPath expression to locate the final preamble element.
- Returns:
Updates the instance’s preamble_final attribute with the found final preamble text.
- Return type:
None
- get_body(body_xpath) None
Extracts the body element from the document.
- Parameters:
body_xpath (str) – XPath expression to locate the body element. For Akoma Ntoso, this is usually ‘.//akn:body’, while for Formex it is ‘.//ENACTING.TERMS’.
- Returns:
Updates the instance’s body attribute with the found body element.
- Return type:
None
- get_chapters(chapter_xpath: str, num_xpath: str, heading_xpath: str, extract_eId=None, get_headings=None) None
Extracts chapter information from the document.
- Parameters:
chapter_xpath (str) – XPath expression to locate the chapter elements.
num_xpath (str) – XPath expression to locate the chapter number within each chapter element.
heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.
- Returns:
Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text
- Return type:
None
- abstract get_articles() None
Extracts articles from the body section.
MUST be implemented by all XML parser subclasses. Subclasses should extract articles according to their specific XML format and store them in self.articles.
- Returns:
Articles are stored in self.articles attribute
- Return type:
None
- get_conclusions()
Extracts conclusions from the body section.
Override in subclass if format has conclusions. Default implementation does nothing.
- Return type:
None
- parse(file: str, **options) XMLParser
Template method that orchestrates the entire parsing workflow.
DO NOT OVERRIDE THIS METHOD. Instead, override individual component extraction methods like get_preface(), get_articles(), etc.
- Parameters:
- Returns:
Self for method chaining with the parsed data stored in its attributes.
- Return type:
XML Helpers
XML Helper Utilities Module
This module provides utility classes for common XML operations including XPath-based extraction, validation, and node manipulation. These utilities reduce code duplication across XML-based parsers.
- class tulit.parser.xml.helpers.XMLNodeExtractor(namespaces: dict[str, str] | None = None)
Bases:
objectUtility class for XPath-based XML node extraction and manipulation.
This class encapsulates common XPath operations and text extraction patterns, reducing duplication and complexity in XML parsers.
Example
>>> extractor = XMLNodeExtractor({'akn': 'http://...'}) >>> node = extractor.find(root, './/akn:article') >>> text = extractor.extract_text(node)
- __init__(namespaces: dict[str, str] | None = None)
Initialize the node extractor.
- Parameters:
namespaces (dict, optional) – Dictionary of namespace prefixes to URIs
- find(element: _Element, xpath: str) _Element | None
Find the first element matching the XPath expression.
- Parameters:
element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression
- Returns:
First matching element or None
- Return type:
lxml.etree._Element or None
- findall(element: _Element, xpath: str) List[_Element]
Find all elements matching the XPath expression.
- extract_text(element: _Element, strip: bool = True) str
Extract all text content from an element and its descendants.
- extract_text_from_all(parent: _Element, xpath: str, strip: bool = True) List[str]
Extract text from all elements matching the XPath.
- safe_find(element: _Element, xpath: str, default: _Element | None = None) _Element | None
Safely find an element, returning default if not found.
- Parameters:
element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression
default (lxml.etree._Element, optional) – Value to return if not found
- Returns:
Found element or default value
- Return type:
lxml.etree._Element or default
- safe_find_text(element: _Element, xpath: str, default: str = '') str
Safely find an element and extract its text.
- class tulit.parser.xml.helpers.XMLValidator
Bases:
objectHandles XML schema loading and validation.
This class provides robust schema validation with proper error handling and logging. It supports both XSD and RelaxNG schemas.
Example
>>> validator = XMLValidator() >>> validator.load_schema('schema.xsd') >>> is_valid = validator.validate(xml_root)
- __init__()
Initialize the XML validator.
Formex Parser
- class tulit.parser.xml.formex.Formex4Parser
Bases:
XMLParserA parser for processing and extracting content from Formex XML files.
The parser handles XML documents following the Formex schema for legal documents. It inherits from the XMLParser class and provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.
- get_preface() None
Extracts the preface from the document. It is assumed that the preface is contained within the TITLE and P elements.
- get_preamble() None
Extracts the preamble from the document. It is assumed that the preamble is contained within the PREAMBLE element, while notes are contained within the NOTE elements.
- get_formula() None
Extracts the formula from the preamble. The formula is assumed to be contained within the PREAMBLE.INIT element.
- Returns:
Formula text from the preamble.
- Return type:
- get_citations() None
Extracts citations from the preamble. Citations are assumed to be contained within the GR.VISA and VISA elements. The citation identifier is set as the index of the citation in the preamble.
- Returns:
List of dictionaries containing citation data with keys: - ‘eId’: Citation identifier, which is the index of the citation in the preamble - ‘text’: Citation text
- Return type:
- get_recitals() None
Extracts recitals from the preamble. Recitals are assumed to be contained within the GR.CONSID and CONSID elements. The introductory recital is extracted separately. The recital identifier is set as the index of the recital in the preamble.
- Returns:
List of dictionaries containing recital text and eId for each recital. Returns None if no recitals are found.
- Return type:
list or None
- get_preamble_final() None
Extracts the final preamble text from the document. The final preamble text is assumed to be contained within the PREAMBLE.FINAL element.
- get_body() None
Extracts the body section from the document. The body is assumed to be contained within the ENACTING.TERMS element.
- get_chapters() None
Extracts chapter information from the document. Chapter numbers and headings are assumed to be contained within the TITLE element. The chapter identifier is set as the index of the chapter in the document.
- Returns:
List of dictionaries containing chapter data with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text
- Return type:
- get_articles() None
Extracts articles from the ENACTING.TERMS section using FormexArticleStrategy.
This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.
- Returns:
Articles with identifier and content.
- Return type:
- get_conclusions() None
Extracts conclusions from the document. The conclusion text is assumed to be contained within the FINAL section of the document. The signature details are assumed to be contained within the SIGNATURE element.
- Returns:
Dictionary containing the conclusion text and signature details.
- Return type:
- parse(file: str, **options) Formex4Parser
Parses a FORMEX XML document to extract its components, which are inherited from the XMLParser class. If the input is a directory, searches for the correct XML file (one containing ACT or DECISION tags).
- Parameters:
- Returns:
Self for method chaining with parsed data.
- Return type:
Akoma Ntoso Parsers
Akoma Ntoso Base Parser
This module provides the base AkomaNtosoParser class for processing legal documents in the Akoma Ntoso 3.0 format. All variant parsers (AKN4EU, German LegalDocML, Luxembourg) inherit from this base class.
- class tulit.parser.xml.akomantoso.base.AkomaNtosoParser
Bases:
XMLParserBase parser for processing Akoma Ntoso 3.0 legal documents.
The parser handles XML documents following the Akoma Ntoso 3.0 schema for legal documents. It provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.
Example
>>> parser = AkomaNtosoParser() >>> parser.parse('document.xml') >>> articles = parser.get_articles()
- get_preface() None
Extract preface information from the document.
The preface is contained within the ‘preface’ element in the XML file.
- get_preamble() None
Extract preamble information from the document.
The preamble is contained within the ‘preamble’ element in the XML file.
- get_formula() None
Extract formula from the preamble.
The formula is contained within the ‘formula’ element in the XML file. The formula text is extracted from all paragraphs within the formula element.
- Returns:
Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.
- Return type:
str or None
- get_citations() None
Extract citations from the preamble.
The citations are contained within the ‘citations’ element. Each citation is extracted from the ‘citation’ element, with text from all paragraphs.
- get_recitals() None
Extract recitals from the preamble.
Recitals are contained within the ‘recitals’ element. Each recital is extracted from the ‘recital’ element, with text from all paragraphs.
- get_preamble_final() None
Extract the final part of the preamble.
This is typically the text after citations and recitals, contained in the ‘preamble.final’ block.
- get_body() None
Extract the body section from the document.
The body contains the main content including articles, chapters, etc.
- get_chapters() None
Extract chapters from the body.
Chapters structure the main content and may contain articles.
- extract_eId(element: _Element, index: int | None = None) str
Extract the element ID (eId) from an XML element.
The standard Akoma Ntoso format uses ‘eId’ attribute for element identification. Subclasses may override this for format-specific ID extraction.
- get_articles() None
Extract articles from the body using AKNArticleExtractor.
Articles are the main structural units of legal documents. This method uses AKNArticleExtractor to handle the extraction logic. Also handles sections for jurisdictions that use sections instead of articles.
- get_conclusions() None
Extract conclusions from the document.
Conclusions contain closing text and signatures.
- parse(file: str, **options) AkomaNtosoParser
Parse an Akoma Ntoso document to extract all components.
This method validates the document against the Akoma Ntoso 3.0 schema and extracts all content using the orchestrator pattern.
- Parameters:
- Returns:
Self for method chaining
- Return type:
Example
>>> parser = AkomaNtosoParser() >>> parser.parse('document.xml') >>> print(len(parser.articles))
AKN4EU Parser
This module provides the AKN4EU parser for European Union legal documents using the Akoma Ntoso for EU (AKN4EU) format.
- class tulit.parser.xml.akomantoso.akn4eu.AKN4EUParser
Bases:
AkomaNtosoParserParser for AKN4EU (Akoma Ntoso for European Union) documents.
This parser handles EU legal documents that use the AKN4EU variant of Akoma Ntoso, which includes EU-specific extensions and conventions.
Key Differences from Standard Akoma Ntoso: - Uses XML ‘id’ attribute instead of ‘eId’ for element identification - Follows EU-specific document structure conventions
Example
>>> parser = AKN4EUParser() >>> parser.parse('eu_regulation.xml') >>> print(parser.preface)
German LegalDocML Parser
This module provides the parser for German LegalDocML documents, which follow the Akoma Ntoso structure but use a German-specific namespace.
- class tulit.parser.xml.akomantoso.german.GermanLegalDocMLParser
Bases:
AkomaNtosoParserParser for German LegalDocML documents.
This parser handles German legal documents that follow the Akoma Ntoso structure but use the German RIS (Rechtsinformationssystem) namespace.
German LegalDocML Namespace: http://Inhaltsdaten.LegalDocML.de/1.8.2/
Key Differences from Standard Akoma Ntoso: - Uses German-specific namespace while maintaining AKN structure - Schema validation is skipped (German-specific schema variations) - All XPath queries work seamlessly due to namespace remapping
Example
>>> parser = GermanLegalDocMLParser() >>> parser.parse('german_law.xml') >>> print(parser.articles)
- parse(file: str, **options) GermanLegalDocMLParser
Parse a German LegalDocML document to extract its components.
German LegalDocML follows Akoma Ntoso structure but uses a German-specific namespace and may have schema variations. This method bypasses schema validation and directly extracts the content.
- Parameters:
- Returns:
Self for method chaining
- Return type:
Example
>>> parser = GermanLegalDocMLParser() >>> parser.parse('bgb.xml')
Luxembourg Akoma Ntoso Parser
This module provides the parser for Luxembourg legal documents using the Committee Specification Draft 13 (CSD13) variant of Akoma Ntoso 3.0.
- class tulit.parser.xml.akomantoso.luxembourg.LuxembourgAKNParser
Bases:
AkomaNtosoParserParser for Luxembourg Akoma Ntoso documents (CSD13 variant).
This parser handles Luxembourg Legilux documents which use the Committee Specification Draft 13 (CSD13) namespace variant of Akoma Ntoso 3.0.
Luxembourg Namespace: http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13
Key Differences from Standard Akoma Ntoso: - Uses CSD13 namespace variant - Uses ‘id’ attribute instead of ‘eId’ for element identification - Content is nested in <alinea><content><p> structure - Includes Luxembourg-specific metadata namespace (http://www.scl.lu)
Example
>>> parser = LuxembourgAKNParser() >>> parser.parse('luxembourg_law.xml') >>> print(parser.articles)
- extract_eId(element: _Element, index: int | None = None) str
Extract element ID from ‘id’ attribute (Luxembourg convention).
Luxembourg documents use the ‘id’ attribute instead of ‘eId’ for element identification.
- parse(file: str, **options) LuxembourgAKNParser
Parse a Luxembourg Akoma Ntoso document to extract its components.
Luxembourg documents use the CSD13 variant and may have specific structural differences. This method bypasses schema validation and uses the orchestrator for content extraction.
- Parameters:
- Returns:
Self for method chaining
- Return type:
Example
>>> parser = LuxembourgAKNParser() >>> parser.parse('luxembourg_code.xml')
Akoma Ntoso Utility Functions
This module provides utility functions for detecting Akoma Ntoso formats and creating appropriate parser instances.
- tulit.parser.xml.akomantoso.utils.detect_akn_format(file_path: str) str
Automatically detect the Akoma Ntoso format/dialect based on the XML namespace.
This function examines the root element’s namespace to determine which variant of Akoma Ntoso is being used (standard, German LegalDocML, Luxembourg CSD13, or AKN4EU).
- Parameters:
file_path (str) – Path to the XML file
- Returns:
Format identifier: ‘german’, ‘akn4eu’, ‘luxembourg’, or ‘akn’ (standard)
- Return type:
Example
>>> format_type = detect_akn_format('document.xml') >>> print(format_type) 'akn4eu'
- tulit.parser.xml.akomantoso.utils.create_akn_parser(file_path: str | None = None, format: str | None = None) XMLParser
Factory function to create the appropriate Akoma Ntoso parser.
This function uses the registry pattern to instantiate the correct parser based on either explicit format specification or automatic detection.
- Parameters:
- Returns:
Appropriate parser instance for the detected or specified format
- Return type:
- Raises:
ValueError – If neither file_path nor format is provided
Example
>>> # Auto-detect format >>> parser = create_akn_parser(file_path='document.xml') >>> >>> # Explicitly specify format >>> parser = create_akn_parser(format='german')
- tulit.parser.xml.akomantoso.utils.register_akn_parsers() None
Register all Akoma Ntoso parser variants in the registry.
This function should be called during module initialization to ensure all parser types are available for the factory function.
Helper classes for Akoma Ntoso article and content extraction.
This module provides specialized extractors to reduce duplication across AkomaNtoso parser variants and improve code organization.
- class tulit.parser.xml.akomantoso.extractors.AKNArticleExtractor(namespaces: Dict[str, str], id_attr: str = 'eId')
Bases:
objectExtracts article information from Akoma Ntoso documents.
Centralizes common article extraction logic used across different AKN parser variants (standard, AKN4EU, German, Luxembourg).
- __init__(namespaces: Dict[str, str], id_attr: str = 'eId')
Initialize with namespace configuration.
- class tulit.parser.xml.akomantoso.extractors.AKNParseOrchestrator(parser)
Bases:
objectOrchestrates the parsing workflow for Akoma Ntoso documents.
Implements Template Method pattern to reduce parse() method duplication across different AKN parser variants.
- __init__(parser)
Initialize with reference to parser instance.
- Parameters:
parser (AkomaNtosoParser) – The parser instance to orchestrate.
- class tulit.parser.xml.akomantoso.extractors.AKNContentProcessor(namespaces: Dict[str, str])
Bases:
objectProcesses complex content structures in Akoma Ntoso documents.
Handles lists, tables, and nested structures common across different AKN document types.
- __init__(namespaces: Dict[str, str])
Initialize with namespace configuration.
- Parameters:
namespaces (dict) – XML namespace mapping for XPath queries.
BOE Parser
- class tulit.parser.xml.boe.BOEXMLParser
Bases:
XMLParserParser for BOE XML documents to LegalJSON.
Uses BOEArticleStrategy to extract articles, reducing code duplication and improving maintainability.
- get_articles() None
Extract articles using BOEArticleStrategy.
This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.
- get_chapters() list
Extracts chapter information from the document.
- Parameters:
chapter_xpath (str) – XPath expression to locate the chapter elements.
num_xpath (str) – XPath expression to locate the chapter number within each chapter element.
heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.
- Returns:
Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text
- Return type:
None
- get_citations() list
Extracts citations from the preamble.
- Parameters:
- Returns:
Updates the instance’s citations attribute with the found citations.
- Return type:
None
- get_recitals() list
Extracts recitals from the preamble.
- Parameters:
recitals_xpath (str) – XPath expression to locate the recitals section.
recital_xpath (str) – XPath expression to locate individual recitals.
text_xpath (str) – XPath expression to locate the text within each recital.
extract_intro (function, optional) – Function to handle the extraction of the introductory recital.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.
- Returns:
Updates the instance’s recitals attribute with the found recitals.
- Return type:
None
- get_formula() None
Extracts formula text from the preamble.
- Parameters:
- Returns:
Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.
- Return type:
str or None
- get_preamble_final() None
Extracts the final preamble text from the document.
- Parameters:
preamble_final_xpath (str) – XPath expression to locate the final preamble element.
- Returns:
Updates the instance’s preamble_final attribute with the found final preamble text.
- Return type:
None
- get_conclusions() None
Extracts conclusions from the body section.
Override in subclass if format has conclusions. Default implementation does nothing.
- Return type:
None
- parse(file: str, **options) BOEXMLParser
Parse a BOE XML document.
- Parameters:
- Returns:
Self for method chaining
- Return type:
HTML Parsers
Base HTML Parser
- class tulit.parser.html.html_parser.HTMLParser
Bases:
ParserAbstract base class for HTML parsers.
Provides common HTML parsing utilities and a template parse() method. Subclasses must implement get_preface() and get_articles(). Optional methods like get_preamble(), get_chapters(), etc. can be overridden.
- get_root(file: str) None
Loads an HTML file and parses it with BeautifulSoup.
- Parameters:
file (str) – The path to the HTML file.
- Returns:
The root element is stored in the parser under the ‘root’ attribute.
- Return type:
None
- parse(file: str, **options) HTMLParser
Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.
- Parameters:
- Returns:
Self for method chaining with the parsed elements stored in the attributes.
- Return type:
Cellar HTML Parsers
- class tulit.parser.html.cellar.cellar.CellarHTMLParser
Bases:
HTMLParser- get_preface() None
Extracts the preface text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted preface is stored in the ‘preface’ attribute.
- Return type:
None
- get_preamble() None
Extracts the preamble text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted preamble is stored in the ‘preamble’ attribute.
- Return type:
None
- get_formula() None
Extracts the formula from the HTML, if present.
- Parameters:
None –
- Returns:
The extracted formula is stored in the ‘formula’ attribute.
- Return type:
None
- get_citations() None
Extracts citations from the HTML.
- Parameters:
None –
- Returns:
The extracted citations are stored in the ‘citations’ attribute
- Return type:
None
- get_recitals() None
Extracts recitals from the HTML.
- Parameters:
None –
- Returns:
The extracted recitals are stored in the ‘recitals’ attribute.
- Return type:
None
- get_preamble_final() None
Extracts the final preamble text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted final preamble is stored in the ‘preamble_final’ attribute.
- Return type:
None
- get_body() None
Extracts the body content from the HTML.
- Parameters:
None –
- Returns:
The extracted body content is stored in the ‘body’ attribute
- Return type:
None
- get_articles() None
Extracts articles from the HTML. Each <div> with an id starting with “art” is treated as an article (eId). Subsequent subdivisions are processed based on the closest parent with an id.
- parse(file: str, **options) CellarHTMLParser
Parses an XHTML document. If the input is a directory, searches for XHTML files.
- Parameters:
- Returns:
Self for method chaining with extracted content.
- Return type:
- class tulit.parser.html.cellar.cellar_standard.CellarStandardHTMLParser
Bases:
HTMLParserParser for standard HTML format documents from EU Cellar. This format wraps content in <TXT_TE> tags with simple <p> structure, unlike the semantic XHTML format with class-based structure.
- get_preface() None
Extract document title/preface. In standard HTML, this is typically in the metadata or first heading.
- get_preamble() None
Extract preamble content. In standard HTML, the preamble typically includes the decision-making body, references, and recitals.
- get_formula() None
Extract the formula (decision-making body statement). Usually starts with “THE COUNCIL”, “THE COMMISSION”, etc.
- get_citations()
Extract citations (legal references). Usually contains phrases like “Having regard to”.
- get_recitals()
Extract recitals (whereas clauses). Usually starts with “Whereas:” followed by numbered items.
- get_preamble_final()
Extract final preamble statement (e.g., “HAS ADOPTED THIS DECISION:”).
- get_body()
The body is the TXT_TE container itself.
- get_chapters()
Extract chapters. In standard HTML, these might be section headings. For most documents, this may not apply.
- get_articles()
Extract articles from the document using CellarStandardArticleStrategy.
This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.
- get_conclusions()
Extract conclusion text (e.g., “Done at Brussels, …”).
- parse(file_path: str, **options) CellarStandardHTMLParser
Parse a standard HTML document and extract all components. If the input is a directory, searches for HTML files.
- Parameters:
- Returns:
Self for method chaining with parsed document.
- Return type:
- class tulit.parser.html.cellar.proposal.ProposalHTMLParser
Bases:
HTMLParserParser for European Commission proposal documents (COM documents).
These documents have a different structure than regular EUR-Lex legislative acts. They typically contain: - Metadata (institution, date, reference numbers) - Proposal status and title - Explanatory Memorandum with sections and subsections - Sometimes the actual legal act text at the end
- get_metadata() None
Extracts metadata from the Commission proposal HTML.
Metadata includes: - Institution name (e.g., “EUROPEAN COMMISSION”) - Emission date and location - Reference numbers (COM number, interinstitutional reference) - Proposal status - Document type - Title/subject
- Returns:
The extracted metadata is stored in the ‘metadata’ attribute.
- Return type:
None
- get_explanatory_memorandum() None
Extracts the Explanatory Memorandum section from the proposal.
The Explanatory Memorandum typically contains: - Title (class=”Exposdesmotifstitre”) - Sections with headings (class=”li ManualHeading1”, “li ManualHeading2”, etc.) - Numbered paragraphs (class=”li ManualNumPar1”) - Normal text (class=”Normal”)
- Returns:
The extracted content is stored in the ‘explanatory_memorandum’ attribute.
- Return type:
None
- get_preface() None
For proposals, the preface is the combination of status, document type, and title. This extracts from the SECOND occurrence (the actual legal act), not the first (cover page).
- get_preamble() None
Extracts the preamble of the legal act (not the explanatory memorandum). The preamble appears after the explanatory memorandum and contains: - Interinstitutional reference - Status - Document type - Title - Institution acting - Citations (Having regard to…) - Recitals (Whereas…)
- Returns:
Sets self.preamble to the preamble element
- Return type:
None
- get_formula() None
Extracts the formula from the preamble (e.g., “THE COUNCIL OF THE EUROPEAN UNION,”).
- Returns:
The extracted formula is stored in the ‘formula’ attribute.
- Return type:
None
- get_citations() None
Extracts citations from the preamble (paragraphs starting with “Having regard to”). Citations appear between the formula and “Whereas:”
- Returns:
The extracted citations are stored in the ‘citations’ attribute.
- Return type:
None
- get_recitals() None
Extracts recitals from the preamble (paragraphs with class “li ManualConsidrant”). Recitals may span multiple content divs.
- Returns:
The extracted recitals are stored in the ‘recitals’ attribute.
- Return type:
None
- get_preamble_final() None
Extracts the final formula of the preamble (e.g., “HAS ADOPTED THIS DECISION:”).
- Returns:
The extracted final preamble is stored in the ‘preamble_final’ attribute.
- Return type:
None
- get_body() None
Extracts the body of the legal act (the enacting terms/articles).
- Returns:
Sets self.body to the body element
- Return type:
None
- get_articles() None
Extracts articles from the body of the legal act.
Note: Due to the complex nested structure of Proposal documents (content divs, list concatenation, nested siblings), the full extraction logic remains in parser helper methods. The strategy pattern provides a consistent interface but delegates to parser-specific methods for the actual complex traversal logic.
- Returns:
The extracted articles are stored in the ‘articles’ attribute.
- Return type:
None
- get_conclusions() None
Extracts conclusions from the legal act (signature section).
- Returns:
The extracted conclusions are stored in the ‘conclusions’ attribute.
- Return type:
None
- parse(file: str) ProposalHTMLParser
Parses a Commission proposal HTML file and extracts all relevant information.
- Parameters:
file (str) – Path to the HTML file to parse.
- Returns:
The parser object with parsed elements stored in attributes.
- Return type:
Other HTML Parsers
- class tulit.parser.html.veneto.VenetoHTMLParser
Bases:
HTMLParser- get_root(file: str) None
Loads an HTML file and parses it with BeautifulSoup.
- Parameters:
file (str) – The path to the HTML file.
- Returns:
The root element is stored in the parser under the ‘root’ attribute.
- Return type:
None
- get_preface() None
Extracts the preface text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted preface is stored in the ‘preface’ attribute.
- Return type:
None
- get_preamble()
Extracts the preamble text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted preamble is stored in the ‘preamble’ attribute.
- Return type:
None
- get_formula()
Extracts the formula from the HTML, if present.
- Parameters:
None –
- Returns:
The extracted formula is stored in the ‘formula’ attribute.
- Return type:
None
- get_citations()
Extracts citations from the HTML.
- Parameters:
None –
- Returns:
The extracted citations are stored in the ‘citations’ attribute
- Return type:
None
- get_recitals()
Extracts recitals from the HTML.
- Parameters:
None –
- Returns:
The extracted recitals are stored in the ‘recitals’ attribute.
- Return type:
None
- get_preamble_final()
Extracts the final preamble text from the HTML, if available.
- Parameters:
None –
- Returns:
The extracted final preamble is stored in the ‘preamble_final’ attribute.
- Return type:
None
- get_body()
Extracts the body content from the HTML.
- Parameters:
None –
- Returns:
The extracted body content is stored in the ‘body’ attribute
- Return type:
None
- get_chapters()
Extracts chapters from the HTML, grouping them by their IDs and headings.
- get_articles()
Extracts articles from the HTML. Each <h6> is treated as an article heading, and the next <div> contains the article content. Subdivisions are separated by <br> tags and stored as children.
- get_conclusions()
Extracts conclusions from the HTML, if present.
- parse(file)
Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.
- Parameters:
- Returns:
Self for method chaining with the parsed elements stored in the attributes.
- Return type:
Article Extraction Strategies
Article Extraction Strategy Pattern
This module provides a hierarchy of strategies for extracting articles from different document formats (XML, HTML). It eliminates code duplication across parser classes by centralizing common article extraction logic.
Design Pattern: Strategy Pattern Purpose: Encapsulate article extraction algorithms and make them interchangeable
- class tulit.parser.strategies.article_extraction.ArticleExtractionStrategy
Bases:
ABCAbstract base class for article extraction strategies.
This defines the interface that all concrete extraction strategies must implement. Each strategy encapsulates a specific algorithm for extracting articles from a particular document format.
- abstract extract_articles(document: Any, **kwargs) List[Dict[str, Any]]
Extract articles from the given document.
- Parameters:
document (Any) – The document to extract articles from (XML Element, HTML BeautifulSoup, etc.)
**kwargs (dict) – Additional parameters specific to the extraction strategy
- Returns:
List of article dictionaries with keys: ‘eId’, ‘num’, ‘heading’, ‘children’
- Return type:
List[Dict[str, Any]]
- class tulit.parser.strategies.article_extraction.XMLArticleExtractionStrategy(namespaces: Dict[str, str] | None = None)
Bases:
ArticleExtractionStrategyBase strategy for extracting articles from XML documents.
Provides common XML operations like namespace handling, XPath queries, and text extraction.
- class tulit.parser.strategies.article_extraction.HTMLArticleExtractionStrategy(article_pattern: str | None = None)
Bases:
ArticleExtractionStrategyBase strategy for extracting articles from HTML documents.
Provides common HTML operations like element finding, class matching, and text extraction using BeautifulSoup.
- class tulit.parser.strategies.article_extraction.FormexArticleStrategy(namespaces: Dict[str, str] | None = None)
Bases:
XMLArticleExtractionStrategyStrategy for extracting articles from Formex XML documents.
Formex uses ARTICLE elements with IDENTIFIER attributes, and content is stored in PARAG, ALINEA, or LIST/ITEM elements.
- class tulit.parser.strategies.article_extraction.BOEArticleStrategy(namespaces: Dict[str, str] | None = None)
Bases:
XMLArticleExtractionStrategyStrategy for extracting articles from Spanish BOE XML documents.
BOE uses <p class=”articulo”> for article titles and <p class=”parrafo”> for content paragraphs.
- class tulit.parser.strategies.article_extraction.CellarStandardArticleStrategy
Bases:
HTMLArticleExtractionStrategyStrategy for extracting articles from Cellar HTML documents (standard format).
Cellar documents use specific paragraph patterns to mark article starts and structure content.
- class tulit.parser.strategies.article_extraction.ProposalArticleStrategy
Bases:
HTMLArticleExtractionStrategyStrategy for extracting articles from EU Proposal HTML documents.
Proposals use <p class=”Titrearticle”> for article headers and various paragraph classes for content.