Parsers

This subpackage contains modules for parsing legal documents in various formats into a JSON representation.

The formats currently supported are:

XML Formats:
- Formex 4 (EU legislative documents)
- Akoma Ntoso 3.0 (multiple variants: EU, German LegalDocML, Luxembourg CSD13)
- BOE XML (Spanish Official Gazette)
HTML Formats:
- Cellar XHTML (semantic structure)
- Cellar Standard HTML (simple structure)
- EU Legislative Proposals

Core Parser Architecture

Parser Base Module

This module provides the abstract Parser base class and JSON validation utilities. All concrete parsers should inherit from the Parser class and implement the required abstract methods.

The module now imports domain models, exceptions, registry, and normalization strategies from their respective focused modules for better organization.

class tulit.parser.parser.Parser

Bases: ABC

Abstract base class for legal document parsers.

All subclasses must implement: - get_preface() - get_articles() - parse()

Optional methods with default implementations: - get_preamble() - get_formula() - get_citations() - get_recitals() - get_preamble_final() - get_body() - get_chapters() - get_conclusions()

root

Root element of the XML or HTML document.

Type:: lxml.etree._Element or bs4.BeautifulSoup

preface

Extracted preface text from the document.

Type:: str or None

preamble

The preamble section of the document.

Type:: lxml.etree.Element or bs4.Tag or None

formula

The formula element extracted from the preamble.

Type:: str or None

citations

List of extracted citations from the preamble.

Type:: list

recitals

List of extracted recitals from the preamble.

Type:: list

preamble_final

The final preamble text extracted from the document.

Type:: str or None

body

The body section of the document.

Type:: lxml.etree.Element or bs4.Tag or None

chapters

List of extracted chapters from the body.

Type:: list

articles

List of extracted articles from the body. Each article is a dictionary with keys: - ‘eId’: Article identifier - ‘text’: Article text - ‘children’: List of child elements of the article

Type:: list

conclusions

Extracted conclusions from the body.

Type:: None or dict

__init__() → None

Initializes the Parser object.

Parameters:: None –

abstract get_preface() → str | None

Extract document preface/title.

MUST be implemented by all subclasses.

Returns:: Document title/preface text
Return type:: str or None

abstract get_articles() → None

Extract articles from document body.

MUST be implemented by all subclasses. Extracts articles and stores them in self.articles as a list of dictionaries.

Returns:: Articles are stored in self.articles attribute
Return type:: None

abstract parse(file: str, **options) → Parser

Parse document and extract all components.

MUST be implemented by all subclasses.

Parameters:

file (str) – Path to document file
**options (dict) – Optional parser-specific configuration options

Returns:

Self (for method chaining)

Return type:

Parser

get_preamble() → Any | None

Extract preamble section.

Override in subclass if format has preamble. Default returns None.

Returns:: Preamble element or None if not present
Return type:: Any or None

get_formula() → str | None

Extract formula (enacting clause).

Override in subclass if format has formula. Default returns None.

Returns:: Formula text or None if not present
Return type:: str or None

get_citations() → list[dict[str, str]]

Extract citations/references.

Override in subclass if format has citations. Default returns empty list.

Returns:: List of citation dictionaries
Return type:: list[dict[str, str]]

get_recitals() → list[dict[str, str]]

Extract recitals (whereas clauses).

Override in subclass if format has recitals. Default returns empty list.

Returns:: List of recital dictionaries
Return type:: list[dict[str, str]]

get_preamble_final() → str | None

Extract final preamble text.

Override in subclass if format has final preamble. Default returns None.

Returns:: Final preamble text or None if not present
Return type:: str or None

get_body() → Any | None

Extract body section.

Override in subclass if needed. Default returns None.

Returns:: Body element or None
Return type:: Any or None

get_chapters() → list[dict[str, Any]]

Extract chapters.

Override in subclass if format has chapters. Default returns empty list.

Returns:: List of chapter dictionaries
Return type:: list[dict[str, Any]]

get_conclusions() → dict[str, Any] | None

Extract conclusions section.

Override in subclass if format has conclusions. Default returns None.

Returns:: Conclusions dictionary or None if not present
Return type:: dict[str, Any] or None

to_dict() → dict[str, Any]

Convert the parser’s extracted data to a dictionary.

This version ensures that common non-JSON-native objects are converted to JSON-serializable forms. It will: - Call .to_dict() on domain model objects (Citation, Article, etc.) if

available.

Recursively convert lists and dicts.
Convert BeautifulSoup Tag objects to their text content.
Convert lxml elements to their concatenated text content.

Returns:: A dictionary containing all extracted elements from the document with JSON-serializable values.
Return type:: dict

class tulit.parser.parser.LegalJSONValidator(schema_path: str | None = None)

Bases: object

Validator for LegalJSON output using the LegalJSON schema.

validate(data: dict[str, Any]) → bool: Validate a LegalJSON object against the LegalJSON schema. Returns True if valid, False otherwise.

Domain Models

Domain Models Module

This module contains domain model classes representing legal document structures. These models provide a clear, type-safe representation of legal documents, independent of the parsing implementation.

class tulit.parser.models.Citation(eId: str, text: str)

Bases: object

Represents a citation in a legal document.

eId: str

text: str

to_dict() → Dict[str, Any]: Convert citation to dictionary format.

class tulit.parser.models.Recital(eId: str, text: str)

Bases: object

Represents a recital (whereas clause) in a legal document.

eId: str

text: str

to_dict() → Dict[str, Any]: Convert recital to dictionary format.

class tulit.parser.models.ArticleChild(eId: str, text: str, amendment: bool | None = None)

Bases: object

Represents a child element of an article (paragraph, point, etc.).

eId

Element identifier

Type:: str

text

Content text

Type:: str

amendment

Whether this is an amendment marker

Type:: bool, optional

eId: str

text: str

amendment: bool | None = None

to_dict() → Dict[str, Any]: Convert article child to dictionary format.

class tulit.parser.models.Article(eId: str, num: str, heading: str | None = None, children: List[ArticleChild] = None)

Bases: object

Represents an article in a legal document.

eId

Article identifier

Type:: str

num

Article number

Type:: str

heading

Article heading/title

Type:: str, optional

children

Child elements (paragraphs, points)

Type:: List[ArticleChild]

eId: str

num: str

heading: str | None = None

children: List[ArticleChild] = None

to_dict() → Dict[str, Any]: Convert article to dictionary format.

class tulit.parser.models.Chapter(eId: str, num: str, heading: str | None = None)

Bases: object

Represents a chapter in a legal document.

eId

Chapter identifier

Type:: str

num

Chapter number

Type:: str

heading

Chapter heading/title

Type:: str, optional

eId: str

num: str

heading: str | None = None

to_dict() → Dict[str, Any]: Convert chapter to dictionary format.

Parser Registry

Parser Registry Module

This module provides a registry pattern for managing parser implementations. It allows for dynamic parser discovery and instantiation based on format types.

class tulit.parser.registry.ParserRegistry

Bases: object

Registry for managing parser implementations.

This class implements the Registry pattern to allow dynamic parser discovery and instantiation. Parsers can be registered with format identifiers and aliases, and then retrieved by format name.

Example

>>> registry = ParserRegistry()
>>> registry.register('xml', XMLParser)
>>> parser = registry.create('xml')

__init__(): Initialize an empty parser registry.

register(format_id: str, parser_class: Type, aliases: List[str] | None = None) → None

Register a parser class for a given format.

Parameters:

format_id (str) – Primary identifier for this parser format
parser_class (Type) – The parser class to register
aliases (List[str], optional) – Alternative names for this format

Raises:

ParserError – If format_id or any alias is already registered

register_factory(format_id: str, factory_func: Callable, aliases: List[str] | None = None) → None

Register a factory function for creating parser instances.

This is useful when parser instantiation requires special logic or when dealing with parser variants.

Parameters:

format_id (str) – Primary identifier for this parser format
factory_func (Callable) – Function that returns a parser instance
aliases (List[str], optional) – Alternative names for this format

create(format_id: str, *args, **kwargs)

Create a parser instance for the given format.

Parameters:

format_id (str) – Format identifier or alias
*args – Arguments to pass to parser constructor
**kwargs – Arguments to pass to parser constructor

Returns:

An instance of the requested parser

Return type:

Parser

Raises:

ParserError – If format_id is not registered

list_formats() → List[str]

List all registered format identifiers.

Returns:: List of format identifiers (not including aliases)
Return type:: List[str]

list_aliases() → Dict[str, str]

Get mapping of aliases to their primary format identifiers.

Returns:: Mapping of alias -> format_id
Return type:: Dict[str, str]

is_registered(format_id: str) → bool

Check if a format or alias is registered.

Parameters:: format_id (str) – Format identifier or alias to check
Returns:: True if format is registered
Return type:: bool

tulit.parser.registry.get_parser_registry() → ParserRegistry

Get the global parser registry instance.

Returns:: The global parser registry
Return type:: ParserRegistry

tulit.parser.registry.register_parser(format_id: str, parser_class: Type = None, factory: Callable = None, aliases: List[str] | None = None) → None

Convenience function to register a parser in the global registry.

Parameters:

format_id (str) – Primary identifier for the parser
parser_class (Type, optional) – Parser class to register
factory (Callable, optional) – Factory function that returns a parser instance
aliases (List[str], optional) – Alternative names for the parser

Example

>>> register_parser('xml', XMLParser, aliases=['xmldoc'])

tulit.parser.registry.get_parser(format_id: str, **kwargs)

Convenience function to get a parser from the global registry.

Parameters:

format_id (str) – Parser format identifier or alias
**kwargs (dict) – Arguments to pass to parser constructor/factory

Returns:

Instantiated parser

Return type:

Parser

Example

>>> parser = get_parser('xml', schema_path='schema.xsd')

Text Normalization

Text Normalization Strategies Module

This module provides text normalization strategies following the Strategy pattern. Different normalization algorithms can be selected at runtime, making parsers more flexible and testable.

class tulit.parser.normalization.TextNormalizationStrategy

Bases: ABC

Abstract base class for text normalization strategies.

The Strategy pattern allows different text cleaning/normalization algorithms to be selected at runtime, making parsers more flexible and testable.

Example

>>> normalizer = WhitespaceNormalizer()
>>> clean_text = normalizer.normalize("  multiple   spaces  ")
"multiple spaces"

abstract normalize(text: str) → str

Normalize the given text according to the strategy’s rules.

Parameters:: text (str) – Text to normalize
Returns:: Normalized text
Return type:: str

class tulit.parser.normalization.WhitespaceNormalizer(fix_punctuation: bool = True)

Bases: TextNormalizationStrategy

Normalizes whitespace in text.

Removes newlines, tabs, carriage returns
Collapses multiple spaces to single space
Strips leading/trailing whitespace
Optionally fixes spacing before punctuation

__init__(fix_punctuation: bool = True)

Initialize whitespace normalizer.

Parameters:: fix_punctuation (bool, optional) – Whether to remove spaces before punctuation (default: True)

normalize(text: str) → str: Remove and normalize whitespace.

class tulit.parser.normalization.UnicodeNormalizer(unicode_form: str | None = None, replace_nbsp: bool = True)

Bases: TextNormalizationStrategy

Normalizes unicode characters in text.

Replaces non-breaking spaces with regular spaces
Optionally normalizes unicode to a specific form (NFC, NFD, NFKC, NFKD)

__init__(unicode_form: str | None = None, replace_nbsp: bool = True)

Initialize unicode normalizer.

Parameters:

unicode_form (str, optional) – Unicode normalization form (‘NFC’, ‘NFD’, ‘NFKC’, ‘NFKD’)
replace_nbsp (bool, optional) – Whether to replace non-breaking spaces with regular spaces (default: True)

normalize(text: str) → str: Normalize unicode characters.

class tulit.parser.normalization.PatternReplacementNormalizer(patterns: List[tuple[str, str]])

Bases: TextNormalizationStrategy

Normalizes text using regex pattern replacements.

Useful for removing specific markers, formatting codes, or document-specific artifacts.

__init__(patterns: List[tuple[str, str]])

Initialize pattern replacement normalizer.

Parameters:: patterns (List[tuple[str, str]]) – List of (pattern, replacement) tuples for regex substitution

Example

>>> normalizer = PatternReplacementNormalizer([
...     (r'▼[A-Z]\d*', ''),  # Remove consolidation markers
...     (r'^\(\d+\)', '')     # Remove leading numbers in parentheses
... ])

normalize(text: str) → str: Apply pattern replacements.

class tulit.parser.normalization.CompositeNormalizer(strategies: List[TextNormalizationStrategy])

Bases: TextNormalizationStrategy

Composite strategy that applies multiple normalizers in sequence.

This allows combining different normalization strategies in a specific order to achieve complex text cleaning operations.

Example

>>> normalizer = CompositeNormalizer([
...     UnicodeNormalizer(),
...     WhitespaceNormalizer(),
...     PatternReplacementNormalizer([(r'▼[A-Z]\d*', '')])
... ])
>>> clean_text = normalizer.normalize(raw_text)

__init__(strategies: List[TextNormalizationStrategy])

Initialize composite normalizer.

Parameters:: strategies (List[TextNormalizationStrategy]) – List of normalizers to apply in order

normalize(text: str) → str: Apply all strategies in sequence.

tulit.parser.normalization.create_standard_normalizer() → CompositeNormalizer

Create a standard text normalizer suitable for most legal documents.

Applies: 1. Unicode normalization (non-breaking spaces) 2. Whitespace normalization (newlines, tabs, multiple spaces) 3. Punctuation spacing fixes

Returns:: Composite normalizer with standard strategies
Return type:: CompositeNormalizer

tulit.parser.normalization.create_html_normalizer() → CompositeNormalizer

Create a normalizer for HTML-based legal documents.

Applies: 1. Pattern removal (consolidation markers) 2. Unicode normalization 3. Whitespace normalization

Returns:: Composite normalizer for HTML documents
Return type:: CompositeNormalizer

tulit.parser.normalization.create_formex_normalizer() → CompositeNormalizer

Create a normalizer for Formex XML documents.

Applies: 1. Pattern removal (leading parentheses numbers) 2. Unicode normalization 3. Whitespace normalization

Returns:: Composite normalizer for Formex documents
Return type:: CompositeNormalizer

Parser Exceptions

Parser Exceptions Module

This module contains all custom exception classes for the parser package. Organizing exceptions in a dedicated module improves maintainability and allows for better exception handling patterns.

exception tulit.parser.exceptions.ParserError

Bases: Exception

Base exception for all parser-related errors.

exception tulit.parser.exceptions.ParseError

Bases: ParserError

Raised when parsing fails due to malformed input.

exception tulit.parser.exceptions.ValidationError

Bases: ParserError

Raised when validation against a schema fails.

exception tulit.parser.exceptions.ExtractionError

Bases: ParserError

Raised when extraction of specific content fails.

exception tulit.parser.exceptions.FileLoadError

Bases: ParserError

Raised when loading a file fails.

XML Parsers

Base XML Parser

XML Parser Base Module

This module provides the abstract XMLParser base class for XML-based document parsers. All XML parsers should inherit from XMLParser and implement the required abstract methods.

The XMLParser class integrates XML validation, node extraction utilities, and text normalization from the organized helper modules.

class tulit.parser.xml.xml.XMLParser(normalizer: TextNormalizationStrategy | None = None)

Bases: Parser

Abstract base class for XML parsers.

Provides common XML parsing utilities and helper methods. Uses XMLValidator for schema validation and TextNormalizationStrategy for text processing.

Subclasses must implement get_preface(), get_articles(), and parse() or use the provided parse() template method by overriding component methods.

valid

Indicates whether the XML file is valid against the schema.

Type:: bool or None

format

The format of the XML file (e.g., ‘Akoma Ntoso’, ‘Formex 4’).

Type:: str or None

validation_errors

Validation errors if the XML file is invalid.

Type:: lxml.etree._LogEntry or None

namespaces

Dictionary containing XML namespaces.

Type:: dict

normalizer

Strategy for text normalization operations.

Type:: TextNormalizationStrategy

__init__(normalizer: TextNormalizationStrategy | None = None) → None

Initializes the Parser object with default attributes.

Parameters:: normalizer (TextNormalizationStrategy, optional) – Text normalization strategy to use. Defaults to standard normalizer.

property namespaces: dict[str, str]: Get the XML namespaces dictionary.

load_schema(schema: str) → None

Load an XSD schema for XML validation.

Delegates to XMLValidator for actual schema loading.

Parameters:: schema (str) – Filename of the XSD schema file
Return type:: None

validate(file: str, format: str) → bool

Validate an XML file against the loaded schema.

Delegates to XMLValidator for actual validation.

Parameters:

file (str) – Path to the XML file to validate
format (str) – Name of the format for logging (e.g., ‘Akoma Ntoso’, ‘Formex 4’)

Returns:

True if valid, False otherwise. Also updates self.valid attribute.

Return type:

bool

remove_node(tree, node)

Removes specified nodes from the XML tree while preserving their tail text.

Delegates to XMLNodeExtractor for node removal.

Parameters:

tree (lxml.etree._Element) – The XML tree or subtree to process.
node (str) – XPath expression identifying the nodes to remove.

Returns:

The modified XML tree with specified nodes removed.

Return type:

lxml.etree._Element

get_root(file: str | None = None)

Parses an XML file and returns its root element using secure parser settings.

Parameters:: file (str, optional) – Path to the XML file. If not provided, uses the file path from parse()
Return type:: None
Raises:: FileLoadError – If file cannot be loaded or parsed

get_preface(preface_xpath, paragraph_xpath) → None

Extracts paragraphs from the preface section of the document.

Parameters:

preface_xpath (str) – XPath expression to locate the preface element.
paragraph_xpath (str) – XPath expression to locate the paragraphs within the preface.

Returns:

Updates the instance’s preface attribute with the found preface element.

Return type:

None

get_preamble(preamble_xpath, notes_xpath) → None

Extracts the preamble section from the document.

Parameters:

preamble_xpath (str) – XPath expression to locate the preamble element.
notes_xpath (str) – XPath expression to locate notes within the preamble.

Returns:

Updates the instance’s preamble attribute with the found preamble element

Return type:

None

get_formula(formula_xpath: str, paragraph_xpath: str) → str

Extracts formula text from the preamble.

Parameters:

formula_xpath (str) – XPath expression to locate the formula element.
paragraph_xpath (str) – XPath expression to locate the paragraphs within the formula.

Returns:

Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.

Return type:

str or None

get_citations(citations_xpath, citation_xpath, extract_eId=None)

Extracts citations from the preamble.

Parameters:

citations_xpath (str) – XPath to locate the citations section.
citation_xpath (str) – XPath to locate individual citations.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s citations attribute with the found citations.

Return type:

None

get_recitals(recitals_xpath, recital_xpath, text_xpath, extract_intro=None, extract_eId=None)

Extracts recitals from the preamble.

Parameters:

recitals_xpath (str) – XPath expression to locate the recitals section.
recital_xpath (str) – XPath expression to locate individual recitals.
text_xpath (str) – XPath expression to locate the text within each recital.
extract_intro (function, optional) – Function to handle the extraction of the introductory recital.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s recitals attribute with the found recitals.

Return type:

None

get_preamble_final(preamble_final_xpath) → str

Extracts the final preamble text from the document.

Parameters:: preamble_final_xpath (str) – XPath expression to locate the final preamble element.
Returns:: Updates the instance’s preamble_final attribute with the found final preamble text.
Return type:: None

get_body(body_xpath) → None

Extracts the body element from the document.

Parameters:: body_xpath (str) – XPath expression to locate the body element. For Akoma Ntoso, this is usually ‘.//akn:body’, while for Formex it is ‘.//ENACTING.TERMS’.
Returns:: Updates the instance’s body attribute with the found body element.
Return type:: None

get_chapters(chapter_xpath: str, num_xpath: str, heading_xpath: str, extract_eId=None, get_headings=None) → None

Extracts chapter information from the document.

Parameters:

chapter_xpath (str) – XPath expression to locate the chapter elements.
num_xpath (str) – XPath expression to locate the chapter number within each chapter element.
heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text

Return type:

None

abstract get_articles() → None

Extracts articles from the body section.

MUST be implemented by all XML parser subclasses. Subclasses should extract articles according to their specific XML format and store them in self.articles.

Returns:: Articles are stored in self.articles attribute
Return type:: None

get_conclusions()

Extracts conclusions from the body section.

Override in subclass if format has conclusions. Default implementation does nothing.

Return type:: None

parse(file: str, **options) → XMLParser

Template method that orchestrates the entire parsing workflow.

DO NOT OVERRIDE THIS METHOD. Instead, override individual component extraction methods like get_preface(), get_articles(), etc.

Parameters:

file (str) – Path to the XML file to parse.
**options (dict) – Optional configuration: - schema : str - Path to the XSD schema file - format : str - Format of the XML file (e.g., ‘Akoma Ntoso’, ‘Formex 4’)

Returns:

Self for method chaining with the parsed data stored in its attributes.

Return type:

XMLParser

XML Helpers

XML Helper Utilities Module

This module provides utility classes for common XML operations including XPath-based extraction, validation, and node manipulation. These utilities reduce code duplication across XML-based parsers.

class tulit.parser.xml.helpers.XMLNodeExtractor(namespaces: dict[str, str] | None = None)

Bases: object

Utility class for XPath-based XML node extraction and manipulation.

This class encapsulates common XPath operations and text extraction patterns, reducing duplication and complexity in XML parsers.

namespaces

Dictionary of XML namespaces for XPath queries

Type:: dict

Example

>>> extractor = XMLNodeExtractor({'akn': 'http://...'})
>>> node = extractor.find(root, './/akn:article')
>>> text = extractor.extract_text(node)

__init__(namespaces: dict[str, str] | None = None)

Initialize the node extractor.

Parameters:: namespaces (dict, optional) – Dictionary of namespace prefixes to URIs

find(element: _Element, xpath: str) → _Element | None

Find the first element matching the XPath expression.

Parameters:

element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression

Returns:

First matching element or None

Return type:

lxml.etree._Element or None

findall(element: _Element, xpath: str) → List[_Element]

Find all elements matching the XPath expression.

Parameters:

element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression

Returns:

List of matching elements

Return type:

list[lxml.etree._Element]

extract_text(element: _Element, strip: bool = True) → str

Extract all text content from an element and its descendants.

Parameters:

element (lxml.etree._Element) – Element to extract text from
strip (bool, optional) – Whether to strip whitespace (default: True)

Returns:

Concatenated text content

Return type:

str

extract_text_from_all(parent: _Element, xpath: str, strip: bool = True) → List[str]

Extract text from all elements matching the XPath.

Parameters:

parent (lxml.etree._Element) – Parent element to search from
xpath (str) – XPath expression
strip (bool, optional) – Whether to strip whitespace (default: True)

Returns:

List of extracted text strings

Return type:

list[str]

safe_find(element: _Element, xpath: str, default: _Element | None = None) → _Element | None

Safely find an element, returning default if not found.

Parameters:

element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression
default (lxml.etree._Element, optional) – Value to return if not found

Returns:

Found element or default value

Return type:

lxml.etree._Element or default

safe_find_text(element: _Element, xpath: str, default: str = '') → str

Safely find an element and extract its text.

Parameters:

element (lxml.etree._Element) – Root element to search from
xpath (str) – XPath expression
default (str, optional) – Value to return if not found

Returns:

Extracted text or default value

Return type:

str

remove_nodes(tree: _Element, xpath: str, preserve_tail: bool = True) → _Element

Remove nodes matching XPath, optionally preserving tail text.

Parameters:

tree (lxml.etree._Element) – Tree to modify
xpath (str) – XPath expression for nodes to remove
preserve_tail (bool, optional) – Whether to preserve tail text (default: True)

Returns:

Modified tree

Return type:

lxml.etree._Element

class tulit.parser.xml.helpers.XMLValidator

Bases: object

Handles XML schema loading and validation.

This class provides robust schema validation with proper error handling and logging. It supports both XSD and RelaxNG schemas.

Example

>>> validator = XMLValidator()
>>> validator.load_schema('schema.xsd')
>>> is_valid = validator.validate(xml_root)

__init__(): Initialize the XML validator.

load_schema(schema_path: str, schema_type: str = 'xsd') → bool

Load an XML schema file.

Parameters:

schema_path (str) – Path to the schema file
schema_type (str, optional) – Type of schema (‘xsd’ or ‘relaxng’), default: ‘xsd’

Returns:

True if schema loaded successfully

Return type:

bool

validate(xml_tree: _Element) → bool

Validate an XML tree against the loaded schema.

Parameters:: xml_tree (lxml.etree._Element) – XML tree to validate
Returns:: True if validation succeeds
Return type:: bool

get_validation_errors() → List[str]

Get list of validation error messages.

Returns:: List of error messages from last validation
Return type:: list[str]

Formex Parser

class tulit.parser.xml.formex.Formex4Parser

Bases: XMLParser

A parser for processing and extracting content from Formex XML files.

The parser handles XML documents following the Formex schema for legal documents. It inherits from the XMLParser class and provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.

__init__() → None: Initializes the Formex4Parser object with the Formex namespace.

get_preface() → None: Extracts the preface from the document. It is assumed that the preface is contained within the TITLE and P elements.

get_preamble() → None: Extracts the preamble from the document. It is assumed that the preamble is contained within the PREAMBLE element, while notes are contained within the NOTE elements.

get_formula() → None

Extracts the formula from the preamble. The formula is assumed to be contained within the PREAMBLE.INIT element.

Returns:: Formula text from the preamble.
Return type:: str

get_citations() → None

Extracts citations from the preamble. Citations are assumed to be contained within the GR.VISA and VISA elements. The citation identifier is set as the index of the citation in the preamble.

Returns:: List of dictionaries containing citation data with keys: - ‘eId’: Citation identifier, which is the index of the citation in the preamble - ‘text’: Citation text
Return type:: list

get_recitals() → None

Extracts recitals from the preamble. Recitals are assumed to be contained within the GR.CONSID and CONSID elements. The introductory recital is extracted separately. The recital identifier is set as the index of the recital in the preamble.

Returns:: List of dictionaries containing recital text and eId for each recital. Returns None if no recitals are found.
Return type:: list or None

get_preamble_final() → None: Extracts the final preamble text from the document. The final preamble text is assumed to be contained within the PREAMBLE.FINAL element.

get_body() → None: Extracts the body section from the document. The body is assumed to be contained within the ENACTING.TERMS element.

get_chapters() → None

Extracts chapter information from the document. Chapter numbers and headings are assumed to be contained within the TITLE element. The chapter identifier is set as the index of the chapter in the document.

Returns:: List of dictionaries containing chapter data with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text
Return type:: list

get_articles() → None

Extracts articles from the ENACTING.TERMS section using FormexArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

Returns:: Articles with identifier and content.
Return type:: list

get_conclusions() → None

Extracts conclusions from the document. The conclusion text is assumed to be contained within the FINAL section of the document. The signature details are assumed to be contained within the SIGNATURE element.

Returns:: Dictionary containing the conclusion text and signature details.
Return type:: dict

clean_text(element: _Element) → str

parse(file: str, **options) → Formex4Parser

Parses a FORMEX XML document to extract its components, which are inherited from the XMLParser class. If the input is a directory, searches for the correct XML file (one containing ACT or DECISION tags).

Parameters:

file (str) – Path to the FORMEX XML file or directory containing FORMEX files.
**options (dict) – Optional configuration options (passed to parent XMLParser)

Returns:

Self for method chaining with parsed data.

Return type:

Formex4Parser

Akoma Ntoso Parsers

Akoma Ntoso Base Parser

This module provides the base AkomaNtosoParser class for processing legal documents in the Akoma Ntoso 3.0 format. All variant parsers (AKN4EU, German LegalDocML, Luxembourg) inherit from this base class.

class tulit.parser.xml.akomantoso.base.AkomaNtosoParser

Bases: XMLParser

Base parser for processing Akoma Ntoso 3.0 legal documents.

The parser handles XML documents following the Akoma Ntoso 3.0 schema for legal documents. It provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.

namespaces

Dictionary mapping namespace prefixes to their URIs.

Type:: dict

Example

>>> parser = AkomaNtosoParser()
>>> parser.parse('document.xml')
>>> articles = parser.get_articles()

__init__() → None: Initialize the Akoma Ntoso parser with standard namespaces.

get_preface() → None

Extract preface information from the document.

The preface is contained within the ‘preface’ element in the XML file.

get_preamble() → None

Extract preamble information from the document.

The preamble is contained within the ‘preamble’ element in the XML file.

get_formula() → None

Extract formula from the preamble.

The formula is contained within the ‘formula’ element in the XML file. The formula text is extracted from all paragraphs within the formula element.

Returns:: Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.
Return type:: str or None

get_citations() → None

Extract citations from the preamble.

The citations are contained within the ‘citations’ element. Each citation is extracted from the ‘citation’ element, with text from all paragraphs.

get_recitals() → None

Extract recitals from the preamble.

Recitals are contained within the ‘recitals’ element. Each recital is extracted from the ‘recital’ element, with text from all paragraphs.

get_preamble_final() → None

Extract the final part of the preamble.

This is typically the text after citations and recitals, contained in the ‘preamble.final’ block.

get_body() → None

Extract the body section from the document.

The body contains the main content including articles, chapters, etc.

get_chapters() → None

Extract chapters from the body.

Chapters structure the main content and may contain articles.

extract_eId(element: _Element, index: int | None = None) → str

Extract the element ID (eId) from an XML element.

The standard Akoma Ntoso format uses ‘eId’ attribute for element identification. Subclasses may override this for format-specific ID extraction.

Parameters:

element (lxml.etree._Element) – XML element to extract ID from
index (int, optional) – Index to use if no ID attribute is found

Returns:

The element ID, or formatted index if no ID found

Return type:

str

get_articles() → None

Extract articles from the body using AKNArticleExtractor.

Articles are the main structural units of legal documents. This method uses AKNArticleExtractor to handle the extraction logic. Also handles sections for jurisdictions that use sections instead of articles.

get_conclusions() → None

Extract conclusions from the document.

Conclusions contain closing text and signatures.

parse(file: str, **options) → AkomaNtosoParser

Parse an Akoma Ntoso document to extract all components.

This method validates the document against the Akoma Ntoso 3.0 schema and extracts all content using the orchestrator pattern.

Parameters:

file (str) – Path to the Akoma Ntoso XML file to parse
**options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

AkomaNtosoParser

Example

>>> parser = AkomaNtosoParser()
>>> parser.parse('document.xml')
>>> print(len(parser.articles))

AKN4EU Parser

This module provides the AKN4EU parser for European Union legal documents using the Akoma Ntoso for EU (AKN4EU) format.

class tulit.parser.xml.akomantoso.akn4eu.AKN4EUParser

Bases: AkomaNtosoParser

Parser for AKN4EU (Akoma Ntoso for European Union) documents.

This parser handles EU legal documents that use the AKN4EU variant of Akoma Ntoso, which includes EU-specific extensions and conventions.

Key Differences from Standard Akoma Ntoso: - Uses XML ‘id’ attribute instead of ‘eId’ for element identification - Follows EU-specific document structure conventions

Example

>>> parser = AKN4EUParser()
>>> parser.parse('eu_regulation.xml')
>>> print(parser.preface)

__init__() → None: Initialize the AKN4EU parser.

extract_eId(element: _Element, index: int | None = None) → str

Extract element ID from XML ‘id’ attribute (AKN4EU convention).

AKN4EU documents use the standard XML ‘id’ attribute from the XML namespace instead of the ‘eId’ attribute.

Parameters:

element (lxml.etree._Element) – XML element to extract ID from
index (int, optional) – Index to use if no ID attribute is found

Returns:

The element ID from xml:id attribute, or formatted index if not found

Return type:

str

German LegalDocML Parser

This module provides the parser for German LegalDocML documents, which follow the Akoma Ntoso structure but use a German-specific namespace.

class tulit.parser.xml.akomantoso.german.GermanLegalDocMLParser

Bases: AkomaNtosoParser

Parser for German LegalDocML documents.

This parser handles German legal documents that follow the Akoma Ntoso structure but use the German RIS (Rechtsinformationssystem) namespace.

German LegalDocML Namespace: http://Inhaltsdaten.LegalDocML.de/1.8.2/

Key Differences from Standard Akoma Ntoso: - Uses German-specific namespace while maintaining AKN structure - Schema validation is skipped (German-specific schema variations) - All XPath queries work seamlessly due to namespace remapping

Example

>>> parser = GermanLegalDocMLParser()
>>> parser.parse('german_law.xml')
>>> print(parser.articles)

__init__() → None: Initialize the German LegalDocML parser with German namespace.

parse(file: str, **options) → GermanLegalDocMLParser

Parse a German LegalDocML document to extract its components.

German LegalDocML follows Akoma Ntoso structure but uses a German-specific namespace and may have schema variations. This method bypasses schema validation and directly extracts the content.

Parameters:

file (str) – Path to the German LegalDocML XML file to parse
**options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

GermanLegalDocMLParser

Example

>>> parser = GermanLegalDocMLParser()
>>> parser.parse('bgb.xml')

Luxembourg Akoma Ntoso Parser

This module provides the parser for Luxembourg legal documents using the Committee Specification Draft 13 (CSD13) variant of Akoma Ntoso 3.0.

class tulit.parser.xml.akomantoso.luxembourg.LuxembourgAKNParser

Bases: AkomaNtosoParser

Parser for Luxembourg Akoma Ntoso documents (CSD13 variant).

This parser handles Luxembourg Legilux documents which use the Committee Specification Draft 13 (CSD13) namespace variant of Akoma Ntoso 3.0.

Luxembourg Namespace: http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13

Key Differences from Standard Akoma Ntoso: - Uses CSD13 namespace variant - Uses ‘id’ attribute instead of ‘eId’ for element identification - Content is nested in <alinea><content><p> structure - Includes Luxembourg-specific metadata namespace (http://www.scl.lu)

Example

>>> parser = LuxembourgAKNParser()
>>> parser.parse('luxembourg_law.xml')
>>> print(parser.articles)

__init__() → None: Initialize the Luxembourg parser with CSD13 namespace.

extract_eId(element: _Element, index: int | None = None) → str

Extract element ID from ‘id’ attribute (Luxembourg convention).

Luxembourg documents use the ‘id’ attribute instead of ‘eId’ for element identification.

Parameters:

element (lxml.etree._Element) – XML element to extract ID from
index (int, optional) – Index to use if no ID attribute is found

Returns:

The ID value from the ‘id’ attribute, or formatted index if not found

Return type:

str

parse(file: str, **options) → LuxembourgAKNParser

Parse a Luxembourg Akoma Ntoso document to extract its components.

Luxembourg documents use the CSD13 variant and may have specific structural differences. This method bypasses schema validation and uses the orchestrator for content extraction.

Parameters:

file (str) – Path to the Luxembourg Akoma Ntoso XML file to parse
**options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

LuxembourgAKNParser

Example

>>> parser = LuxembourgAKNParser()
>>> parser.parse('luxembourg_code.xml')

get_articles() → None

Extract articles from the body using AKNArticleExtractor with ‘id’ attribute.

Luxembourg documents use ‘id’ instead of ‘eId’ for element identification.

Akoma Ntoso Utility Functions

This module provides utility functions for detecting Akoma Ntoso formats and creating appropriate parser instances.

tulit.parser.xml.akomantoso.utils.detect_akn_format(file_path: str) → str

Automatically detect the Akoma Ntoso format/dialect based on the XML namespace.

This function examines the root element’s namespace to determine which variant of Akoma Ntoso is being used (standard, German LegalDocML, Luxembourg CSD13, or AKN4EU).

Parameters:: file_path (str) – Path to the XML file
Returns:: Format identifier: ‘german’, ‘akn4eu’, ‘luxembourg’, or ‘akn’ (standard)
Return type:: str

Example

>>> format_type = detect_akn_format('document.xml')
>>> print(format_type)
'akn4eu'

tulit.parser.xml.akomantoso.utils.create_akn_parser(file_path: str | None = None, format: str | None = None) → XMLParser

Factory function to create the appropriate Akoma Ntoso parser.

This function uses the registry pattern to instantiate the correct parser based on either explicit format specification or automatic detection.

Parameters:

file_path (str, optional) – Path to the XML file (required for auto-detection)
format (str, optional) – Explicitly specify format: ‘german’, ‘akn4eu’, ‘luxembourg’, or ‘akn’ If not provided, format will be auto-detected from file_path

Returns:

Appropriate parser instance for the detected or specified format

Return type:

XMLParser

Raises:

ValueError – If neither file_path nor format is provided

Example

>>> # Auto-detect format
>>> parser = create_akn_parser(file_path='document.xml')
>>>
>>> # Explicitly specify format
>>> parser = create_akn_parser(format='german')

tulit.parser.xml.akomantoso.utils.register_akn_parsers() → None

Register all Akoma Ntoso parser variants in the registry.

This function should be called during module initialization to ensure all parser types are available for the factory function.

Helper classes for Akoma Ntoso article and content extraction.

This module provides specialized extractors to reduce duplication across AkomaNtoso parser variants and improve code organization.

class tulit.parser.xml.akomantoso.extractors.AKNArticleExtractor(namespaces: Dict[str, str], id_attr: str = 'eId')

Bases: object

Extracts article information from Akoma Ntoso documents.

Centralizes common article extraction logic used across different AKN parser variants (standard, AKN4EU, German, Luxembourg).

__init__(namespaces: Dict[str, str], id_attr: str = 'eId')

Initialize with namespace configuration.

Parameters:

namespaces (dict) – XML namespace mapping for XPath queries.
id_attr (str) – The attribute name used for element IDs (default ‘eId’).

extract_article_metadata(article: _Element) → Dict[str, str | None]

Extract basic article metadata (eId, num, heading).

Parameters:: article (etree._Element) – The article XML element.
Returns:: Dictionary with ‘eId’, ‘num’, and ‘heading’ keys.
Return type:: dict

extract_paragraphs_by_eid(node: _Element) → List[Dict[str, str]]

Extract paragraph text grouped by nearest parent eId.

Parameters:: node (etree._Element) – XML node to process for text extraction.
Returns:: List of dicts with ‘eId’ and ‘text’ keys.
Return type:: list

class tulit.parser.xml.akomantoso.extractors.AKNParseOrchestrator(parser)

Bases: object

Orchestrates the parsing workflow for Akoma Ntoso documents.

Implements Template Method pattern to reduce parse() method duplication across different AKN parser variants.

__init__(parser)

Initialize with reference to parser instance.

Parameters:: parser (AkomaNtosoParser) – The parser instance to orchestrate.

execute_parse_step(method_name: str, description: str) → None

Execute a single parsing step with error handling and logging.

Parameters:

method_name (str) – Name of the parser method to call.
description (str) – Human-readable description for logging.

execute_standard_workflow() → None

Execute standard AKN parsing workflow.

This is the common sequence used by most AKN parsers: preface -> preamble -> formula -> citations -> recitals -> preamble_final -> body -> chapters -> articles -> conclusions

class tulit.parser.xml.akomantoso.extractors.AKNContentProcessor(namespaces: Dict[str, str])

Bases: object

Processes complex content structures in Akoma Ntoso documents.

Handles lists, tables, and nested structures common across different AKN document types.

__init__(namespaces: Dict[str, str])

Initialize with namespace configuration.

Parameters:: namespaces (dict) – XML namespace mapping for XPath queries.

extract_list_items(parent: _Element) → List[Dict[str, str]]

Extract list items from an AKN element.

Parameters:: parent (etree._Element) – Parent element containing list items.
Returns:: List of dicts with ‘eId’ and ‘text’ keys.
Return type:: list

extract_table_content(table: _Element) → Dict[str, any]

Extract table content from an AKN table element.

Parameters:: table (etree._Element) – Table element to process.
Returns:: Dictionary with ‘eId’ and ‘rows’ keys.
Return type:: dict

BOE Parser

class tulit.parser.xml.boe.BOEXMLParser

Bases: XMLParser

Parser for BOE XML documents to LegalJSON.

Uses BOEArticleStrategy to extract articles, reducing code duplication and improving maintainability.

get_preface() → str | None

Extracts paragraphs from the preface section of the document.

Parameters:

preface_xpath (str) – XPath expression to locate the preface element.
paragraph_xpath (str) – XPath expression to locate the paragraphs within the preface.

Returns:

Updates the instance’s preface attribute with the found preface element.

Return type:

None

get_articles() → None

Extract articles using BOEArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

get_chapters() → list

Extracts chapter information from the document.

Parameters:

chapter_xpath (str) – XPath expression to locate the chapter elements.
num_xpath (str) – XPath expression to locate the chapter number within each chapter element.
heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text

Return type:

None

get_citations() → list

Extracts citations from the preamble.

Parameters:

citations_xpath (str) – XPath to locate the citations section.
citation_xpath (str) – XPath to locate individual citations.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s citations attribute with the found citations.

Return type:

None

get_recitals() → list

Extracts recitals from the preamble.

Parameters:

recitals_xpath (str) – XPath expression to locate the recitals section.
recital_xpath (str) – XPath expression to locate individual recitals.
text_xpath (str) – XPath expression to locate the text within each recital.
extract_intro (function, optional) – Function to handle the extraction of the introductory recital.
extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s recitals attribute with the found recitals.

Return type:

None

get_preamble() → None

Extracts the preamble section from the document.

Parameters:

preamble_xpath (str) – XPath expression to locate the preamble element.
notes_xpath (str) – XPath expression to locate notes within the preamble.

Returns:

Updates the instance’s preamble attribute with the found preamble element

Return type:

None

get_formula() → None

Extracts formula text from the preamble.

Parameters:

formula_xpath (str) – XPath expression to locate the formula element.
paragraph_xpath (str) – XPath expression to locate the paragraphs within the formula.

Returns:

Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.

Return type:

str or None

get_preamble_final() → None

Extracts the final preamble text from the document.

Parameters:: preamble_final_xpath (str) – XPath expression to locate the final preamble element.
Returns:: Updates the instance’s preamble_final attribute with the found final preamble text.
Return type:: None

get_conclusions() → None

Extracts conclusions from the body section.

Override in subclass if format has conclusions. Default implementation does nothing.

Return type:: None

parse(file: str, **options) → BOEXMLParser

Parse a BOE XML document.

Parameters:

file (str) – Path to the BOE XML file
**options (dict) – Optional configuration options

Returns:

Self for method chaining

Return type:

BOEXMLParser

HTML Parsers

Base HTML Parser

class tulit.parser.html.html_parser.HTMLParser

Bases: Parser

Abstract base class for HTML parsers.

Provides common HTML parsing utilities and a template parse() method. Subclasses must implement get_preface() and get_articles(). Optional methods like get_preamble(), get_chapters(), etc. can be overridden.

__init__() → None: Initializes the HTML parser and sets up the BeautifulSoup instance.

get_root(file: str) → None

Loads an HTML file and parses it with BeautifulSoup.

Parameters:: file (str) – The path to the HTML file.
Returns:: The root element is stored in the parser under the ‘root’ attribute.
Return type:: None

parse(file: str, **options) → HTMLParser

Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.

Parameters:

file (str) – Path to the HTML file to parse.
**options (dict) – Optional configuration options

Returns:

Self for method chaining with the parsed elements stored in the attributes.

Return type:

HTMLParser

Cellar HTML Parsers

class tulit.parser.html.cellar.cellar.CellarHTMLParser

Bases: HTMLParser

get_preface() → None

Extracts the preface text from the HTML, if available.

Parameters:: None –
Returns:: The extracted preface is stored in the ‘preface’ attribute.
Return type:: None

get_preamble() → None

Extracts the preamble text from the HTML, if available.

Parameters:: None –
Returns:: The extracted preamble is stored in the ‘preamble’ attribute.
Return type:: None

get_formula() → None

Extracts the formula from the HTML, if present.

Parameters:: None –
Returns:: The extracted formula is stored in the ‘formula’ attribute.
Return type:: None

get_citations() → None

Extracts citations from the HTML.

Parameters:: None –
Returns:: The extracted citations are stored in the ‘citations’ attribute
Return type:: None

get_recitals() → None

Extracts recitals from the HTML.

Parameters:: None –
Returns:: The extracted recitals are stored in the ‘recitals’ attribute.
Return type:: None

get_preamble_final() → None

Extracts the final preamble text from the HTML, if available.

Parameters:: None –
Returns:: The extracted final preamble is stored in the ‘preamble_final’ attribute.
Return type:: None

get_body() → None

Extracts the body content from the HTML.

Parameters:: None –
Returns:: The extracted body content is stored in the ‘body’ attribute
Return type:: None

get_chapters() → None: Extracts chapters from the HTML, grouping them by their IDs and headings.

get_articles() → None

Extracts articles from the HTML. Each <div> with an id starting with “art” is treated as an article (eId). Subsequent subdivisions are processed based on the closest parent with an id.

Returns:: List of articles, each containing its eId and associated content.
Return type:: list[dict]

get_conclusions() → None: Extracts conclusions from the HTML, if present.

parse(file: str, **options) → CellarHTMLParser

Parses an XHTML document. If the input is a directory, searches for XHTML files.

Parameters:

file (str) – Path to the XHTML file or directory containing XHTML files.
**options (dict) – Optional configuration options

Returns:

Self for method chaining with extracted content.

Return type:

CellarHTMLParser

class tulit.parser.html.cellar.cellar_standard.CellarStandardHTMLParser

Bases: HTMLParser

Parser for standard HTML format documents from EU Cellar. This format wraps content in <TXT_TE> tags with simple <p> structure, unlike the semantic XHTML format with class-based structure.

get_preface() → None: Extract document title/preface. In standard HTML, this is typically in the metadata or first heading.

get_preamble() → None: Extract preamble content. In standard HTML, the preamble typically includes the decision-making body, references, and recitals.

get_formula() → None: Extract the formula (decision-making body statement). Usually starts with “THE COUNCIL”, “THE COMMISSION”, etc.

get_citations(): Extract citations (legal references). Usually contains phrases like “Having regard to”.

get_recitals(): Extract recitals (whereas clauses). Usually starts with “Whereas:” followed by numbered items.

get_preamble_final(): Extract final preamble statement (e.g., “HAS ADOPTED THIS DECISION:”).

get_body(): The body is the TXT_TE container itself.

get_chapters(): Extract chapters. In standard HTML, these might be section headings. For most documents, this may not apply.

get_articles()

Extract articles from the document using CellarStandardArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

get_conclusions(): Extract conclusion text (e.g., “Done at Brussels, …”).

parse(file_path: str, **options) → CellarStandardHTMLParser

Parse a standard HTML document and extract all components. If the input is a directory, searches for HTML files.

Parameters:

file_path (str) – Path to the HTML file or directory containing HTML files
**options (dict) – Optional configuration: - validate : bool - Whether to validate against LegalJSON schema (default: False)

Returns:

Self for method chaining with parsed document.

Return type:

CellarStandardHTMLParser

class tulit.parser.html.cellar.proposal.ProposalHTMLParser

Bases: HTMLParser

Parser for European Commission proposal documents (COM documents).

These documents have a different structure than regular EUR-Lex legislative acts. They typically contain: - Metadata (institution, date, reference numbers) - Proposal status and title - Explanatory Memorandum with sections and subsections - Sometimes the actual legal act text at the end

get_metadata() → None

Extracts metadata from the Commission proposal HTML.

Metadata includes: - Institution name (e.g., “EUROPEAN COMMISSION”) - Emission date and location - Reference numbers (COM number, interinstitutional reference) - Proposal status - Document type - Title/subject

Returns:: The extracted metadata is stored in the ‘metadata’ attribute.
Return type:: None

get_explanatory_memorandum() → None

Extracts the Explanatory Memorandum section from the proposal.

The Explanatory Memorandum typically contains: - Title (class=”Exposdesmotifstitre”) - Sections with headings (class=”li ManualHeading1”, “li ManualHeading2”, etc.) - Numbered paragraphs (class=”li ManualNumPar1”) - Normal text (class=”Normal”)

Returns:: The extracted content is stored in the ‘explanatory_memorandum’ attribute.
Return type:: None

get_preface() → None: For proposals, the preface is the combination of status, document type, and title. This extracts from the SECOND occurrence (the actual legal act), not the first (cover page).

get_preamble() → None

Extracts the preamble of the legal act (not the explanatory memorandum). The preamble appears after the explanatory memorandum and contains: - Interinstitutional reference - Status - Document type - Title - Institution acting - Citations (Having regard to…) - Recitals (Whereas…)

Returns:: Sets self.preamble to the preamble element
Return type:: None

get_formula() → None

Extracts the formula from the preamble (e.g., “THE COUNCIL OF THE EUROPEAN UNION,”).

Returns:: The extracted formula is stored in the ‘formula’ attribute.
Return type:: None

get_citations() → None

Extracts citations from the preamble (paragraphs starting with “Having regard to”). Citations appear between the formula and “Whereas:”

Returns:: The extracted citations are stored in the ‘citations’ attribute.
Return type:: None

get_recitals() → None

Extracts recitals from the preamble (paragraphs with class “li ManualConsidrant”). Recitals may span multiple content divs.

Returns:: The extracted recitals are stored in the ‘recitals’ attribute.
Return type:: None

get_preamble_final() → None

Extracts the final formula of the preamble (e.g., “HAS ADOPTED THIS DECISION:”).

Returns:: The extracted final preamble is stored in the ‘preamble_final’ attribute.
Return type:: None

get_body() → None

Extracts the body of the legal act (the enacting terms/articles).

Returns:: Sets self.body to the body element
Return type:: None

get_articles() → None

Extracts articles from the body of the legal act.

Note: Due to the complex nested structure of Proposal documents (content divs, list concatenation, nested siblings), the full extraction logic remains in parser helper methods. The strategy pattern provides a consistent interface but delegates to parser-specific methods for the actual complex traversal logic.

Returns:: The extracted articles are stored in the ‘articles’ attribute.
Return type:: None

get_conclusions() → None

Extracts conclusions from the legal act (signature section).

Returns:: The extracted conclusions are stored in the ‘conclusions’ attribute.
Return type:: None

parse(file: str) → ProposalHTMLParser

Parses a Commission proposal HTML file and extracts all relevant information.

Parameters:: file (str) – Path to the HTML file to parse.
Returns:: The parser object with parsed elements stored in attributes.
Return type:: ProposalHTMLParser

Other HTML Parsers

class tulit.parser.html.veneto.VenetoHTMLParser

Bases: HTMLParser

get_root(file: str) → None

Loads an HTML file and parses it with BeautifulSoup.

Parameters:: file (str) – The path to the HTML file.
Returns:: The root element is stored in the parser under the ‘root’ attribute.
Return type:: None

get_preface() → None

Extracts the preface text from the HTML, if available.

Parameters:: None –
Returns:: The extracted preface is stored in the ‘preface’ attribute.
Return type:: None

get_preamble()

Extracts the preamble text from the HTML, if available.

Parameters:: None –
Returns:: The extracted preamble is stored in the ‘preamble’ attribute.
Return type:: None

get_formula()

Extracts the formula from the HTML, if present.

Parameters:: None –
Returns:: The extracted formula is stored in the ‘formula’ attribute.
Return type:: None

get_citations()

Extracts citations from the HTML.

Parameters:: None –
Returns:: The extracted citations are stored in the ‘citations’ attribute
Return type:: None

get_recitals()

Extracts recitals from the HTML.

Parameters:: None –
Returns:: The extracted recitals are stored in the ‘recitals’ attribute.
Return type:: None

get_preamble_final()

Extracts the final preamble text from the HTML, if available.

Parameters:: None –
Returns:: The extracted final preamble is stored in the ‘preamble_final’ attribute.
Return type:: None

get_body()

Extracts the body content from the HTML.

Parameters:: None –
Returns:: The extracted body content is stored in the ‘body’ attribute
Return type:: None

get_chapters(): Extracts chapters from the HTML, grouping them by their IDs and headings.

get_articles(): Extracts articles from the HTML. Each <h6> is treated as an article heading, and the next <div> contains the article content. Subdivisions are separated by <br> tags and stored as children.

get_conclusions(): Extracts conclusions from the HTML, if present.

parse(file)

Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.

Parameters:

file (str) – Path to the HTML file to parse.
**options (dict) – Optional configuration options

Returns:

Self for method chaining with the parsed elements stored in the attributes.

Return type:

HTMLParser

Article Extraction Strategies

Article Extraction Strategy Pattern

This module provides a hierarchy of strategies for extracting articles from different document formats (XML, HTML). It eliminates code duplication across parser classes by centralizing common article extraction logic.

Design Pattern: Strategy Pattern Purpose: Encapsulate article extraction algorithms and make them interchangeable

class tulit.parser.strategies.article_extraction.ArticleExtractionStrategy

Bases: ABC

Abstract base class for article extraction strategies.

This defines the interface that all concrete extraction strategies must implement. Each strategy encapsulates a specific algorithm for extracting articles from a particular document format.

abstract extract_articles(document: Any, **kwargs) → List[Dict[str, Any]]

Extract articles from the given document.

Parameters:

document (Any) – The document to extract articles from (XML Element, HTML BeautifulSoup, etc.)
**kwargs (dict) – Additional parameters specific to the extraction strategy

Returns:

List of article dictionaries with keys: ‘eId’, ‘num’, ‘heading’, ‘children’

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.XMLArticleExtractionStrategy(namespaces: Dict[str, str] | None = None)

Bases: ArticleExtractionStrategy

Base strategy for extracting articles from XML documents.

Provides common XML operations like namespace handling, XPath queries, and text extraction.

__init__(namespaces: Dict[str, str] | None = None)

Initialize XML extraction strategy.

Parameters:: namespaces (dict, optional) – XML namespace mappings

class tulit.parser.strategies.article_extraction.HTMLArticleExtractionStrategy(article_pattern: str | None = None)

Bases: ArticleExtractionStrategy

Base strategy for extracting articles from HTML documents.

Provides common HTML operations like element finding, class matching, and text extraction using BeautifulSoup.

__init__(article_pattern: str | None = None)

Initialize HTML extraction strategy.

Parameters:: article_pattern (str, optional) – Regex pattern to identify article markers

class tulit.parser.strategies.article_extraction.FormexArticleStrategy(namespaces: Dict[str, str] | None = None)

Bases: XMLArticleExtractionStrategy

Strategy for extracting articles from Formex XML documents.

Formex uses ARTICLE elements with IDENTIFIER attributes, and content is stored in PARAG, ALINEA, or LIST/ITEM elements.

extract_articles(document: _Element, **kwargs) → List[Dict[str, Any]]

Extract articles from Formex XML document.

Parameters:

document (lxml.etree._Element) – The body element containing articles
**kwargs (dict) – Optional: ‘remove_notes’ (bool) - whether to remove NOTE elements

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.BOEArticleStrategy(namespaces: Dict[str, str] | None = None)

Bases: XMLArticleExtractionStrategy

Strategy for extracting articles from Spanish BOE XML documents.

BOE uses <p class=”articulo”> for article titles and <p class=”parrafo”> for content paragraphs.

extract_articles(document: _Element, **kwargs) → List[Dict[str, Any]]

Extract articles from BOE XML document.

Parameters:

document (lxml.etree._Element) – The root element
**kwargs (dict) – Not used for BOE

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.CellarStandardArticleStrategy

Bases: HTMLArticleExtractionStrategy

Strategy for extracting articles from Cellar HTML documents (standard format).

Cellar documents use specific paragraph patterns to mark article starts and structure content.

extract_articles(document: Any, **kwargs) → List[Dict[str, Any]]

Extract articles from Cellar HTML document.

Parameters:

document (BeautifulSoup element) – The txt_te container element
**kwargs (dict) – Optional: ‘stop_markers’ (list) - text patterns that signal end of articles

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.ProposalArticleStrategy

Bases: HTMLArticleExtractionStrategy

Strategy for extracting articles from EU Proposal HTML documents.

Proposals use <p class=”Titrearticle”> for article headers and various paragraph classes for content.

extract_articles(document: Any, **kwargs) → List[Dict[str, Any]]

Extract articles from Proposal HTML document.

Parameters:

document (BeautifulSoup root) – The document root element
**kwargs (dict) – Not used for Proposal

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]