Parsers

This subpackage contains modules for parsing legal documents in various formats into a JSON representation.

The formats currently supported are:

  • XML Formats:

    • Formex 4 (EU legislative documents)

    • Akoma Ntoso 3.0 (multiple variants: EU, German LegalDocML, Luxembourg CSD13)

    • BOE XML (Spanish Official Gazette)

  • HTML Formats:

    • Cellar XHTML (semantic structure)

    • Cellar Standard HTML (simple structure)

    • EU Legislative Proposals

Core Parser Architecture

Parser Base Module

This module provides the abstract Parser base class and JSON validation utilities. All concrete parsers should inherit from the Parser class and implement the required abstract methods.

The module now imports domain models, exceptions, registry, and normalization strategies from their respective focused modules for better organization.

class tulit.parser.parser.Parser

Bases: ABC

Abstract base class for legal document parsers.

All subclasses must implement: - get_preface() - get_articles() - parse()

Optional methods with default implementations: - get_preamble() - get_formula() - get_citations() - get_recitals() - get_preamble_final() - get_body() - get_chapters() - get_conclusions()

root

Root element of the XML or HTML document.

Type:

lxml.etree._Element or bs4.BeautifulSoup

preface

Extracted preface text from the document.

Type:

str or None

preamble

The preamble section of the document.

Type:

lxml.etree.Element or bs4.Tag or None

formula

The formula element extracted from the preamble.

Type:

str or None

citations

List of extracted citations from the preamble.

Type:

list

recitals

List of extracted recitals from the preamble.

Type:

list

preamble_final

The final preamble text extracted from the document.

Type:

str or None

body

The body section of the document.

Type:

lxml.etree.Element or bs4.Tag or None

chapters

List of extracted chapters from the body.

Type:

list

articles

List of extracted articles from the body. Each article is a dictionary with keys: - ‘eId’: Article identifier - ‘text’: Article text - ‘children’: List of child elements of the article

Type:

list

conclusions

Extracted conclusions from the body.

Type:

None or dict

__init__() None

Initializes the Parser object.

Parameters:

None

abstract get_preface() str | None

Extract document preface/title.

MUST be implemented by all subclasses.

Returns:

Document title/preface text

Return type:

str or None

abstract get_articles() None

Extract articles from document body.

MUST be implemented by all subclasses. Extracts articles and stores them in self.articles as a list of dictionaries.

Returns:

Articles are stored in self.articles attribute

Return type:

None

abstract parse(file: str, **options) Parser

Parse document and extract all components.

MUST be implemented by all subclasses.

Parameters:
  • file (str) – Path to document file

  • **options (dict) – Optional parser-specific configuration options

Returns:

Self (for method chaining)

Return type:

Parser

get_preamble() Any | None

Extract preamble section.

Override in subclass if format has preamble. Default returns None.

Returns:

Preamble element or None if not present

Return type:

Any or None

get_formula() str | None

Extract formula (enacting clause).

Override in subclass if format has formula. Default returns None.

Returns:

Formula text or None if not present

Return type:

str or None

get_citations() list[dict[str, str]]

Extract citations/references.

Override in subclass if format has citations. Default returns empty list.

Returns:

List of citation dictionaries

Return type:

list[dict[str, str]]

get_recitals() list[dict[str, str]]

Extract recitals (whereas clauses).

Override in subclass if format has recitals. Default returns empty list.

Returns:

List of recital dictionaries

Return type:

list[dict[str, str]]

get_preamble_final() str | None

Extract final preamble text.

Override in subclass if format has final preamble. Default returns None.

Returns:

Final preamble text or None if not present

Return type:

str or None

get_body() Any | None

Extract body section.

Override in subclass if needed. Default returns None.

Returns:

Body element or None

Return type:

Any or None

get_chapters() list[dict[str, Any]]

Extract chapters.

Override in subclass if format has chapters. Default returns empty list.

Returns:

List of chapter dictionaries

Return type:

list[dict[str, Any]]

get_conclusions() dict[str, Any] | None

Extract conclusions section.

Override in subclass if format has conclusions. Default returns None.

Returns:

Conclusions dictionary or None if not present

Return type:

dict[str, Any] or None

to_dict() dict[str, Any]

Convert the parser’s extracted data to a dictionary.

This version ensures that common non-JSON-native objects are converted to JSON-serializable forms. It will: - Call .to_dict() on domain model objects (Citation, Article, etc.) if

available.

  • Recursively convert lists and dicts.

  • Convert BeautifulSoup Tag objects to their text content.

  • Convert lxml elements to their concatenated text content.

Returns:

A dictionary containing all extracted elements from the document with JSON-serializable values.

Return type:

dict

class tulit.parser.parser.LegalJSONValidator(schema_path: str | None = None)

Bases: object

Validator for LegalJSON output using the LegalJSON schema.

validate(data: dict[str, Any]) bool

Validate a LegalJSON object against the LegalJSON schema. Returns True if valid, False otherwise.

Domain Models

Domain Models Module

This module contains domain model classes representing legal document structures. These models provide a clear, type-safe representation of legal documents, independent of the parsing implementation.

class tulit.parser.models.Citation(eId: str, text: str)

Bases: object

Represents a citation in a legal document.

eId: str
text: str
to_dict() Dict[str, Any]

Convert citation to dictionary format.

class tulit.parser.models.Recital(eId: str, text: str)

Bases: object

Represents a recital (whereas clause) in a legal document.

eId: str
text: str
to_dict() Dict[str, Any]

Convert recital to dictionary format.

class tulit.parser.models.ArticleChild(eId: str, text: str, amendment: bool | None = None)

Bases: object

Represents a child element of an article (paragraph, point, etc.).

eId

Element identifier

Type:

str

text

Content text

Type:

str

amendment

Whether this is an amendment marker

Type:

bool, optional

eId: str
text: str
amendment: bool | None = None
to_dict() Dict[str, Any]

Convert article child to dictionary format.

class tulit.parser.models.Article(eId: str, num: str, heading: str | None = None, children: List[ArticleChild] = None)

Bases: object

Represents an article in a legal document.

eId

Article identifier

Type:

str

num

Article number

Type:

str

heading

Article heading/title

Type:

str, optional

children

Child elements (paragraphs, points)

Type:

List[ArticleChild]

eId: str
num: str
heading: str | None = None
children: List[ArticleChild] = None
to_dict() Dict[str, Any]

Convert article to dictionary format.

class tulit.parser.models.Chapter(eId: str, num: str, heading: str | None = None)

Bases: object

Represents a chapter in a legal document.

eId

Chapter identifier

Type:

str

num

Chapter number

Type:

str

heading

Chapter heading/title

Type:

str, optional

eId: str
num: str
heading: str | None = None
to_dict() Dict[str, Any]

Convert chapter to dictionary format.

Parser Registry

Parser Registry Module

This module provides a registry pattern for managing parser implementations. It allows for dynamic parser discovery and instantiation based on format types.

class tulit.parser.registry.ParserRegistry

Bases: object

Registry for managing parser implementations.

This class implements the Registry pattern to allow dynamic parser discovery and instantiation. Parsers can be registered with format identifiers and aliases, and then retrieved by format name.

Example

>>> registry = ParserRegistry()
>>> registry.register('xml', XMLParser)
>>> parser = registry.create('xml')
__init__()

Initialize an empty parser registry.

register(format_id: str, parser_class: Type, aliases: List[str] | None = None) None

Register a parser class for a given format.

Parameters:
  • format_id (str) – Primary identifier for this parser format

  • parser_class (Type) – The parser class to register

  • aliases (List[str], optional) – Alternative names for this format

Raises:

ParserError – If format_id or any alias is already registered

register_factory(format_id: str, factory_func: Callable, aliases: List[str] | None = None) None

Register a factory function for creating parser instances.

This is useful when parser instantiation requires special logic or when dealing with parser variants.

Parameters:
  • format_id (str) – Primary identifier for this parser format

  • factory_func (Callable) – Function that returns a parser instance

  • aliases (List[str], optional) – Alternative names for this format

create(format_id: str, *args, **kwargs)

Create a parser instance for the given format.

Parameters:
  • format_id (str) – Format identifier or alias

  • *args – Arguments to pass to parser constructor

  • **kwargs – Arguments to pass to parser constructor

Returns:

An instance of the requested parser

Return type:

Parser

Raises:

ParserError – If format_id is not registered

list_formats() List[str]

List all registered format identifiers.

Returns:

List of format identifiers (not including aliases)

Return type:

List[str]

list_aliases() Dict[str, str]

Get mapping of aliases to their primary format identifiers.

Returns:

Mapping of alias -> format_id

Return type:

Dict[str, str]

is_registered(format_id: str) bool

Check if a format or alias is registered.

Parameters:

format_id (str) – Format identifier or alias to check

Returns:

True if format is registered

Return type:

bool

tulit.parser.registry.get_parser_registry() ParserRegistry

Get the global parser registry instance.

Returns:

The global parser registry

Return type:

ParserRegistry

tulit.parser.registry.register_parser(format_id: str, parser_class: Type = None, factory: Callable = None, aliases: List[str] | None = None) None

Convenience function to register a parser in the global registry.

Parameters:
  • format_id (str) – Primary identifier for the parser

  • parser_class (Type, optional) – Parser class to register

  • factory (Callable, optional) – Factory function that returns a parser instance

  • aliases (List[str], optional) – Alternative names for the parser

Example

>>> register_parser('xml', XMLParser, aliases=['xmldoc'])
tulit.parser.registry.get_parser(format_id: str, **kwargs)

Convenience function to get a parser from the global registry.

Parameters:
  • format_id (str) – Parser format identifier or alias

  • **kwargs (dict) – Arguments to pass to parser constructor/factory

Returns:

Instantiated parser

Return type:

Parser

Example

>>> parser = get_parser('xml', schema_path='schema.xsd')

Text Normalization

Text Normalization Strategies Module

This module provides text normalization strategies following the Strategy pattern. Different normalization algorithms can be selected at runtime, making parsers more flexible and testable.

class tulit.parser.normalization.TextNormalizationStrategy

Bases: ABC

Abstract base class for text normalization strategies.

The Strategy pattern allows different text cleaning/normalization algorithms to be selected at runtime, making parsers more flexible and testable.

Example

>>> normalizer = WhitespaceNormalizer()
>>> clean_text = normalizer.normalize("  multiple   spaces  ")
"multiple spaces"
abstract normalize(text: str) str

Normalize the given text according to the strategy’s rules.

Parameters:

text (str) – Text to normalize

Returns:

Normalized text

Return type:

str

class tulit.parser.normalization.WhitespaceNormalizer(fix_punctuation: bool = True)

Bases: TextNormalizationStrategy

Normalizes whitespace in text.

  • Removes newlines, tabs, carriage returns

  • Collapses multiple spaces to single space

  • Strips leading/trailing whitespace

  • Optionally fixes spacing before punctuation

__init__(fix_punctuation: bool = True)

Initialize whitespace normalizer.

Parameters:

fix_punctuation (bool, optional) – Whether to remove spaces before punctuation (default: True)

normalize(text: str) str

Remove and normalize whitespace.

class tulit.parser.normalization.UnicodeNormalizer(unicode_form: str | None = None, replace_nbsp: bool = True)

Bases: TextNormalizationStrategy

Normalizes unicode characters in text.

  • Replaces non-breaking spaces with regular spaces

  • Optionally normalizes unicode to a specific form (NFC, NFD, NFKC, NFKD)

__init__(unicode_form: str | None = None, replace_nbsp: bool = True)

Initialize unicode normalizer.

Parameters:
  • unicode_form (str, optional) – Unicode normalization form (‘NFC’, ‘NFD’, ‘NFKC’, ‘NFKD’)

  • replace_nbsp (bool, optional) – Whether to replace non-breaking spaces with regular spaces (default: True)

normalize(text: str) str

Normalize unicode characters.

class tulit.parser.normalization.PatternReplacementNormalizer(patterns: List[tuple[str, str]])

Bases: TextNormalizationStrategy

Normalizes text using regex pattern replacements.

Useful for removing specific markers, formatting codes, or document-specific artifacts.

__init__(patterns: List[tuple[str, str]])

Initialize pattern replacement normalizer.

Parameters:

patterns (List[tuple[str, str]]) – List of (pattern, replacement) tuples for regex substitution

Example

>>> normalizer = PatternReplacementNormalizer([
...     (r'▼[A-Z]\d*', ''),  # Remove consolidation markers
...     (r'^\(\d+\)', '')     # Remove leading numbers in parentheses
... ])
normalize(text: str) str

Apply pattern replacements.

class tulit.parser.normalization.CompositeNormalizer(strategies: List[TextNormalizationStrategy])

Bases: TextNormalizationStrategy

Composite strategy that applies multiple normalizers in sequence.

This allows combining different normalization strategies in a specific order to achieve complex text cleaning operations.

Example

>>> normalizer = CompositeNormalizer([
...     UnicodeNormalizer(),
...     WhitespaceNormalizer(),
...     PatternReplacementNormalizer([(r'▼[A-Z]\d*', '')])
... ])
>>> clean_text = normalizer.normalize(raw_text)
__init__(strategies: List[TextNormalizationStrategy])

Initialize composite normalizer.

Parameters:

strategies (List[TextNormalizationStrategy]) – List of normalizers to apply in order

normalize(text: str) str

Apply all strategies in sequence.

tulit.parser.normalization.create_standard_normalizer() CompositeNormalizer

Create a standard text normalizer suitable for most legal documents.

Applies: 1. Unicode normalization (non-breaking spaces) 2. Whitespace normalization (newlines, tabs, multiple spaces) 3. Punctuation spacing fixes

Returns:

Composite normalizer with standard strategies

Return type:

CompositeNormalizer

tulit.parser.normalization.create_html_normalizer() CompositeNormalizer

Create a normalizer for HTML-based legal documents.

Applies: 1. Pattern removal (consolidation markers) 2. Unicode normalization 3. Whitespace normalization

Returns:

Composite normalizer for HTML documents

Return type:

CompositeNormalizer

tulit.parser.normalization.create_formex_normalizer() CompositeNormalizer

Create a normalizer for Formex XML documents.

Applies: 1. Pattern removal (leading parentheses numbers) 2. Unicode normalization 3. Whitespace normalization

Returns:

Composite normalizer for Formex documents

Return type:

CompositeNormalizer

Parser Exceptions

Parser Exceptions Module

This module contains all custom exception classes for the parser package. Organizing exceptions in a dedicated module improves maintainability and allows for better exception handling patterns.

exception tulit.parser.exceptions.ParserError

Bases: Exception

Base exception for all parser-related errors.

exception tulit.parser.exceptions.ParseError

Bases: ParserError

Raised when parsing fails due to malformed input.

exception tulit.parser.exceptions.ValidationError

Bases: ParserError

Raised when validation against a schema fails.

exception tulit.parser.exceptions.ExtractionError

Bases: ParserError

Raised when extraction of specific content fails.

exception tulit.parser.exceptions.FileLoadError

Bases: ParserError

Raised when loading a file fails.

XML Parsers

Base XML Parser

XML Parser Base Module

This module provides the abstract XMLParser base class for XML-based document parsers. All XML parsers should inherit from XMLParser and implement the required abstract methods.

The XMLParser class integrates XML validation, node extraction utilities, and text normalization from the organized helper modules.

class tulit.parser.xml.xml.XMLParser(normalizer: TextNormalizationStrategy | None = None)

Bases: Parser

Abstract base class for XML parsers.

Provides common XML parsing utilities and helper methods. Uses XMLValidator for schema validation and TextNormalizationStrategy for text processing.

Subclasses must implement get_preface(), get_articles(), and parse() or use the provided parse() template method by overriding component methods.

valid

Indicates whether the XML file is valid against the schema.

Type:

bool or None

format

The format of the XML file (e.g., ‘Akoma Ntoso’, ‘Formex 4’).

Type:

str or None

validation_errors

Validation errors if the XML file is invalid.

Type:

lxml.etree._LogEntry or None

namespaces

Dictionary containing XML namespaces.

Type:

dict

normalizer

Strategy for text normalization operations.

Type:

TextNormalizationStrategy

__init__(normalizer: TextNormalizationStrategy | None = None) None

Initializes the Parser object with default attributes.

Parameters:

normalizer (TextNormalizationStrategy, optional) – Text normalization strategy to use. Defaults to standard normalizer.

property namespaces: dict[str, str]

Get the XML namespaces dictionary.

load_schema(schema: str) None

Load an XSD schema for XML validation.

Delegates to XMLValidator for actual schema loading.

Parameters:

schema (str) – Filename of the XSD schema file

Return type:

None

validate(file: str, format: str) bool

Validate an XML file against the loaded schema.

Delegates to XMLValidator for actual validation.

Parameters:
  • file (str) – Path to the XML file to validate

  • format (str) – Name of the format for logging (e.g., ‘Akoma Ntoso’, ‘Formex 4’)

Returns:

True if valid, False otherwise. Also updates self.valid attribute.

Return type:

bool

remove_node(tree, node)

Removes specified nodes from the XML tree while preserving their tail text.

Delegates to XMLNodeExtractor for node removal.

Parameters:
  • tree (lxml.etree._Element) – The XML tree or subtree to process.

  • node (str) – XPath expression identifying the nodes to remove.

Returns:

The modified XML tree with specified nodes removed.

Return type:

lxml.etree._Element

get_root(file: str | None = None)

Parses an XML file and returns its root element using secure parser settings.

Parameters:

file (str, optional) – Path to the XML file. If not provided, uses the file path from parse()

Return type:

None

Raises:

FileLoadError – If file cannot be loaded or parsed

get_preface(preface_xpath, paragraph_xpath) None

Extracts paragraphs from the preface section of the document.

Parameters:
  • preface_xpath (str) – XPath expression to locate the preface element.

  • paragraph_xpath (str) – XPath expression to locate the paragraphs within the preface.

Returns:

Updates the instance’s preface attribute with the found preface element.

Return type:

None

get_preamble(preamble_xpath, notes_xpath) None

Extracts the preamble section from the document.

Parameters:
  • preamble_xpath (str) – XPath expression to locate the preamble element.

  • notes_xpath (str) – XPath expression to locate notes within the preamble.

Returns:

Updates the instance’s preamble attribute with the found preamble element

Return type:

None

get_formula(formula_xpath: str, paragraph_xpath: str) str

Extracts formula text from the preamble.

Parameters:
  • formula_xpath (str) – XPath expression to locate the formula element.

  • paragraph_xpath (str) – XPath expression to locate the paragraphs within the formula.

Returns:

Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.

Return type:

str or None

get_citations(citations_xpath, citation_xpath, extract_eId=None)

Extracts citations from the preamble.

Parameters:
  • citations_xpath (str) – XPath to locate the citations section.

  • citation_xpath (str) – XPath to locate individual citations.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s citations attribute with the found citations.

Return type:

None

get_recitals(recitals_xpath, recital_xpath, text_xpath, extract_intro=None, extract_eId=None)

Extracts recitals from the preamble.

Parameters:
  • recitals_xpath (str) – XPath expression to locate the recitals section.

  • recital_xpath (str) – XPath expression to locate individual recitals.

  • text_xpath (str) – XPath expression to locate the text within each recital.

  • extract_intro (function, optional) – Function to handle the extraction of the introductory recital.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s recitals attribute with the found recitals.

Return type:

None

get_preamble_final(preamble_final_xpath) str

Extracts the final preamble text from the document.

Parameters:

preamble_final_xpath (str) – XPath expression to locate the final preamble element.

Returns:

Updates the instance’s preamble_final attribute with the found final preamble text.

Return type:

None

get_body(body_xpath) None

Extracts the body element from the document.

Parameters:

body_xpath (str) – XPath expression to locate the body element. For Akoma Ntoso, this is usually ‘.//akn:body’, while for Formex it is ‘.//ENACTING.TERMS’.

Returns:

Updates the instance’s body attribute with the found body element.

Return type:

None

get_chapters(chapter_xpath: str, num_xpath: str, heading_xpath: str, extract_eId=None, get_headings=None) None

Extracts chapter information from the document.

Parameters:
  • chapter_xpath (str) – XPath expression to locate the chapter elements.

  • num_xpath (str) – XPath expression to locate the chapter number within each chapter element.

  • heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text

Return type:

None

abstract get_articles() None

Extracts articles from the body section.

MUST be implemented by all XML parser subclasses. Subclasses should extract articles according to their specific XML format and store them in self.articles.

Returns:

Articles are stored in self.articles attribute

Return type:

None

get_conclusions()

Extracts conclusions from the body section.

Override in subclass if format has conclusions. Default implementation does nothing.

Return type:

None

parse(file: str, **options) XMLParser

Template method that orchestrates the entire parsing workflow.

DO NOT OVERRIDE THIS METHOD. Instead, override individual component extraction methods like get_preface(), get_articles(), etc.

Parameters:
  • file (str) – Path to the XML file to parse.

  • **options (dict) – Optional configuration: - schema : str - Path to the XSD schema file - format : str - Format of the XML file (e.g., ‘Akoma Ntoso’, ‘Formex 4’)

Returns:

Self for method chaining with the parsed data stored in its attributes.

Return type:

XMLParser

XML Helpers

XML Helper Utilities Module

This module provides utility classes for common XML operations including XPath-based extraction, validation, and node manipulation. These utilities reduce code duplication across XML-based parsers.

class tulit.parser.xml.helpers.XMLNodeExtractor(namespaces: dict[str, str] | None = None)

Bases: object

Utility class for XPath-based XML node extraction and manipulation.

This class encapsulates common XPath operations and text extraction patterns, reducing duplication and complexity in XML parsers.

namespaces

Dictionary of XML namespaces for XPath queries

Type:

dict

Example

>>> extractor = XMLNodeExtractor({'akn': 'http://...'})
>>> node = extractor.find(root, './/akn:article')
>>> text = extractor.extract_text(node)
__init__(namespaces: dict[str, str] | None = None)

Initialize the node extractor.

Parameters:

namespaces (dict, optional) – Dictionary of namespace prefixes to URIs

find(element: _Element, xpath: str) _Element | None

Find the first element matching the XPath expression.

Parameters:
  • element (lxml.etree._Element) – Root element to search from

  • xpath (str) – XPath expression

Returns:

First matching element or None

Return type:

lxml.etree._Element or None

findall(element: _Element, xpath: str) List[_Element]

Find all elements matching the XPath expression.

Parameters:
  • element (lxml.etree._Element) – Root element to search from

  • xpath (str) – XPath expression

Returns:

List of matching elements

Return type:

list[lxml.etree._Element]

extract_text(element: _Element, strip: bool = True) str

Extract all text content from an element and its descendants.

Parameters:
  • element (lxml.etree._Element) – Element to extract text from

  • strip (bool, optional) – Whether to strip whitespace (default: True)

Returns:

Concatenated text content

Return type:

str

extract_text_from_all(parent: _Element, xpath: str, strip: bool = True) List[str]

Extract text from all elements matching the XPath.

Parameters:
  • parent (lxml.etree._Element) – Parent element to search from

  • xpath (str) – XPath expression

  • strip (bool, optional) – Whether to strip whitespace (default: True)

Returns:

List of extracted text strings

Return type:

list[str]

safe_find(element: _Element, xpath: str, default: _Element | None = None) _Element | None

Safely find an element, returning default if not found.

Parameters:
  • element (lxml.etree._Element) – Root element to search from

  • xpath (str) – XPath expression

  • default (lxml.etree._Element, optional) – Value to return if not found

Returns:

Found element or default value

Return type:

lxml.etree._Element or default

safe_find_text(element: _Element, xpath: str, default: str = '') str

Safely find an element and extract its text.

Parameters:
  • element (lxml.etree._Element) – Root element to search from

  • xpath (str) – XPath expression

  • default (str, optional) – Value to return if not found

Returns:

Extracted text or default value

Return type:

str

remove_nodes(tree: _Element, xpath: str, preserve_tail: bool = True) _Element

Remove nodes matching XPath, optionally preserving tail text.

Parameters:
  • tree (lxml.etree._Element) – Tree to modify

  • xpath (str) – XPath expression for nodes to remove

  • preserve_tail (bool, optional) – Whether to preserve tail text (default: True)

Returns:

Modified tree

Return type:

lxml.etree._Element

class tulit.parser.xml.helpers.XMLValidator

Bases: object

Handles XML schema loading and validation.

This class provides robust schema validation with proper error handling and logging. It supports both XSD and RelaxNG schemas.

Example

>>> validator = XMLValidator()
>>> validator.load_schema('schema.xsd')
>>> is_valid = validator.validate(xml_root)
__init__()

Initialize the XML validator.

load_schema(schema_path: str, schema_type: str = 'xsd') bool

Load an XML schema file.

Parameters:
  • schema_path (str) – Path to the schema file

  • schema_type (str, optional) – Type of schema (‘xsd’ or ‘relaxng’), default: ‘xsd’

Returns:

True if schema loaded successfully

Return type:

bool

validate(xml_tree: _Element) bool

Validate an XML tree against the loaded schema.

Parameters:

xml_tree (lxml.etree._Element) – XML tree to validate

Returns:

True if validation succeeds

Return type:

bool

get_validation_errors() List[str]

Get list of validation error messages.

Returns:

List of error messages from last validation

Return type:

list[str]

Formex Parser

class tulit.parser.xml.formex.Formex4Parser

Bases: XMLParser

A parser for processing and extracting content from Formex XML files.

The parser handles XML documents following the Formex schema for legal documents. It inherits from the XMLParser class and provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.

__init__() None

Initializes the Formex4Parser object with the Formex namespace.

get_preface() None

Extracts the preface from the document. It is assumed that the preface is contained within the TITLE and P elements.

get_preamble() None

Extracts the preamble from the document. It is assumed that the preamble is contained within the PREAMBLE element, while notes are contained within the NOTE elements.

get_formula() None

Extracts the formula from the preamble. The formula is assumed to be contained within the PREAMBLE.INIT element.

Returns:

Formula text from the preamble.

Return type:

str

get_citations() None

Extracts citations from the preamble. Citations are assumed to be contained within the GR.VISA and VISA elements. The citation identifier is set as the index of the citation in the preamble.

Returns:

List of dictionaries containing citation data with keys: - ‘eId’: Citation identifier, which is the index of the citation in the preamble - ‘text’: Citation text

Return type:

list

get_recitals() None

Extracts recitals from the preamble. Recitals are assumed to be contained within the GR.CONSID and CONSID elements. The introductory recital is extracted separately. The recital identifier is set as the index of the recital in the preamble.

Returns:

List of dictionaries containing recital text and eId for each recital. Returns None if no recitals are found.

Return type:

list or None

get_preamble_final() None

Extracts the final preamble text from the document. The final preamble text is assumed to be contained within the PREAMBLE.FINAL element.

get_body() None

Extracts the body section from the document. The body is assumed to be contained within the ENACTING.TERMS element.

get_chapters() None

Extracts chapter information from the document. Chapter numbers and headings are assumed to be contained within the TITLE element. The chapter identifier is set as the index of the chapter in the document.

Returns:

List of dictionaries containing chapter data with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text

Return type:

list

get_articles() None

Extracts articles from the ENACTING.TERMS section using FormexArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

Returns:

Articles with identifier and content.

Return type:

list

get_conclusions() None

Extracts conclusions from the document. The conclusion text is assumed to be contained within the FINAL section of the document. The signature details are assumed to be contained within the SIGNATURE element.

Returns:

Dictionary containing the conclusion text and signature details.

Return type:

dict

clean_text(element: _Element) str
parse(file: str, **options) Formex4Parser

Parses a FORMEX XML document to extract its components, which are inherited from the XMLParser class. If the input is a directory, searches for the correct XML file (one containing ACT or DECISION tags).

Parameters:
  • file (str) – Path to the FORMEX XML file or directory containing FORMEX files.

  • **options (dict) – Optional configuration options (passed to parent XMLParser)

Returns:

Self for method chaining with parsed data.

Return type:

Formex4Parser

Akoma Ntoso Parsers

Akoma Ntoso Base Parser

This module provides the base AkomaNtosoParser class for processing legal documents in the Akoma Ntoso 3.0 format. All variant parsers (AKN4EU, German LegalDocML, Luxembourg) inherit from this base class.

class tulit.parser.xml.akomantoso.base.AkomaNtosoParser

Bases: XMLParser

Base parser for processing Akoma Ntoso 3.0 legal documents.

The parser handles XML documents following the Akoma Ntoso 3.0 schema for legal documents. It provides methods to extract various components like preface, preamble, chapters, articles, and conclusions.

namespaces

Dictionary mapping namespace prefixes to their URIs.

Type:

dict

Example

>>> parser = AkomaNtosoParser()
>>> parser.parse('document.xml')
>>> articles = parser.get_articles()
__init__() None

Initialize the Akoma Ntoso parser with standard namespaces.

get_preface() None

Extract preface information from the document.

The preface is contained within the ‘preface’ element in the XML file.

get_preamble() None

Extract preamble information from the document.

The preamble is contained within the ‘preamble’ element in the XML file.

get_formula() None

Extract formula from the preamble.

The formula is contained within the ‘formula’ element in the XML file. The formula text is extracted from all paragraphs within the formula element.

Returns:

Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.

Return type:

str or None

get_citations() None

Extract citations from the preamble.

The citations are contained within the ‘citations’ element. Each citation is extracted from the ‘citation’ element, with text from all paragraphs.

get_recitals() None

Extract recitals from the preamble.

Recitals are contained within the ‘recitals’ element. Each recital is extracted from the ‘recital’ element, with text from all paragraphs.

get_preamble_final() None

Extract the final part of the preamble.

This is typically the text after citations and recitals, contained in the ‘preamble.final’ block.

get_body() None

Extract the body section from the document.

The body contains the main content including articles, chapters, etc.

get_chapters() None

Extract chapters from the body.

Chapters structure the main content and may contain articles.

extract_eId(element: _Element, index: int | None = None) str

Extract the element ID (eId) from an XML element.

The standard Akoma Ntoso format uses ‘eId’ attribute for element identification. Subclasses may override this for format-specific ID extraction.

Parameters:
  • element (lxml.etree._Element) – XML element to extract ID from

  • index (int, optional) – Index to use if no ID attribute is found

Returns:

The element ID, or formatted index if no ID found

Return type:

str

get_articles() None

Extract articles from the body using AKNArticleExtractor.

Articles are the main structural units of legal documents. This method uses AKNArticleExtractor to handle the extraction logic. Also handles sections for jurisdictions that use sections instead of articles.

get_conclusions() None

Extract conclusions from the document.

Conclusions contain closing text and signatures.

parse(file: str, **options) AkomaNtosoParser

Parse an Akoma Ntoso document to extract all components.

This method validates the document against the Akoma Ntoso 3.0 schema and extracts all content using the orchestrator pattern.

Parameters:
  • file (str) – Path to the Akoma Ntoso XML file to parse

  • **options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

AkomaNtosoParser

Example

>>> parser = AkomaNtosoParser()
>>> parser.parse('document.xml')
>>> print(len(parser.articles))

AKN4EU Parser

This module provides the AKN4EU parser for European Union legal documents using the Akoma Ntoso for EU (AKN4EU) format.

class tulit.parser.xml.akomantoso.akn4eu.AKN4EUParser

Bases: AkomaNtosoParser

Parser for AKN4EU (Akoma Ntoso for European Union) documents.

This parser handles EU legal documents that use the AKN4EU variant of Akoma Ntoso, which includes EU-specific extensions and conventions.

Key Differences from Standard Akoma Ntoso: - Uses XML ‘id’ attribute instead of ‘eId’ for element identification - Follows EU-specific document structure conventions

Example

>>> parser = AKN4EUParser()
>>> parser.parse('eu_regulation.xml')
>>> print(parser.preface)
__init__() None

Initialize the AKN4EU parser.

extract_eId(element: _Element, index: int | None = None) str

Extract element ID from XML ‘id’ attribute (AKN4EU convention).

AKN4EU documents use the standard XML ‘id’ attribute from the XML namespace instead of the ‘eId’ attribute.

Parameters:
  • element (lxml.etree._Element) – XML element to extract ID from

  • index (int, optional) – Index to use if no ID attribute is found

Returns:

The element ID from xml:id attribute, or formatted index if not found

Return type:

str

German LegalDocML Parser

This module provides the parser for German LegalDocML documents, which follow the Akoma Ntoso structure but use a German-specific namespace.

class tulit.parser.xml.akomantoso.german.GermanLegalDocMLParser

Bases: AkomaNtosoParser

Parser for German LegalDocML documents.

This parser handles German legal documents that follow the Akoma Ntoso structure but use the German RIS (Rechtsinformationssystem) namespace.

German LegalDocML Namespace: http://Inhaltsdaten.LegalDocML.de/1.8.2/

Key Differences from Standard Akoma Ntoso: - Uses German-specific namespace while maintaining AKN structure - Schema validation is skipped (German-specific schema variations) - All XPath queries work seamlessly due to namespace remapping

Example

>>> parser = GermanLegalDocMLParser()
>>> parser.parse('german_law.xml')
>>> print(parser.articles)
__init__() None

Initialize the German LegalDocML parser with German namespace.

parse(file: str, **options) GermanLegalDocMLParser

Parse a German LegalDocML document to extract its components.

German LegalDocML follows Akoma Ntoso structure but uses a German-specific namespace and may have schema variations. This method bypasses schema validation and directly extracts the content.

Parameters:
  • file (str) – Path to the German LegalDocML XML file to parse

  • **options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

GermanLegalDocMLParser

Example

>>> parser = GermanLegalDocMLParser()
>>> parser.parse('bgb.xml')

Luxembourg Akoma Ntoso Parser

This module provides the parser for Luxembourg legal documents using the Committee Specification Draft 13 (CSD13) variant of Akoma Ntoso 3.0.

class tulit.parser.xml.akomantoso.luxembourg.LuxembourgAKNParser

Bases: AkomaNtosoParser

Parser for Luxembourg Akoma Ntoso documents (CSD13 variant).

This parser handles Luxembourg Legilux documents which use the Committee Specification Draft 13 (CSD13) namespace variant of Akoma Ntoso 3.0.

Luxembourg Namespace: http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13

Key Differences from Standard Akoma Ntoso: - Uses CSD13 namespace variant - Uses ‘id’ attribute instead of ‘eId’ for element identification - Content is nested in <alinea><content><p> structure - Includes Luxembourg-specific metadata namespace (http://www.scl.lu)

Example

>>> parser = LuxembourgAKNParser()
>>> parser.parse('luxembourg_law.xml')
>>> print(parser.articles)
__init__() None

Initialize the Luxembourg parser with CSD13 namespace.

extract_eId(element: _Element, index: int | None = None) str

Extract element ID from ‘id’ attribute (Luxembourg convention).

Luxembourg documents use the ‘id’ attribute instead of ‘eId’ for element identification.

Parameters:
  • element (lxml.etree._Element) – XML element to extract ID from

  • index (int, optional) – Index to use if no ID attribute is found

Returns:

The ID value from the ‘id’ attribute, or formatted index if not found

Return type:

str

parse(file: str, **options) LuxembourgAKNParser

Parse a Luxembourg Akoma Ntoso document to extract its components.

Luxembourg documents use the CSD13 variant and may have specific structural differences. This method bypasses schema validation and uses the orchestrator for content extraction.

Parameters:
  • file (str) – Path to the Luxembourg Akoma Ntoso XML file to parse

  • **options (dict) – Additional parsing options passed to the orchestrator

Returns:

Self for method chaining

Return type:

LuxembourgAKNParser

Example

>>> parser = LuxembourgAKNParser()
>>> parser.parse('luxembourg_code.xml')
get_articles() None

Extract articles from the body using AKNArticleExtractor with ‘id’ attribute.

Luxembourg documents use ‘id’ instead of ‘eId’ for element identification.

Akoma Ntoso Utility Functions

This module provides utility functions for detecting Akoma Ntoso formats and creating appropriate parser instances.

tulit.parser.xml.akomantoso.utils.detect_akn_format(file_path: str) str

Automatically detect the Akoma Ntoso format/dialect based on the XML namespace.

This function examines the root element’s namespace to determine which variant of Akoma Ntoso is being used (standard, German LegalDocML, Luxembourg CSD13, or AKN4EU).

Parameters:

file_path (str) – Path to the XML file

Returns:

Format identifier: ‘german’, ‘akn4eu’, ‘luxembourg’, or ‘akn’ (standard)

Return type:

str

Example

>>> format_type = detect_akn_format('document.xml')
>>> print(format_type)
'akn4eu'
tulit.parser.xml.akomantoso.utils.create_akn_parser(file_path: str | None = None, format: str | None = None) XMLParser

Factory function to create the appropriate Akoma Ntoso parser.

This function uses the registry pattern to instantiate the correct parser based on either explicit format specification or automatic detection.

Parameters:
  • file_path (str, optional) – Path to the XML file (required for auto-detection)

  • format (str, optional) – Explicitly specify format: ‘german’, ‘akn4eu’, ‘luxembourg’, or ‘akn’ If not provided, format will be auto-detected from file_path

Returns:

Appropriate parser instance for the detected or specified format

Return type:

XMLParser

Raises:

ValueError – If neither file_path nor format is provided

Example

>>> # Auto-detect format
>>> parser = create_akn_parser(file_path='document.xml')
>>>
>>> # Explicitly specify format
>>> parser = create_akn_parser(format='german')
tulit.parser.xml.akomantoso.utils.register_akn_parsers() None

Register all Akoma Ntoso parser variants in the registry.

This function should be called during module initialization to ensure all parser types are available for the factory function.

Helper classes for Akoma Ntoso article and content extraction.

This module provides specialized extractors to reduce duplication across AkomaNtoso parser variants and improve code organization.

class tulit.parser.xml.akomantoso.extractors.AKNArticleExtractor(namespaces: Dict[str, str], id_attr: str = 'eId')

Bases: object

Extracts article information from Akoma Ntoso documents.

Centralizes common article extraction logic used across different AKN parser variants (standard, AKN4EU, German, Luxembourg).

__init__(namespaces: Dict[str, str], id_attr: str = 'eId')

Initialize with namespace configuration.

Parameters:
  • namespaces (dict) – XML namespace mapping for XPath queries.

  • id_attr (str) – The attribute name used for element IDs (default ‘eId’).

extract_article_metadata(article: _Element) Dict[str, str | None]

Extract basic article metadata (eId, num, heading).

Parameters:

article (etree._Element) – The article XML element.

Returns:

Dictionary with ‘eId’, ‘num’, and ‘heading’ keys.

Return type:

dict

extract_paragraphs_by_eid(node: _Element) List[Dict[str, str]]

Extract paragraph text grouped by nearest parent eId.

Parameters:

node (etree._Element) – XML node to process for text extraction.

Returns:

List of dicts with ‘eId’ and ‘text’ keys.

Return type:

list

class tulit.parser.xml.akomantoso.extractors.AKNParseOrchestrator(parser)

Bases: object

Orchestrates the parsing workflow for Akoma Ntoso documents.

Implements Template Method pattern to reduce parse() method duplication across different AKN parser variants.

__init__(parser)

Initialize with reference to parser instance.

Parameters:

parser (AkomaNtosoParser) – The parser instance to orchestrate.

execute_parse_step(method_name: str, description: str) None

Execute a single parsing step with error handling and logging.

Parameters:
  • method_name (str) – Name of the parser method to call.

  • description (str) – Human-readable description for logging.

execute_standard_workflow() None

Execute standard AKN parsing workflow.

This is the common sequence used by most AKN parsers: preface -> preamble -> formula -> citations -> recitals -> preamble_final -> body -> chapters -> articles -> conclusions

class tulit.parser.xml.akomantoso.extractors.AKNContentProcessor(namespaces: Dict[str, str])

Bases: object

Processes complex content structures in Akoma Ntoso documents.

Handles lists, tables, and nested structures common across different AKN document types.

__init__(namespaces: Dict[str, str])

Initialize with namespace configuration.

Parameters:

namespaces (dict) – XML namespace mapping for XPath queries.

extract_list_items(parent: _Element) List[Dict[str, str]]

Extract list items from an AKN element.

Parameters:

parent (etree._Element) – Parent element containing list items.

Returns:

List of dicts with ‘eId’ and ‘text’ keys.

Return type:

list

extract_table_content(table: _Element) Dict[str, any]

Extract table content from an AKN table element.

Parameters:

table (etree._Element) – Table element to process.

Returns:

Dictionary with ‘eId’ and ‘rows’ keys.

Return type:

dict

BOE Parser

class tulit.parser.xml.boe.BOEXMLParser

Bases: XMLParser

Parser for BOE XML documents to LegalJSON.

Uses BOEArticleStrategy to extract articles, reducing code duplication and improving maintainability.

get_preface() str | None

Extracts paragraphs from the preface section of the document.

Parameters:
  • preface_xpath (str) – XPath expression to locate the preface element.

  • paragraph_xpath (str) – XPath expression to locate the paragraphs within the preface.

Returns:

Updates the instance’s preface attribute with the found preface element.

Return type:

None

get_articles() None

Extract articles using BOEArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

get_chapters() list

Extracts chapter information from the document.

Parameters:
  • chapter_xpath (str) – XPath expression to locate the chapter elements.

  • num_xpath (str) – XPath expression to locate the chapter number within each chapter element.

  • heading_xpath (str) – XPath expression to locate the chapter heading within each chapter element.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s chapters attribute with the found chapter data. Each chapter is a dictionary with keys: - ‘eId’: Chapter identifier - ‘chapter_num’: Chapter number - ‘chapter_heading’: Chapter heading text

Return type:

None

get_citations() list

Extracts citations from the preamble.

Parameters:
  • citations_xpath (str) – XPath to locate the citations section.

  • citation_xpath (str) – XPath to locate individual citations.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s citations attribute with the found citations.

Return type:

None

get_recitals() list

Extracts recitals from the preamble.

Parameters:
  • recitals_xpath (str) – XPath expression to locate the recitals section.

  • recital_xpath (str) – XPath expression to locate individual recitals.

  • text_xpath (str) – XPath expression to locate the text within each recital.

  • extract_intro (function, optional) – Function to handle the extraction of the introductory recital.

  • extract_eId (function, optional) – Function to handle the extraction or generation of eId.

Returns:

Updates the instance’s recitals attribute with the found recitals.

Return type:

None

get_preamble() None

Extracts the preamble section from the document.

Parameters:
  • preamble_xpath (str) – XPath expression to locate the preamble element.

  • notes_xpath (str) – XPath expression to locate notes within the preamble.

Returns:

Updates the instance’s preamble attribute with the found preamble element

Return type:

None

get_formula() None

Extracts formula text from the preamble.

Parameters:
  • formula_xpath (str) – XPath expression to locate the formula element.

  • paragraph_xpath (str) – XPath expression to locate the paragraphs within the formula.

Returns:

Concatenated text from all paragraphs within the formula element. Returns None if no formula is found.

Return type:

str or None

get_preamble_final() None

Extracts the final preamble text from the document.

Parameters:

preamble_final_xpath (str) – XPath expression to locate the final preamble element.

Returns:

Updates the instance’s preamble_final attribute with the found final preamble text.

Return type:

None

get_conclusions() None

Extracts conclusions from the body section.

Override in subclass if format has conclusions. Default implementation does nothing.

Return type:

None

parse(file: str, **options) BOEXMLParser

Parse a BOE XML document.

Parameters:
  • file (str) – Path to the BOE XML file

  • **options (dict) – Optional configuration options

Returns:

Self for method chaining

Return type:

BOEXMLParser

HTML Parsers

Base HTML Parser

class tulit.parser.html.html_parser.HTMLParser

Bases: Parser

Abstract base class for HTML parsers.

Provides common HTML parsing utilities and a template parse() method. Subclasses must implement get_preface() and get_articles(). Optional methods like get_preamble(), get_chapters(), etc. can be overridden.

__init__() None

Initializes the HTML parser and sets up the BeautifulSoup instance.

get_root(file: str) None

Loads an HTML file and parses it with BeautifulSoup.

Parameters:

file (str) – The path to the HTML file.

Returns:

The root element is stored in the parser under the ‘root’ attribute.

Return type:

None

parse(file: str, **options) HTMLParser

Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.

Parameters:
  • file (str) – Path to the HTML file to parse.

  • **options (dict) – Optional configuration options

Returns:

Self for method chaining with the parsed elements stored in the attributes.

Return type:

HTMLParser

Cellar HTML Parsers

class tulit.parser.html.cellar.cellar.CellarHTMLParser

Bases: HTMLParser

get_preface() None

Extracts the preface text from the HTML, if available.

Parameters:

None

Returns:

The extracted preface is stored in the ‘preface’ attribute.

Return type:

None

get_preamble() None

Extracts the preamble text from the HTML, if available.

Parameters:

None

Returns:

The extracted preamble is stored in the ‘preamble’ attribute.

Return type:

None

get_formula() None

Extracts the formula from the HTML, if present.

Parameters:

None

Returns:

The extracted formula is stored in the ‘formula’ attribute.

Return type:

None

get_citations() None

Extracts citations from the HTML.

Parameters:

None

Returns:

The extracted citations are stored in the ‘citations’ attribute

Return type:

None

get_recitals() None

Extracts recitals from the HTML.

Parameters:

None

Returns:

The extracted recitals are stored in the ‘recitals’ attribute.

Return type:

None

get_preamble_final() None

Extracts the final preamble text from the HTML, if available.

Parameters:

None

Returns:

The extracted final preamble is stored in the ‘preamble_final’ attribute.

Return type:

None

get_body() None

Extracts the body content from the HTML.

Parameters:

None

Returns:

The extracted body content is stored in the ‘body’ attribute

Return type:

None

get_chapters() None

Extracts chapters from the HTML, grouping them by their IDs and headings.

get_articles() None

Extracts articles from the HTML. Each <div> with an id starting with “art” is treated as an article (eId). Subsequent subdivisions are processed based on the closest parent with an id.

Returns:

List of articles, each containing its eId and associated content.

Return type:

list[dict]

get_conclusions() None

Extracts conclusions from the HTML, if present.

parse(file: str, **options) CellarHTMLParser

Parses an XHTML document. If the input is a directory, searches for XHTML files.

Parameters:
  • file (str) – Path to the XHTML file or directory containing XHTML files.

  • **options (dict) – Optional configuration options

Returns:

Self for method chaining with extracted content.

Return type:

CellarHTMLParser

class tulit.parser.html.cellar.cellar_standard.CellarStandardHTMLParser

Bases: HTMLParser

Parser for standard HTML format documents from EU Cellar. This format wraps content in <TXT_TE> tags with simple <p> structure, unlike the semantic XHTML format with class-based structure.

get_preface() None

Extract document title/preface. In standard HTML, this is typically in the metadata or first heading.

get_preamble() None

Extract preamble content. In standard HTML, the preamble typically includes the decision-making body, references, and recitals.

get_formula() None

Extract the formula (decision-making body statement). Usually starts with “THE COUNCIL”, “THE COMMISSION”, etc.

get_citations()

Extract citations (legal references). Usually contains phrases like “Having regard to”.

get_recitals()

Extract recitals (whereas clauses). Usually starts with “Whereas:” followed by numbered items.

get_preamble_final()

Extract final preamble statement (e.g., “HAS ADOPTED THIS DECISION:”).

get_body()

The body is the TXT_TE container itself.

get_chapters()

Extract chapters. In standard HTML, these might be section headings. For most documents, this may not apply.

get_articles()

Extract articles from the document using CellarStandardArticleStrategy.

This method delegates article extraction to the strategy pattern, reducing code duplication and improving testability.

get_conclusions()

Extract conclusion text (e.g., “Done at Brussels, …”).

parse(file_path: str, **options) CellarStandardHTMLParser

Parse a standard HTML document and extract all components. If the input is a directory, searches for HTML files.

Parameters:
  • file_path (str) – Path to the HTML file or directory containing HTML files

  • **options (dict) – Optional configuration: - validate : bool - Whether to validate against LegalJSON schema (default: False)

Returns:

Self for method chaining with parsed document.

Return type:

CellarStandardHTMLParser

class tulit.parser.html.cellar.proposal.ProposalHTMLParser

Bases: HTMLParser

Parser for European Commission proposal documents (COM documents).

These documents have a different structure than regular EUR-Lex legislative acts. They typically contain: - Metadata (institution, date, reference numbers) - Proposal status and title - Explanatory Memorandum with sections and subsections - Sometimes the actual legal act text at the end

get_metadata() None

Extracts metadata from the Commission proposal HTML.

Metadata includes: - Institution name (e.g., “EUROPEAN COMMISSION”) - Emission date and location - Reference numbers (COM number, interinstitutional reference) - Proposal status - Document type - Title/subject

Returns:

The extracted metadata is stored in the ‘metadata’ attribute.

Return type:

None

get_explanatory_memorandum() None

Extracts the Explanatory Memorandum section from the proposal.

The Explanatory Memorandum typically contains: - Title (class=”Exposdesmotifstitre”) - Sections with headings (class=”li ManualHeading1”, “li ManualHeading2”, etc.) - Numbered paragraphs (class=”li ManualNumPar1”) - Normal text (class=”Normal”)

Returns:

The extracted content is stored in the ‘explanatory_memorandum’ attribute.

Return type:

None

get_preface() None

For proposals, the preface is the combination of status, document type, and title. This extracts from the SECOND occurrence (the actual legal act), not the first (cover page).

get_preamble() None

Extracts the preamble of the legal act (not the explanatory memorandum). The preamble appears after the explanatory memorandum and contains: - Interinstitutional reference - Status - Document type - Title - Institution acting - Citations (Having regard to…) - Recitals (Whereas…)

Returns:

Sets self.preamble to the preamble element

Return type:

None

get_formula() None

Extracts the formula from the preamble (e.g., “THE COUNCIL OF THE EUROPEAN UNION,”).

Returns:

The extracted formula is stored in the ‘formula’ attribute.

Return type:

None

get_citations() None

Extracts citations from the preamble (paragraphs starting with “Having regard to”). Citations appear between the formula and “Whereas:”

Returns:

The extracted citations are stored in the ‘citations’ attribute.

Return type:

None

get_recitals() None

Extracts recitals from the preamble (paragraphs with class “li ManualConsidrant”). Recitals may span multiple content divs.

Returns:

The extracted recitals are stored in the ‘recitals’ attribute.

Return type:

None

get_preamble_final() None

Extracts the final formula of the preamble (e.g., “HAS ADOPTED THIS DECISION:”).

Returns:

The extracted final preamble is stored in the ‘preamble_final’ attribute.

Return type:

None

get_body() None

Extracts the body of the legal act (the enacting terms/articles).

Returns:

Sets self.body to the body element

Return type:

None

get_articles() None

Extracts articles from the body of the legal act.

Note: Due to the complex nested structure of Proposal documents (content divs, list concatenation, nested siblings), the full extraction logic remains in parser helper methods. The strategy pattern provides a consistent interface but delegates to parser-specific methods for the actual complex traversal logic.

Returns:

The extracted articles are stored in the ‘articles’ attribute.

Return type:

None

get_conclusions() None

Extracts conclusions from the legal act (signature section).

Returns:

The extracted conclusions are stored in the ‘conclusions’ attribute.

Return type:

None

parse(file: str) ProposalHTMLParser

Parses a Commission proposal HTML file and extracts all relevant information.

Parameters:

file (str) – Path to the HTML file to parse.

Returns:

The parser object with parsed elements stored in attributes.

Return type:

ProposalHTMLParser

Other HTML Parsers

class tulit.parser.html.veneto.VenetoHTMLParser

Bases: HTMLParser

get_root(file: str) None

Loads an HTML file and parses it with BeautifulSoup.

Parameters:

file (str) – The path to the HTML file.

Returns:

The root element is stored in the parser under the ‘root’ attribute.

Return type:

None

get_preface() None

Extracts the preface text from the HTML, if available.

Parameters:

None

Returns:

The extracted preface is stored in the ‘preface’ attribute.

Return type:

None

get_preamble()

Extracts the preamble text from the HTML, if available.

Parameters:

None

Returns:

The extracted preamble is stored in the ‘preamble’ attribute.

Return type:

None

get_formula()

Extracts the formula from the HTML, if present.

Parameters:

None

Returns:

The extracted formula is stored in the ‘formula’ attribute.

Return type:

None

get_citations()

Extracts citations from the HTML.

Parameters:

None

Returns:

The extracted citations are stored in the ‘citations’ attribute

Return type:

None

get_recitals()

Extracts recitals from the HTML.

Parameters:

None

Returns:

The extracted recitals are stored in the ‘recitals’ attribute.

Return type:

None

get_preamble_final()

Extracts the final preamble text from the HTML, if available.

Parameters:

None

Returns:

The extracted final preamble is stored in the ‘preamble_final’ attribute.

Return type:

None

get_body()

Extracts the body content from the HTML.

Parameters:

None

Returns:

The extracted body content is stored in the ‘body’ attribute

Return type:

None

get_chapters()

Extracts chapters from the HTML, grouping them by their IDs and headings.

get_articles()

Extracts articles from the HTML. Each <h6> is treated as an article heading, and the next <div> contains the article content. Subdivisions are separated by <br> tags and stored as children.

get_conclusions()

Extracts conclusions from the HTML, if present.

parse(file)

Parses an HTML file and extracts the preface, preamble, formula, citations, recitals, preamble final, body, chapters, articles, and conclusions.

Parameters:
  • file (str) – Path to the HTML file to parse.

  • **options (dict) – Optional configuration options

Returns:

Self for method chaining with the parsed elements stored in the attributes.

Return type:

HTMLParser

Article Extraction Strategies

Article Extraction Strategy Pattern

This module provides a hierarchy of strategies for extracting articles from different document formats (XML, HTML). It eliminates code duplication across parser classes by centralizing common article extraction logic.

Design Pattern: Strategy Pattern Purpose: Encapsulate article extraction algorithms and make them interchangeable

class tulit.parser.strategies.article_extraction.ArticleExtractionStrategy

Bases: ABC

Abstract base class for article extraction strategies.

This defines the interface that all concrete extraction strategies must implement. Each strategy encapsulates a specific algorithm for extracting articles from a particular document format.

abstract extract_articles(document: Any, **kwargs) List[Dict[str, Any]]

Extract articles from the given document.

Parameters:
  • document (Any) – The document to extract articles from (XML Element, HTML BeautifulSoup, etc.)

  • **kwargs (dict) – Additional parameters specific to the extraction strategy

Returns:

List of article dictionaries with keys: ‘eId’, ‘num’, ‘heading’, ‘children’

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.XMLArticleExtractionStrategy(namespaces: Dict[str, str] | None = None)

Bases: ArticleExtractionStrategy

Base strategy for extracting articles from XML documents.

Provides common XML operations like namespace handling, XPath queries, and text extraction.

__init__(namespaces: Dict[str, str] | None = None)

Initialize XML extraction strategy.

Parameters:

namespaces (dict, optional) – XML namespace mappings

class tulit.parser.strategies.article_extraction.HTMLArticleExtractionStrategy(article_pattern: str | None = None)

Bases: ArticleExtractionStrategy

Base strategy for extracting articles from HTML documents.

Provides common HTML operations like element finding, class matching, and text extraction using BeautifulSoup.

__init__(article_pattern: str | None = None)

Initialize HTML extraction strategy.

Parameters:

article_pattern (str, optional) – Regex pattern to identify article markers

class tulit.parser.strategies.article_extraction.FormexArticleStrategy(namespaces: Dict[str, str] | None = None)

Bases: XMLArticleExtractionStrategy

Strategy for extracting articles from Formex XML documents.

Formex uses ARTICLE elements with IDENTIFIER attributes, and content is stored in PARAG, ALINEA, or LIST/ITEM elements.

extract_articles(document: _Element, **kwargs) List[Dict[str, Any]]

Extract articles from Formex XML document.

Parameters:
  • document (lxml.etree._Element) – The body element containing articles

  • **kwargs (dict) – Optional: ‘remove_notes’ (bool) - whether to remove NOTE elements

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.BOEArticleStrategy(namespaces: Dict[str, str] | None = None)

Bases: XMLArticleExtractionStrategy

Strategy for extracting articles from Spanish BOE XML documents.

BOE uses <p class=”articulo”> for article titles and <p class=”parrafo”> for content paragraphs.

extract_articles(document: _Element, **kwargs) List[Dict[str, Any]]

Extract articles from BOE XML document.

Parameters:
  • document (lxml.etree._Element) – The root element

  • **kwargs (dict) – Not used for BOE

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.CellarStandardArticleStrategy

Bases: HTMLArticleExtractionStrategy

Strategy for extracting articles from Cellar HTML documents (standard format).

Cellar documents use specific paragraph patterns to mark article starts and structure content.

extract_articles(document: Any, **kwargs) List[Dict[str, Any]]

Extract articles from Cellar HTML document.

Parameters:
  • document (BeautifulSoup element) – The txt_te container element

  • **kwargs (dict) – Optional: ‘stop_markers’ (list) - text patterns that signal end of articles

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]

class tulit.parser.strategies.article_extraction.ProposalArticleStrategy

Bases: HTMLArticleExtractionStrategy

Strategy for extracting articles from EU Proposal HTML documents.

Proposals use <p class=”Titrearticle”> for article headers and various paragraph classes for content.

extract_articles(document: Any, **kwargs) List[Dict[str, Any]]

Extract articles from Proposal HTML document.

Parameters:
  • document (BeautifulSoup root) – The document root element

  • **kwargs (dict) – Not used for Proposal

Returns:

List of article dictionaries

Return type:

List[Dict[str, Any]]