Files
zugferd-service/AGENTS.md
2026-02-17 11:33:32 +01:00

6.4 KiB

AGENTS.md - Agent Development Guide

This document provides context and guidelines for agentic coding agents working on the zugferd-service repository.

Project Overview

ZUGFeRD-Service is a REST API for extracting and validating ZUGFeRD/Factur-X invoice data from PDF files. Built with FastAPI and Python 3.11+.

Tech Stack:

  • FastAPI >= 0.109.0 (web framework)
  • Uvicorn >= 0.27.0 (ASGI server)
  • Pydantic >= 2.5.0 (data validation)
  • factur-x >= 2.5 (ZUGFeRD/Factur-X library)
  • pypdf >= 4.0.0 (PDF text extraction)
  • lxml >= 5.0.0 (XML processing)

Commands

Development

# Install dependencies
pip install -e .

# Run the service (default: 0.0.0.0:5000)
python -m src.main
zugferd-service  # entry point

# With environment variables
HOST=127.0.0.1 PORT=8000 LOG_LEVEL=DEBUG python -m src.main

Testing

# Run all tests
pytest

# Run specific test file
pytest tests/test_extract.py

# Run specific test function
pytest tests/test_api.py::test_health_check

# Run with coverage
pytest --cov=src

# Run with verbose output
pytest -v

Building

# Docker build
docker build -t zugferd-service .

# Nix build
nix build .#zugferd-service

# Nix development shell
nix develop

Code Style Guidelines

Type Hints (Python 3.11+)

Use modern union syntax (|) instead of Optional or Union:

# Good
field: str | None
numbers: list[int] | None

# Avoid
from typing import Optional, Union
field: Optional[str]
numbers: Union[list[int], None]

All public functions must have type hints:

def extract_zugferd(pdf_bytes: bytes) -> ExtractResponse:
    """Extract ZUGFeRD data from PDF bytes.

    Args:
        pdf_bytes: Raw PDF file content

    Returns:
        ExtractResponse with extraction results
    """

Imports

  • Group imports: standard library, third-party, local modules
  • Use from typing import Any only when needed
  • Avoid star imports (from module import *)
# Standard library
import io
import time
from typing import Any

# Third-party
from fastapi import FastAPI
from lxml import etree
from pydantic import BaseModel

# Local modules
from src.models import ExtractResponse
from src.utils import amounts_match

Naming Conventions

  • Classes: PascalCase (e.g., ExtractionMeta, ValidateRequest)
  • Functions/variables: snake_case (e.g., extract_text_from_pdf, pdf_bytes)
  • Constants: SCREAMING_SNAKE_CASE (e.g., NAMESPACES, UNECE_UNIT_CODES)
  • Private: _leading_underscore (e.g., _parse_internal)

Pydantic Models

All models defined in src/models.py using Pydantic v2:

from pydantic import BaseModel, Field

class Supplier(BaseModel):
    """Supplier/seller information."""

    name: str = Field(description="Supplier name")
    vat_id: str | None = Field(default=None, description="VAT ID")
  • Use Field() for all fields with descriptions
  • Use default=None for optional fields (not None in type hint)
  • Use default_factory=list for mutable defaults

Error Handling

Custom exceptions for domain-specific errors:

class ExtractionError(Exception):
    """Error during PDF extraction."""

    def __init__(self, error_code: str, message: str, details: str = ""):
        self.error_code = error_code
        self.message = message
        self.details = details
        super().__init__(message)

FastAPI exception handlers defined in src/main.py:

  • ExtractionError → 400 with error code/message
  • HTTPException → preserves status_code
  • Generic Exception → 500 internal error

Raise HTTPException for validation errors:

from fastapi import HTTPException

raise HTTPException(
    status_code=400,
    detail={"error": "invalid_base64", "message": "Invalid base64 encoding"},
)

Docstrings

Use Google-style docstrings:

def parse_supplier(xml_root: etree._Element) -> Supplier:
    """Parse supplier information from XML.

    Args:
        xml_root: XML root element

    Returns:
        Supplier model with parsed data
    """

XML Parsing

Use lxml.etree with namespace-aware XPath:

from lxml import etree

NAMESPACES = {
    "rsm": "urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100",
    "ram": "urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100",
}

# Use namespaces in all XPath queries
name = xml_root.xpath(
    "//ram:ApplicableHeaderTradeAgreement/ram:SellerTradeParty/ram:Name/text()",
    namespaces=NAMESPACES,
)

Logging

Structured JSON logging via custom JSONFormatter in src/main.py:

logger = logging.getLogger(__name__)
logger.info("Extraction completed", extra={"data": {"profile": "EN16931"}})

Testing

  • Use pytest with pytest-asyncio for async tests
  • Use TestClient from fastapi.testclient for API tests
  • Define fixtures in tests/conftest.py
  • Test PDFs in tests/fixtures/
import pytest
from fastapi.testclient import TestClient
from src.main import app

@pytest.fixture
def client():
    return TestClient(app)

def test_health_check(client):
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"

File Structure

src/
├── __init__.py
├── main.py          # FastAPI app, endpoints, exception handlers
├── models.py        # Pydantic models (all requests/responses)
├── extractor.py     # ZUGFeRD XML extraction logic
├── validator.py     # Invoice validation logic
├── pdf_parser.py    # PDF text extraction
└── utils.py         # Utility functions (constants, helpers)

tests/
├── conftest.py      # Pytest fixtures
├── test_api.py      # API endpoint tests
├── test_extractor.py # Extraction logic tests
├── test_validator.py # Validation logic tests
└── fixtures/        # Test PDF files

Validation Checks

Four validation checks supported:

  1. pflichtfelder - Required fields present and non-empty
  2. betraege - Amount calculations correct (tolerance: 0.01)
  3. ustid - VAT ID format (DE, AT, CH)
  4. pdf_abgleich - XML vs PDF text comparison

Environment Variables

  • HOST (default: 0.0.0.0)
  • PORT (default: 5000)
  • LOG_LEVEL (default: INFO)

Additional Notes

  • Python 3.11+ required
  • No type suppression (# type: ignore) allowed
  • File size limit: 10MB for PDF uploads
  • Returns warnings (not errors) for non-critical issues
  • Uses Decimal rounding with ROUND_HALF_UP for monetary values