# AGENTS.md - Agent Development Guide This document provides context and guidelines for agentic coding agents working on the zugferd-service repository. ## Project Overview ZUGFeRD-Service is a REST API for extracting and validating ZUGFeRD/Factur-X invoice data from PDF files. Built with FastAPI and Python 3.11+. **Tech Stack:** - FastAPI >= 0.109.0 (web framework) - Uvicorn >= 0.27.0 (ASGI server) - Pydantic >= 2.5.0 (data validation) - factur-x >= 2.5 (ZUGFeRD/Factur-X library) - pypdf >= 4.0.0 (PDF text extraction) - lxml >= 5.0.0 (XML processing) ## Commands ### Development ```bash # Install dependencies pip install -e . # Run the service (default: 0.0.0.0:5000) python -m src.main zugferd-service # entry point # With environment variables HOST=127.0.0.1 PORT=8000 LOG_LEVEL=DEBUG python -m src.main ``` ### Testing ```bash # Run all tests pytest # Run specific test file pytest tests/test_extract.py # Run specific test function pytest tests/test_api.py::test_health_check # Run with coverage pytest --cov=src # Run with verbose output pytest -v ``` ### Building ```bash # Docker build docker build -t zugferd-service . # Nix build nix build .#zugferd-service # Nix development shell nix develop ``` ## Code Style Guidelines ### Type Hints (Python 3.11+) Use modern union syntax (`|`) instead of `Optional` or `Union`: ```python # Good field: str | None numbers: list[int] | None # Avoid from typing import Optional, Union field: Optional[str] numbers: Union[list[int], None] ``` All public functions must have type hints: ```python def extract_zugferd(pdf_bytes: bytes) -> ExtractResponse: """Extract ZUGFeRD data from PDF bytes. Args: pdf_bytes: Raw PDF file content Returns: ExtractResponse with extraction results """ ``` ### Imports - Group imports: standard library, third-party, local modules - Use `from typing import Any` only when needed - Avoid star imports (`from module import *`) ```python # Standard library import io import time from typing import Any # Third-party from fastapi import FastAPI from lxml import etree from pydantic import BaseModel # Local modules from src.models import ExtractResponse from src.utils import amounts_match ``` ### Naming Conventions - **Classes**: `PascalCase` (e.g., `ExtractionMeta`, `ValidateRequest`) - **Functions/variables**: `snake_case` (e.g., `extract_text_from_pdf`, `pdf_bytes`) - **Constants**: `SCREAMING_SNAKE_CASE` (e.g., `NAMESPACES`, `UNECE_UNIT_CODES`) - **Private**: `_leading_underscore` (e.g., `_parse_internal`) ### Pydantic Models All models defined in `src/models.py` using Pydantic v2: ```python from pydantic import BaseModel, Field class Supplier(BaseModel): """Supplier/seller information.""" name: str = Field(description="Supplier name") vat_id: str | None = Field(default=None, description="VAT ID") ``` - Use `Field()` for all fields with descriptions - Use `default=None` for optional fields (not `None` in type hint) - Use `default_factory=list` for mutable defaults ### Error Handling **Custom exceptions** for domain-specific errors: ```python class ExtractionError(Exception): """Error during PDF extraction.""" def __init__(self, error_code: str, message: str, details: str = ""): self.error_code = error_code self.message = message self.details = details super().__init__(message) ``` **FastAPI exception handlers** defined in `src/main.py`: - `ExtractionError` → 400 with error code/message - `HTTPException` → preserves status_code - Generic `Exception` → 500 internal error **Raise HTTPException for validation errors:** ```python from fastapi import HTTPException raise HTTPException( status_code=400, detail={"error": "invalid_base64", "message": "Invalid base64 encoding"}, ) ``` ### Docstrings Use Google-style docstrings: ```python def parse_supplier(xml_root: etree._Element) -> Supplier: """Parse supplier information from XML. Args: xml_root: XML root element Returns: Supplier model with parsed data """ ``` ### XML Parsing Use `lxml.etree` with namespace-aware XPath: ```python from lxml import etree NAMESPACES = { "rsm": "urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100", "ram": "urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100", } # Use namespaces in all XPath queries name = xml_root.xpath( "//ram:ApplicableHeaderTradeAgreement/ram:SellerTradeParty/ram:Name/text()", namespaces=NAMESPACES, ) ``` ### Logging Structured JSON logging via custom `JSONFormatter` in `src/main.py`: ```python logger = logging.getLogger(__name__) logger.info("Extraction completed", extra={"data": {"profile": "EN16931"}}) ``` ### Testing - Use `pytest` with `pytest-asyncio` for async tests - Use `TestClient` from `fastapi.testclient` for API tests - Define fixtures in `tests/conftest.py` - Test PDFs in `tests/fixtures/` ```python import pytest from fastapi.testclient import TestClient from src.main import app @pytest.fixture def client(): return TestClient(app) def test_health_check(client): response = client.get("/health") assert response.status_code == 200 assert response.json()["status"] == "healthy" ``` ### File Structure ``` src/ ├── __init__.py ├── main.py # FastAPI app, endpoints, exception handlers ├── models.py # Pydantic models (all requests/responses) ├── extractor.py # ZUGFeRD XML extraction logic ├── validator.py # Invoice validation logic ├── pdf_parser.py # PDF text extraction └── utils.py # Utility functions (constants, helpers) tests/ ├── conftest.py # Pytest fixtures ├── test_api.py # API endpoint tests ├── test_extractor.py # Extraction logic tests ├── test_validator.py # Validation logic tests └── fixtures/ # Test PDF files ``` ## Validation Checks Four validation checks supported: 1. **pflichtfelder** - Required fields present and non-empty 2. **betraege** - Amount calculations correct (tolerance: 0.01) 3. **ustid** - VAT ID format (DE, AT, CH) 4. **pdf_abgleich** - XML vs PDF text comparison ## Environment Variables - `HOST` (default: `0.0.0.0`) - `PORT` (default: `5000`) - `LOG_LEVEL` (default: `INFO`) ## Additional Notes - Python 3.11+ required - No type suppression (`# type: ignore`) allowed - File size limit: 10MB for PDF uploads - Returns warnings (not errors) for non-critical issues - Uses Decimal rounding with ROUND_HALF_UP for monetary values