Wave 3 tasks complete:
- Task 7: Validator with 4 checks (pflichtfelder, betraege, ustid, pdf_abgleich)
- Task 8: FastAPI app with CORS, exception handlers, JSON logging
- Task 9: Health endpoint returning status and version
Features:
- validate_invoice() runs selected validation checks
- Exception handlers for ExtractionError and generic errors
- GET /health returns {status: healthy, version: 1.0.0}
Tests: 52 validator tests covering all validation rules
1765 lines
52 KiB
Markdown
1765 lines
52 KiB
Markdown
# ZUGFeRD-Service: Python REST API for Invoice Extraction and Validation
|
||
|
||
## TL;DR
|
||
|
||
> **Quick Summary**: Build a stateless FastAPI service that extracts structured data from ZUGFeRD/Factur-X invoices embedded in PDFs, validates invoice correctness, and is packaged for both Docker and NixOS deployment.
|
||
>
|
||
> **Deliverables**:
|
||
> - Complete FastAPI application with 3 endpoints (`/health`, `/extract`, `/validate`)
|
||
> - Pydantic models for all request/response schemas
|
||
> - ZUGFeRD extraction pipeline using factur-x library
|
||
> - 4 validation checks (pflichtfelder, betraege, ustid, pdf_abgleich)
|
||
> - Docker multi-stage build (production-ready)
|
||
> - Nix flake packaging (buildPythonApplication)
|
||
> - Comprehensive test suite (TDD, pytest)
|
||
> - README documentation with installation guide
|
||
>
|
||
> **Estimated Effort**: Medium-Large (20-30 tasks)
|
||
> **Parallel Execution**: YES - 4 waves
|
||
> **Critical Path**: Project Setup → Models → Extractor → Validator → API → Packaging
|
||
|
||
---
|
||
|
||
## Context
|
||
|
||
### Original Request
|
||
Build a Python-based REST API service for ZUGFeRD invoice extraction and validation. The service runs as a Docker container on NixOS and should also be packaged as a Nix flake. Key requirements include handling all ZUGFeRD profiles (MINIMUM through EXTENDED), validating invoice amounts and required fields, and comparing XML data against PDF text.
|
||
|
||
### Interview Summary
|
||
**Key Discussions**:
|
||
- **Framework choice**: FastAPI (user preference, excellent for REST APIs)
|
||
- **Project format**: pyproject.toml with hatchling (modern, Nix-friendly)
|
||
- **Nix packaging**: Flake-based approach with buildPythonApplication
|
||
- **Testing strategy**: TDD with pytest, source official sample PDFs
|
||
- **Authentication**: Open endpoints (no auth required)
|
||
- **File handling**: 10MB limit, reject password-protected PDFs
|
||
- **Calculations**: Standard rounding, 0.01 EUR tolerance
|
||
|
||
**Research Findings**:
|
||
- factur-x library provides `get_xml_from_pdf()`, `get_level()`, `get_flavor()` for extraction
|
||
- Nix packaging follows mem0 pattern: pyproject=true, pythonRelaxDeps=true
|
||
- UN/ECE unit codes require mapping dictionary (C62→Piece, KGM→Kilogram, etc.)
|
||
- ZUGFeRD profiles detected via GuidelineSpecifiedDocumentContextParameter in XML
|
||
|
||
### Metis Review
|
||
**Identified Gaps** (addressed):
|
||
- Password-protected PDFs → Reject with error
|
||
- File size limits → 10MB hard limit
|
||
- Decimal precision → Standard rounding
|
||
- Authentication → Confirmed open/stateless
|
||
|
||
---
|
||
|
||
## Work Objectives
|
||
|
||
### Core Objective
|
||
Create a production-ready, stateless REST API that extracts ZUGFeRD/Factur-X invoice data from PDFs, validates invoice correctness against business rules, and provides clear error reporting for invoice processing workflows.
|
||
|
||
### Concrete Deliverables
|
||
- `src/main.py` - FastAPI application with all endpoints
|
||
- `src/models.py` - Pydantic models for all data structures
|
||
- `src/extractor.py` - ZUGFeRD extraction logic
|
||
- `src/validator.py` - Validation checks implementation
|
||
- `src/pdf_parser.py` - PDF text extraction for cross-validation
|
||
- `src/utils.py` - Helper functions (unit codes, decimal handling)
|
||
- `pyproject.toml` - Project configuration with hatchling
|
||
- `Dockerfile` - Multi-stage production build
|
||
- `docker-compose.yml` - Local development setup
|
||
- `flake.nix` - Nix flake packaging
|
||
- `tests/` - Complete test suite with fixtures
|
||
- `README.md` - Installation and usage documentation
|
||
|
||
### Definition of Done
|
||
- [ ] `nix build .#zugferd-service` completes without errors
|
||
- [ ] `docker build -t zugferd-service .` produces image <500MB
|
||
- [ ] `pytest` runs all tests with 100% pass rate
|
||
- [ ] `curl http://localhost:5000/health` returns `{"status": "healthy", "version": "1.0.0"}`
|
||
- [ ] All ZUGFeRD profiles correctly detected from sample PDFs
|
||
- [ ] All validation checks produce expected errors/warnings
|
||
|
||
### Must Have
|
||
- All 3 API endpoints as specified
|
||
- Support for all ZUGFeRD profiles (MINIMUM, BASIC, BASIC WL, EN16931, EXTENDED)
|
||
- All 4 validation checks (pflichtfelder, betraege, ustid, pdf_abgleich)
|
||
- Structured JSON error responses with error codes
|
||
- UTF-8 encoding throughout
|
||
- Structured JSON logging
|
||
|
||
### Must NOT Have (Guardrails)
|
||
- ❌ Authentication middleware (open endpoints)
|
||
- ❌ Database or persistence layer (stateless only)
|
||
- ❌ Caching layers
|
||
- ❌ Rate limiting
|
||
- ❌ Metrics endpoints beyond /health
|
||
- ❌ CLI interface
|
||
- ❌ Web UI or admin dashboard
|
||
- ❌ Batch processing or queue system
|
||
- ❌ Online USt-ID validation (format check only)
|
||
- ❌ Support for ZUGFeRD 1.x (only 2.x)
|
||
- ❌ Abstraction layers for "future extensibility"
|
||
|
||
---
|
||
|
||
## Verification Strategy
|
||
|
||
> **UNIVERSAL RULE: ZERO HUMAN INTERVENTION**
|
||
>
|
||
> ALL tasks in this plan MUST be verifiable WITHOUT any human action.
|
||
> Every verification step is executed by the agent using tools (curl, pytest, nix, docker).
|
||
|
||
### Test Decision
|
||
- **Infrastructure exists**: NO (new project)
|
||
- **Automated tests**: TDD (test-first)
|
||
- **Framework**: pytest with pytest-asyncio
|
||
|
||
### TDD Workflow
|
||
Each implementation task follows RED-GREEN-REFACTOR:
|
||
1. **RED**: Write failing test first
|
||
2. **GREEN**: Implement minimum code to pass
|
||
3. **REFACTOR**: Clean up while keeping tests green
|
||
|
||
### Test Infrastructure Setup (Task 0)
|
||
```bash
|
||
# Install pytest and test dependencies
|
||
pip install pytest pytest-asyncio httpx
|
||
|
||
# Verify pytest works
|
||
pytest --version
|
||
|
||
# Run initial test
|
||
pytest tests/ -v
|
||
```
|
||
|
||
### Agent-Executed QA Scenarios (MANDATORY)
|
||
|
||
All verifications use:
|
||
- **API Testing**: `curl` commands with JSON assertions
|
||
- **Docker Testing**: `docker build` and `docker run` commands
|
||
- **Nix Testing**: `nix build` and `nix run` commands
|
||
- **Unit Testing**: `pytest` with specific test file targets
|
||
|
||
---
|
||
|
||
## Execution Strategy
|
||
|
||
### Parallel Execution Waves
|
||
|
||
```
|
||
Wave 1 (Start Immediately):
|
||
├── Task 1: Project scaffold (pyproject.toml, directories)
|
||
├── Task 2: Download ZUGFeRD sample PDFs
|
||
└── Task 3: Create Pydantic models
|
||
|
||
Wave 2 (After Wave 1):
|
||
├── Task 4: Extractor unit tests + implementation
|
||
├── Task 5: PDF parser unit tests + implementation
|
||
└── Task 6: Utils (unit codes, decimal handling)
|
||
|
||
Wave 3 (After Wave 2):
|
||
├── Task 7: Validator unit tests + implementation
|
||
├── Task 8: FastAPI app structure
|
||
└── Task 9: Health endpoint
|
||
|
||
Wave 4 (After Wave 3):
|
||
├── Task 10: Extract endpoint
|
||
├── Task 11: Validate endpoint
|
||
└── Task 12: Error handling middleware
|
||
|
||
Wave 5 (After Wave 4):
|
||
├── Task 13: Integration tests
|
||
├── Task 14: Dockerfile
|
||
└── Task 15: docker-compose.yml
|
||
|
||
Wave 6 (After Wave 5):
|
||
├── Task 16: Nix flake packaging
|
||
├── Task 17: NixOS module example
|
||
└── Task 18: README documentation
|
||
|
||
Critical Path: Task 1 → Task 4 → Task 7 → Task 10 → Task 13 → Task 16
|
||
```
|
||
|
||
### Dependency Matrix
|
||
|
||
| Task | Depends On | Blocks | Can Parallelize With |
|
||
|------|------------|--------|---------------------|
|
||
| 1 | None | 2-18 | None (must be first) |
|
||
| 2 | 1 | 4, 5 | 3 |
|
||
| 3 | 1 | 4, 5, 7 | 2 |
|
||
| 4 | 2, 3 | 7, 10 | 5, 6 |
|
||
| 5 | 2 | 7 | 4, 6 |
|
||
| 6 | 1 | 4, 7 | 4, 5 |
|
||
| 7 | 4, 5, 6 | 11 | 8, 9 |
|
||
| 8 | 3 | 9, 10, 11 | 7 |
|
||
| 9 | 8 | 10, 11 | 7 |
|
||
| 10 | 4, 8, 9 | 13 | 11, 12 |
|
||
| 11 | 7, 8, 9 | 13 | 10, 12 |
|
||
| 12 | 8 | 13 | 10, 11 |
|
||
| 13 | 10, 11, 12 | 14, 16 | None |
|
||
| 14 | 13 | 16 | 15 |
|
||
| 15 | 13 | None | 14 |
|
||
| 16 | 14 | 17, 18 | None |
|
||
| 17 | 16 | 18 | None |
|
||
| 18 | 16, 17 | None | None |
|
||
|
||
---
|
||
|
||
## TODOs
|
||
|
||
### Wave 1: Project Foundation
|
||
|
||
- [x] 1. Project Scaffold and Configuration
|
||
|
||
**What to do**:
|
||
- Create project directory structure as specified
|
||
- Create `pyproject.toml` with hatchling build system
|
||
- Configure pytest in pyproject.toml
|
||
- Create `src/__init__.py` and `tests/__init__.py`
|
||
- Set up Python version requirement (3.11+)
|
||
|
||
**Must NOT do**:
|
||
- Do NOT add optional dependencies
|
||
- Do NOT create README yet (separate task)
|
||
- Do NOT add pre-commit hooks
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Straightforward file creation with established patterns
|
||
- **Skills**: [`git-master`]
|
||
- `git-master`: For proper initial commit after scaffold
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: NO
|
||
- **Parallel Group**: Sequential (must be first)
|
||
- **Blocks**: All subsequent tasks
|
||
- **Blocked By**: None
|
||
|
||
**References**:
|
||
- **Pattern References**: mem0 pyproject.toml pattern (from librarian research)
|
||
- **External References**: https://hatch.pypa.io/latest/config/metadata/ (hatchling docs)
|
||
|
||
**Files to Create**:
|
||
```
|
||
pyproject.toml
|
||
src/__init__.py
|
||
src/main.py (placeholder)
|
||
src/models.py (placeholder)
|
||
src/extractor.py (placeholder)
|
||
src/validator.py (placeholder)
|
||
src/pdf_parser.py (placeholder)
|
||
src/utils.py (placeholder)
|
||
tests/__init__.py
|
||
tests/conftest.py (pytest fixtures)
|
||
tests/fixtures/.gitkeep
|
||
```
|
||
|
||
**pyproject.toml Structure**:
|
||
```toml
|
||
[build-system]
|
||
requires = ["hatchling"]
|
||
build-backend = "hatchling.build"
|
||
|
||
[project]
|
||
name = "zugferd-service"
|
||
version = "1.0.0"
|
||
description = "REST API for ZUGFeRD invoice extraction and validation"
|
||
requires-python = ">=3.11"
|
||
dependencies = [
|
||
"fastapi>=0.109.0",
|
||
"uvicorn>=0.27.0",
|
||
"python-multipart>=0.0.6",
|
||
"factur-x>=2.5",
|
||
"pypdf>=4.0.0",
|
||
"pydantic>=2.5.0",
|
||
"lxml>=5.0.0",
|
||
]
|
||
|
||
[project.optional-dependencies]
|
||
dev = [
|
||
"pytest>=8.0.0",
|
||
"pytest-asyncio>=0.23.0",
|
||
"httpx>=0.27.0",
|
||
]
|
||
|
||
[project.scripts]
|
||
zugferd-service = "src.main:run"
|
||
|
||
[tool.pytest.ini_options]
|
||
asyncio_mode = "auto"
|
||
testpaths = ["tests"]
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: pyproject.toml is valid
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "import tomllib; tomllib.load(open('pyproject.toml', 'rb'))"
|
||
2. Assert: Exit code 0
|
||
Expected Result: TOML parses without error
|
||
Evidence: Command output captured
|
||
|
||
Scenario: Project structure exists
|
||
Tool: Bash
|
||
Steps:
|
||
1. ls -la src/
|
||
2. Assert: main.py, models.py, extractor.py, validator.py, pdf_parser.py, utils.py exist
|
||
3. ls -la tests/
|
||
4. Assert: conftest.py, fixtures/ exist
|
||
Expected Result: All required files present
|
||
Evidence: Directory listing captured
|
||
|
||
Scenario: Dependencies install correctly
|
||
Tool: Bash
|
||
Steps:
|
||
1. pip install -e ".[dev]"
|
||
2. Assert: Exit code 0
|
||
3. python -c "import fastapi; import facturx; import pypdf"
|
||
4. Assert: Exit code 0
|
||
Expected Result: All dependencies importable
|
||
Evidence: pip output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(project): initialize ZUGFeRD service with pyproject.toml and directory structure`
|
||
- Files: All created files
|
||
- Pre-commit: `python -c "import tomllib; tomllib.load(open('pyproject.toml', 'rb'))"`
|
||
|
||
---
|
||
|
||
- [x] 2. Download ZUGFeRD Sample PDFs
|
||
|
||
**What to do**:
|
||
- Download official ZUGFeRD sample PDFs from FeRD/ZUGFeRD repositories
|
||
- Include samples for: MINIMUM, BASIC, BASIC WL, EN16931, EXTENDED profiles
|
||
- Include a non-ZUGFeRD PDF for negative testing
|
||
- Store in `tests/fixtures/`
|
||
- Create fixture manifest documenting each file
|
||
|
||
**Must NOT do**:
|
||
- Do NOT create synthetic PDFs
|
||
- Do NOT download more than 10 samples (keep focused)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Download and organize files
|
||
- **Skills**: []
|
||
- No special skills needed
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 1 (with Task 3)
|
||
- **Blocks**: Tasks 4, 5
|
||
- **Blocked By**: Task 1
|
||
|
||
**References**:
|
||
- **External References**:
|
||
- https://www.ferd-net.de/download/testrechnungen
|
||
- https://github.com/ZUGFeRD/mustangproject/tree/master/Mustang-CLI/src/test/resources
|
||
- https://github.com/akretion/factur-x/tree/master/tests
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Sample PDFs exist for all profiles
|
||
Tool: Bash
|
||
Steps:
|
||
1. ls tests/fixtures/*.pdf | wc -l
|
||
2. Assert: At least 5 PDF files
|
||
3. file tests/fixtures/*.pdf
|
||
4. Assert: All files identified as "PDF document"
|
||
Expected Result: Multiple valid PDF files in fixtures
|
||
Evidence: File listing captured
|
||
|
||
Scenario: Fixture manifest documents samples
|
||
Tool: Bash
|
||
Steps:
|
||
1. cat tests/fixtures/MANIFEST.md
|
||
2. Assert: Contains entries for MINIMUM, BASIC, EN16931, EXTENDED
|
||
Expected Result: Manifest describes all test fixtures
|
||
Evidence: Manifest content captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `test(fixtures): add official ZUGFeRD sample PDFs for all profiles`
|
||
- Files: `tests/fixtures/*.pdf`, `tests/fixtures/MANIFEST.md`
|
||
|
||
---
|
||
|
||
- [x] 3. Create Pydantic Models
|
||
|
||
**What to do**:
|
||
- Define all Pydantic models as per API specification
|
||
- Models: ExtractRequest, ExtractResponse, ValidateRequest, ValidateResponse
|
||
- Nested models: Supplier, Buyer, LineItem, Totals, VatBreakdown, PaymentTerms
|
||
- Error models: ErrorResponse, ValidationError
|
||
- Add field validators where appropriate (e.g., VAT ID format)
|
||
|
||
**Must NOT do**:
|
||
- Do NOT add ORM mappings (no database)
|
||
- Do NOT add serialization beyond Pydantic defaults
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Straightforward Pydantic model definitions from spec
|
||
- **Skills**: []
|
||
- Standard Python work
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 1 (with Task 2)
|
||
- **Blocks**: Tasks 4, 5, 7
|
||
- **Blocked By**: Task 1
|
||
|
||
**References**:
|
||
- **API Specification**: User's detailed spec document (request/response schemas)
|
||
- **Pattern References**: https://docs.pydantic.dev/latest/concepts/models/
|
||
|
||
**Key Models from Spec**:
|
||
```python
|
||
class Supplier(BaseModel):
|
||
name: str
|
||
street: str | None = None
|
||
postal_code: str | None = None
|
||
city: str | None = None
|
||
country: str | None = None
|
||
vat_id: str | None = None
|
||
email: str | None = None
|
||
|
||
class LineItem(BaseModel):
|
||
position: int
|
||
article_number: str | None = None
|
||
article_number_buyer: str | None = None
|
||
description: str
|
||
quantity: float
|
||
unit: str # Human-readable, translated from UN/ECE code
|
||
unit_price: float
|
||
line_total: float
|
||
vat_rate: float | None = None
|
||
vat_amount: float | None = None
|
||
|
||
class ExtractResponse(BaseModel):
|
||
is_zugferd: bool
|
||
zugferd_profil: str | None = None
|
||
xml_raw: str | None = None
|
||
xml_data: XmlData | None = None
|
||
pdf_text: str | None = None
|
||
extraction_meta: ExtractionMeta
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
# tests/test_models.py
|
||
def test_extract_response_zugferd():
|
||
response = ExtractResponse(
|
||
is_zugferd=True,
|
||
zugferd_profil="EN16931",
|
||
xml_raw="<?xml ...",
|
||
# ... full data
|
||
)
|
||
assert response.is_zugferd is True
|
||
assert response.zugferd_profil == "EN16931"
|
||
|
||
def test_extract_response_non_zugferd():
|
||
response = ExtractResponse(
|
||
is_zugferd=False,
|
||
pdf_text="Invoice text...",
|
||
extraction_meta=ExtractionMeta(pages=1, extraction_time_ms=50)
|
||
)
|
||
assert response.xml_data is None
|
||
|
||
def test_line_item_validation():
|
||
item = LineItem(
|
||
position=1,
|
||
description="Widget",
|
||
quantity=10.0,
|
||
unit="Piece",
|
||
unit_price=9.99,
|
||
line_total=99.90
|
||
)
|
||
assert item.line_total == 99.90
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Models pass all unit tests
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_models.py -v
|
||
2. Assert: All tests pass (exit code 0)
|
||
Expected Result: 100% test pass rate
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: Models serialize to JSON correctly
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "from src.models import ExtractResponse, ExtractionMeta; r = ExtractResponse(is_zugferd=False, extraction_meta=ExtractionMeta(pages=1, extraction_time_ms=50)); print(r.model_dump_json())"
|
||
2. Assert: Valid JSON output containing "is_zugferd": false
|
||
Expected Result: JSON serialization works
|
||
Evidence: JSON output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(models): add Pydantic models for all API request/response schemas`
|
||
- Files: `src/models.py`, `tests/test_models.py`
|
||
- Pre-commit: `pytest tests/test_models.py`
|
||
|
||
---
|
||
|
||
### Wave 2: Core Extraction Logic
|
||
|
||
- [x] 4. ZUGFeRD Extractor Implementation (TDD)
|
||
|
||
**What to do**:
|
||
- Write tests first using sample PDFs from fixtures
|
||
- Implement `extract_zugferd()` function using factur-x library
|
||
- Handle profile detection (get_level, get_flavor)
|
||
- Parse XML to structured data matching models
|
||
- Handle non-ZUGFeRD PDFs gracefully
|
||
- Handle errors (corrupt PDF, password-protected)
|
||
|
||
**Must NOT do**:
|
||
- Do NOT implement custom XML parsing if factur-x provides it
|
||
- Do NOT cache extracted data
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Core business logic requiring careful implementation
|
||
- **Skills**: [`systematic-debugging`]
|
||
- `systematic-debugging`: For handling edge cases in PDF extraction
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 2 (with Tasks 5, 6)
|
||
- **Blocks**: Tasks 7, 10
|
||
- **Blocked By**: Tasks 2, 3
|
||
|
||
**References**:
|
||
- **Library Docs**: https://github.com/akretion/factur-x
|
||
- **Pattern References**: factur-x get_xml_from_pdf usage (from librarian research)
|
||
- **Test Fixtures**: `tests/fixtures/*.pdf`
|
||
|
||
**Key Functions**:
|
||
```python
|
||
from facturx import get_xml_from_pdf, get_level, get_flavor
|
||
|
||
def extract_zugferd(pdf_bytes: bytes) -> ExtractResponse:
|
||
"""Extract ZUGFeRD data from PDF bytes."""
|
||
# 1. Check file size (<10MB)
|
||
# 2. Try to extract XML using factur-x
|
||
# 3. Detect profile and flavor
|
||
# 4. Parse XML to structured data
|
||
# 5. Extract PDF text for pdf_abgleich
|
||
# 6. Return ExtractResponse
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
# tests/test_extractor.py
|
||
def test_extract_en16931_profile(sample_en16931_pdf):
|
||
result = extract_zugferd(sample_en16931_pdf)
|
||
assert result.is_zugferd is True
|
||
assert result.zugferd_profil == "EN16931"
|
||
assert result.xml_data is not None
|
||
assert result.xml_data.invoice_number is not None
|
||
|
||
def test_extract_non_zugferd_pdf(sample_plain_pdf):
|
||
result = extract_zugferd(sample_plain_pdf)
|
||
assert result.is_zugferd is False
|
||
assert result.xml_data is None
|
||
assert result.pdf_text is not None
|
||
|
||
def test_extract_corrupt_pdf():
|
||
with pytest.raises(ExtractionError) as exc:
|
||
extract_zugferd(b"not a pdf")
|
||
assert exc.value.error_code == "invalid_pdf"
|
||
|
||
def test_file_size_limit():
|
||
large_pdf = b"x" * (11 * 1024 * 1024) # 11MB
|
||
with pytest.raises(ExtractionError) as exc:
|
||
extract_zugferd(large_pdf)
|
||
assert exc.value.error_code == "file_too_large"
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Extractor passes all unit tests
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_extractor.py -v
|
||
2. Assert: All tests pass
|
||
Expected Result: 100% test pass rate
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: EN16931 profile correctly detected
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.extractor import extract_zugferd
|
||
with open('tests/fixtures/sample_en16931.pdf', 'rb') as f:
|
||
result = extract_zugferd(f.read())
|
||
print(f'Profile: {result.zugferd_profil}')
|
||
print(f'Invoice: {result.xml_data.invoice_number}')"
|
||
2. Assert: Output contains "Profile: EN16931" or "Profile: en16931"
|
||
Expected Result: Profile correctly identified
|
||
Evidence: Script output captured
|
||
|
||
Scenario: Non-ZUGFeRD PDF handled gracefully
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.extractor import extract_zugferd
|
||
with open('tests/fixtures/sample_no_zugferd.pdf', 'rb') as f:
|
||
result = extract_zugferd(f.read())
|
||
assert result.is_zugferd == False
|
||
assert result.pdf_text is not None
|
||
print('OK: Non-ZUGFeRD handled correctly')"
|
||
2. Assert: Output contains "OK:"
|
||
Expected Result: Graceful handling
|
||
Evidence: Script output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(extractor): implement ZUGFeRD extraction with profile detection`
|
||
- Files: `src/extractor.py`, `tests/test_extractor.py`
|
||
- Pre-commit: `pytest tests/test_extractor.py`
|
||
|
||
---
|
||
|
||
- [x] 5. PDF Text Parser Implementation (TDD)
|
||
|
||
**What to do**:
|
||
- Write tests first with expected extraction patterns
|
||
- Implement PDF text extraction using pypdf
|
||
- Create regex patterns for extracting key values from text
|
||
- Extract: invoice_number, invoice_date, amounts (net, gross, vat)
|
||
- Return confidence scores for each extraction
|
||
|
||
**Must NOT do**:
|
||
- Do NOT use OCR (text extraction only)
|
||
- Do NOT parse images in PDFs
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Regex pattern development requires careful testing
|
||
- **Skills**: [`systematic-debugging`]
|
||
- `systematic-debugging`: For developing and testing regex patterns
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 2 (with Tasks 4, 6)
|
||
- **Blocks**: Task 7
|
||
- **Blocked By**: Task 2
|
||
|
||
**References**:
|
||
- **User Spec**: Regex patterns provided in spec (invoice_number, gross_amount patterns)
|
||
- **Library Docs**: https://pypdf.readthedocs.io/en/stable/
|
||
|
||
**Key Patterns from Spec**:
|
||
```python
|
||
EXTRACTION_PATTERNS = {
|
||
"invoice_number": [
|
||
r"Rechnungs?-?(?:Nr|Nummer)[.:\s]*([A-Z0-9\-]+)",
|
||
r"Invoice\s*(?:No|Number)?[.:\s]*([A-Z0-9\-]+)",
|
||
r"Beleg-?Nr[.:\s]*([A-Z0-9\-]+)"
|
||
],
|
||
"gross_amount": [
|
||
r"Brutto[:\s]*([0-9.,]+)\s*(?:EUR|€)?",
|
||
r"Gesamtbetrag[:\s]*([0-9.,]+)",
|
||
r"Total[:\s]*([0-9.,]+)\s*(?:EUR|€)?"
|
||
],
|
||
# ... more patterns
|
||
}
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
# tests/test_pdf_parser.py
|
||
def test_extract_invoice_number():
|
||
text = "Rechnung\nRechnungs-Nr.: RE-2025-001234\nDatum: 04.02.2025"
|
||
result = extract_from_text(text)
|
||
assert result["invoice_number"] == "RE-2025-001234"
|
||
assert result["invoice_number_confidence"] >= 0.9
|
||
|
||
def test_extract_amounts():
|
||
text = "Netto: 100,00 EUR\nMwSt 19%: 19,00 EUR\nBrutto: 119,00 EUR"
|
||
result = extract_from_text(text)
|
||
assert result["gross_amount"] == 119.00
|
||
assert result["net_amount"] == 100.00
|
||
|
||
def test_german_number_format():
|
||
text = "Gesamtbetrag: 1.234,56 €"
|
||
result = extract_from_text(text)
|
||
assert result["gross_amount"] == 1234.56
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: PDF parser passes all unit tests
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_pdf_parser.py -v
|
||
2. Assert: All tests pass
|
||
Expected Result: 100% test pass rate
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: Real PDF text extraction works
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.pdf_parser import extract_text_from_pdf, extract_from_text
|
||
with open('tests/fixtures/sample_en16931.pdf', 'rb') as f:
|
||
text = extract_text_from_pdf(f.read())
|
||
print(f'Extracted {len(text)} characters')
|
||
result = extract_from_text(text)
|
||
print(f'Invoice: {result.get(\"invoice_number\", \"NOT FOUND\")}')"
|
||
2. Assert: Output shows extracted characters and invoice number
|
||
Expected Result: Text extraction works on real PDFs
|
||
Evidence: Script output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(pdf-parser): implement PDF text extraction with regex patterns`
|
||
- Files: `src/pdf_parser.py`, `tests/test_pdf_parser.py`
|
||
- Pre-commit: `pytest tests/test_pdf_parser.py`
|
||
|
||
---
|
||
|
||
- [x] 6. Utility Functions Implementation
|
||
|
||
**What to do**:
|
||
- Create UN/ECE unit code mapping dictionary
|
||
- Implement decimal rounding helper (2 decimal places, standard rounding)
|
||
- Implement tolerance comparison function (0.01 EUR)
|
||
- Add German date format parser
|
||
|
||
**Must NOT do**:
|
||
- Do NOT create configurable tolerance (hardcode 0.01)
|
||
- Do NOT add unit codes beyond common ones (expand later if needed)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Simple utility functions
|
||
- **Skills**: []
|
||
- Standard Python work
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 2 (with Tasks 4, 5)
|
||
- **Blocks**: Tasks 4, 7
|
||
- **Blocked By**: Task 1
|
||
|
||
**References**:
|
||
- **UN/ECE Codes**: https://docs.peppol.eu/poacc/upgrade-3/codelist/UNECERec20/
|
||
|
||
**Unit Code Dictionary**:
|
||
```python
|
||
UNECE_UNIT_CODES = {
|
||
"C62": "Stück",
|
||
"H87": "Stück",
|
||
"KGM": "Kilogramm",
|
||
"GRM": "Gramm",
|
||
"TNE": "Tonne",
|
||
"MTR": "Meter",
|
||
"KMT": "Kilometer",
|
||
"MTK": "Quadratmeter",
|
||
"LTR": "Liter",
|
||
"MLT": "Milliliter",
|
||
"DAY": "Tag",
|
||
"HUR": "Stunde",
|
||
"MON": "Monat",
|
||
"ANN": "Jahr",
|
||
"SET": "Set",
|
||
"PCE": "Stück",
|
||
"EA": "Stück",
|
||
}
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Unit code translation works
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.utils import translate_unit_code
|
||
assert translate_unit_code('C62') == 'Stück'
|
||
assert translate_unit_code('KGM') == 'Kilogramm'
|
||
assert translate_unit_code('UNKNOWN') == 'UNKNOWN'
|
||
print('OK: Unit codes translate correctly')"
|
||
2. Assert: Output contains "OK:"
|
||
Expected Result: All translations correct
|
||
Evidence: Script output captured
|
||
|
||
Scenario: Decimal comparison with tolerance
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.utils import amounts_match
|
||
assert amounts_match(100.00, 100.00) == True
|
||
assert amounts_match(100.00, 100.005) == True # Within 0.01
|
||
assert amounts_match(100.00, 100.02) == False # Outside tolerance
|
||
print('OK: Tolerance comparison works')"
|
||
2. Assert: Output contains "OK:"
|
||
Expected Result: Tolerance logic correct
|
||
Evidence: Script output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(utils): add unit code mapping and decimal utilities`
|
||
- Files: `src/utils.py`, `tests/test_utils.py`
|
||
- Pre-commit: `pytest tests/test_utils.py`
|
||
|
||
---
|
||
|
||
### Wave 3: Validation Logic
|
||
|
||
- [x] 7. Validator Implementation (TDD)
|
||
|
||
**What to do**:
|
||
- Write tests first for each validation check
|
||
- Implement all 4 validation checks:
|
||
1. `pflichtfelder` - Required fields check
|
||
2. `betraege` - Amount calculations check
|
||
3. `ustid` - VAT ID format check
|
||
4. `pdf_abgleich` - XML vs PDF comparison
|
||
- Return structured ValidationResult with errors and warnings
|
||
|
||
**Must NOT do**:
|
||
- Do NOT implement online USt-ID validation
|
||
- Do NOT add validation checks beyond spec
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Core business logic with multiple validation rules
|
||
- **Skills**: [`systematic-debugging`]
|
||
- `systematic-debugging`: For handling edge cases in validation logic
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 3 (with Tasks 8, 9)
|
||
- **Blocks**: Task 11
|
||
- **Blocked By**: Tasks 4, 5, 6
|
||
|
||
**References**:
|
||
- **User Spec**: Detailed validation logic tables
|
||
- **Pattern References**: src/models.py ValidationError model
|
||
|
||
**Validation Rules from Spec**:
|
||
|
||
**pflichtfelder (Required Fields)**:
|
||
| Field | Severity |
|
||
|-------|----------|
|
||
| invoice_number | critical |
|
||
| invoice_date | critical |
|
||
| supplier.name | critical |
|
||
| supplier.vat_id | critical |
|
||
| buyer.name | critical |
|
||
| totals.net | critical |
|
||
| totals.gross | critical |
|
||
| totals.vat_total | critical |
|
||
| due_date | warning |
|
||
| payment_terms.iban | warning |
|
||
| line_items (min 1) | critical |
|
||
|
||
**betraege (Calculations)**:
|
||
- line_total ≈ quantity × unit_price (±0.01)
|
||
- totals.net ≈ Σ(line_items.line_total) (±0.01)
|
||
- vat_breakdown.amount ≈ base × (rate/100) (±0.01)
|
||
- totals.vat_total ≈ Σ(vat_breakdown.amount) (±0.01)
|
||
- totals.gross ≈ totals.net + totals.vat_total (±0.01)
|
||
|
||
**ustid (VAT ID Format)**:
|
||
| Country | Regex |
|
||
|---------|-------|
|
||
| DE | `^DE[0-9]{9}$` |
|
||
| AT | `^ATU[0-9]{8}$` |
|
||
| CH | `^CHE[0-9]{9}(MWST\|TVA\|IVA)$` |
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
# tests/test_validator.py
|
||
def test_pflichtfelder_missing_invoice_number():
|
||
data = XmlData(invoice_number=None, ...)
|
||
result = validate_pflichtfelder(data)
|
||
assert any(e.field == "invoice_number" for e in result.errors)
|
||
assert any(e.severity == "critical" for e in result.errors)
|
||
|
||
def test_betraege_calculation_mismatch():
|
||
data = XmlData(
|
||
line_items=[LineItem(quantity=10, unit_price=9.99, line_total=100.00)], # Wrong!
|
||
...
|
||
)
|
||
result = validate_betraege(data)
|
||
assert any(e.error_code == "calculation_mismatch" for e in result.errors)
|
||
assert any("99.90" in e.message for e in result.errors)
|
||
|
||
def test_ustid_valid_german():
|
||
result = validate_ustid("DE123456789")
|
||
assert result.is_valid is True
|
||
|
||
def test_ustid_invalid_format():
|
||
result = validate_ustid("DE12345") # Too short
|
||
assert result.is_valid is False
|
||
assert result.error_code == "invalid_format"
|
||
|
||
def test_pdf_abgleich_mismatch():
|
||
xml_data = XmlData(invoice_number="RE-001", totals=Totals(gross=118.88))
|
||
pdf_values = {"invoice_number": "RE-002", "gross_amount": 118.88}
|
||
result = validate_pdf_abgleich(xml_data, pdf_values)
|
||
assert any(e.field == "invoice_number" for e in result.errors)
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Validator passes all unit tests
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_validator.py -v
|
||
2. Assert: All tests pass
|
||
Expected Result: 100% test pass rate
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: All validation checks work together
|
||
Tool: Bash
|
||
Steps:
|
||
1. python -c "
|
||
from src.validator import validate_invoice
|
||
from src.models import XmlData, ValidateRequest
|
||
# Create a request with known issues
|
||
request = ValidateRequest(
|
||
xml_data={'invoice_number': None, ...},
|
||
checks=['pflichtfelder', 'betraege']
|
||
)
|
||
result = validate_invoice(request)
|
||
print(f'Errors: {len(result.errors)}')"
|
||
2. Assert: Script runs without error, shows error count
|
||
Expected Result: Validator processes all checks
|
||
Evidence: Script output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(validator): implement all validation checks (pflichtfelder, betraege, ustid, pdf_abgleich)`
|
||
- Files: `src/validator.py`, `tests/test_validator.py`
|
||
- Pre-commit: `pytest tests/test_validator.py`
|
||
|
||
---
|
||
|
||
### Wave 3 (continued): API Foundation
|
||
|
||
- [x] 8. FastAPI Application Structure
|
||
|
||
**What to do**:
|
||
- Create FastAPI app instance in main.py
|
||
- Configure exception handlers for custom errors
|
||
- Set up structured JSON logging
|
||
- Add CORS middleware (for local development)
|
||
- Configure app metadata (title, version, description)
|
||
|
||
**Must NOT do**:
|
||
- Do NOT add authentication middleware
|
||
- Do NOT add rate limiting
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Standard FastAPI setup
|
||
- **Skills**: []
|
||
- Standard FastAPI patterns
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 3 (with Tasks 7, 9)
|
||
- **Blocks**: Tasks 9, 10, 11
|
||
- **Blocked By**: Task 3
|
||
|
||
**References**:
|
||
- **FastAPI Docs**: https://fastapi.tiangolo.com/
|
||
- **Pattern References**: User spec error handling table
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: FastAPI app starts
|
||
Tool: Bash
|
||
Steps:
|
||
1. timeout 5 uvicorn src.main:app --port 5001 &
|
||
2. sleep 2
|
||
3. curl -s http://localhost:5001/openapi.json | head -c 100
|
||
4. Assert: Output contains "openapi"
|
||
5. kill %1
|
||
Expected Result: OpenAPI schema accessible
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(api): create FastAPI application structure with error handling`
|
||
- Files: `src/main.py`
|
||
- Pre-commit: `python -c "from src.main import app; print(app.title)"`
|
||
|
||
---
|
||
|
||
- [x] 9. Health Endpoint Implementation
|
||
|
||
**What to do**:
|
||
- Implement `GET /health` endpoint
|
||
- Return version from pyproject.toml
|
||
- Return simple health status
|
||
|
||
**Must NOT do**:
|
||
- Do NOT add complex health checks (no dependencies to check)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Simple endpoint
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 3 (with Tasks 7, 8)
|
||
- **Blocks**: Tasks 10, 11
|
||
- **Blocked By**: Task 8
|
||
|
||
**References**:
|
||
- **User Spec**: Health endpoint response format
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Health endpoint returns correct format
|
||
Tool: Bash
|
||
Steps:
|
||
1. uvicorn src.main:app --port 5002 &
|
||
2. sleep 2
|
||
3. curl -s http://localhost:5002/health
|
||
4. Assert: Response contains {"status": "healthy", "version": "1.0.0"}
|
||
5. kill %1
|
||
Expected Result: Health check passes
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES (groups with Task 8)
|
||
- Message: `feat(api): add health endpoint`
|
||
- Files: `src/main.py`
|
||
|
||
---
|
||
|
||
### Wave 4: API Endpoints
|
||
|
||
- [ ] 10. Extract Endpoint Implementation (TDD)
|
||
|
||
**What to do**:
|
||
- Write integration tests for `/extract` endpoint
|
||
- Implement `POST /extract` endpoint
|
||
- Accept base64-encoded PDF in JSON body
|
||
- Use extractor module for extraction
|
||
- Return structured ExtractResponse
|
||
|
||
**Must NOT do**:
|
||
- Do NOT accept multipart file uploads (JSON only per spec)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Integration of extractor into API endpoint
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 4 (with Tasks 11, 12)
|
||
- **Blocks**: Task 13
|
||
- **Blocked By**: Tasks 4, 8, 9
|
||
|
||
**References**:
|
||
- **User Spec**: /extract request/response format
|
||
- **Pattern References**: src/extractor.py
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
# tests/test_api.py
|
||
@pytest.fixture
|
||
def client():
|
||
return TestClient(app)
|
||
|
||
def test_extract_zugferd_pdf(client, sample_en16931_pdf_base64):
|
||
response = client.post("/extract", json={"pdf_base64": sample_en16931_pdf_base64})
|
||
assert response.status_code == 200
|
||
data = response.json()
|
||
assert data["is_zugferd"] is True
|
||
assert data["zugferd_profil"] is not None
|
||
|
||
def test_extract_invalid_base64(client):
|
||
response = client.post("/extract", json={"pdf_base64": "not-valid-base64!!!"})
|
||
assert response.status_code == 400
|
||
assert response.json()["error"] == "invalid_base64"
|
||
|
||
def test_extract_non_pdf(client):
|
||
# Base64 of "Hello World" (not a PDF)
|
||
response = client.post("/extract", json={"pdf_base64": "SGVsbG8gV29ybGQ="})
|
||
assert response.status_code == 400
|
||
assert response.json()["error"] == "invalid_pdf"
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Extract endpoint integration test
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_api.py::test_extract -v
|
||
2. Assert: All extract tests pass
|
||
Expected Result: Endpoint works correctly
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: Live extract endpoint test
|
||
Tool: Bash
|
||
Steps:
|
||
1. uvicorn src.main:app --port 5003 &
|
||
2. sleep 2
|
||
3. PDF_BASE64=$(base64 -w 0 tests/fixtures/sample_en16931.pdf)
|
||
4. curl -s -X POST http://localhost:5003/extract \
|
||
-H "Content-Type: application/json" \
|
||
-d "{\"pdf_base64\": \"$PDF_BASE64\"}" | jq '.is_zugferd'
|
||
5. Assert: Output is "true"
|
||
6. kill %1
|
||
Expected Result: ZUGFeRD detected
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(api): implement /extract endpoint for PDF processing`
|
||
- Files: `src/main.py`, `tests/test_api.py`
|
||
- Pre-commit: `pytest tests/test_api.py::test_extract`
|
||
|
||
---
|
||
|
||
- [ ] 11. Validate Endpoint Implementation (TDD)
|
||
|
||
**What to do**:
|
||
- Write integration tests for `/validate` endpoint
|
||
- Implement `POST /validate` endpoint
|
||
- Accept xml_data, pdf_text, and checks array
|
||
- Use validator module for validation
|
||
- Return structured ValidateResponse
|
||
|
||
**Must NOT do**:
|
||
- Do NOT run checks not in the request's checks array
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Integration of validator into API endpoint
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 4 (with Tasks 10, 12)
|
||
- **Blocks**: Task 13
|
||
- **Blocked By**: Tasks 7, 8, 9
|
||
|
||
**References**:
|
||
- **User Spec**: /validate request/response format
|
||
- **Pattern References**: src/validator.py
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**TDD - Write Tests First:**
|
||
```python
|
||
def test_validate_all_checks(client):
|
||
response = client.post("/validate", json={
|
||
"xml_data": {...},
|
||
"pdf_text": "...",
|
||
"checks": ["pflichtfelder", "betraege", "ustid", "pdf_abgleich"]
|
||
})
|
||
assert response.status_code == 200
|
||
data = response.json()
|
||
assert "is_valid" in data
|
||
assert "errors" in data
|
||
assert "summary" in data
|
||
|
||
def test_validate_partial_checks(client):
|
||
response = client.post("/validate", json={
|
||
"xml_data": {...},
|
||
"checks": ["pflichtfelder"]
|
||
})
|
||
assert response.status_code == 200
|
||
# Only pflichtfelder check should run
|
||
```
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Validate endpoint integration test
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/test_api.py::test_validate -v
|
||
2. Assert: All validate tests pass
|
||
Expected Result: Endpoint works correctly
|
||
Evidence: pytest output captured
|
||
|
||
Scenario: Live validate endpoint with invalid data
|
||
Tool: Bash
|
||
Steps:
|
||
1. uvicorn src.main:app --port 5004 &
|
||
2. sleep 2
|
||
3. curl -s -X POST http://localhost:5004/validate \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"xml_data": {"invoice_number": null}, "checks": ["pflichtfelder"]}' | jq '.is_valid'
|
||
4. Assert: Output is "false"
|
||
5. kill %1
|
||
Expected Result: Validation detects missing field
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `feat(api): implement /validate endpoint for invoice validation`
|
||
- Files: `src/main.py`, `tests/test_api.py`
|
||
- Pre-commit: `pytest tests/test_api.py::test_validate`
|
||
|
||
---
|
||
|
||
- [ ] 12. Error Handling Middleware
|
||
|
||
**What to do**:
|
||
- Implement exception handlers for all error types
|
||
- Map exceptions to HTTP status codes and error responses
|
||
- Ensure all errors return JSON format
|
||
- Add request logging
|
||
|
||
**Must NOT do**:
|
||
- Do NOT expose stack traces in production
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Standard FastAPI error handling
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 4 (with Tasks 10, 11)
|
||
- **Blocks**: Task 13
|
||
- **Blocked By**: Task 8
|
||
|
||
**References**:
|
||
- **User Spec**: Error handling table
|
||
|
||
**Error Mapping from Spec**:
|
||
| Error | Status | error_code |
|
||
|-------|--------|------------|
|
||
| Invalid JSON | 400 | invalid_json |
|
||
| Not a PDF | 400 | invalid_pdf |
|
||
| PDF corrupt | 400 | corrupt_pdf |
|
||
| Base64 invalid | 400 | invalid_base64 |
|
||
| File too large | 400 | file_too_large |
|
||
| Password protected | 400 | password_protected_pdf |
|
||
| Internal error | 500 | internal_error |
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Error responses are JSON
|
||
Tool: Bash
|
||
Steps:
|
||
1. uvicorn src.main:app --port 5005 &
|
||
2. sleep 2
|
||
3. curl -s -X POST http://localhost:5005/extract \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"pdf_base64": "invalid!!!"}' | jq '.error'
|
||
4. Assert: Output is "invalid_base64"
|
||
5. kill %1
|
||
Expected Result: JSON error response
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES (groups with Task 10, 11)
|
||
- Message: `feat(api): add comprehensive error handling middleware`
|
||
- Files: `src/main.py`
|
||
|
||
---
|
||
|
||
### Wave 5: Packaging
|
||
|
||
- [ ] 13. Integration Tests
|
||
|
||
**What to do**:
|
||
- Create end-to-end integration tests
|
||
- Test full workflow: extract → validate
|
||
- Test with all sample PDFs
|
||
- Test error scenarios
|
||
|
||
**Must NOT do**:
|
||
- Do NOT create performance tests
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Integration testing requires comprehensive coverage
|
||
- **Skills**: [`systematic-debugging`]
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: NO
|
||
- **Parallel Group**: Sequential (depends on all endpoints)
|
||
- **Blocks**: Tasks 14, 16
|
||
- **Blocked By**: Tasks 10, 11, 12
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Full integration test suite passes
|
||
Tool: Bash
|
||
Steps:
|
||
1. pytest tests/ -v --tb=short
|
||
2. Assert: All tests pass (exit code 0)
|
||
Expected Result: 100% test pass rate
|
||
Evidence: pytest output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `test(integration): add end-to-end integration tests`
|
||
- Files: `tests/test_integration.py`
|
||
- Pre-commit: `pytest tests/`
|
||
|
||
---
|
||
|
||
- [ ] 14. Dockerfile Creation
|
||
|
||
**What to do**:
|
||
- Create multi-stage Dockerfile as per spec
|
||
- Build stage: install dependencies
|
||
- Production stage: slim image with app
|
||
- Non-root user for security
|
||
- Expose port 5000
|
||
|
||
**Must NOT do**:
|
||
- Do NOT include dev dependencies in production image
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Standard Dockerfile from spec template
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 5 (with Task 15)
|
||
- **Blocks**: Task 16
|
||
- **Blocked By**: Task 13
|
||
|
||
**References**:
|
||
- **User Spec**: Complete Dockerfile template provided
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Docker build succeeds
|
||
Tool: Bash
|
||
Steps:
|
||
1. docker build -t zugferd-service:test .
|
||
2. Assert: Exit code 0
|
||
3. docker images zugferd-service:test --format "{{.Size}}"
|
||
4. Assert: Size < 500MB
|
||
Expected Result: Image builds and is reasonably sized
|
||
Evidence: Build output captured
|
||
|
||
Scenario: Container runs and responds
|
||
Tool: Bash
|
||
Steps:
|
||
1. docker run -d --name zugferd-test -p 5006:5000 zugferd-service:test
|
||
2. sleep 3
|
||
3. curl -s http://localhost:5006/health | jq '.status'
|
||
4. Assert: Output is "healthy"
|
||
5. docker stop zugferd-test && docker rm zugferd-test
|
||
Expected Result: Container is functional
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `build(docker): add multi-stage Dockerfile for production`
|
||
- Files: `Dockerfile`
|
||
- Pre-commit: `docker build -t zugferd-service:test .`
|
||
|
||
---
|
||
|
||
- [ ] 15. Docker Compose Configuration
|
||
|
||
**What to do**:
|
||
- Create docker-compose.yml for local development
|
||
- Include volume mount for live reload
|
||
- Configure environment variables
|
||
- Add health check
|
||
|
||
**Must NOT do**:
|
||
- Do NOT add additional services (no DB, no cache)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Simple docker-compose setup
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: YES
|
||
- **Parallel Group**: Wave 5 (with Task 14)
|
||
- **Blocks**: None
|
||
- **Blocked By**: Task 13
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Docker Compose starts service
|
||
Tool: Bash
|
||
Steps:
|
||
1. docker-compose up -d
|
||
2. sleep 5
|
||
3. curl -s http://localhost:5000/health | jq '.status'
|
||
4. Assert: Output is "healthy"
|
||
5. docker-compose down
|
||
Expected Result: Compose setup works
|
||
Evidence: curl output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `build(docker): add docker-compose.yml for local development`
|
||
- Files: `docker-compose.yml`
|
||
|
||
---
|
||
|
||
### Wave 6: Nix Packaging
|
||
|
||
- [ ] 16. Nix Flake Packaging
|
||
|
||
**What to do**:
|
||
- Create flake.nix with buildPythonApplication
|
||
- Use pythonRelaxDeps for dependency flexibility
|
||
- Include devShell for development
|
||
- Test with nix build and nix run
|
||
|
||
**Must NOT do**:
|
||
- Do NOT create complex overlay structure
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `unspecified-high`
|
||
- Reason: Nix packaging requires careful dependency handling
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: NO
|
||
- **Parallel Group**: Sequential
|
||
- **Blocks**: Tasks 17, 18
|
||
- **Blocked By**: Task 14
|
||
|
||
**References**:
|
||
- **Librarian Research**: mem0 Nix packaging pattern
|
||
- **External References**: https://github.com/mem0ai/mem0 Nix package
|
||
|
||
**flake.nix Structure**:
|
||
```nix
|
||
{
|
||
description = "ZUGFeRD REST API Service";
|
||
|
||
inputs = {
|
||
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
|
||
flake-utils.url = "github:numtide/flake-utils";
|
||
};
|
||
|
||
outputs = { self, nixpkgs, flake-utils }:
|
||
flake-utils.lib.eachDefaultSystem (system:
|
||
let
|
||
pkgs = nixpkgs.legacyPackages.${system};
|
||
pythonPackages = pkgs.python311Packages;
|
||
|
||
zugferd-service = pythonPackages.buildPythonApplication {
|
||
pname = "zugferd-service";
|
||
version = "1.0.0";
|
||
pyproject = true;
|
||
src = ./.;
|
||
|
||
pythonRelaxDeps = true;
|
||
|
||
build-system = [ pythonPackages.hatchling ];
|
||
|
||
dependencies = with pythonPackages; [
|
||
fastapi
|
||
uvicorn
|
||
pydantic
|
||
python-multipart
|
||
# factur-x - may need packaging
|
||
pypdf
|
||
lxml
|
||
];
|
||
|
||
nativeCheckInputs = with pythonPackages; [
|
||
pytestCheckHook
|
||
pytest-asyncio
|
||
httpx
|
||
];
|
||
|
||
passthru = {
|
||
mainProgram = "zugferd-service";
|
||
};
|
||
|
||
meta = {
|
||
description = "REST API for ZUGFeRD invoice extraction";
|
||
license = pkgs.lib.licenses.mit;
|
||
};
|
||
};
|
||
in
|
||
{
|
||
packages.default = zugferd-service;
|
||
packages.zugferd-service = zugferd-service;
|
||
|
||
devShells.default = pkgs.mkShell {
|
||
packages = [
|
||
(pkgs.python311.withPackages (ps: with ps; [
|
||
fastapi uvicorn pydantic pypdf lxml
|
||
pytest pytest-asyncio httpx
|
||
]))
|
||
];
|
||
};
|
||
}
|
||
);
|
||
}
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: Nix flake builds successfully
|
||
Tool: Bash
|
||
Steps:
|
||
1. nix build .#zugferd-service
|
||
2. Assert: Exit code 0
|
||
3. ls -la result/bin/
|
||
4. Assert: zugferd-service binary exists
|
||
Expected Result: Nix package builds
|
||
Evidence: Build output captured
|
||
|
||
Scenario: Nix package runs correctly
|
||
Tool: Bash
|
||
Steps:
|
||
1. nix run .#zugferd-service &
|
||
2. sleep 3
|
||
3. curl -s http://localhost:5000/health | jq '.status'
|
||
4. Assert: Output is "healthy"
|
||
5. kill %1
|
||
Expected Result: Nix-built service runs
|
||
Evidence: curl output captured
|
||
|
||
Scenario: Dev shell provides dependencies
|
||
Tool: Bash
|
||
Steps:
|
||
1. nix develop -c python -c "import fastapi; import pypdf; print('OK')"
|
||
2. Assert: Output is "OK"
|
||
Expected Result: Dev shell has all deps
|
||
Evidence: Command output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `build(nix): add flake.nix for Nix packaging`
|
||
- Files: `flake.nix`
|
||
- Pre-commit: `nix flake check`
|
||
|
||
---
|
||
|
||
- [ ] 17. NixOS Service Module Example
|
||
|
||
**What to do**:
|
||
- Create example NixOS module for deployment
|
||
- Include service configuration options
|
||
- Add systemd service definition
|
||
- Document usage in README
|
||
|
||
**Must NOT do**:
|
||
- Do NOT create production-ready module (example only)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `quick`
|
||
- Reason: Example module following standard patterns
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: NO
|
||
- **Parallel Group**: Sequential
|
||
- **Blocks**: Task 18
|
||
- **Blocked By**: Task 16
|
||
|
||
**References**:
|
||
- **User Spec**: NixOS container configuration example
|
||
- **Librarian Research**: NixOS service module pattern
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: NixOS module syntax is valid
|
||
Tool: Bash
|
||
Steps:
|
||
1. nix-instantiate --eval -E "import ./nix/module.nix"
|
||
2. Assert: Exit code 0 or specific Nix evaluation output
|
||
Expected Result: Module parses correctly
|
||
Evidence: Nix output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `docs(nix): add example NixOS service module`
|
||
- Files: `nix/module.nix`
|
||
|
||
---
|
||
|
||
- [ ] 18. README Documentation
|
||
|
||
**What to do**:
|
||
- Create comprehensive README.md
|
||
- Include: Overview, Installation, Usage, API Reference
|
||
- Add examples for Docker, Nix, and direct Python usage
|
||
- Document all endpoints with curl examples
|
||
- Include troubleshooting section
|
||
|
||
**Must NOT do**:
|
||
- Do NOT duplicate API spec (reference it)
|
||
|
||
**Recommended Agent Profile**:
|
||
- **Category**: `writing`
|
||
- Reason: Documentation writing
|
||
- **Skills**: []
|
||
|
||
**Parallelization**:
|
||
- **Can Run In Parallel**: NO
|
||
- **Parallel Group**: Sequential (last task)
|
||
- **Blocks**: None
|
||
- **Blocked By**: Tasks 16, 17
|
||
|
||
**README Structure**:
|
||
```markdown
|
||
# ZUGFeRD-Service
|
||
|
||
## Overview
|
||
## Quick Start
|
||
### Docker
|
||
### Nix
|
||
### Python (Development)
|
||
## API Reference
|
||
### GET /health
|
||
### POST /extract
|
||
### POST /validate
|
||
## Configuration
|
||
## NixOS Deployment
|
||
## Development
|
||
## Troubleshooting
|
||
## License
|
||
```
|
||
|
||
**Acceptance Criteria**:
|
||
|
||
**Agent-Executed QA Scenarios:**
|
||
|
||
```
|
||
Scenario: README contains all required sections
|
||
Tool: Bash
|
||
Steps:
|
||
1. grep -c "## " README.md
|
||
2. Assert: At least 8 sections
|
||
3. grep "curl" README.md | wc -l
|
||
4. Assert: At least 3 curl examples
|
||
Expected Result: Comprehensive documentation
|
||
Evidence: grep output captured
|
||
```
|
||
|
||
**Commit**: YES
|
||
- Message: `docs: add comprehensive README with installation and usage guide`
|
||
- Files: `README.md`
|
||
|
||
---
|
||
|
||
## Commit Strategy
|
||
|
||
| After Task | Message | Files | Verification |
|
||
|------------|---------|-------|--------------|
|
||
| 1 | `feat(project): initialize ZUGFeRD service` | pyproject.toml, src/, tests/ | toml parse |
|
||
| 2 | `test(fixtures): add ZUGFeRD sample PDFs` | tests/fixtures/ | file exists |
|
||
| 3 | `feat(models): add Pydantic models` | src/models.py | pytest |
|
||
| 4 | `feat(extractor): implement extraction` | src/extractor.py | pytest |
|
||
| 5 | `feat(pdf-parser): implement PDF parsing` | src/pdf_parser.py | pytest |
|
||
| 6 | `feat(utils): add utilities` | src/utils.py | pytest |
|
||
| 7 | `feat(validator): implement validation` | src/validator.py | pytest |
|
||
| 8-9 | `feat(api): add FastAPI app + health` | src/main.py | curl |
|
||
| 10 | `feat(api): add /extract endpoint` | src/main.py | pytest |
|
||
| 11 | `feat(api): add /validate endpoint` | src/main.py | pytest |
|
||
| 12 | `feat(api): add error handling` | src/main.py | curl |
|
||
| 13 | `test(integration): add e2e tests` | tests/ | pytest |
|
||
| 14 | `build(docker): add Dockerfile` | Dockerfile | docker build |
|
||
| 15 | `build(docker): add docker-compose` | docker-compose.yml | compose up |
|
||
| 16 | `build(nix): add flake.nix` | flake.nix | nix build |
|
||
| 17 | `docs(nix): add NixOS module` | nix/module.nix | nix eval |
|
||
| 18 | `docs: add README` | README.md | grep check |
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
### Verification Commands
|
||
```bash
|
||
# All tests pass
|
||
pytest tests/ -v
|
||
# Expected: All tests pass (exit code 0)
|
||
|
||
# Docker builds and runs
|
||
docker build -t zugferd-service .
|
||
docker run -p 5000:5000 zugferd-service &
|
||
curl http://localhost:5000/health
|
||
# Expected: {"status": "healthy", "version": "1.0.0"}
|
||
|
||
# Nix builds and runs
|
||
nix build .#zugferd-service
|
||
./result/bin/zugferd-service &
|
||
curl http://localhost:5000/health
|
||
# Expected: {"status": "healthy", "version": "1.0.0"}
|
||
|
||
# Extract endpoint works
|
||
PDF_BASE64=$(base64 -w 0 tests/fixtures/sample_en16931.pdf)
|
||
curl -X POST http://localhost:5000/extract \
|
||
-H "Content-Type: application/json" \
|
||
-d "{\"pdf_base64\": \"$PDF_BASE64\"}" | jq '.is_zugferd'
|
||
# Expected: true
|
||
```
|
||
|
||
### Final Checklist
|
||
- [ ] All 18 tasks completed
|
||
- [ ] All tests pass (pytest)
|
||
- [ ] Docker image builds (<500MB)
|
||
- [ ] Docker container runs and responds
|
||
- [ ] Nix flake builds without errors
|
||
- [ ] Nix package runs and responds
|
||
- [ ] All endpoints return expected responses
|
||
- [ ] README documents all features
|
||
- [ ] No "Must NOT Have" items present
|