feat(core): implement extractor, pdf_parser, and utils with TDD

Wave 2 tasks complete:
- Task 4: ZUGFeRD extractor with profile detection (factur-x)
- Task 5: PDF text parser with regex patterns
- Task 6: Utils with unit code mapping and tolerance checks

Features:
- extract_zugferd() extracts XML and text from PDFs
- parse_zugferd_xml() parses UN/CEFACT CII XML to models
- extract_from_text() extracts values using regex patterns
- translate_unit_code() maps UN/ECE codes to German
- amounts_match() checks with 0.01 EUR tolerance
- German number/date format handling

Tests: 27 utils tests, 27 pdf_parser tests, extractor tests
This commit is contained in:
m3tm3re
2026-02-04 19:42:32 +01:00
parent 29bd8453ec
commit c1f603cd46
8 changed files with 1642 additions and 8 deletions

View File

@@ -116,3 +116,122 @@ Initial session for ZUGFeRD-Service implementation.
- Optional fields: `type | None = Field(default=None, ...)`
- Empty list defaults: `list[Type] = Field(default_factory=list)`
## [2026-02-04T20:30:00.000Z] Task 5: PDF Text Parser Implementation
### TDD Implementation Pattern
- Write failing tests first (RED), implement minimum code (GREEN), refactor if needed
- 27 tests written covering: PDF extraction, regex patterns, number/date formats, edge cases
- All tests pass after implementation
### pypdf Text Extraction
- `PdfReader` requires file-like object, not raw bytes
- Use `io.BytesIO(pdf_bytes)` to wrap bytes for pypdf
- Extract text page-by-page, concatenate with newlines
### Regex Pattern Design for Numbers
- Initial pattern `[0-9.,]+` matches lone dots (invalid number)
- Fixed pattern: `[0-9]+(?:[.,][0-9]+)*` requires at least one digit
- Ensures matched values are valid numbers before parsing
### German Number Format Detection
- German: `1.234,56` (dot=thousands, comma=decimal)
- International: `1,234.56` (comma=thousands, dot=decimal)
- Detection: Check if comma appears after last dot
```python
if "," in num_str and num_str.rfind(",") > num_str.rfind("."):
# German format
else:
# International format
```
### Confidence Scoring
- First pattern match = 1.0 confidence
- Each subsequent pattern reduces confidence by 0.1
- Range: 1.0 (first pattern) → 0.6 (fifth pattern)
### German Date Format Conversion
- Input: `04.02.2025` (DD.MM.YYYY)
- Output: `2025-02-04` (ISO format YYYY-MM-DD)
- Use `zfill(2)` to pad single digits: `4` → `04`
### Test Docstrings are Necessary
- Pytest uses method docstrings in test reports
- Essential for readable test output
- Module/class docstrings provide organization context
### Invoice Field Patterns (from spec)
- invoice_number: "Rechnungs-Nr", "Invoice No", "Beleg-Nr", "Rechnung X/Y"
- gross_amount: "Brutto", "Gesamtbetrag", "Total", "Endbetrag", "Summe"
- net_amount: "Netto", "Rechnungsbetrag"
- vat_amount: "MwSt", "USt", "Steuer"
- invoice_date: "Rechnungsdatum", "Datum", "Invoice Date"
- supplier_name: "Lieferant", "Verkäufer"
### PDF Layout Variations
- Real PDFs may have different field layouts than spec patterns
- EN16931 sample uses "Bruttosumme" instead of "Brutto"
- Patterns can be refined iteratively based on real data
## [2026-02-04T20:45:00.000Z] Task 6: Utility Functions Implementation
### UNECE Unit Code Mapping
- UN/ECE unit codes standardized for cross-border trade documents
- 17 common codes mapped to German translations:
- "C62", "H87", "PCE", "EA" → "Stück"
- "KGM" → "Kilogramm", "GRM" → "Gramm", "TNE" → "Tonne"
- "MTR" → "Meter", "KMT" → "Kilometer", "MTK" → "Quadratmeter"
- "LTR" → "Liter", "MLT" → "Milliliter"
- "DAY" → "Tag", "HUR" → "Stunde", "MON" → "Monat", "ANN" → "Jahr"
- "SET" → "Set"
- Fallback: return original code if not found in dictionary
### Floating Point Precision Handling
- `amounts_match()` with hardcoded 0.01 EUR tolerance
- Floating point arithmetic causes precision issues: `100.01 - 100.00 = 0.010000000000005116`
- Solution: Add small epsilon margin (1e-10) to tolerance for robust comparison
- Formula: `abs(actual - expected) <= tolerance + 1e-10`
### German Number Format Parsing
- German format: `1.234,56` (dot=thousands, comma=decimal)
- Conversion: Remove dots, replace comma with dot
- Single-line: `num_str.replace('.', '').replace(',', '.')`
- Important: Remove thousands separator BEFORE replacing decimal separator
### German Date Format Parsing
- Input: `04.02.2025` (DD.MM.YYYY)
- Output: `2025-02-04` (ISO format YYYY-MM-DD)
- Validation: Check for 3 parts separated by dots before parsing
- Pad single digits: `zfill(2)` → `4` → `04`
### Standard Rounding (Not Banker's Rounding)
- Python's `round()` uses banker's rounding (round half to even)
- Task requires standard rounding (round half away from zero)
- Solution: Use `Decimal` with `ROUND_HALF_UP`
- Implementation:
```python
from decimal import Decimal, ROUND_HALF_UP
quantizer = Decimal(f'1.{"0" * (places - 1)}1' if places > 1 else "0.1")
float(Decimal(str(amount)).quantize(quantizer, rounding=ROUND_HALF_UP))
```
- Note: Use `str(amount)` when creating Decimal to avoid floating point issues
### Test Coverage Patterns
- Unit code translation: all 17 codes + unknown fallback
- Amounts match: exact, within tolerance, at boundary, beyond tolerance, negative, zero
- German numbers: integer, decimal, thousands, large, negative
- German dates: standard, single digit, ISO format, invalid format
- Rounding: default 2 places, custom places, rounding up/down, negative, zero
### Decimal quantize Pattern
- For N decimal places: use quantizer string with N-1 zeros and trailing 1
- 2 places: `"0.11"` → `Decimal('0.11')`
- 3 places: `"0.101"` → `Decimal('0.101')`
- 1 place: `"0.1"` → `Decimal('0.1')`
### Nix Environment Testing
- Pytest not installed in base Python environment
- Use nix-shell for testing: `nix-shell -p python312Packages.pytest --run "pytest tests/test_utils.py -v"`
- All tests must pass before marking task complete