feat(core): implement extractor, pdf_parser, and utils with TDD

Wave 2 tasks complete: - Task 4: ZUGFeRD extractor with profile detection (factur-x) - Task 5: PDF text parser with regex patterns - Task 6: Utils with unit code mapping and tolerance checks Features: - extract_zugferd() extracts XML and text from PDFs - parse_zugferd_xml() parses UN/CEFACT CII XML to models - extract_from_text() extracts values using regex patterns - translate_unit_code() maps UN/ECE codes to German - amounts_match() checks with 0.01 EUR tolerance - German number/date format handling Tests: 27 utils tests, 27 pdf_parser tests, extractor tests
2026-02-04 19:42:32 +01:00
parent 29bd8453ec
commit c1f603cd46
8 changed files with 1642 additions and 8 deletions
--- a/.sisyphus/notepads/zugferd-service/learnings.md
+++ b/.sisyphus/notepads/zugferd-service/learnings.md
@@ -116,3 +116,122 @@ Initial session for ZUGFeRD-Service implementation.
 - Optional fields: `type | None = Field(default=None, ...)`
 - Empty list defaults: `list[Type] = Field(default_factory=list)`

+
+## [2026-02-04T20:30:00.000Z] Task 5: PDF Text Parser Implementation
+
+### TDD Implementation Pattern
+- Write failing tests first (RED), implement minimum code (GREEN), refactor if needed
+- 27 tests written covering: PDF extraction, regex patterns, number/date formats, edge cases
+- All tests pass after implementation
+
+### pypdf Text Extraction
+- `PdfReader` requires file-like object, not raw bytes
+- Use `io.BytesIO(pdf_bytes)` to wrap bytes for pypdf
+- Extract text page-by-page, concatenate with newlines
+
+### Regex Pattern Design for Numbers
+- Initial pattern `[0-9.,]+` matches lone dots (invalid number)
+- Fixed pattern: `[0-9]+(?:[.,][0-9]+)*` requires at least one digit
+- Ensures matched values are valid numbers before parsing
+
+### German Number Format Detection
+- German: `1.234,56` (dot=thousands, comma=decimal)
+- International: `1,234.56` (comma=thousands, dot=decimal)
+- Detection: Check if comma appears after last dot
+  ```python
+  if "," in num_str and num_str.rfind(",") > num_str.rfind("."):
+      # German format
+  else:
+      # International format
+  ```
+
+### Confidence Scoring
+- First pattern match = 1.0 confidence
+- Each subsequent pattern reduces confidence by 0.1
+- Range: 1.0 (first pattern) → 0.6 (fifth pattern)
+
+### German Date Format Conversion
+- Input: `04.02.2025` (DD.MM.YYYY)
+- Output: `2025-02-04` (ISO format YYYY-MM-DD)
+- Use `zfill(2)` to pad single digits: `4` → `04`
+
+### Test Docstrings are Necessary
+- Pytest uses method docstrings in test reports
+- Essential for readable test output
+- Module/class docstrings provide organization context
+
+### Invoice Field Patterns (from spec)
+- invoice_number: "Rechnungs-Nr", "Invoice No", "Beleg-Nr", "Rechnung X/Y"
+- gross_amount: "Brutto", "Gesamtbetrag", "Total", "Endbetrag", "Summe"
+- net_amount: "Netto", "Rechnungsbetrag"
+- vat_amount: "MwSt", "USt", "Steuer"
+- invoice_date: "Rechnungsdatum", "Datum", "Invoice Date"
+- supplier_name: "Lieferant", "Verkäufer"
+
+### PDF Layout Variations
+- Real PDFs may have different field layouts than spec patterns
+- EN16931 sample uses "Bruttosumme" instead of "Brutto"
+- Patterns can be refined iteratively based on real data
+
+
+## [2026-02-04T20:45:00.000Z] Task 6: Utility Functions Implementation
+
+### UNECE Unit Code Mapping
+- UN/ECE unit codes standardized for cross-border trade documents
+- 17 common codes mapped to German translations:
+  - "C62", "H87", "PCE", "EA" → "Stück"
+  - "KGM" → "Kilogramm", "GRM" → "Gramm", "TNE" → "Tonne"
+  - "MTR" → "Meter", "KMT" → "Kilometer", "MTK" → "Quadratmeter"
+  - "LTR" → "Liter", "MLT" → "Milliliter"
+  - "DAY" → "Tag", "HUR" → "Stunde", "MON" → "Monat", "ANN" → "Jahr"
+  - "SET" → "Set"
+- Fallback: return original code if not found in dictionary
+
+### Floating Point Precision Handling
+- `amounts_match()` with hardcoded 0.01 EUR tolerance
+- Floating point arithmetic causes precision issues: `100.01 - 100.00 = 0.010000000000005116`
+- Solution: Add small epsilon margin (1e-10) to tolerance for robust comparison
+- Formula: `abs(actual - expected) <= tolerance + 1e-10`
+
+### German Number Format Parsing
+- German format: `1.234,56` (dot=thousands, comma=decimal)
+- Conversion: Remove dots, replace comma with dot
+- Single-line: `num_str.replace('.', '').replace(',', '.')`
+- Important: Remove thousands separator BEFORE replacing decimal separator
+
+### German Date Format Parsing
+- Input: `04.02.2025` (DD.MM.YYYY)
+- Output: `2025-02-04` (ISO format YYYY-MM-DD)
+- Validation: Check for 3 parts separated by dots before parsing
+- Pad single digits: `zfill(2)` → `4` → `04`
+
+### Standard Rounding (Not Banker's Rounding)
+- Python's `round()` uses banker's rounding (round half to even)
+- Task requires standard rounding (round half away from zero)
+- Solution: Use `Decimal` with `ROUND_HALF_UP`
+- Implementation:
+  ```python
+  from decimal import Decimal, ROUND_HALF_UP
+  quantizer = Decimal(f'1.{"0" * (places - 1)}1' if places > 1 else "0.1")
+  float(Decimal(str(amount)).quantize(quantizer, rounding=ROUND_HALF_UP))
+  ```
+- Note: Use `str(amount)` when creating Decimal to avoid floating point issues
+
+### Test Coverage Patterns
+- Unit code translation: all 17 codes + unknown fallback
+- Amounts match: exact, within tolerance, at boundary, beyond tolerance, negative, zero
+- German numbers: integer, decimal, thousands, large, negative
+- German dates: standard, single digit, ISO format, invalid format
+- Rounding: default 2 places, custom places, rounding up/down, negative, zero
+
+### Decimal quantize Pattern
+- For N decimal places: use quantizer string with N-1 zeros and trailing 1
+  - 2 places: `"0.11"` → `Decimal('0.11')`
+  - 3 places: `"0.101"` → `Decimal('0.101')`
+  - 1 place: `"0.1"` → `Decimal('0.1')`
+
+### Nix Environment Testing
+- Pytest not installed in base Python environment
+- Use nix-shell for testing: `nix-shell -p python312Packages.pytest --run "pytest tests/test_utils.py -v"`
+- All tests must pass before marking task complete
+