feat(core): implement extractor, pdf_parser, and utils with TDD
Wave 2 tasks complete: - Task 4: ZUGFeRD extractor with profile detection (factur-x) - Task 5: PDF text parser with regex patterns - Task 6: Utils with unit code mapping and tolerance checks Features: - extract_zugferd() extracts XML and text from PDFs - parse_zugferd_xml() parses UN/CEFACT CII XML to models - extract_from_text() extracts values using regex patterns - translate_unit_code() maps UN/ECE codes to German - amounts_match() checks with 0.01 EUR tolerance - German number/date format handling Tests: 27 utils tests, 27 pdf_parser tests, extractor tests
This commit is contained in:
@@ -116,3 +116,122 @@ Initial session for ZUGFeRD-Service implementation.
|
||||
- Optional fields: `type | None = Field(default=None, ...)`
|
||||
- Empty list defaults: `list[Type] = Field(default_factory=list)`
|
||||
|
||||
|
||||
## [2026-02-04T20:30:00.000Z] Task 5: PDF Text Parser Implementation
|
||||
|
||||
### TDD Implementation Pattern
|
||||
- Write failing tests first (RED), implement minimum code (GREEN), refactor if needed
|
||||
- 27 tests written covering: PDF extraction, regex patterns, number/date formats, edge cases
|
||||
- All tests pass after implementation
|
||||
|
||||
### pypdf Text Extraction
|
||||
- `PdfReader` requires file-like object, not raw bytes
|
||||
- Use `io.BytesIO(pdf_bytes)` to wrap bytes for pypdf
|
||||
- Extract text page-by-page, concatenate with newlines
|
||||
|
||||
### Regex Pattern Design for Numbers
|
||||
- Initial pattern `[0-9.,]+` matches lone dots (invalid number)
|
||||
- Fixed pattern: `[0-9]+(?:[.,][0-9]+)*` requires at least one digit
|
||||
- Ensures matched values are valid numbers before parsing
|
||||
|
||||
### German Number Format Detection
|
||||
- German: `1.234,56` (dot=thousands, comma=decimal)
|
||||
- International: `1,234.56` (comma=thousands, dot=decimal)
|
||||
- Detection: Check if comma appears after last dot
|
||||
```python
|
||||
if "," in num_str and num_str.rfind(",") > num_str.rfind("."):
|
||||
# German format
|
||||
else:
|
||||
# International format
|
||||
```
|
||||
|
||||
### Confidence Scoring
|
||||
- First pattern match = 1.0 confidence
|
||||
- Each subsequent pattern reduces confidence by 0.1
|
||||
- Range: 1.0 (first pattern) → 0.6 (fifth pattern)
|
||||
|
||||
### German Date Format Conversion
|
||||
- Input: `04.02.2025` (DD.MM.YYYY)
|
||||
- Output: `2025-02-04` (ISO format YYYY-MM-DD)
|
||||
- Use `zfill(2)` to pad single digits: `4` → `04`
|
||||
|
||||
### Test Docstrings are Necessary
|
||||
- Pytest uses method docstrings in test reports
|
||||
- Essential for readable test output
|
||||
- Module/class docstrings provide organization context
|
||||
|
||||
### Invoice Field Patterns (from spec)
|
||||
- invoice_number: "Rechnungs-Nr", "Invoice No", "Beleg-Nr", "Rechnung X/Y"
|
||||
- gross_amount: "Brutto", "Gesamtbetrag", "Total", "Endbetrag", "Summe"
|
||||
- net_amount: "Netto", "Rechnungsbetrag"
|
||||
- vat_amount: "MwSt", "USt", "Steuer"
|
||||
- invoice_date: "Rechnungsdatum", "Datum", "Invoice Date"
|
||||
- supplier_name: "Lieferant", "Verkäufer"
|
||||
|
||||
### PDF Layout Variations
|
||||
- Real PDFs may have different field layouts than spec patterns
|
||||
- EN16931 sample uses "Bruttosumme" instead of "Brutto"
|
||||
- Patterns can be refined iteratively based on real data
|
||||
|
||||
|
||||
## [2026-02-04T20:45:00.000Z] Task 6: Utility Functions Implementation
|
||||
|
||||
### UNECE Unit Code Mapping
|
||||
- UN/ECE unit codes standardized for cross-border trade documents
|
||||
- 17 common codes mapped to German translations:
|
||||
- "C62", "H87", "PCE", "EA" → "Stück"
|
||||
- "KGM" → "Kilogramm", "GRM" → "Gramm", "TNE" → "Tonne"
|
||||
- "MTR" → "Meter", "KMT" → "Kilometer", "MTK" → "Quadratmeter"
|
||||
- "LTR" → "Liter", "MLT" → "Milliliter"
|
||||
- "DAY" → "Tag", "HUR" → "Stunde", "MON" → "Monat", "ANN" → "Jahr"
|
||||
- "SET" → "Set"
|
||||
- Fallback: return original code if not found in dictionary
|
||||
|
||||
### Floating Point Precision Handling
|
||||
- `amounts_match()` with hardcoded 0.01 EUR tolerance
|
||||
- Floating point arithmetic causes precision issues: `100.01 - 100.00 = 0.010000000000005116`
|
||||
- Solution: Add small epsilon margin (1e-10) to tolerance for robust comparison
|
||||
- Formula: `abs(actual - expected) <= tolerance + 1e-10`
|
||||
|
||||
### German Number Format Parsing
|
||||
- German format: `1.234,56` (dot=thousands, comma=decimal)
|
||||
- Conversion: Remove dots, replace comma with dot
|
||||
- Single-line: `num_str.replace('.', '').replace(',', '.')`
|
||||
- Important: Remove thousands separator BEFORE replacing decimal separator
|
||||
|
||||
### German Date Format Parsing
|
||||
- Input: `04.02.2025` (DD.MM.YYYY)
|
||||
- Output: `2025-02-04` (ISO format YYYY-MM-DD)
|
||||
- Validation: Check for 3 parts separated by dots before parsing
|
||||
- Pad single digits: `zfill(2)` → `4` → `04`
|
||||
|
||||
### Standard Rounding (Not Banker's Rounding)
|
||||
- Python's `round()` uses banker's rounding (round half to even)
|
||||
- Task requires standard rounding (round half away from zero)
|
||||
- Solution: Use `Decimal` with `ROUND_HALF_UP`
|
||||
- Implementation:
|
||||
```python
|
||||
from decimal import Decimal, ROUND_HALF_UP
|
||||
quantizer = Decimal(f'1.{"0" * (places - 1)}1' if places > 1 else "0.1")
|
||||
float(Decimal(str(amount)).quantize(quantizer, rounding=ROUND_HALF_UP))
|
||||
```
|
||||
- Note: Use `str(amount)` when creating Decimal to avoid floating point issues
|
||||
|
||||
### Test Coverage Patterns
|
||||
- Unit code translation: all 17 codes + unknown fallback
|
||||
- Amounts match: exact, within tolerance, at boundary, beyond tolerance, negative, zero
|
||||
- German numbers: integer, decimal, thousands, large, negative
|
||||
- German dates: standard, single digit, ISO format, invalid format
|
||||
- Rounding: default 2 places, custom places, rounding up/down, negative, zero
|
||||
|
||||
### Decimal quantize Pattern
|
||||
- For N decimal places: use quantizer string with N-1 zeros and trailing 1
|
||||
- 2 places: `"0.11"` → `Decimal('0.11')`
|
||||
- 3 places: `"0.101"` → `Decimal('0.101')`
|
||||
- 1 place: `"0.1"` → `Decimal('0.1')`
|
||||
|
||||
### Nix Environment Testing
|
||||
- Pytest not installed in base Python environment
|
||||
- Use nix-shell for testing: `nix-shell -p python312Packages.pytest --run "pytest tests/test_utils.py -v"`
|
||||
- All tests must pass before marking task complete
|
||||
|
||||
|
||||
@@ -515,7 +515,7 @@ Critical Path: Task 1 → Task 4 → Task 7 → Task 10 → Task 13 → Task 16
|
||||
|
||||
### Wave 2: Core Extraction Logic
|
||||
|
||||
- [ ] 4. ZUGFeRD Extractor Implementation (TDD)
|
||||
- [x] 4. ZUGFeRD Extractor Implementation (TDD)
|
||||
|
||||
**What to do**:
|
||||
- Write tests first using sample PDFs from fixtures
|
||||
@@ -636,7 +636,7 @@ Critical Path: Task 1 → Task 4 → Task 7 → Task 10 → Task 13 → Task 16
|
||||
|
||||
---
|
||||
|
||||
- [ ] 5. PDF Text Parser Implementation (TDD)
|
||||
- [x] 5. PDF Text Parser Implementation (TDD)
|
||||
|
||||
**What to do**:
|
||||
- Write tests first with expected extraction patterns
|
||||
@@ -738,7 +738,7 @@ Critical Path: Task 1 → Task 4 → Task 7 → Task 10 → Task 13 → Task 16
|
||||
|
||||
---
|
||||
|
||||
- [ ] 6. Utility Functions Implementation
|
||||
- [x] 6. Utility Functions Implementation
|
||||
|
||||
**What to do**:
|
||||
- Create UN/ECE unit code mapping dictionary
|
||||
|
||||
Reference in New Issue
Block a user