feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides

Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:44:48 +01:00
parent b76e87a0b6
commit 18ee24818b
14 changed files with 1084 additions and 2 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -0,0 +1,106 @@
+# CLAUDE.md — Padelnomics
+
+This file tells Claude Code how to work in this repository.
+
+## Project Overview
+
+Padelnomics is a SaaS application built with Quart (async Python), HTMX, and SQLite.
+It includes a full data pipeline:
+
+```
+External APIs → extract → landing zone → SQLMesh transform → DuckDB → web app
+```
+
+**Packages** (uv workspace):
+- `web/` — Quart + HTMX web application (auth, billing, dashboard)
+- `extract/padelnomics_extract/` — data extraction to local landing zone
+- `transform/sqlmesh_padelnomics/` — 4-layer SQL transformation (raw → staging → foundation → serving)
+- `src/padelnomics/` — CLI utilities, export_serving helper
+
+## Skills: invoke these for domain tasks
+
+### Working on extraction or transformation?
+
+Use the **`data-engineer`** skill for:
+- Designing or reviewing SQLMesh model logic
+- Adding a new data source (extract + raw + staging models)
+- Performance tuning DuckDB queries
+- Data modeling decisions (dimensions, facts, aggregates)
+- Understanding the 4-layer architecture
+
+```
+/data-engineer  (or ask Claude to invoke it)
+```
+
+### Working on the web app UI or frontend?
+
+Use the **`frontend-design`** skill for UI components, templates, or dashboard layouts.
+
+### Working on payments or subscriptions?
+
+Use the **`paddle-integration`** skill for billing, webhooks, and subscription logic.
+
+## Key commands
+
+```bash
+# Install all dependencies
+uv sync --all-packages
+
+# Lint & format
+ruff check .
+ruff format .
+
+# Run tests
+uv run pytest tests/ -v
+
+# Dev server
+./scripts/dev_run.sh
+
+# Extract data
+LANDING_DIR=data/landing uv run extract
+
+# SQLMesh plan + run (from repo root)
+uv run sqlmesh -p transform/sqlmesh_padelnomics plan
+uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
+
+# Export serving tables (run after SQLMesh)
+DUCKDB_PATH=local.duckdb SERVING_DUCKDB_PATH=analytics.duckdb \
+    uv run python -m padelnomics.export_serving
+```
+
+## Architecture documentation
+
+| Topic | File |
+|-------|------|
+| Extraction patterns, state tracking, adding new sources | `extract/padelnomics_extract/README.md` |
+| 4-layer SQLMesh architecture, materialization strategy | `transform/sqlmesh_padelnomics/README.md` |
+| Two-file DuckDB architecture (SQLMesh lock isolation) | `src/padelnomics/export_serving.py` docstring |
+
+## Pipeline data flow
+
+```
+data/landing/
+  └── padelnomics/{year}/{etag}.csv.gz   ← extraction output
+
+local.duckdb                ← SQLMesh exclusive (raw → staging → foundation → serving)
+
+analytics.duckdb            ← serving tables only, web app read-only
+  └── serving.*             ← atomically replaced by export_serving.py
+```
+
+## Environment variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LANDING_DIR` | `data/landing` | Landing zone root (extraction writes here) |
+| `DUCKDB_PATH` | `local.duckdb` | SQLMesh pipeline DB (exclusive write) |
+| `SERVING_DUCKDB_PATH` | `analytics.duckdb` | Read-only DB for web app |
+
+## Coding philosophy
+
+- **Simple and procedural** — functions over classes, no "Manager" patterns
+- **Idempotent operations** — running twice produces the same result
+- **Explicit assertions** — assert preconditions at function boundaries
+- **Bounded operations** — set timeouts, page limits, buffer sizes
+
+Read `coding_philosophy.md` (if present) for the full guide.
--- a/.claude/coding_philosophy.md
+++ b/.claude/coding_philosophy.md
@@ -0,0 +1,542 @@
+# Coding Philosophy & Engineering Principles
+
+This document defines the coding philosophy and engineering principles that guide all agent work. All agents should internalize and follow these principles.
+
+Influenced by Casey Muratori, Jonathan Blow, and [TigerStyle](https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TIGER_STYLE.md) (adapted for Python/SQL).
+
+<core_philosophy>
+**Simple, Direct, Procedural Code**
+
+- Solve the actual problem, not the general case
+- Understand what the computer is doing
+- Explicit is better than clever
+- Code should be obvious, not impressive
+- Do it right the first time — feature gaps are acceptable, but what ships must meet design goals
+</core_philosophy>
+
+<code_style>
+
+<functions_over_classes>
+**Prefer:**
+- Pure functions that transform data
+- Simple procedures that do clear things
+- Explicit data structures (dicts, lists, named tuples)
+
+**Avoid:**
+- Classes that are just namespaces for functions
+- Objects hiding behavior behind methods
+- Inheritance hierarchies
+- "Manager" or "Handler" classes
+
+**Example - Good:**
+```python
+def calculate_user_metrics(events: list[dict]) -> dict:
+    """Calculate metrics from event list."""
+    total = len(events)
+    unique_sessions = len(set(e['session_id'] for e in events))
+
+    return {
+        'total_events': total,
+        'unique_sessions': unique_sessions,
+        'events_per_session': total / unique_sessions if unique_sessions > 0 else 0
+    }
+```
+
+**Example - Bad:**
+```python
+class UserMetricsCalculator:
+    def __init__(self):
+        self._events = []
+
+    def add_events(self, events: list[dict]):
+        self._events.extend(events)
+
+    def calculate(self) -> UserMetrics:
+        return UserMetrics(
+            total=self._calculate_total(),
+            sessions=self._calculate_sessions()
+        )
+```
+</functions_over_classes>
+
+<data_oriented_design>
+**Think about the data:**
+- What's the shape of the data?
+- How does it flow through the system?
+- What transformations are needed?
+- What's the memory layout?
+
+**Data is just data:**
+- Use simple structures (dicts, lists, tuples)
+- Don't hide data behind getters/setters
+- Make data transformations explicit
+- Consider performance implications
+
+**Example - Good:**
+```python
+# Data is data, functions transform it
+users = [
+    {'id': 1, 'name': 'Alice', 'active': True},
+    {'id': 2, 'name': 'Bob', 'active': False},
+]
+
+def filter_active(users: list[dict]) -> list[dict]:
+    return [u for u in users if u['active']]
+
+active_users = filter_active(users)
+```
+
+**Example - Bad:**
+```python
+# Data hidden behind objects
+class User:
+    def __init__(self, id, name, active):
+        self._id = id
+        self._name = name
+        self._active = active
+
+    def get_name(self):
+        return self._name
+
+    def is_active(self):
+        return self._active
+
+users = [User(1, 'Alice', True), User(2, 'Bob', False)]
+active_users = [u for u in users if u.is_active()]
+```
+</data_oriented_design>
+
+<keep_it_simple>
+**Simple control flow:**
+- Straightforward if/else over clever tricks
+- Explicit loops over list comprehensions when clearer
+- Early returns to reduce nesting
+- Avoid deeply nested logic
+
+**Simple naming:**
+- Descriptive variable names (`user_count` not `uc`)
+- Function names that say what they do (`calculate_total` not `process`)
+- No abbreviations unless universal (`id`, `url`, `sql`)
+- Include units in names: `timeout_seconds`, `size_bytes`, `latency_ms` — not `timeout`, `size`, `latency`
+- Place qualifiers last in descending significance: `latency_ms_max` not `max_latency_ms` (aligns related variables)
+
+**Simple structure:**
+- Functions should do one thing
+- Keep functions short (20-50 lines, hard limit ~70 — must fit on screen without scrolling)
+- If it's getting complex, break it up
+- But don't break it up "just because"
+</keep_it_simple>
+
+<minimize_variable_scope>
+**Declare variables close to where they're used:**
+- Don't introduce variables before they're needed
+- Remove them when no longer relevant
+- Minimize the number of variables in scope at any point
+- Reduces probability of stale-state bugs (check something in one place, use it in another)
+
+**Don't duplicate state:**
+- One source of truth for each piece of data
+- Don't create aliases or copies that can drift out of sync
+- If you compute a value, use it directly — don't store it in a variable you'll use 50 lines later
+</minimize_variable_scope>
+
+</code_style>
+
+<architecture_principles>
+
+<build_minimum_that_works>
+**Start simple:**
+- Solve the immediate problem
+- Don't build for imagined future requirements
+- Add complexity only when actually needed
+- Prefer obvious solutions over clever ones
+
+**Avoid premature abstraction:**
+- Duplication is okay early on
+- Abstract only when pattern is clear
+- Three examples before abstracting
+- Question every layer of indirection
+
+**Zero technical debt:**
+- Do it right the first time
+- A problem solved in design costs less than one solved in implementation, which costs less than one solved in production
+- Feature gaps are acceptable; broken or half-baked code is not
+</build_minimum_that_works>
+
+<explicit_over_implicit>
+**Be explicit about:**
+- Where data comes from
+- What transformations happen
+- Error conditions and handling
+- Dependencies and side effects
+
+**Avoid magic:**
+- Framework conventions that hide behavior
+- Implicit configuration
+- Action-at-a-distance
+- Metaprogramming tricks
+- Relying on library defaults — pass options explicitly at call site
+</explicit_over_implicit>
+
+<set_limits_on_everything>
+**Nothing should run unbounded:**
+- Set max retries on network calls
+- Set timeouts on all external requests
+- Bound loop iterations where data size is unknown
+- Set max page counts on paginated API fetches
+- Cap queue/buffer sizes
+
+**Why:** Unbounded operations cause tail latency spikes, resource exhaustion, and silent hangs. A system that fails loudly at a known limit is better than one that degrades mysteriously.
+</set_limits_on_everything>
+
+<question_dependencies>
+**Before adding a library:**
+- Can I write this simply myself?
+- What's the complexity budget?
+- Am I using 5% of a large framework?
+- Is this solving my actual problem?
+
+**Prefer:**
+- Standard library when possible
+- Small, focused libraries
+- Direct solutions
+- Understanding what code does
+
+**Approved dependencies (earn their place):**
+- `msgspec` — struct types and validation at system boundaries (external APIs, user input,
+  inter-process data). Use `msgspec.Struct` instead of dataclasses when you need: fast
+  encode/decode, built-in validation, or typed containers for boundary data.
+  **Rule:** use Structs at boundaries (API responses, HAR entries, MCP tool I/O) —
+  keep internal plumbing as plain dicts/tuples.
+</question_dependencies>
+
+</architecture_principles>
+
+<performance_consciousness>
+
+<think_about_the_computer>
+**Understand:**
+- Memory layout matters
+- Cache locality matters
+- Allocations have cost
+- Loops over data can be fast or slow
+
+**Common issues:**
+- N+1 queries (database or API)
+- Nested loops over large data
+- Copying large structures unnecessarily
+- Loading entire datasets into memory
+</think_about_the_computer>
+
+<design_phase_performance>
+**Think about performance upfront during design, not just after profiling:**
+- The largest wins (100-1000x) happen in the design phase
+- Back-of-envelope sketch: estimate load across network, disk, memory, CPU
+- Optimize for the slowest resource first (network > disk > memory > CPU)
+- Compensate for frequency — a cheap operation called 10M times can dominate
+
+**Batching:**
+- Amortize costs via batching (network calls, disk writes, database inserts)
+- One batch insert of 1000 rows beats 1000 individual inserts
+- Distinguish control plane (rare, can be slow) from data plane (hot path, must be fast)
+
+**But don't prematurely optimize implementation details:**
+- Design for performance, then measure before micro-optimizing
+- Make it work, then make it fast
+- Optimize the hot path, not everything
+</design_phase_performance>
+
+</performance_consciousness>
+
+<assertions_and_invariants>
+
+<use_assertions_as_documentation>
+**Assert preconditions, postconditions, and invariants — especially in data pipelines:**
+
+```python
+def normalize_prices(prices: list[dict], currency: str) -> list[dict]:
+    assert len(prices) > 0, "prices must not be empty"
+    assert currency in ("USD", "EUR", "BRL"), f"unsupported currency: {currency}"
+
+    result = [convert_price(p, currency) for p in prices]
+
+    assert len(result) == len(prices), "normalization must not drop rows"
+    assert all(r['currency'] == currency for r in result), "all prices must be in target currency"
+    return result
+```
+
+**Guidelines:**
+- Assert function arguments and return values at boundaries
+- Assert data quality: row counts, non-null columns, expected ranges
+- Use assertions to document surprising or critical invariants
+- Split compound assertions: `assert a; assert b` not `assert a and b` (clearer error messages)
+- Assertions catch programmer errors — they should never be used for expected runtime conditions (use if/else for those)
+</use_assertions_as_documentation>
+
+</assertions_and_invariants>
+
+<sql_and_data>
+
+<keep_logic_in_sql>
+**Good:**
+```sql
+-- Logic is clear, database does the work
+SELECT
+    user_id,
+    COUNT(*) as event_count,
+    COUNT(DISTINCT session_id) as session_count,
+    MAX(event_time) as last_active
+FROM events
+WHERE event_time >= CURRENT_DATE - 30
+GROUP BY user_id
+HAVING COUNT(*) >= 10
+```
+
+**Bad:**
+```python
+# Pulling too much data, doing work in Python
+events = db.query("SELECT * FROM events WHERE event_time >= CURRENT_DATE - 30")
+user_events = {}
+for event in events:  # Could be millions of rows!
+    if event.user_id not in user_events:
+        user_events[event.user_id] = []
+    user_events[event.user_id].append(event)
+
+results = []
+for user_id, events in user_events.items():
+    if len(events) >= 10:
+        results.append({'user_id': user_id, 'count': len(events)})
+```
+</keep_logic_in_sql>
+
+<sql_best_practices>
+**Write readable SQL:**
+- Use CTEs for complex queries
+- One concept per CTE
+- Descriptive CTE names
+- Comments for non-obvious logic
+
+**Example:**
+```sql
+WITH active_users AS (
+    -- Users who logged in within last 30 days
+    SELECT DISTINCT user_id
+    FROM login_events
+    WHERE login_time >= CURRENT_DATE - 30
+),
+
+user_activity AS (
+    -- Count events for active users
+    SELECT
+        e.user_id,
+        COUNT(*) as event_count
+    FROM events e
+    INNER JOIN active_users au ON e.user_id = au.user_id
+    GROUP BY e.user_id
+)
+
+SELECT
+    user_id,
+    event_count,
+    event_count / 30.0 as avg_daily_events
+FROM user_activity
+ORDER BY event_count DESC
+```
+</sql_best_practices>
+
+</sql_and_data>
+
+<error_handling>
+
+<be_explicit_about_errors>
+**Handle errors explicitly:**
+```python
+def get_user(user_id: str) -> dict | None:
+    """Get user by ID. Returns None if not found."""
+    result = db.query("SELECT * FROM users WHERE id = ?", [user_id])
+    return result[0] if result else None
+
+def process_user(user_id: str):
+    user = get_user(user_id)
+    if user is None:
+        logger.warning(f"User {user_id} not found")
+        return None
+
+    # Process user...
+    return result
+```
+
+**Don't hide errors:**
+```python
+# Bad - silently catches everything
+try:
+    result = do_something()
+except:
+    result = None
+
+# Good - explicit about what can fail
+try:
+    result = do_something()
+except ValueError as e:
+    logger.error(f"Invalid value: {e}")
+    raise
+except ConnectionError as e:
+    logger.error(f"Connection failed: {e}")
+    return None
+```
+</be_explicit_about_errors>
+
+<fail_fast>
+- Validate inputs at boundaries
+- Check preconditions early
+- Return early on error conditions
+- Don't let bad data propagate
+- All errors must be handled — 92% of catastrophic system failures come from incorrect handling of non-fatal errors
+</fail_fast>
+
+</error_handling>
+
+<anti_patterns>
+
+<over_engineering>
+- Repository pattern for simple CRUD
+- Service layer that just calls the database
+- Dependency injection containers
+- Abstract factories for concrete things
+- Interfaces with one implementation
+</over_engineering>
+
+<framework_magic>
+- ORM hiding N+1 queries
+- Decorators doing complex logic
+- Metaclass magic
+- Convention over configuration (when it hides behavior)
+</framework_magic>
+
+<premature_abstraction>
+- Creating interfaces "for future flexibility"
+- Generics for specific use cases
+- Configuration files for hardcoded values
+- Plugins systems for known features
+</premature_abstraction>
+
+<unnecessary_complexity>
+- Class hierarchies for classification
+- Design patterns "just because"
+- Microservices for a small app
+- Message queues for synchronous operations
+</unnecessary_complexity>
+
+</anti_patterns>
+
+<testing_philosophy>
+
+<test_behavior_not_implementation>
+**Focus on:**
+- What the function does (inputs → outputs)
+- Edge cases and boundaries
+- Error conditions
+- Data transformations
+
+**Don't test:**
+- Private implementation details
+- Framework internals
+- External libraries
+- Simple property access
+</test_behavior_not_implementation>
+
+<keep_tests_simple>
+```python
+def test_user_aggregation():
+    # Arrange - simple, clear test data
+    events = [
+        {'user_id': 'u1', 'event': 'click'},
+        {'user_id': 'u1', 'event': 'view'},
+        {'user_id': 'u2', 'event': 'click'},
+    ]
+
+    # Act - call the function
+    result = aggregate_user_events(events)
+
+    # Assert - check the behavior
+    assert result == {'u1': 2, 'u2': 1}
+```
+</keep_tests_simple>
+
+<test_both_spaces>
+**Test positive and negative space:**
+- Test valid inputs produce correct outputs (positive space)
+- Test invalid inputs are rejected or handled correctly (negative space)
+- For data pipelines: test with realistic data samples AND with malformed/missing data
+</test_both_spaces>
+
+<integration_tests_often_more_valuable>
+- Test with real database (DuckDB is fast)
+- Test actual SQL queries
+- Test end-to-end flows
+- Use realistic data samples
+</integration_tests_often_more_valuable>
+
+</testing_philosophy>
+
+<comments_and_documentation>
+
+<when_to_comment>
+**Comment the "why":**
+```python
+# Use binary search because list is sorted and can be large (1M+ items)
+index = binary_search(sorted_items, target)
+
+# Cache for 5 minutes - balance freshness vs database load
+@cache(ttl=300)
+def get_user_stats(user_id):
+    ...
+```
+
+**Don't comment the "what":**
+```python
+# Bad - code is self-explanatory
+# Increment the counter
+counter += 1
+
+# Good - code is clear on its own
+counter += 1
+```
+
+**Always motivate decisions:**
+- Explain why you wrote code the way you did
+- Code alone isn't documentation — the reasoning matters
+- Comments are well-written prose, not margin scribblings
+</when_to_comment>
+
+<self_documenting_code>
+- Use descriptive names
+- Keep functions focused
+- Make data flow obvious
+- Structure for readability
+</self_documenting_code>
+
+</comments_and_documentation>
+
+<summary>
+**Key Principles:**
+1. **Simple, direct, procedural** — functions over classes
+2. **Data-oriented** — understand the data and its flow
+3. **Explicit over implicit** — no magic, no hiding
+4. **Build minimum that works** — solve actual problems, zero technical debt
+5. **Performance conscious** — design for performance, then measure before micro-optimizing
+6. **Keep logic in SQL** — let the database do the work
+7. **Handle errors explicitly** — no silent failures, all errors handled
+8. **Assert invariants** — use assertions to document and enforce correctness
+9. **Set limits on everything** — nothing runs unbounded
+10. **Question abstractions** — every layer needs justification
+
+**Ask yourself:**
+- Is this the simplest solution?
+- Can someone else understand this?
+- What is the computer actually doing?
+- Am I solving the real problem?
+- What are the bounds on this operation?
+
+When in doubt, go simpler.
+</summary>