feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides

Sync template from 29ac25b → v0.9.0 (29 template commits). Due to
template's _subdirectory migration, new files were manually rendered
rather than auto-merged by copier.

New files:
- .claude/CLAUDE.md + coding_philosophy.md (agent instructions)
- extract utils.py: SQLite state tracking for extraction runs
- extract/transform READMEs: architecture & pattern documentation
- infra/supervisor: systemd service + orchestration script
- Per-layer model READMEs (raw, staging, foundation, serving)

Also fixes copier-answers.yml (adds 4 feature toggles, removes stale
payment_provider key) and scopes CLAUDE.md gitignore to root only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Deeman
2026-02-22 15:44:48 +01:00
parent b76e87a0b6
commit 18ee24818b
14 changed files with 1084 additions and 2 deletions

106
.claude/CLAUDE.md Normal file
View File

@@ -0,0 +1,106 @@
# CLAUDE.md — Padelnomics
This file tells Claude Code how to work in this repository.
## Project Overview
Padelnomics is a SaaS application built with Quart (async Python), HTMX, and SQLite.
It includes a full data pipeline:
```
External APIs → extract → landing zone → SQLMesh transform → DuckDB → web app
```
**Packages** (uv workspace):
- `web/` — Quart + HTMX web application (auth, billing, dashboard)
- `extract/padelnomics_extract/` — data extraction to local landing zone
- `transform/sqlmesh_padelnomics/` — 4-layer SQL transformation (raw → staging → foundation → serving)
- `src/padelnomics/` — CLI utilities, export_serving helper
## Skills: invoke these for domain tasks
### Working on extraction or transformation?
Use the **`data-engineer`** skill for:
- Designing or reviewing SQLMesh model logic
- Adding a new data source (extract + raw + staging models)
- Performance tuning DuckDB queries
- Data modeling decisions (dimensions, facts, aggregates)
- Understanding the 4-layer architecture
```
/data-engineer (or ask Claude to invoke it)
```
### Working on the web app UI or frontend?
Use the **`frontend-design`** skill for UI components, templates, or dashboard layouts.
### Working on payments or subscriptions?
Use the **`paddle-integration`** skill for billing, webhooks, and subscription logic.
## Key commands
```bash
# Install all dependencies
uv sync --all-packages
# Lint & format
ruff check .
ruff format .
# Run tests
uv run pytest tests/ -v
# Dev server
./scripts/dev_run.sh
# Extract data
LANDING_DIR=data/landing uv run extract
# SQLMesh plan + run (from repo root)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
# Export serving tables (run after SQLMesh)
DUCKDB_PATH=local.duckdb SERVING_DUCKDB_PATH=analytics.duckdb \
uv run python -m padelnomics.export_serving
```
## Architecture documentation
| Topic | File |
|-------|------|
| Extraction patterns, state tracking, adding new sources | `extract/padelnomics_extract/README.md` |
| 4-layer SQLMesh architecture, materialization strategy | `transform/sqlmesh_padelnomics/README.md` |
| Two-file DuckDB architecture (SQLMesh lock isolation) | `src/padelnomics/export_serving.py` docstring |
## Pipeline data flow
```
data/landing/
└── padelnomics/{year}/{etag}.csv.gz ← extraction output
local.duckdb ← SQLMesh exclusive (raw → staging → foundation → serving)
analytics.duckdb ← serving tables only, web app read-only
└── serving.* ← atomically replaced by export_serving.py
```
## Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `LANDING_DIR` | `data/landing` | Landing zone root (extraction writes here) |
| `DUCKDB_PATH` | `local.duckdb` | SQLMesh pipeline DB (exclusive write) |
| `SERVING_DUCKDB_PATH` | `analytics.duckdb` | Read-only DB for web app |
## Coding philosophy
- **Simple and procedural** — functions over classes, no "Manager" patterns
- **Idempotent operations** — running twice produces the same result
- **Explicit assertions** — assert preconditions at function boundaries
- **Bounded operations** — set timeouts, page limits, buffer sizes
Read `coding_philosophy.md` (if present) for the full guide.

View File

@@ -0,0 +1,542 @@
# Coding Philosophy & Engineering Principles
This document defines the coding philosophy and engineering principles that guide all agent work. All agents should internalize and follow these principles.
Influenced by Casey Muratori, Jonathan Blow, and [TigerStyle](https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TIGER_STYLE.md) (adapted for Python/SQL).
<core_philosophy>
**Simple, Direct, Procedural Code**
- Solve the actual problem, not the general case
- Understand what the computer is doing
- Explicit is better than clever
- Code should be obvious, not impressive
- Do it right the first time — feature gaps are acceptable, but what ships must meet design goals
</core_philosophy>
<code_style>
<functions_over_classes>
**Prefer:**
- Pure functions that transform data
- Simple procedures that do clear things
- Explicit data structures (dicts, lists, named tuples)
**Avoid:**
- Classes that are just namespaces for functions
- Objects hiding behavior behind methods
- Inheritance hierarchies
- "Manager" or "Handler" classes
**Example - Good:**
```python
def calculate_user_metrics(events: list[dict]) -> dict:
"""Calculate metrics from event list."""
total = len(events)
unique_sessions = len(set(e['session_id'] for e in events))
return {
'total_events': total,
'unique_sessions': unique_sessions,
'events_per_session': total / unique_sessions if unique_sessions > 0 else 0
}
```
**Example - Bad:**
```python
class UserMetricsCalculator:
def __init__(self):
self._events = []
def add_events(self, events: list[dict]):
self._events.extend(events)
def calculate(self) -> UserMetrics:
return UserMetrics(
total=self._calculate_total(),
sessions=self._calculate_sessions()
)
```
</functions_over_classes>
<data_oriented_design>
**Think about the data:**
- What's the shape of the data?
- How does it flow through the system?
- What transformations are needed?
- What's the memory layout?
**Data is just data:**
- Use simple structures (dicts, lists, tuples)
- Don't hide data behind getters/setters
- Make data transformations explicit
- Consider performance implications
**Example - Good:**
```python
# Data is data, functions transform it
users = [
{'id': 1, 'name': 'Alice', 'active': True},
{'id': 2, 'name': 'Bob', 'active': False},
]
def filter_active(users: list[dict]) -> list[dict]:
return [u for u in users if u['active']]
active_users = filter_active(users)
```
**Example - Bad:**
```python
# Data hidden behind objects
class User:
def __init__(self, id, name, active):
self._id = id
self._name = name
self._active = active
def get_name(self):
return self._name
def is_active(self):
return self._active
users = [User(1, 'Alice', True), User(2, 'Bob', False)]
active_users = [u for u in users if u.is_active()]
```
</data_oriented_design>
<keep_it_simple>
**Simple control flow:**
- Straightforward if/else over clever tricks
- Explicit loops over list comprehensions when clearer
- Early returns to reduce nesting
- Avoid deeply nested logic
**Simple naming:**
- Descriptive variable names (`user_count` not `uc`)
- Function names that say what they do (`calculate_total` not `process`)
- No abbreviations unless universal (`id`, `url`, `sql`)
- Include units in names: `timeout_seconds`, `size_bytes`, `latency_ms` — not `timeout`, `size`, `latency`
- Place qualifiers last in descending significance: `latency_ms_max` not `max_latency_ms` (aligns related variables)
**Simple structure:**
- Functions should do one thing
- Keep functions short (20-50 lines, hard limit ~70 — must fit on screen without scrolling)
- If it's getting complex, break it up
- But don't break it up "just because"
</keep_it_simple>
<minimize_variable_scope>
**Declare variables close to where they're used:**
- Don't introduce variables before they're needed
- Remove them when no longer relevant
- Minimize the number of variables in scope at any point
- Reduces probability of stale-state bugs (check something in one place, use it in another)
**Don't duplicate state:**
- One source of truth for each piece of data
- Don't create aliases or copies that can drift out of sync
- If you compute a value, use it directly — don't store it in a variable you'll use 50 lines later
</minimize_variable_scope>
</code_style>
<architecture_principles>
<build_minimum_that_works>
**Start simple:**
- Solve the immediate problem
- Don't build for imagined future requirements
- Add complexity only when actually needed
- Prefer obvious solutions over clever ones
**Avoid premature abstraction:**
- Duplication is okay early on
- Abstract only when pattern is clear
- Three examples before abstracting
- Question every layer of indirection
**Zero technical debt:**
- Do it right the first time
- A problem solved in design costs less than one solved in implementation, which costs less than one solved in production
- Feature gaps are acceptable; broken or half-baked code is not
</build_minimum_that_works>
<explicit_over_implicit>
**Be explicit about:**
- Where data comes from
- What transformations happen
- Error conditions and handling
- Dependencies and side effects
**Avoid magic:**
- Framework conventions that hide behavior
- Implicit configuration
- Action-at-a-distance
- Metaprogramming tricks
- Relying on library defaults — pass options explicitly at call site
</explicit_over_implicit>
<set_limits_on_everything>
**Nothing should run unbounded:**
- Set max retries on network calls
- Set timeouts on all external requests
- Bound loop iterations where data size is unknown
- Set max page counts on paginated API fetches
- Cap queue/buffer sizes
**Why:** Unbounded operations cause tail latency spikes, resource exhaustion, and silent hangs. A system that fails loudly at a known limit is better than one that degrades mysteriously.
</set_limits_on_everything>
<question_dependencies>
**Before adding a library:**
- Can I write this simply myself?
- What's the complexity budget?
- Am I using 5% of a large framework?
- Is this solving my actual problem?
**Prefer:**
- Standard library when possible
- Small, focused libraries
- Direct solutions
- Understanding what code does
**Approved dependencies (earn their place):**
- `msgspec` — struct types and validation at system boundaries (external APIs, user input,
inter-process data). Use `msgspec.Struct` instead of dataclasses when you need: fast
encode/decode, built-in validation, or typed containers for boundary data.
**Rule:** use Structs at boundaries (API responses, HAR entries, MCP tool I/O) —
keep internal plumbing as plain dicts/tuples.
</question_dependencies>
</architecture_principles>
<performance_consciousness>
<think_about_the_computer>
**Understand:**
- Memory layout matters
- Cache locality matters
- Allocations have cost
- Loops over data can be fast or slow
**Common issues:**
- N+1 queries (database or API)
- Nested loops over large data
- Copying large structures unnecessarily
- Loading entire datasets into memory
</think_about_the_computer>
<design_phase_performance>
**Think about performance upfront during design, not just after profiling:**
- The largest wins (100-1000x) happen in the design phase
- Back-of-envelope sketch: estimate load across network, disk, memory, CPU
- Optimize for the slowest resource first (network > disk > memory > CPU)
- Compensate for frequency — a cheap operation called 10M times can dominate
**Batching:**
- Amortize costs via batching (network calls, disk writes, database inserts)
- One batch insert of 1000 rows beats 1000 individual inserts
- Distinguish control plane (rare, can be slow) from data plane (hot path, must be fast)
**But don't prematurely optimize implementation details:**
- Design for performance, then measure before micro-optimizing
- Make it work, then make it fast
- Optimize the hot path, not everything
</design_phase_performance>
</performance_consciousness>
<assertions_and_invariants>
<use_assertions_as_documentation>
**Assert preconditions, postconditions, and invariants — especially in data pipelines:**
```python
def normalize_prices(prices: list[dict], currency: str) -> list[dict]:
assert len(prices) > 0, "prices must not be empty"
assert currency in ("USD", "EUR", "BRL"), f"unsupported currency: {currency}"
result = [convert_price(p, currency) for p in prices]
assert len(result) == len(prices), "normalization must not drop rows"
assert all(r['currency'] == currency for r in result), "all prices must be in target currency"
return result
```
**Guidelines:**
- Assert function arguments and return values at boundaries
- Assert data quality: row counts, non-null columns, expected ranges
- Use assertions to document surprising or critical invariants
- Split compound assertions: `assert a; assert b` not `assert a and b` (clearer error messages)
- Assertions catch programmer errors — they should never be used for expected runtime conditions (use if/else for those)
</use_assertions_as_documentation>
</assertions_and_invariants>
<sql_and_data>
<keep_logic_in_sql>
**Good:**
```sql
-- Logic is clear, database does the work
SELECT
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
MAX(event_time) as last_active
FROM events
WHERE event_time >= CURRENT_DATE - 30
GROUP BY user_id
HAVING COUNT(*) >= 10
```
**Bad:**
```python
# Pulling too much data, doing work in Python
events = db.query("SELECT * FROM events WHERE event_time >= CURRENT_DATE - 30")
user_events = {}
for event in events: # Could be millions of rows!
if event.user_id not in user_events:
user_events[event.user_id] = []
user_events[event.user_id].append(event)
results = []
for user_id, events in user_events.items():
if len(events) >= 10:
results.append({'user_id': user_id, 'count': len(events)})
```
</keep_logic_in_sql>
<sql_best_practices>
**Write readable SQL:**
- Use CTEs for complex queries
- One concept per CTE
- Descriptive CTE names
- Comments for non-obvious logic
**Example:**
```sql
WITH active_users AS (
-- Users who logged in within last 30 days
SELECT DISTINCT user_id
FROM login_events
WHERE login_time >= CURRENT_DATE - 30
),
user_activity AS (
-- Count events for active users
SELECT
e.user_id,
COUNT(*) as event_count
FROM events e
INNER JOIN active_users au ON e.user_id = au.user_id
GROUP BY e.user_id
)
SELECT
user_id,
event_count,
event_count / 30.0 as avg_daily_events
FROM user_activity
ORDER BY event_count DESC
```
</sql_best_practices>
</sql_and_data>
<error_handling>
<be_explicit_about_errors>
**Handle errors explicitly:**
```python
def get_user(user_id: str) -> dict | None:
"""Get user by ID. Returns None if not found."""
result = db.query("SELECT * FROM users WHERE id = ?", [user_id])
return result[0] if result else None
def process_user(user_id: str):
user = get_user(user_id)
if user is None:
logger.warning(f"User {user_id} not found")
return None
# Process user...
return result
```
**Don't hide errors:**
```python
# Bad - silently catches everything
try:
result = do_something()
except:
result = None
# Good - explicit about what can fail
try:
result = do_something()
except ValueError as e:
logger.error(f"Invalid value: {e}")
raise
except ConnectionError as e:
logger.error(f"Connection failed: {e}")
return None
```
</be_explicit_about_errors>
<fail_fast>
- Validate inputs at boundaries
- Check preconditions early
- Return early on error conditions
- Don't let bad data propagate
- All errors must be handled — 92% of catastrophic system failures come from incorrect handling of non-fatal errors
</fail_fast>
</error_handling>
<anti_patterns>
<over_engineering>
- Repository pattern for simple CRUD
- Service layer that just calls the database
- Dependency injection containers
- Abstract factories for concrete things
- Interfaces with one implementation
</over_engineering>
<framework_magic>
- ORM hiding N+1 queries
- Decorators doing complex logic
- Metaclass magic
- Convention over configuration (when it hides behavior)
</framework_magic>
<premature_abstraction>
- Creating interfaces "for future flexibility"
- Generics for specific use cases
- Configuration files for hardcoded values
- Plugins systems for known features
</premature_abstraction>
<unnecessary_complexity>
- Class hierarchies for classification
- Design patterns "just because"
- Microservices for a small app
- Message queues for synchronous operations
</unnecessary_complexity>
</anti_patterns>
<testing_philosophy>
<test_behavior_not_implementation>
**Focus on:**
- What the function does (inputs → outputs)
- Edge cases and boundaries
- Error conditions
- Data transformations
**Don't test:**
- Private implementation details
- Framework internals
- External libraries
- Simple property access
</test_behavior_not_implementation>
<keep_tests_simple>
```python
def test_user_aggregation():
# Arrange - simple, clear test data
events = [
{'user_id': 'u1', 'event': 'click'},
{'user_id': 'u1', 'event': 'view'},
{'user_id': 'u2', 'event': 'click'},
]
# Act - call the function
result = aggregate_user_events(events)
# Assert - check the behavior
assert result == {'u1': 2, 'u2': 1}
```
</keep_tests_simple>
<test_both_spaces>
**Test positive and negative space:**
- Test valid inputs produce correct outputs (positive space)
- Test invalid inputs are rejected or handled correctly (negative space)
- For data pipelines: test with realistic data samples AND with malformed/missing data
</test_both_spaces>
<integration_tests_often_more_valuable>
- Test with real database (DuckDB is fast)
- Test actual SQL queries
- Test end-to-end flows
- Use realistic data samples
</integration_tests_often_more_valuable>
</testing_philosophy>
<comments_and_documentation>
<when_to_comment>
**Comment the "why":**
```python
# Use binary search because list is sorted and can be large (1M+ items)
index = binary_search(sorted_items, target)
# Cache for 5 minutes - balance freshness vs database load
@cache(ttl=300)
def get_user_stats(user_id):
...
```
**Don't comment the "what":**
```python
# Bad - code is self-explanatory
# Increment the counter
counter += 1
# Good - code is clear on its own
counter += 1
```
**Always motivate decisions:**
- Explain why you wrote code the way you did
- Code alone isn't documentation — the reasoning matters
- Comments are well-written prose, not margin scribblings
</when_to_comment>
<self_documenting_code>
- Use descriptive names
- Keep functions focused
- Make data flow obvious
- Structure for readability
</self_documenting_code>
</comments_and_documentation>
<summary>
**Key Principles:**
1. **Simple, direct, procedural** — functions over classes
2. **Data-oriented** — understand the data and its flow
3. **Explicit over implicit** — no magic, no hiding
4. **Build minimum that works** — solve actual problems, zero technical debt
5. **Performance conscious** — design for performance, then measure before micro-optimizing
6. **Keep logic in SQL** — let the database do the work
7. **Handle errors explicitly** — no silent failures, all errors handled
8. **Assert invariants** — use assertions to document and enforce correctness
9. **Set limits on everything** — nothing runs unbounded
10. **Question abstractions** — every layer needs justification
**Ask yourself:**
- Is this the simplest solution?
- Can someone else understand this?
- What is the computer actually doing?
- Am I solving the real problem?
- What are the bounds on this operation?
When in doubt, go simpler.
</summary>

View File

@@ -1,5 +1,5 @@
# Changes here will be overwritten by Copier; NEVER EDIT MANUALLY
_commit: 29ac25b
_commit: v0.9.0
_src_path: /home/Deeman/Projects/quart_saas_boilerplate
author_email: ''
author_name: ''

2
.gitignore vendored
View File

@@ -1,5 +1,5 @@
# Personal / project-root
CLAUDE.md
/CLAUDE.md
.bedrockapikey
.live-slot
.worktrees/

View File

@@ -7,6 +7,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
## [Unreleased]
### Added
- Template sync: copier update from `29ac25b``v0.9.0` (29 template commits)
- `.claude/CLAUDE.md`: project-specific Claude Code instructions (skills, commands, architecture)
- `.claude/coding_philosophy.md`: engineering principles guide
- `extract/padelnomics_extract/README.md`: extraction patterns & state tracking docs
- `extract/padelnomics_extract/src/padelnomics_extract/utils.py`: SQLite state tracking
(`open_state_db`, `start_run`, `end_run`, `get_last_cursor`) + file I/O helpers
(`landing_path`, `content_hash`, `write_gzip_atomic`)
- `transform/sqlmesh_padelnomics/README.md`: 4-layer SQLMesh architecture guide
- Per-layer model READMEs (raw, staging, foundation, serving)
- `infra/supervisor/`: systemd service + supervisor script for pipeline orchestration
- Copier answers file now includes `enable_daas`, `enable_cms`, `enable_directory`, `enable_i18n`
toggles (prevents accidental deletion on future copier updates)
- Expanded programmatic SEO city coverage from 18 to 40 cities (+22 cities across ES, FR,
IT, NL, AT, CH, SE, PT, BE, AE, AU, IE) — generates 80 articles (40 cities × EN + DE)
- `scripts/refresh_from_daas.py`: syncs template_data rows from DuckDB `planner_defaults`

View File

@@ -0,0 +1,90 @@
# Padelnomics Extraction
Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.
## Running
```bash
# One-shot (most recent data only)
LANDING_DIR=data/landing uv run extract
# First-time full backfill (add your own backfill entry point)
LANDING_DIR=data/landing uv run python -m padelnomics_extract.execute
```
## Design: filesystem as state
The landing zone is an append-only store of raw files. Each file is named by its content fingerprint (etag or SHA256 hash), so:
- **Idempotency**: running twice writes nothing if the source hasn't changed
- **Debugging**: every historical raw file is preserved — reprocess any window by re-running transforms
- **Safety**: extraction never mutates existing files, only appends new ones
### Etag-based dedup (preferred)
When the source provides an `ETag` header, use it as the filename:
```
data/landing/padelnomics/{year}/{month:02d}/{etag}.csv.gz
```
The file existing on disk means the content matches the server's current version. No content download needed.
### Hash-based dedup (fallback)
When the source has no etag (static files that update in-place), download the content and use its SHA256 prefix as the filename:
```
data/landing/padelnomics/{year}/{date}_{sha256[:8]}.csv.gz
```
Two runs that produce identical content produce the same hash → same filename → skip.
## State tracking
Every run writes one row to `data/landing/.state.sqlite`. Query it to answer operational questions:
```bash
# When did extraction last succeed?
sqlite3 data/landing/.state.sqlite \
"SELECT extractor, started_at, status, files_written, files_skipped, cursor_value
FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
# Did anything fail in the last 7 days?
sqlite3 data/landing/.state.sqlite \
"SELECT * FROM extraction_runs WHERE status = 'failed'
AND started_at > datetime('now', '-7 days')"
```
State table schema:
| Column | Type | Description |
|--------|------|-------------|
| `run_id` | INTEGER | Auto-increment primary key |
| `extractor` | TEXT | Extractor name (e.g. `padelnomics`) |
| `started_at` | TEXT | ISO 8601 UTC timestamp |
| `finished_at` | TEXT | ISO 8601 UTC timestamp, NULL if still running |
| `status` | TEXT | `running``success` or `failed` |
| `files_written` | INTEGER | New files written this run |
| `files_skipped` | INTEGER | Files already present (content unchanged) |
| `bytes_written` | INTEGER | Compressed bytes written |
| `cursor_value` | TEXT | Last successful cursor (date, etag, page, etc.) |
| `error_message` | TEXT | Exception message if status = `failed` |
## Adding a new extractor
1. Add a function in `execute.py` following the same pattern as `extract_file_by_etag()` or `extract_file_by_hash()`
2. Call it from `extract_dataset()` with its own `extractor` name in `start_run()`
3. Store files under a new subdirectory: `landing_path(LANDING_DIR, "my_new_source", year)`
4. Add a new SQLMesh `raw/` model that reads from the new subdirectory glob
## Landing zone structure
```
data/landing/
├── .state.sqlite # extraction run history
└── padelnomics/ # one subdirectory per source
└── {year}/
└── {month:02d}/
└── {etag}.csv.gz # immutable, content-addressed files
```

View File

@@ -0,0 +1,129 @@
"""Extraction utilities: SQLite state tracking, file I/O helpers.
These are inline equivalents of the extract_core library used in larger
multi-extractor pipelines. For a single-package project they live here;
if you add multiple data sources, extract them to a shared workspace package.
"""
import gzip
import hashlib
import sqlite3
from pathlib import Path
# ---------------------------------------------------------------------------
# State tracking (SQLite — transactional, stdlib, no extra dependency)
# ---------------------------------------------------------------------------
_CREATE_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS extraction_runs (
run_id INTEGER PRIMARY KEY AUTOINCREMENT,
extractor TEXT NOT NULL,
started_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
finished_at TEXT,
status TEXT NOT NULL DEFAULT 'running',
files_written INTEGER DEFAULT 0,
files_skipped INTEGER DEFAULT 0,
bytes_written INTEGER DEFAULT 0,
cursor_value TEXT,
error_message TEXT
)
"""
def open_state_db(landing_dir: str | Path) -> sqlite3.Connection:
"""Open (or create) .state.sqlite inside landing_dir.
WAL mode allows concurrent reads while a run is in progress.
Caller is responsible for conn.close().
"""
db_path = Path(landing_dir) / ".state.sqlite"
db_path.parent.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(str(db_path))
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute(_CREATE_TABLE_SQL)
conn.commit()
return conn
def start_run(conn: sqlite3.Connection, extractor: str) -> int:
"""Insert a 'running' row. Returns run_id."""
cur = conn.execute(
"INSERT INTO extraction_runs (extractor, status) VALUES (?, 'running')",
(extractor,),
)
conn.commit()
return cur.lastrowid
def end_run(
conn: sqlite3.Connection,
run_id: int,
*,
status: str,
files_written: int = 0,
files_skipped: int = 0,
bytes_written: int = 0,
cursor_value: str | None = None,
error_message: str | None = None,
) -> None:
"""Update the run row to its final state."""
assert status in ("success", "failed")
conn.execute(
"""
UPDATE extraction_runs
SET finished_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now'),
status = ?,
files_written = ?,
files_skipped = ?,
bytes_written = ?,
cursor_value = ?,
error_message = ?
WHERE run_id = ?
""",
(status, files_written, files_skipped, bytes_written, cursor_value, error_message, run_id),
)
conn.commit()
def get_last_cursor(conn: sqlite3.Connection, extractor: str) -> str | None:
"""Return the cursor_value from the most recent successful run, or None."""
row = conn.execute(
"""
SELECT cursor_value FROM extraction_runs
WHERE extractor = ? AND status = 'success' AND cursor_value IS NOT NULL
ORDER BY run_id DESC LIMIT 1
""",
(extractor,),
).fetchone()
return row["cursor_value"] if row else None
# ---------------------------------------------------------------------------
# File I/O helpers
# ---------------------------------------------------------------------------
def landing_path(landing_dir: str | Path, *parts: str) -> Path:
"""Return path to a subdirectory of landing_dir, creating it if absent."""
path = Path(landing_dir).joinpath(*parts)
path.mkdir(parents=True, exist_ok=True)
return path
def content_hash(data: bytes, prefix_bytes: int = 8) -> str:
"""SHA256 content fingerprint — used as idempotency key in filenames."""
assert data, "data must not be empty"
return hashlib.sha256(data).hexdigest()[:prefix_bytes]
def write_gzip_atomic(path: Path, data: bytes) -> int:
"""Gzip compress data and write to path atomically via .tmp sibling.
Returns bytes written. Atomic write means readers never see a partial file.
"""
assert data, "data must not be empty"
compressed = gzip.compress(data)
tmp = path.with_suffix(path.suffix + ".tmp")
tmp.write_bytes(compressed)
tmp.rename(path)
return len(compressed)

View File

@@ -0,0 +1,24 @@
[Unit]
Description=Padelnomics Supervisor — Pipeline Orchestration
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/padelnomics
ExecStart=/opt/padelnomics/infra/supervisor/supervisor.sh
Restart=always
RestartSec=10
EnvironmentFile=/opt/padelnomics/.env
Environment=LANDING_DIR=/data/padelnomics/landing
Environment=DUCKDB_PATH=/data/padelnomics/lakehouse.duckdb
LimitNOFILE=65536
StandardOutput=journal
StandardError=journal
SyslogIdentifier=padelnomics-supervisor
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,47 @@
#!/bin/sh
# Padelnomics Supervisor — continuous pipeline orchestration.
# Inspired by TigerBeetle's CFO supervisor: simple, resilient, easy to understand.
# https://github.com/tigerbeetle/tigerbeetle/blob/main/src/scripts/cfo_supervisor.sh
#
# Environment variables (set in systemd EnvironmentFile or .env):
# LANDING_DIR — local path for extracted landing data
# DUCKDB_PATH — path to DuckDB lakehouse file
# ALERT_WEBHOOK_URL — optional ntfy.sh / Slack / Telegram webhook for failures
set -eu
readonly REPO_DIR="/opt/padelnomics"
while true
do
(
if ! [ -d "$REPO_DIR/.git" ]; then
echo "Repository not found at $REPO_DIR — bootstrap required!"
exit 1
fi
cd "$REPO_DIR"
# Pull latest code
git fetch origin master
git switch --discard-changes --detach origin/master
uv sync
# Extract
LANDING_DIR="${LANDING_DIR:-/data/padelnomics/landing}" \
DUCKDB_PATH="${DUCKDB_PATH:-/data/padelnomics/lakehouse.duckdb}" \
uv run --package padelnomics_extract extract
# Transform
LANDING_DIR="${LANDING_DIR:-/data/padelnomics/landing}" \
DUCKDB_PATH="${DUCKDB_PATH:-/data/padelnomics/lakehouse.duckdb}" \
uv run --package sqlmesh_padelnomics sqlmesh run --select-model "serving.*"
) || {
if [ -n "${ALERT_WEBHOOK_URL:-}" ]; then
curl -s -d "Padelnomics pipeline failed at $(date)" \
"$ALERT_WEBHOOK_URL" 2>/dev/null || true
fi
sleep 600 # back off 10 min on failure
}
done

View File

@@ -0,0 +1,107 @@
# Padelnomics Transform (SQLMesh)
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
## Running
```bash
# From repo root — plan all changes (shows what will run)
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
# Apply to production
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
# Run model tests
uv run sqlmesh -p transform/sqlmesh_padelnomics test
# Format SQL
uv run sqlmesh -p transform/sqlmesh_padelnomics format
```
## 4-layer architecture
```
landing/ <- raw files (extraction output)
+-- padelnomics/
+-- {year}/{etag}.csv.gz
raw/ <- reads files verbatim
+-- raw.padelnomics
staging/ <- type casting, deduplication
+-- staging.stg_padelnomics
foundation/ <- business logic, dimensions, facts
+-- foundation.dim_category
serving/ <- pre-aggregated for web app
+-- serving.padelnomics_metrics
```
### raw/ — verbatim source reads
- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
- No transformations, no business logic
- Column names match the source exactly
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
- Naming: `raw.<source>`
### staging/ — type casting and cleansing
- One model per raw model (1:1)
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
- Deduplicate if source produces duplicates
- Minimal renaming — only where raw names are genuinely unclear
- Naming: `staging.stg_<source>`
### foundation/ — business logic
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
- Facts (`fact_*`): events and measurements, one row per event
- May join across multiple staging models from different sources
- Surrogate keys: `MD5(business_key)` for stable joins
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
### serving/ — analytics-ready aggregates
- Pre-aggregated for specific web app query patterns
- These are the only tables the web app reads
- Queried from `analytics.py` via `fetch_analytics()`
- Named to match what the frontend expects
- Naming: `serving.<purpose>`
## Adding a new data source
1. Add a landing zone directory in the extraction package
2. Add a glob macro in `macros/__init__.py`:
```python
@macro()
def my_source_glob(evaluator) -> str:
landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
return f"'{landing_dir}/my_source/**/*.csv.gz'"
```
3. Add a raw model: `models/raw/raw_my_source.sql`
4. Add a staging model: `models/staging/stg_my_source.sql`
5. Join into foundation or serving models as needed
## Model materialization
| Layer | Default kind | Rationale |
|-------|-------------|-----------|
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
| staging | FULL | 1:1 with raw; same cost |
| foundation | FULL | Business logic rarely changes; recompute is fast |
| serving | FULL | Small aggregates; web app needs latest at all times |
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
## Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
SQLMesh holds an exclusive write lock during plan/run.

View File

@@ -0,0 +1,6 @@
# foundation
Business logic layer: dimensions, facts, conformed metrics.
May join across staging models from different sources.
Naming convention: `foundation.dim_<entity>`, `foundation.fact_<event>`

View File

@@ -0,0 +1,6 @@
# raw
Read raw landing zone files directly with `read_csv_auto()`.
No transformations — schema as-is from source.
Naming convention: `raw.<source>_<dataset>`

View File

@@ -0,0 +1,6 @@
# serving
Analytics-ready views consumed by the web app and programmatic SEO.
Query these from `analytics.py` via DuckDB read-only connection.
Naming convention: `serving.<purpose>` (e.g. `serving.city_market_profile`)

View File

@@ -0,0 +1,6 @@
# staging
Type casting, deduplication, null handling on top of raw models.
One staging model per raw model.
Naming convention: `staging.<source>_<dataset>`