feat: copier update v0.9.0 — extraction docs, state tracking, architecture guides
Sync template from 29ac25b → v0.9.0 (29 template commits). Due to template's _subdirectory migration, new files were manually rendered rather than auto-merged by copier. New files: - .claude/CLAUDE.md + coding_philosophy.md (agent instructions) - extract utils.py: SQLite state tracking for extraction runs - extract/transform READMEs: architecture & pattern documentation - infra/supervisor: systemd service + orchestration script - Per-layer model READMEs (raw, staging, foundation, serving) Also fixes copier-answers.yml (adds 4 feature toggles, removes stale payment_provider key) and scopes CLAUDE.md gitignore to root only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
106
.claude/CLAUDE.md
Normal file
106
.claude/CLAUDE.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# CLAUDE.md — Padelnomics
|
||||
|
||||
This file tells Claude Code how to work in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
Padelnomics is a SaaS application built with Quart (async Python), HTMX, and SQLite.
|
||||
It includes a full data pipeline:
|
||||
|
||||
```
|
||||
External APIs → extract → landing zone → SQLMesh transform → DuckDB → web app
|
||||
```
|
||||
|
||||
**Packages** (uv workspace):
|
||||
- `web/` — Quart + HTMX web application (auth, billing, dashboard)
|
||||
- `extract/padelnomics_extract/` — data extraction to local landing zone
|
||||
- `transform/sqlmesh_padelnomics/` — 4-layer SQL transformation (raw → staging → foundation → serving)
|
||||
- `src/padelnomics/` — CLI utilities, export_serving helper
|
||||
|
||||
## Skills: invoke these for domain tasks
|
||||
|
||||
### Working on extraction or transformation?
|
||||
|
||||
Use the **`data-engineer`** skill for:
|
||||
- Designing or reviewing SQLMesh model logic
|
||||
- Adding a new data source (extract + raw + staging models)
|
||||
- Performance tuning DuckDB queries
|
||||
- Data modeling decisions (dimensions, facts, aggregates)
|
||||
- Understanding the 4-layer architecture
|
||||
|
||||
```
|
||||
/data-engineer (or ask Claude to invoke it)
|
||||
```
|
||||
|
||||
### Working on the web app UI or frontend?
|
||||
|
||||
Use the **`frontend-design`** skill for UI components, templates, or dashboard layouts.
|
||||
|
||||
### Working on payments or subscriptions?
|
||||
|
||||
Use the **`paddle-integration`** skill for billing, webhooks, and subscription logic.
|
||||
|
||||
## Key commands
|
||||
|
||||
```bash
|
||||
# Install all dependencies
|
||||
uv sync --all-packages
|
||||
|
||||
# Lint & format
|
||||
ruff check .
|
||||
ruff format .
|
||||
|
||||
# Run tests
|
||||
uv run pytest tests/ -v
|
||||
|
||||
# Dev server
|
||||
./scripts/dev_run.sh
|
||||
|
||||
# Extract data
|
||||
LANDING_DIR=data/landing uv run extract
|
||||
|
||||
# SQLMesh plan + run (from repo root)
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
|
||||
|
||||
# Export serving tables (run after SQLMesh)
|
||||
DUCKDB_PATH=local.duckdb SERVING_DUCKDB_PATH=analytics.duckdb \
|
||||
uv run python -m padelnomics.export_serving
|
||||
```
|
||||
|
||||
## Architecture documentation
|
||||
|
||||
| Topic | File |
|
||||
|-------|------|
|
||||
| Extraction patterns, state tracking, adding new sources | `extract/padelnomics_extract/README.md` |
|
||||
| 4-layer SQLMesh architecture, materialization strategy | `transform/sqlmesh_padelnomics/README.md` |
|
||||
| Two-file DuckDB architecture (SQLMesh lock isolation) | `src/padelnomics/export_serving.py` docstring |
|
||||
|
||||
## Pipeline data flow
|
||||
|
||||
```
|
||||
data/landing/
|
||||
└── padelnomics/{year}/{etag}.csv.gz ← extraction output
|
||||
|
||||
local.duckdb ← SQLMesh exclusive (raw → staging → foundation → serving)
|
||||
|
||||
analytics.duckdb ← serving tables only, web app read-only
|
||||
└── serving.* ← atomically replaced by export_serving.py
|
||||
```
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `LANDING_DIR` | `data/landing` | Landing zone root (extraction writes here) |
|
||||
| `DUCKDB_PATH` | `local.duckdb` | SQLMesh pipeline DB (exclusive write) |
|
||||
| `SERVING_DUCKDB_PATH` | `analytics.duckdb` | Read-only DB for web app |
|
||||
|
||||
## Coding philosophy
|
||||
|
||||
- **Simple and procedural** — functions over classes, no "Manager" patterns
|
||||
- **Idempotent operations** — running twice produces the same result
|
||||
- **Explicit assertions** — assert preconditions at function boundaries
|
||||
- **Bounded operations** — set timeouts, page limits, buffer sizes
|
||||
|
||||
Read `coding_philosophy.md` (if present) for the full guide.
|
||||
542
.claude/coding_philosophy.md
Normal file
542
.claude/coding_philosophy.md
Normal file
@@ -0,0 +1,542 @@
|
||||
# Coding Philosophy & Engineering Principles
|
||||
|
||||
This document defines the coding philosophy and engineering principles that guide all agent work. All agents should internalize and follow these principles.
|
||||
|
||||
Influenced by Casey Muratori, Jonathan Blow, and [TigerStyle](https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TIGER_STYLE.md) (adapted for Python/SQL).
|
||||
|
||||
<core_philosophy>
|
||||
**Simple, Direct, Procedural Code**
|
||||
|
||||
- Solve the actual problem, not the general case
|
||||
- Understand what the computer is doing
|
||||
- Explicit is better than clever
|
||||
- Code should be obvious, not impressive
|
||||
- Do it right the first time — feature gaps are acceptable, but what ships must meet design goals
|
||||
</core_philosophy>
|
||||
|
||||
<code_style>
|
||||
|
||||
<functions_over_classes>
|
||||
**Prefer:**
|
||||
- Pure functions that transform data
|
||||
- Simple procedures that do clear things
|
||||
- Explicit data structures (dicts, lists, named tuples)
|
||||
|
||||
**Avoid:**
|
||||
- Classes that are just namespaces for functions
|
||||
- Objects hiding behavior behind methods
|
||||
- Inheritance hierarchies
|
||||
- "Manager" or "Handler" classes
|
||||
|
||||
**Example - Good:**
|
||||
```python
|
||||
def calculate_user_metrics(events: list[dict]) -> dict:
|
||||
"""Calculate metrics from event list."""
|
||||
total = len(events)
|
||||
unique_sessions = len(set(e['session_id'] for e in events))
|
||||
|
||||
return {
|
||||
'total_events': total,
|
||||
'unique_sessions': unique_sessions,
|
||||
'events_per_session': total / unique_sessions if unique_sessions > 0 else 0
|
||||
}
|
||||
```
|
||||
|
||||
**Example - Bad:**
|
||||
```python
|
||||
class UserMetricsCalculator:
|
||||
def __init__(self):
|
||||
self._events = []
|
||||
|
||||
def add_events(self, events: list[dict]):
|
||||
self._events.extend(events)
|
||||
|
||||
def calculate(self) -> UserMetrics:
|
||||
return UserMetrics(
|
||||
total=self._calculate_total(),
|
||||
sessions=self._calculate_sessions()
|
||||
)
|
||||
```
|
||||
</functions_over_classes>
|
||||
|
||||
<data_oriented_design>
|
||||
**Think about the data:**
|
||||
- What's the shape of the data?
|
||||
- How does it flow through the system?
|
||||
- What transformations are needed?
|
||||
- What's the memory layout?
|
||||
|
||||
**Data is just data:**
|
||||
- Use simple structures (dicts, lists, tuples)
|
||||
- Don't hide data behind getters/setters
|
||||
- Make data transformations explicit
|
||||
- Consider performance implications
|
||||
|
||||
**Example - Good:**
|
||||
```python
|
||||
# Data is data, functions transform it
|
||||
users = [
|
||||
{'id': 1, 'name': 'Alice', 'active': True},
|
||||
{'id': 2, 'name': 'Bob', 'active': False},
|
||||
]
|
||||
|
||||
def filter_active(users: list[dict]) -> list[dict]:
|
||||
return [u for u in users if u['active']]
|
||||
|
||||
active_users = filter_active(users)
|
||||
```
|
||||
|
||||
**Example - Bad:**
|
||||
```python
|
||||
# Data hidden behind objects
|
||||
class User:
|
||||
def __init__(self, id, name, active):
|
||||
self._id = id
|
||||
self._name = name
|
||||
self._active = active
|
||||
|
||||
def get_name(self):
|
||||
return self._name
|
||||
|
||||
def is_active(self):
|
||||
return self._active
|
||||
|
||||
users = [User(1, 'Alice', True), User(2, 'Bob', False)]
|
||||
active_users = [u for u in users if u.is_active()]
|
||||
```
|
||||
</data_oriented_design>
|
||||
|
||||
<keep_it_simple>
|
||||
**Simple control flow:**
|
||||
- Straightforward if/else over clever tricks
|
||||
- Explicit loops over list comprehensions when clearer
|
||||
- Early returns to reduce nesting
|
||||
- Avoid deeply nested logic
|
||||
|
||||
**Simple naming:**
|
||||
- Descriptive variable names (`user_count` not `uc`)
|
||||
- Function names that say what they do (`calculate_total` not `process`)
|
||||
- No abbreviations unless universal (`id`, `url`, `sql`)
|
||||
- Include units in names: `timeout_seconds`, `size_bytes`, `latency_ms` — not `timeout`, `size`, `latency`
|
||||
- Place qualifiers last in descending significance: `latency_ms_max` not `max_latency_ms` (aligns related variables)
|
||||
|
||||
**Simple structure:**
|
||||
- Functions should do one thing
|
||||
- Keep functions short (20-50 lines, hard limit ~70 — must fit on screen without scrolling)
|
||||
- If it's getting complex, break it up
|
||||
- But don't break it up "just because"
|
||||
</keep_it_simple>
|
||||
|
||||
<minimize_variable_scope>
|
||||
**Declare variables close to where they're used:**
|
||||
- Don't introduce variables before they're needed
|
||||
- Remove them when no longer relevant
|
||||
- Minimize the number of variables in scope at any point
|
||||
- Reduces probability of stale-state bugs (check something in one place, use it in another)
|
||||
|
||||
**Don't duplicate state:**
|
||||
- One source of truth for each piece of data
|
||||
- Don't create aliases or copies that can drift out of sync
|
||||
- If you compute a value, use it directly — don't store it in a variable you'll use 50 lines later
|
||||
</minimize_variable_scope>
|
||||
|
||||
</code_style>
|
||||
|
||||
<architecture_principles>
|
||||
|
||||
<build_minimum_that_works>
|
||||
**Start simple:**
|
||||
- Solve the immediate problem
|
||||
- Don't build for imagined future requirements
|
||||
- Add complexity only when actually needed
|
||||
- Prefer obvious solutions over clever ones
|
||||
|
||||
**Avoid premature abstraction:**
|
||||
- Duplication is okay early on
|
||||
- Abstract only when pattern is clear
|
||||
- Three examples before abstracting
|
||||
- Question every layer of indirection
|
||||
|
||||
**Zero technical debt:**
|
||||
- Do it right the first time
|
||||
- A problem solved in design costs less than one solved in implementation, which costs less than one solved in production
|
||||
- Feature gaps are acceptable; broken or half-baked code is not
|
||||
</build_minimum_that_works>
|
||||
|
||||
<explicit_over_implicit>
|
||||
**Be explicit about:**
|
||||
- Where data comes from
|
||||
- What transformations happen
|
||||
- Error conditions and handling
|
||||
- Dependencies and side effects
|
||||
|
||||
**Avoid magic:**
|
||||
- Framework conventions that hide behavior
|
||||
- Implicit configuration
|
||||
- Action-at-a-distance
|
||||
- Metaprogramming tricks
|
||||
- Relying on library defaults — pass options explicitly at call site
|
||||
</explicit_over_implicit>
|
||||
|
||||
<set_limits_on_everything>
|
||||
**Nothing should run unbounded:**
|
||||
- Set max retries on network calls
|
||||
- Set timeouts on all external requests
|
||||
- Bound loop iterations where data size is unknown
|
||||
- Set max page counts on paginated API fetches
|
||||
- Cap queue/buffer sizes
|
||||
|
||||
**Why:** Unbounded operations cause tail latency spikes, resource exhaustion, and silent hangs. A system that fails loudly at a known limit is better than one that degrades mysteriously.
|
||||
</set_limits_on_everything>
|
||||
|
||||
<question_dependencies>
|
||||
**Before adding a library:**
|
||||
- Can I write this simply myself?
|
||||
- What's the complexity budget?
|
||||
- Am I using 5% of a large framework?
|
||||
- Is this solving my actual problem?
|
||||
|
||||
**Prefer:**
|
||||
- Standard library when possible
|
||||
- Small, focused libraries
|
||||
- Direct solutions
|
||||
- Understanding what code does
|
||||
|
||||
**Approved dependencies (earn their place):**
|
||||
- `msgspec` — struct types and validation at system boundaries (external APIs, user input,
|
||||
inter-process data). Use `msgspec.Struct` instead of dataclasses when you need: fast
|
||||
encode/decode, built-in validation, or typed containers for boundary data.
|
||||
**Rule:** use Structs at boundaries (API responses, HAR entries, MCP tool I/O) —
|
||||
keep internal plumbing as plain dicts/tuples.
|
||||
</question_dependencies>
|
||||
|
||||
</architecture_principles>
|
||||
|
||||
<performance_consciousness>
|
||||
|
||||
<think_about_the_computer>
|
||||
**Understand:**
|
||||
- Memory layout matters
|
||||
- Cache locality matters
|
||||
- Allocations have cost
|
||||
- Loops over data can be fast or slow
|
||||
|
||||
**Common issues:**
|
||||
- N+1 queries (database or API)
|
||||
- Nested loops over large data
|
||||
- Copying large structures unnecessarily
|
||||
- Loading entire datasets into memory
|
||||
</think_about_the_computer>
|
||||
|
||||
<design_phase_performance>
|
||||
**Think about performance upfront during design, not just after profiling:**
|
||||
- The largest wins (100-1000x) happen in the design phase
|
||||
- Back-of-envelope sketch: estimate load across network, disk, memory, CPU
|
||||
- Optimize for the slowest resource first (network > disk > memory > CPU)
|
||||
- Compensate for frequency — a cheap operation called 10M times can dominate
|
||||
|
||||
**Batching:**
|
||||
- Amortize costs via batching (network calls, disk writes, database inserts)
|
||||
- One batch insert of 1000 rows beats 1000 individual inserts
|
||||
- Distinguish control plane (rare, can be slow) from data plane (hot path, must be fast)
|
||||
|
||||
**But don't prematurely optimize implementation details:**
|
||||
- Design for performance, then measure before micro-optimizing
|
||||
- Make it work, then make it fast
|
||||
- Optimize the hot path, not everything
|
||||
</design_phase_performance>
|
||||
|
||||
</performance_consciousness>
|
||||
|
||||
<assertions_and_invariants>
|
||||
|
||||
<use_assertions_as_documentation>
|
||||
**Assert preconditions, postconditions, and invariants — especially in data pipelines:**
|
||||
|
||||
```python
|
||||
def normalize_prices(prices: list[dict], currency: str) -> list[dict]:
|
||||
assert len(prices) > 0, "prices must not be empty"
|
||||
assert currency in ("USD", "EUR", "BRL"), f"unsupported currency: {currency}"
|
||||
|
||||
result = [convert_price(p, currency) for p in prices]
|
||||
|
||||
assert len(result) == len(prices), "normalization must not drop rows"
|
||||
assert all(r['currency'] == currency for r in result), "all prices must be in target currency"
|
||||
return result
|
||||
```
|
||||
|
||||
**Guidelines:**
|
||||
- Assert function arguments and return values at boundaries
|
||||
- Assert data quality: row counts, non-null columns, expected ranges
|
||||
- Use assertions to document surprising or critical invariants
|
||||
- Split compound assertions: `assert a; assert b` not `assert a and b` (clearer error messages)
|
||||
- Assertions catch programmer errors — they should never be used for expected runtime conditions (use if/else for those)
|
||||
</use_assertions_as_documentation>
|
||||
|
||||
</assertions_and_invariants>
|
||||
|
||||
<sql_and_data>
|
||||
|
||||
<keep_logic_in_sql>
|
||||
**Good:**
|
||||
```sql
|
||||
-- Logic is clear, database does the work
|
||||
SELECT
|
||||
user_id,
|
||||
COUNT(*) as event_count,
|
||||
COUNT(DISTINCT session_id) as session_count,
|
||||
MAX(event_time) as last_active
|
||||
FROM events
|
||||
WHERE event_time >= CURRENT_DATE - 30
|
||||
GROUP BY user_id
|
||||
HAVING COUNT(*) >= 10
|
||||
```
|
||||
|
||||
**Bad:**
|
||||
```python
|
||||
# Pulling too much data, doing work in Python
|
||||
events = db.query("SELECT * FROM events WHERE event_time >= CURRENT_DATE - 30")
|
||||
user_events = {}
|
||||
for event in events: # Could be millions of rows!
|
||||
if event.user_id not in user_events:
|
||||
user_events[event.user_id] = []
|
||||
user_events[event.user_id].append(event)
|
||||
|
||||
results = []
|
||||
for user_id, events in user_events.items():
|
||||
if len(events) >= 10:
|
||||
results.append({'user_id': user_id, 'count': len(events)})
|
||||
```
|
||||
</keep_logic_in_sql>
|
||||
|
||||
<sql_best_practices>
|
||||
**Write readable SQL:**
|
||||
- Use CTEs for complex queries
|
||||
- One concept per CTE
|
||||
- Descriptive CTE names
|
||||
- Comments for non-obvious logic
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
WITH active_users AS (
|
||||
-- Users who logged in within last 30 days
|
||||
SELECT DISTINCT user_id
|
||||
FROM login_events
|
||||
WHERE login_time >= CURRENT_DATE - 30
|
||||
),
|
||||
|
||||
user_activity AS (
|
||||
-- Count events for active users
|
||||
SELECT
|
||||
e.user_id,
|
||||
COUNT(*) as event_count
|
||||
FROM events e
|
||||
INNER JOIN active_users au ON e.user_id = au.user_id
|
||||
GROUP BY e.user_id
|
||||
)
|
||||
|
||||
SELECT
|
||||
user_id,
|
||||
event_count,
|
||||
event_count / 30.0 as avg_daily_events
|
||||
FROM user_activity
|
||||
ORDER BY event_count DESC
|
||||
```
|
||||
</sql_best_practices>
|
||||
|
||||
</sql_and_data>
|
||||
|
||||
<error_handling>
|
||||
|
||||
<be_explicit_about_errors>
|
||||
**Handle errors explicitly:**
|
||||
```python
|
||||
def get_user(user_id: str) -> dict | None:
|
||||
"""Get user by ID. Returns None if not found."""
|
||||
result = db.query("SELECT * FROM users WHERE id = ?", [user_id])
|
||||
return result[0] if result else None
|
||||
|
||||
def process_user(user_id: str):
|
||||
user = get_user(user_id)
|
||||
if user is None:
|
||||
logger.warning(f"User {user_id} not found")
|
||||
return None
|
||||
|
||||
# Process user...
|
||||
return result
|
||||
```
|
||||
|
||||
**Don't hide errors:**
|
||||
```python
|
||||
# Bad - silently catches everything
|
||||
try:
|
||||
result = do_something()
|
||||
except:
|
||||
result = None
|
||||
|
||||
# Good - explicit about what can fail
|
||||
try:
|
||||
result = do_something()
|
||||
except ValueError as e:
|
||||
logger.error(f"Invalid value: {e}")
|
||||
raise
|
||||
except ConnectionError as e:
|
||||
logger.error(f"Connection failed: {e}")
|
||||
return None
|
||||
```
|
||||
</be_explicit_about_errors>
|
||||
|
||||
<fail_fast>
|
||||
- Validate inputs at boundaries
|
||||
- Check preconditions early
|
||||
- Return early on error conditions
|
||||
- Don't let bad data propagate
|
||||
- All errors must be handled — 92% of catastrophic system failures come from incorrect handling of non-fatal errors
|
||||
</fail_fast>
|
||||
|
||||
</error_handling>
|
||||
|
||||
<anti_patterns>
|
||||
|
||||
<over_engineering>
|
||||
- Repository pattern for simple CRUD
|
||||
- Service layer that just calls the database
|
||||
- Dependency injection containers
|
||||
- Abstract factories for concrete things
|
||||
- Interfaces with one implementation
|
||||
</over_engineering>
|
||||
|
||||
<framework_magic>
|
||||
- ORM hiding N+1 queries
|
||||
- Decorators doing complex logic
|
||||
- Metaclass magic
|
||||
- Convention over configuration (when it hides behavior)
|
||||
</framework_magic>
|
||||
|
||||
<premature_abstraction>
|
||||
- Creating interfaces "for future flexibility"
|
||||
- Generics for specific use cases
|
||||
- Configuration files for hardcoded values
|
||||
- Plugins systems for known features
|
||||
</premature_abstraction>
|
||||
|
||||
<unnecessary_complexity>
|
||||
- Class hierarchies for classification
|
||||
- Design patterns "just because"
|
||||
- Microservices for a small app
|
||||
- Message queues for synchronous operations
|
||||
</unnecessary_complexity>
|
||||
|
||||
</anti_patterns>
|
||||
|
||||
<testing_philosophy>
|
||||
|
||||
<test_behavior_not_implementation>
|
||||
**Focus on:**
|
||||
- What the function does (inputs → outputs)
|
||||
- Edge cases and boundaries
|
||||
- Error conditions
|
||||
- Data transformations
|
||||
|
||||
**Don't test:**
|
||||
- Private implementation details
|
||||
- Framework internals
|
||||
- External libraries
|
||||
- Simple property access
|
||||
</test_behavior_not_implementation>
|
||||
|
||||
<keep_tests_simple>
|
||||
```python
|
||||
def test_user_aggregation():
|
||||
# Arrange - simple, clear test data
|
||||
events = [
|
||||
{'user_id': 'u1', 'event': 'click'},
|
||||
{'user_id': 'u1', 'event': 'view'},
|
||||
{'user_id': 'u2', 'event': 'click'},
|
||||
]
|
||||
|
||||
# Act - call the function
|
||||
result = aggregate_user_events(events)
|
||||
|
||||
# Assert - check the behavior
|
||||
assert result == {'u1': 2, 'u2': 1}
|
||||
```
|
||||
</keep_tests_simple>
|
||||
|
||||
<test_both_spaces>
|
||||
**Test positive and negative space:**
|
||||
- Test valid inputs produce correct outputs (positive space)
|
||||
- Test invalid inputs are rejected or handled correctly (negative space)
|
||||
- For data pipelines: test with realistic data samples AND with malformed/missing data
|
||||
</test_both_spaces>
|
||||
|
||||
<integration_tests_often_more_valuable>
|
||||
- Test with real database (DuckDB is fast)
|
||||
- Test actual SQL queries
|
||||
- Test end-to-end flows
|
||||
- Use realistic data samples
|
||||
</integration_tests_often_more_valuable>
|
||||
|
||||
</testing_philosophy>
|
||||
|
||||
<comments_and_documentation>
|
||||
|
||||
<when_to_comment>
|
||||
**Comment the "why":**
|
||||
```python
|
||||
# Use binary search because list is sorted and can be large (1M+ items)
|
||||
index = binary_search(sorted_items, target)
|
||||
|
||||
# Cache for 5 minutes - balance freshness vs database load
|
||||
@cache(ttl=300)
|
||||
def get_user_stats(user_id):
|
||||
...
|
||||
```
|
||||
|
||||
**Don't comment the "what":**
|
||||
```python
|
||||
# Bad - code is self-explanatory
|
||||
# Increment the counter
|
||||
counter += 1
|
||||
|
||||
# Good - code is clear on its own
|
||||
counter += 1
|
||||
```
|
||||
|
||||
**Always motivate decisions:**
|
||||
- Explain why you wrote code the way you did
|
||||
- Code alone isn't documentation — the reasoning matters
|
||||
- Comments are well-written prose, not margin scribblings
|
||||
</when_to_comment>
|
||||
|
||||
<self_documenting_code>
|
||||
- Use descriptive names
|
||||
- Keep functions focused
|
||||
- Make data flow obvious
|
||||
- Structure for readability
|
||||
</self_documenting_code>
|
||||
|
||||
</comments_and_documentation>
|
||||
|
||||
<summary>
|
||||
**Key Principles:**
|
||||
1. **Simple, direct, procedural** — functions over classes
|
||||
2. **Data-oriented** — understand the data and its flow
|
||||
3. **Explicit over implicit** — no magic, no hiding
|
||||
4. **Build minimum that works** — solve actual problems, zero technical debt
|
||||
5. **Performance conscious** — design for performance, then measure before micro-optimizing
|
||||
6. **Keep logic in SQL** — let the database do the work
|
||||
7. **Handle errors explicitly** — no silent failures, all errors handled
|
||||
8. **Assert invariants** — use assertions to document and enforce correctness
|
||||
9. **Set limits on everything** — nothing runs unbounded
|
||||
10. **Question abstractions** — every layer needs justification
|
||||
|
||||
**Ask yourself:**
|
||||
- Is this the simplest solution?
|
||||
- Can someone else understand this?
|
||||
- What is the computer actually doing?
|
||||
- Am I solving the real problem?
|
||||
- What are the bounds on this operation?
|
||||
|
||||
When in doubt, go simpler.
|
||||
</summary>
|
||||
@@ -1,5 +1,5 @@
|
||||
# Changes here will be overwritten by Copier; NEVER EDIT MANUALLY
|
||||
_commit: 29ac25b
|
||||
_commit: v0.9.0
|
||||
_src_path: /home/Deeman/Projects/quart_saas_boilerplate
|
||||
author_email: ''
|
||||
author_name: ''
|
||||
|
||||
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,5 +1,5 @@
|
||||
# Personal / project-root
|
||||
CLAUDE.md
|
||||
/CLAUDE.md
|
||||
.bedrockapikey
|
||||
.live-slot
|
||||
.worktrees/
|
||||
|
||||
13
CHANGELOG.md
13
CHANGELOG.md
@@ -7,6 +7,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
- Template sync: copier update from `29ac25b` → `v0.9.0` (29 template commits)
|
||||
- `.claude/CLAUDE.md`: project-specific Claude Code instructions (skills, commands, architecture)
|
||||
- `.claude/coding_philosophy.md`: engineering principles guide
|
||||
- `extract/padelnomics_extract/README.md`: extraction patterns & state tracking docs
|
||||
- `extract/padelnomics_extract/src/padelnomics_extract/utils.py`: SQLite state tracking
|
||||
(`open_state_db`, `start_run`, `end_run`, `get_last_cursor`) + file I/O helpers
|
||||
(`landing_path`, `content_hash`, `write_gzip_atomic`)
|
||||
- `transform/sqlmesh_padelnomics/README.md`: 4-layer SQLMesh architecture guide
|
||||
- Per-layer model READMEs (raw, staging, foundation, serving)
|
||||
- `infra/supervisor/`: systemd service + supervisor script for pipeline orchestration
|
||||
- Copier answers file now includes `enable_daas`, `enable_cms`, `enable_directory`, `enable_i18n`
|
||||
toggles (prevents accidental deletion on future copier updates)
|
||||
|
||||
- Expanded programmatic SEO city coverage from 18 to 40 cities (+22 cities across ES, FR,
|
||||
IT, NL, AT, CH, SE, PT, BE, AE, AU, IE) — generates 80 articles (40 cities × EN + DE)
|
||||
- `scripts/refresh_from_daas.py`: syncs template_data rows from DuckDB `planner_defaults`
|
||||
|
||||
90
extract/padelnomics_extract/README.md
Normal file
90
extract/padelnomics_extract/README.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Padelnomics Extraction
|
||||
|
||||
Fetches raw data from external sources to the local landing zone. The pipeline then reads from the landing zone — extraction and transformation are fully decoupled.
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
# One-shot (most recent data only)
|
||||
LANDING_DIR=data/landing uv run extract
|
||||
|
||||
# First-time full backfill (add your own backfill entry point)
|
||||
LANDING_DIR=data/landing uv run python -m padelnomics_extract.execute
|
||||
```
|
||||
|
||||
## Design: filesystem as state
|
||||
|
||||
The landing zone is an append-only store of raw files. Each file is named by its content fingerprint (etag or SHA256 hash), so:
|
||||
|
||||
- **Idempotency**: running twice writes nothing if the source hasn't changed
|
||||
- **Debugging**: every historical raw file is preserved — reprocess any window by re-running transforms
|
||||
- **Safety**: extraction never mutates existing files, only appends new ones
|
||||
|
||||
### Etag-based dedup (preferred)
|
||||
|
||||
When the source provides an `ETag` header, use it as the filename:
|
||||
|
||||
```
|
||||
data/landing/padelnomics/{year}/{month:02d}/{etag}.csv.gz
|
||||
```
|
||||
|
||||
The file existing on disk means the content matches the server's current version. No content download needed.
|
||||
|
||||
### Hash-based dedup (fallback)
|
||||
|
||||
When the source has no etag (static files that update in-place), download the content and use its SHA256 prefix as the filename:
|
||||
|
||||
```
|
||||
data/landing/padelnomics/{year}/{date}_{sha256[:8]}.csv.gz
|
||||
```
|
||||
|
||||
Two runs that produce identical content produce the same hash → same filename → skip.
|
||||
|
||||
## State tracking
|
||||
|
||||
Every run writes one row to `data/landing/.state.sqlite`. Query it to answer operational questions:
|
||||
|
||||
```bash
|
||||
# When did extraction last succeed?
|
||||
sqlite3 data/landing/.state.sqlite \
|
||||
"SELECT extractor, started_at, status, files_written, files_skipped, cursor_value
|
||||
FROM extraction_runs ORDER BY run_id DESC LIMIT 10"
|
||||
|
||||
# Did anything fail in the last 7 days?
|
||||
sqlite3 data/landing/.state.sqlite \
|
||||
"SELECT * FROM extraction_runs WHERE status = 'failed'
|
||||
AND started_at > datetime('now', '-7 days')"
|
||||
```
|
||||
|
||||
State table schema:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `run_id` | INTEGER | Auto-increment primary key |
|
||||
| `extractor` | TEXT | Extractor name (e.g. `padelnomics`) |
|
||||
| `started_at` | TEXT | ISO 8601 UTC timestamp |
|
||||
| `finished_at` | TEXT | ISO 8601 UTC timestamp, NULL if still running |
|
||||
| `status` | TEXT | `running` → `success` or `failed` |
|
||||
| `files_written` | INTEGER | New files written this run |
|
||||
| `files_skipped` | INTEGER | Files already present (content unchanged) |
|
||||
| `bytes_written` | INTEGER | Compressed bytes written |
|
||||
| `cursor_value` | TEXT | Last successful cursor (date, etag, page, etc.) |
|
||||
| `error_message` | TEXT | Exception message if status = `failed` |
|
||||
|
||||
## Adding a new extractor
|
||||
|
||||
1. Add a function in `execute.py` following the same pattern as `extract_file_by_etag()` or `extract_file_by_hash()`
|
||||
2. Call it from `extract_dataset()` with its own `extractor` name in `start_run()`
|
||||
3. Store files under a new subdirectory: `landing_path(LANDING_DIR, "my_new_source", year)`
|
||||
4. Add a new SQLMesh `raw/` model that reads from the new subdirectory glob
|
||||
|
||||
## Landing zone structure
|
||||
|
||||
```
|
||||
data/landing/
|
||||
├── .state.sqlite # extraction run history
|
||||
└── padelnomics/ # one subdirectory per source
|
||||
└── {year}/
|
||||
└── {month:02d}/
|
||||
└── {etag}.csv.gz # immutable, content-addressed files
|
||||
```
|
||||
129
extract/padelnomics_extract/src/padelnomics_extract/utils.py
Normal file
129
extract/padelnomics_extract/src/padelnomics_extract/utils.py
Normal file
@@ -0,0 +1,129 @@
|
||||
"""Extraction utilities: SQLite state tracking, file I/O helpers.
|
||||
|
||||
These are inline equivalents of the extract_core library used in larger
|
||||
multi-extractor pipelines. For a single-package project they live here;
|
||||
if you add multiple data sources, extract them to a shared workspace package.
|
||||
"""
|
||||
|
||||
import gzip
|
||||
import hashlib
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# State tracking (SQLite — transactional, stdlib, no extra dependency)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_CREATE_TABLE_SQL = """
|
||||
CREATE TABLE IF NOT EXISTS extraction_runs (
|
||||
run_id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
extractor TEXT NOT NULL,
|
||||
started_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
|
||||
finished_at TEXT,
|
||||
status TEXT NOT NULL DEFAULT 'running',
|
||||
files_written INTEGER DEFAULT 0,
|
||||
files_skipped INTEGER DEFAULT 0,
|
||||
bytes_written INTEGER DEFAULT 0,
|
||||
cursor_value TEXT,
|
||||
error_message TEXT
|
||||
)
|
||||
"""
|
||||
|
||||
|
||||
def open_state_db(landing_dir: str | Path) -> sqlite3.Connection:
|
||||
"""Open (or create) .state.sqlite inside landing_dir.
|
||||
|
||||
WAL mode allows concurrent reads while a run is in progress.
|
||||
Caller is responsible for conn.close().
|
||||
"""
|
||||
db_path = Path(landing_dir) / ".state.sqlite"
|
||||
db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
conn = sqlite3.connect(str(db_path))
|
||||
conn.row_factory = sqlite3.Row
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute(_CREATE_TABLE_SQL)
|
||||
conn.commit()
|
||||
return conn
|
||||
|
||||
|
||||
def start_run(conn: sqlite3.Connection, extractor: str) -> int:
|
||||
"""Insert a 'running' row. Returns run_id."""
|
||||
cur = conn.execute(
|
||||
"INSERT INTO extraction_runs (extractor, status) VALUES (?, 'running')",
|
||||
(extractor,),
|
||||
)
|
||||
conn.commit()
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def end_run(
|
||||
conn: sqlite3.Connection,
|
||||
run_id: int,
|
||||
*,
|
||||
status: str,
|
||||
files_written: int = 0,
|
||||
files_skipped: int = 0,
|
||||
bytes_written: int = 0,
|
||||
cursor_value: str | None = None,
|
||||
error_message: str | None = None,
|
||||
) -> None:
|
||||
"""Update the run row to its final state."""
|
||||
assert status in ("success", "failed")
|
||||
conn.execute(
|
||||
"""
|
||||
UPDATE extraction_runs
|
||||
SET finished_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now'),
|
||||
status = ?,
|
||||
files_written = ?,
|
||||
files_skipped = ?,
|
||||
bytes_written = ?,
|
||||
cursor_value = ?,
|
||||
error_message = ?
|
||||
WHERE run_id = ?
|
||||
""",
|
||||
(status, files_written, files_skipped, bytes_written, cursor_value, error_message, run_id),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
|
||||
def get_last_cursor(conn: sqlite3.Connection, extractor: str) -> str | None:
|
||||
"""Return the cursor_value from the most recent successful run, or None."""
|
||||
row = conn.execute(
|
||||
"""
|
||||
SELECT cursor_value FROM extraction_runs
|
||||
WHERE extractor = ? AND status = 'success' AND cursor_value IS NOT NULL
|
||||
ORDER BY run_id DESC LIMIT 1
|
||||
""",
|
||||
(extractor,),
|
||||
).fetchone()
|
||||
return row["cursor_value"] if row else None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File I/O helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def landing_path(landing_dir: str | Path, *parts: str) -> Path:
|
||||
"""Return path to a subdirectory of landing_dir, creating it if absent."""
|
||||
path = Path(landing_dir).joinpath(*parts)
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
return path
|
||||
|
||||
|
||||
def content_hash(data: bytes, prefix_bytes: int = 8) -> str:
|
||||
"""SHA256 content fingerprint — used as idempotency key in filenames."""
|
||||
assert data, "data must not be empty"
|
||||
return hashlib.sha256(data).hexdigest()[:prefix_bytes]
|
||||
|
||||
|
||||
def write_gzip_atomic(path: Path, data: bytes) -> int:
|
||||
"""Gzip compress data and write to path atomically via .tmp sibling.
|
||||
|
||||
Returns bytes written. Atomic write means readers never see a partial file.
|
||||
"""
|
||||
assert data, "data must not be empty"
|
||||
compressed = gzip.compress(data)
|
||||
tmp = path.with_suffix(path.suffix + ".tmp")
|
||||
tmp.write_bytes(compressed)
|
||||
tmp.rename(path)
|
||||
return len(compressed)
|
||||
24
infra/supervisor/padelnomics-supervisor.service
Normal file
24
infra/supervisor/padelnomics-supervisor.service
Normal file
@@ -0,0 +1,24 @@
|
||||
[Unit]
|
||||
Description=Padelnomics Supervisor — Pipeline Orchestration
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=root
|
||||
WorkingDirectory=/opt/padelnomics
|
||||
ExecStart=/opt/padelnomics/infra/supervisor/supervisor.sh
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
EnvironmentFile=/opt/padelnomics/.env
|
||||
Environment=LANDING_DIR=/data/padelnomics/landing
|
||||
Environment=DUCKDB_PATH=/data/padelnomics/lakehouse.duckdb
|
||||
|
||||
LimitNOFILE=65536
|
||||
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=padelnomics-supervisor
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
47
infra/supervisor/supervisor.sh
Normal file
47
infra/supervisor/supervisor.sh
Normal file
@@ -0,0 +1,47 @@
|
||||
#!/bin/sh
|
||||
# Padelnomics Supervisor — continuous pipeline orchestration.
|
||||
# Inspired by TigerBeetle's CFO supervisor: simple, resilient, easy to understand.
|
||||
# https://github.com/tigerbeetle/tigerbeetle/blob/main/src/scripts/cfo_supervisor.sh
|
||||
#
|
||||
# Environment variables (set in systemd EnvironmentFile or .env):
|
||||
# LANDING_DIR — local path for extracted landing data
|
||||
# DUCKDB_PATH — path to DuckDB lakehouse file
|
||||
# ALERT_WEBHOOK_URL — optional ntfy.sh / Slack / Telegram webhook for failures
|
||||
|
||||
set -eu
|
||||
|
||||
readonly REPO_DIR="/opt/padelnomics"
|
||||
|
||||
while true
|
||||
do
|
||||
(
|
||||
if ! [ -d "$REPO_DIR/.git" ]; then
|
||||
echo "Repository not found at $REPO_DIR — bootstrap required!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
cd "$REPO_DIR"
|
||||
|
||||
# Pull latest code
|
||||
git fetch origin master
|
||||
git switch --discard-changes --detach origin/master
|
||||
uv sync
|
||||
|
||||
# Extract
|
||||
LANDING_DIR="${LANDING_DIR:-/data/padelnomics/landing}" \
|
||||
DUCKDB_PATH="${DUCKDB_PATH:-/data/padelnomics/lakehouse.duckdb}" \
|
||||
uv run --package padelnomics_extract extract
|
||||
|
||||
# Transform
|
||||
LANDING_DIR="${LANDING_DIR:-/data/padelnomics/landing}" \
|
||||
DUCKDB_PATH="${DUCKDB_PATH:-/data/padelnomics/lakehouse.duckdb}" \
|
||||
uv run --package sqlmesh_padelnomics sqlmesh run --select-model "serving.*"
|
||||
|
||||
) || {
|
||||
if [ -n "${ALERT_WEBHOOK_URL:-}" ]; then
|
||||
curl -s -d "Padelnomics pipeline failed at $(date)" \
|
||||
"$ALERT_WEBHOOK_URL" 2>/dev/null || true
|
||||
fi
|
||||
sleep 600 # back off 10 min on failure
|
||||
}
|
||||
done
|
||||
107
transform/sqlmesh_padelnomics/README.md
Normal file
107
transform/sqlmesh_padelnomics/README.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# Padelnomics Transform (SQLMesh)
|
||||
|
||||
4-layer SQL transformation pipeline using SQLMesh + DuckDB. Reads from the landing zone, produces analytics-ready tables consumed by the web app.
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
# From repo root — plan all changes (shows what will run)
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics plan
|
||||
|
||||
# Apply to production
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics plan prod
|
||||
|
||||
# Run model tests
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics test
|
||||
|
||||
# Format SQL
|
||||
uv run sqlmesh -p transform/sqlmesh_padelnomics format
|
||||
```
|
||||
|
||||
## 4-layer architecture
|
||||
|
||||
```
|
||||
landing/ <- raw files (extraction output)
|
||||
+-- padelnomics/
|
||||
+-- {year}/{etag}.csv.gz
|
||||
|
||||
raw/ <- reads files verbatim
|
||||
+-- raw.padelnomics
|
||||
|
||||
staging/ <- type casting, deduplication
|
||||
+-- staging.stg_padelnomics
|
||||
|
||||
foundation/ <- business logic, dimensions, facts
|
||||
+-- foundation.dim_category
|
||||
|
||||
serving/ <- pre-aggregated for web app
|
||||
+-- serving.padelnomics_metrics
|
||||
```
|
||||
|
||||
### raw/ — verbatim source reads
|
||||
|
||||
- Reads landing zone files directly with `read_csv(..., all_varchar=true)`
|
||||
- No transformations, no business logic
|
||||
- Column names match the source exactly
|
||||
- Uses a macro (`@padelnomics_glob()`) so new landing files are picked up automatically
|
||||
- Naming: `raw.<source>`
|
||||
|
||||
### staging/ — type casting and cleansing
|
||||
|
||||
- One model per raw model (1:1)
|
||||
- Cast all columns to correct types: `TRY_CAST(report_date AS DATE)`
|
||||
- Deduplicate if source produces duplicates
|
||||
- Minimal renaming — only where raw names are genuinely unclear
|
||||
- Naming: `staging.stg_<source>`
|
||||
|
||||
### foundation/ — business logic
|
||||
|
||||
- Dimensions (`dim_*`): slowly changing attributes, one row per entity
|
||||
- Facts (`fact_*`): events and measurements, one row per event
|
||||
- May join across multiple staging models from different sources
|
||||
- Surrogate keys: `MD5(business_key)` for stable joins
|
||||
- Naming: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
||||
|
||||
### serving/ — analytics-ready aggregates
|
||||
|
||||
- Pre-aggregated for specific web app query patterns
|
||||
- These are the only tables the web app reads
|
||||
- Queried from `analytics.py` via `fetch_analytics()`
|
||||
- Named to match what the frontend expects
|
||||
- Naming: `serving.<purpose>`
|
||||
|
||||
## Adding a new data source
|
||||
|
||||
1. Add a landing zone directory in the extraction package
|
||||
2. Add a glob macro in `macros/__init__.py`:
|
||||
```python
|
||||
@macro()
|
||||
def my_source_glob(evaluator) -> str:
|
||||
landing_dir = evaluator.var("LANDING_DIR") or os.environ.get("LANDING_DIR", "data/landing")
|
||||
return f"'{landing_dir}/my_source/**/*.csv.gz'"
|
||||
```
|
||||
3. Add a raw model: `models/raw/raw_my_source.sql`
|
||||
4. Add a staging model: `models/staging/stg_my_source.sql`
|
||||
5. Join into foundation or serving models as needed
|
||||
|
||||
## Model materialization
|
||||
|
||||
| Layer | Default kind | Rationale |
|
||||
|-------|-------------|-----------|
|
||||
| raw | FULL | Always re-reads all files; cheap with DuckDB parallel scan |
|
||||
| staging | FULL | 1:1 with raw; same cost |
|
||||
| foundation | FULL | Business logic rarely changes; recompute is fast |
|
||||
| serving | FULL | Small aggregates; web app needs latest at all times |
|
||||
|
||||
For large historical tables, switch to `kind INCREMENTAL_BY_TIME_RANGE` with a time partition column. SQLMesh handles the incremental logic automatically.
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `LANDING_DIR` | `data/landing` | Root of the landing zone |
|
||||
| `DUCKDB_PATH` | `local.duckdb` | DuckDB file (SQLMesh exclusive write access) |
|
||||
|
||||
The web app reads from a **separate** `analytics.duckdb` file via `export_serving.py`.
|
||||
Never point `DUCKDB_PATH` and `SERVING_DUCKDB_PATH` to the same file —
|
||||
SQLMesh holds an exclusive write lock during plan/run.
|
||||
@@ -0,0 +1,6 @@
|
||||
# foundation
|
||||
|
||||
Business logic layer: dimensions, facts, conformed metrics.
|
||||
May join across staging models from different sources.
|
||||
|
||||
Naming convention: `foundation.dim_<entity>`, `foundation.fact_<event>`
|
||||
6
transform/sqlmesh_padelnomics/models/raw/README.md
Normal file
6
transform/sqlmesh_padelnomics/models/raw/README.md
Normal file
@@ -0,0 +1,6 @@
|
||||
# raw
|
||||
|
||||
Read raw landing zone files directly with `read_csv_auto()`.
|
||||
No transformations — schema as-is from source.
|
||||
|
||||
Naming convention: `raw.<source>_<dataset>`
|
||||
6
transform/sqlmesh_padelnomics/models/serving/README.md
Normal file
6
transform/sqlmesh_padelnomics/models/serving/README.md
Normal file
@@ -0,0 +1,6 @@
|
||||
# serving
|
||||
|
||||
Analytics-ready views consumed by the web app and programmatic SEO.
|
||||
Query these from `analytics.py` via DuckDB read-only connection.
|
||||
|
||||
Naming convention: `serving.<purpose>` (e.g. `serving.city_market_profile`)
|
||||
6
transform/sqlmesh_padelnomics/models/staging/README.md
Normal file
6
transform/sqlmesh_padelnomics/models/staging/README.md
Normal file
@@ -0,0 +1,6 @@
|
||||
# staging
|
||||
|
||||
Type casting, deduplication, null handling on top of raw models.
|
||||
One staging model per raw model.
|
||||
|
||||
Naming convention: `staging.<source>_<dataset>`
|
||||
Reference in New Issue
Block a user