RetailGuard Data Platform
Zero-cost local retail data platform with incremental Bronze extraction, protected Silver data, blocking quality checks, and DuckDB warehouse evidence.
- Project type
- Retail Data Engineering / Quality Platform
- Core stack
- Python, PySpark, DuckDB
- Delivery
- Case study
Case Study
The problem, implementation decisions, measured evidence, and next improvements.
Overview
A local retail analytics platform spanning source seeding, incremental extraction, privacy-aware transformation, quality gates, warehouse loading, and evidence reporting.
Problem
Retail data pipelines need reproducible ingestion, privacy controls, quality gates, and reviewer-visible evidence without requiring cloud accounts or billing.
Solution
Built a local-first pipeline from deterministic PostgreSQL and FastAPI sources into Bronze Parquet, PySpark Silver, blocking quality checks, and DuckDB star-schema serving views.
Technical Decisions
- Local-first defaults keep the portfolio review path free of required cloud infrastructure.
- Raw personal fields are removed before Silver, with email hashed and phone masked.
Outcome
The repository produces a local portfolio report with KPIs, quality status, privacy controls, reconciliation, idempotency proof, layer counts, and DuckDB objects.
What It Proves
Data engineering, PySpark transformations, warehouse modeling, privacy controls, quality gates, idempotency, Docker, and local-first reviewer workflows.
Key Features
- Default review path is fully local and requires no cloud account, billing account, free trial, or hosted service.
- Blocking checks cover keys, values, relationships, reconciliation, volume, and raw PII leakage.
- Two-run idempotency proof and a deliberately failing fixture make reliability visible to reviewers.
Architecture
- 01
PostgreSQL source
- 02
FastAPI campaign source
- 03
Bronze Parquet
- 04
PySpark Silver
- 05
Quality gate
- 06
DuckDB star schema
- 07
Markdown evidence report
Tech Stack
- Python
- PySpark
- DuckDB
- PostgreSQL
- FastAPI
- Docker
- Data quality
Verification
- Local demo writes data/evidence/local_portfolio_report.md
- Quality gates block invalid fixtures before warehouse load
- Two-run demo proves deterministic seeding and warehouse idempotency
Security & Privacy
- Default workflow uses deterministic local sources rather than private retail data.
- Raw name, email, address, and phone are removed before Silver; email is hashed and phone is masked.
Limitations
- The portfolio links to repository evidence rather than claiming a public live deployment.
- Synthetic local data demonstrates the pipeline contract, not production retail volume.
Future Improvements
- Add approved real-world retail sources when available.
- Expand operational monitoring around quality trends.
Claims limited to the public repository and documented local portfolio review path.