Skip to main content
Projects
Case studyData engineering / PySpark / DuckDB

RetailGuard Data Platform

Zero-cost local retail data platform with incremental Bronze extraction, protected Silver data, blocking quality checks, and DuckDB warehouse evidence.

Project type
Retail Data Engineering / Quality Platform
Core stack
Python, PySpark, DuckDB
Delivery
Case study

Case Study

The problem, implementation decisions, measured evidence, and next improvements.

Overview

A local retail analytics platform spanning source seeding, incremental extraction, privacy-aware transformation, quality gates, warehouse loading, and evidence reporting.

Problem

Retail data pipelines need reproducible ingestion, privacy controls, quality gates, and reviewer-visible evidence without requiring cloud accounts or billing.

Solution

Built a local-first pipeline from deterministic PostgreSQL and FastAPI sources into Bronze Parquet, PySpark Silver, blocking quality checks, and DuckDB star-schema serving views.

Technical Decisions

  • Local-first defaults keep the portfolio review path free of required cloud infrastructure.
  • Raw personal fields are removed before Silver, with email hashed and phone masked.

Outcome

The repository produces a local portfolio report with KPIs, quality status, privacy controls, reconciliation, idempotency proof, layer counts, and DuckDB objects.

What It Proves

Data engineering, PySpark transformations, warehouse modeling, privacy controls, quality gates, idempotency, Docker, and local-first reviewer workflows.

Key Features

  • Default review path is fully local and requires no cloud account, billing account, free trial, or hosted service.
  • Blocking checks cover keys, values, relationships, reconciliation, volume, and raw PII leakage.
  • Two-run idempotency proof and a deliberately failing fixture make reliability visible to reviewers.

Architecture

  1. 01

    PostgreSQL source

  2. 02

    FastAPI campaign source

  3. 03

    Bronze Parquet

  4. 04

    PySpark Silver

  5. 05

    Quality gate

  6. 06

    DuckDB star schema

  7. 07

    Markdown evidence report

Tech Stack

  • Python
  • PySpark
  • DuckDB
  • PostgreSQL
  • FastAPI
  • Docker
  • Data quality

Verification

  • Local demo writes data/evidence/local_portfolio_report.md
  • Quality gates block invalid fixtures before warehouse load
  • Two-run demo proves deterministic seeding and warehouse idempotency

Security & Privacy

  • Default workflow uses deterministic local sources rather than private retail data.
  • Raw name, email, address, and phone are removed before Silver; email is hashed and phone is masked.

Limitations

  • The portfolio links to repository evidence rather than claiming a public live deployment.
  • Synthetic local data demonstrates the pipeline contract, not production retail volume.

Future Improvements

  • Add approved real-world retail sources when available.
  • Expand operational monitoring around quality trends.

Claims limited to the public repository and documented local portfolio review path.