Revenue Metrics Pipeline Migration
Migrated the revenue metrics pipeline from single-machine pandas to distributed PySpark, resolving critical implementation bugs to unblock daily ML scoring.
Distributed
Execution Model
GCP Dataproc
Infrastructure
The Problem
The revenue metrics pipeline was built on single-machine pandas processing large CRM datasets — inherently unsuitable for daily production scale. A partial PySpark migration existed but had critical bugs causing incorrect metric outputs that blocked downstream ML scoring pipelines.
What Was Built
Rewrote the pipeline in PySpark using broadcast joins, window function aggregations, and filter pushdown to leverage distributed execution on GCP Dataproc. Diagnosed and resolved the existing PySpark implementation bugs — including incorrect cohort boundary logic and partition misalignment — that were producing silent metric errors in production.
Business Impact
Delivered a correct, production-stable distributed pipeline that unblocked daily ML scoring and dashboard refresh. Replaced a fundamentally unscalable single-machine approach with distributed execution suited to the data volume.
Tech Stack
Domain Tags
Details
- Role
- Primary Owner
- Status
- Production
- Tier
- Tier 1
- Period
- 2022 – 2024
- Employment
- ADA Asia