Data EngineeringADA Asia2022 – 2024Production

Revenue Metrics Pipeline Migration

Migrated the revenue metrics pipeline from single-machine pandas to distributed PySpark, resolving critical implementation bugs to unblock daily ML scoring.

Distributed

Execution Model

GCP Dataproc

Infrastructure

The Problem

The revenue metrics pipeline was built on single-machine pandas processing large CRM datasets — inherently unsuitable for daily production scale. A partial PySpark migration existed but had critical bugs causing incorrect metric outputs that blocked downstream ML scoring pipelines.

What Was Built

Rewrote the pipeline in PySpark using broadcast joins, window function aggregations, and filter pushdown to leverage distributed execution on GCP Dataproc. Diagnosed and resolved the existing PySpark implementation bugs — including incorrect cohort boundary logic and partition misalignment — that were producing silent metric errors in production.

Business Impact

Delivered a correct, production-stable distributed pipeline that unblocked daily ML scoring and dashboard refresh. Replaced a fundamentally unscalable single-machine approach with distributed execution suited to the data volume.

Tech Stack

PySparkSpark SQLGCP DataprocBigQueryParquet

Domain Tags

PySparkDistributed ComputingPipeline MigrationData Engineering

Details

Role
Primary Owner
Status
Production
Tier
Tier 1
Period
2022 – 2024
Employment
ADA Asia