PATHSDATA
Back to Case Studies
Data EngineeringLakehouseApache Iceberg

National Enterprise Client

Unified Data Lakehouse — Eliminating Vendor Dependency

A national enterprise was trapped in a fragmented vendor ecosystem where data collection, aggregation, and reporting were handled by three different vendors. Reports took 5-7 days to generate and frequently contained errors with no clear accountability.

5-7 Days → Minutes

Report Delivery Time

60%

Cost Reduction

100%

Data Lineage Visibility

Zero

Vendor Dependencies

The Problem

A classic case of "too many cooks" — but with data vendors instead of chefs.

Fragmented Vendor Ecosystem

Data collection, aggregation, and reporting were handled by three different vendors with no unified ownership.

Report Delays

Business reports took 5-7 days to generate due to coordination between multiple vendors and manual handoffs.

Data Quality Issues

Reports frequently contained errors due to inconsistent transformations and lack of data validation between vendor systems.

High Costs & Finger-Pointing

When issues arose, vendors blamed each other. Troubleshooting required expensive coordination across multiple contracts.

The Vendor Chaos

A

Vendor A

Data Collection

  • Collected raw data from multiple source systems and APIs
  • No visibility into data quality at source
  • Different data formats with no standardization
B

Vendor B

Data Aggregation & ETL

  • Transformed and aggregated data from Vendor A
  • Black-box transformations with no documentation
  • Batch processing only — no real-time capabilities
C

Vendor C

Reporting & Analytics

  • Built dashboards and reports from Vendor B data
  • Limited to pre-built reports with no self-service
  • Couldn't trace errors back to source

The Solution

We designed and built a consolidated data platform that eliminated vendor dependencies and gave the client full ownership of their data pipeline.

1

ECS-Based Ingestion Layer

Containerized ingestion jobs running on Amazon ECS for both real-time streaming and batch data sources. Scalable, maintainable, and fully managed.

Amazon ECSECRAWS GlueStep FunctionsEventBridge
2

Open Lakehouse Architecture

Apache Iceberg on S3 as the foundation — open table format that prevents vendor lock-in while providing ACID transactions, time travel, and schema evolution.

Apache IcebergAWS S3AWS Glue Data CatalogLake Formation
3

Polars & PyIceberg Processing

High-performance data processing using Polars for lightning-fast transformations and PyIceberg for native Iceberg table operations. Rust-powered performance without Spark overhead.

PolarsPyIcebergPythonRust-native
4

Self-Service Analytics

Business users can now build their own reports and explore data without waiting for IT or vendors. Real-time dashboards with drill-down capabilities.

Amazon QuickSightAthenaQuickSight Q (NL Query)

Architecture Overview

Sources
APIs
Databases
Files
Streaming
Third-Party
Ingestion
ECS Jobs
ECR Containers
Step Functions
EventBridge
Storage
S3 Data Lake
Apache Iceberg
Bronze/Silver/Gold
Processing
Polars
PyIceberg
Data Quality
Glue Catalog
Serving
Athena
QuickSight
API Gateway
Lake Formation

Key Benefits

Single Source of Truth

All data flows through one platform with consistent definitions, eliminating discrepancies between vendor systems.

No Vendor Lock-In

Apache Iceberg's open format means data is portable. The client owns their data and can switch tools anytime.

Real-Time + Batch

Unified architecture handles both streaming data and batch feeds in the same pipeline with ECS-based jobs.

Full Transparency

Every data transformation is documented, version-controlled, and traceable from source to report.

Self-Service Analytics

Business users build their own reports without IT bottlenecks or vendor dependencies.

Rust-Powered Performance

Polars delivers 10-100x faster processing than Pandas, enabling rapid iteration and cost savings.

Project Timeline

1

3 weeks

Discovery & Design

Audit existing vendor systems, map data flows, design target architecture

2

4 weeks

Foundation Build

Set up AWS infrastructure, Iceberg tables, ingestion pipelines

3

6 weeks

Migration & Integration

Migrate data sources, build transformations, implement data quality

4

3 weeks

Analytics & Handoff

Deploy QuickSight dashboards, train users, documentation

Total Project Duration: ~16 weeks

Technology Stack

Ingestion

  • Amazon ECS
  • ECR
  • Step Functions
  • EventBridge

Storage

  • Amazon S3
  • Apache Iceberg
  • Parquet
  • Lake Formation

Processing

  • Polars
  • PyIceberg
  • Python
  • AWS Glue

Analytics

  • Amazon Athena
  • QuickSight
  • QuickSight Q

Trapped in Vendor Dependency?

Let's discuss how a unified data lakehouse can give you control over your data and eliminate costly vendor coordination.