Course Code: CS-315 Cloud-Computing | NUST

Autoscaling Stream ETL & Window Aggregation using Spark Structured Streaming

A real-time pipeline engineered to solve the "Provisioning Dilemma" using Apache Spark Dynamic Allocation and Kafka.

1 | The Problem

Static cloud infrastructure leads to a choice between high costs (over-provisioning) or high latency (under-provisioning). This project implements Infrastructure Elasticity, allowing a cluster to grow and shrink its compute resources based on live data pressure.

02 | System Architecture

SOURCE
Python Producer
INGESTION
Apache Kafka
PROCESSING
Spark Master
W1
W2
W3
W4

W2-W4 executors scale up when a 1s backlog is detected in the 'cs_student_logs' topic

03 | Technology Stack

Category Technology Role
Ingestion Apache Kafka High-throughput message buffer
Processing Spark 3.5.0 10s Tumbling Window Aggregation
Elasticity Dynamic Allocation Real-time Executor scaling
Environment Docker , Linux Containerized cluster management

04 | Performance Impact

62.5% Cost Reduction
300% Spike Capacity

Testing confirmed a reduction from 96 to 36 instance-hours/day compared to static cluster strategies while maintaining sub-second latency.