Apache Spark

← Back to Batch Processing

Open source unified analytics engine with in-memory processing. Core abstractions: RDDs, DataFrames, and SparkSQL. Significantly faster than Hadoop MapReduce for iterative workloads due to in-memory DAG execution.

Key Properties

How It Works

Builds a DAG (directed acyclic graph) of transformations that are lazily evaluated and optimized before execution. Data is kept in memory between stages rather than written to disk, making iterative algorithms (ML, graph) orders of magnitude faster than MapReduce. Catalyst optimizer handles query planning for DataFrames and SparkSQL.

Ecosystem

Spark SQL — structured data queries via DataFrames and SQL
Spark Streaming / Structured Streaming — micro-batch and continuous stream processing
MLlib — distributed machine learning library
GraphX — graph computation

Apache Hadoop (predecessor, disk-based MapReduce)
Apache Beam (portable abstraction that can run on Spark)
Apache Flink (alternative with native streaming)
Trino (SQL query engine, complementary for interactive queries)

data-pipelines batch spark

Software Engineering KB

Explorer

Apache Spark

Apache Spark

Key Properties

How It Works

Ecosystem

Graph View

Table of Contents

Backlinks

Software Engineering KB

Explorer

Apache Spark

Apache Spark

Key Properties

How It Works

Ecosystem

Related

Graph View

Table of Contents

Backlinks