The Spark Stack


• Spark SQL: This is Spark’s module for working with structured data, and it is designed to support workloads that combine familiar SQL database
queries with more complicated, algorithm-based analytics. Spark SQL supports the open source Hive project, and its SQL-like HiveQL query syntax.
Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business
intelligence tools. JDBC connectors can also be used to integrate with Apache Drill, opening up access to an even broader range of data

• Spark Streaming: This module supports scalable and fault-tolerant processing of streaming data, and can integrate with established sources of
data streams like Flume (optimized for data logs) and Kafka (optimized for distributed messaging). Spark Streaming’s design, and its use of
Spark’s RDD abstraction, are meant to ensure that applications written for streaming data can be repurposed to analyze batches of historical data
with little modification.

• MLlib: This is Spark’s scalable machine learning library, which implements a set of commonly used machine learning and statistical algorithms.
These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.

• GraphX: This module began life as a separate UC Berkeley research project, which was eventually donated to the Apache Spark project.
GraphX supports analysis of and computation over graphs of data, and supports a version of graph processing’s Pregel API. GraphX includes a
number of widely understood graph algorithms, including PageRank.

• Spark R: This module was added to the 1.4.x release of Apache Spark,providing data scientists and statisticians using R with a lightweight
mechanism for calling upon Spark’s capabilities.

Storage Options for Apache Spark

• MapR (file system and database)
• Google Cloud
• Amazon S3
• Apache Cassandra
• Apache Hadoop (HDFS)
• Apache HBase
• Apache Hive
• Berkeley’s Tachyon project

Spark Deployment Options

Running Spark on YARN
Running Spark on Mesos
Running Spark on EC2

Programming languages supported by Spark

• Python
• Scala
• R

Spark Examples

Speed of Spark 100 terabytes in in 23 minutes.

Spark wins Daytona Gray Sort 100TB Benchmark

We are proud to announce that Spark won the 2014 Gray Sort Benchmark (Daytona 100TB category). A team from Databricksincluding Spark committers, Reynold Xin, Xiangrui Meng, and Matei Zaharia, entered the benchmark using Spark. Spark won a tie with the Themis team from UCSD, and jointly set a new world record in sorting.

They used Spark and sorted 100TB of data using 206 EC2 i2.8xlarge machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2100 nodes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Outperforming large Hadoop MapReduce clusters on sorting not only validates the vision and work done by the Spark community, but also demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes.

For more information, see the Databricks blog article written by the Reynold Xin.