Header Image

Introduction to Apache Spark

This livecoding session introduces Apache Spark and is aimed at seasoned developers with an interest in understanding the streaming data pipelines that power today’s real-time analytics engines.

Apache Spark is the open-source cluster computing framework that has largely replaced Hadoop in recent years. It features in-memory processing and streaming capabilities as well as an SQL interface and a mature set of tools for machine learning and graph processing workloads.

We’ll first take a look at how to build a few basic static pipelines using Spark’s new DataSet API. Towards the end, we’ll examine a relatively complex Kafka-Spark-Cassandra streaming pipeline that more closely mimicks a real-life high-load production setting.

[who-for]

October 9 @ 14:35
14:35 — 15:10 (35′)

Dan Serban

Leave a Reply