This livecoding session introduces Apache Spark and is aimed at seasoned developers with an interest in understanding the streaming data pipelines that power today’s real-time analytics engines.
Apache Spark is the open-source cluster computing framework that has largely replaced Hadoop in recent years. It features in-memory processing and streaming capabilities as well as an SQL interface and a mature set of tools for machine learning and graph processing workloads.
We’ll first take a look at how to build a few basic static pipelines using Spark’s new DataSet API. Towards the end, we’ll examine a relatively complex Kafka-Spark-Cassandra streaming pipeline that more closely mimicks a real-life high-load production setting.
[who-for]
October 9 @ 14:35
14:35 — 15:10 (35′)
Dan Serban