Dan Serban is a data engineer who is passionate about large-scale distributed systems and cares about producing clean, elegant, maintainable, robust, well-tested Scala code.
Dan Serban is a data engineer who is passionate about large-scale distributed systems and cares about producing clean, elegant, maintainable, robust, well-tested Scala code.
This livecoding session introduces Apache Spark and is aimed at seasoned developers with an interest in understanding the streaming data pipelines that power today’s real-time analytics engines.
Apache Spark is the open-source cluster computing framework that has largely replaced Hadoop in recent years. It features in-memory processing and streaming capabilities as well as an SQL interface and a mature set of tools for machine learning and graph processing workloads.
We’ll first take a look at how to build a few basic static pipelines using Spark’s new DataSet API. Towards the end, we’ll examine a relatively complex Kafka-Spark-Cassandra streaming pipeline that more closely mimicks a real-life high-load production setting.
[who-for]