Applying the Lambda Architecture with Spark, Kafka, and Cassandra

This course introduces how to build robust, scalable, real-time big data systems using a variety of Apache Spark's APIs, including the Streaming, DataFrame, SQL, and DataSources APIs, integrated with Apache Kafka, HDFS and Apache Cassandra.
Course info
Rating
(84)
Level
Beginner
Updated
Nov 15, 2016
Duration
6h 4m
Table of contents
Course Overview
A Modern Big Data Architecture
Batch Layer with Apache Spark
Speed Layer with Spark Streaming
Advanced Streaming Operations
Streaming Ingest with Kafka and Spark Streaming
Persisting with Cassandra
Description
Course info
Rating
(84)
Level
Beginner
Updated
Nov 15, 2016
Duration
6h 4m
Description

This course aims to get beyond all the hype in the big data world and focus on what really works for building robust, highly-scalable batch and real-time systems. In this course, Applying the Lambda Architecture with Spark, Kafka, and Cassandra, you'll string together different technologies that fit well and have been designed by some of the companies with the most demanding data requirements (such as Facebook, Twitter, and LinkedIn) to companies that are leading the way in the design of data processing frameworks, like Apache Spark, which plays an integral role throughout this course. You'll look at each individual component and work out details about their architecture that make them good fits for building a system based on the Lambda Architecture. You'll continue to build out a full application from scratch, starting with a small application that simulates the production of data in a stream, all the way to addressing global state, non-associative calculations, application upgrades and restarts, and finally presenting real-time and batch views in Cassandra. When you're finished with this course, you'll be ready to hit the ground running with these technologies to build better data systems than ever.

About the author
About the author

Ahmad is a Data Architect specializing in the implementation of high-performance data warehouses and BI systems and enjoys speaking at various user groups and conferences.

More from the author
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi! My name is Ahmad Alkilani, and welcome to my course, Applying the Lambda Architecture with Spark, Kafka, and Cassandra. We see big data discussed every day whether you're in the field actively working on big data projects, hear about the scale of problems companies like LinkedIn, Facebook, and Twitter have to deal with on a daily basis, or simply listening to the radio about some initiative where big data enabled the analysis and discovery of new insights into the data we have. In this course, our focus will be on building real-time systems that can handle real-time data at scale with robustness and fault-tolerance as first-class citizens using tools like Apache Spark, Kafka, Cassandra, and Hadoop. We'll look at how thoughtful design of your big data applications allows you to combine low latency streaming data in batch workloads. We'll design and build an application from scratch using Apache Spark, Spark DataFrames, and Spark SQL, in addition to Spark's Data Sources API to load, store, and manipulate data. We'll also look at Spark Streaming and Spark-Kafka integration techniques for reliability and speed. We'll also write and Kafka data producer to simulate our real-time data stream feed into our streaming application. And as we dive deeper into the course, we'll look at how you can preserve global state and use memory efficiently with approximate algorithms as we build a stateful Spark Streaming application. And a production application isn't complete without the ability to handle errors and code updates. We'll also learn how to use a scalable NoSQL database and persist your data to Cassandra and HDFS. By the end of this course, you'll feel comfortable building your own fault-tolerant scalable real-time big data systems and act on streaming and batch data with Spark, Kafka, Cassandra, and HDFS as the backbone for the lambda architecture. Before we begin this course, you should be familiar with some programming language, preferably Java, Scala, or C#. But you certainly don't have to be a master in any of these as we'll walk you through a gentle introduction to get you going. I look forward to you joining me in this journey to learn about lambda architectures with the Applying Lambda Architecture with Spark, Kafka, and Cassandra course at Pluralsight.

Speed Layer with Spark Streaming
Hi! This is Ahmad Alkilani, and welcome to this module. In this module, our focus will shift to building the speed layer in the lambda architecture using Spark Streaming. Let's have a look at what we're going to cover. Most of this module is dedicated to Spark Streaming, so we'll cover Spark Streaming fundamentals and give you a solid understanding of how streaming works. So when you're working with a streaming application, you know how to navigate and relate to what you've learned earlier with RDDs. We'll also show you how you can combine Spark SQL and DataFrames with your streaming application. And we'll briefly discuss and introduce the streaming receiver model and how Spark collects data. This will become critical when we start working with Kafka, especially when we discuss different approaches to receiving data. However, as we're not covering Kafka just yet, we'll also modify our log producer to simulate streaming data into files, and then we'll work with files directly. As we continue to build our streaming application and aggregations in a streaming fashion, we'll also work on the overall structure of our program. We'll separate the business logic from spin-up and teardown procedures for cleaner code. Finally, we'll look at how we can run and test our application using Zeppelin and what it takes to get a streaming application to work with Apache Zeppelin. In the next clip, we'll start off with how Spark Streaming works and introduce DStreams, so let's get started.

Advanced Streaming Operations
Hi! My name is Ahmad Alkilani, and welcome to this module on Advanced Streaming Operations with Apache Spark. In this module, we'll introduce more interesting things we can do with Spark Streaming and a combination of other libraries. We'll start off with a proper introduction to checkpointing in Spark. Then we'll introduce window operations where you can start to use Spark Streaming to run calculations in sliding or tumbling windows. And we'll look into answering one of the questions we raised in the previous module in that how can we use stateful transformations to store and record some state that you wish to keep and maintain throughout streaming batches, and obviously different ways of how we can act on that state and use it. And we'll end this module with a look at how we can re-introduce the unique_visitors calculation and calculate cardinality in a streaming fashion while also not having to store every source record to get a good estimate for uniqueness. The trick here is to use a technique where we wouldn't have to strain memory and system resources by keeping every single source record. This module really starts putting what we've learned to practice, so let's get started.

Persisting with Cassandra
Hi! My name is Ahmad Alkilani, and welcome to this module on Apache Cassandra. In this module, we'll introduce Apache Cassandra, a distributed database management system with unique performance and availability characteristics. We'll discuss some use cases where Cassandra can be a great fit. And we'll look at Cassandra's data model as we look at different design decisions you're going to want to make while creating your Cassandra tables. We'll also look at how we can use the Spark Cassandra Connector and use Spark DataFrames and RDDs to communicate with a Cassandra cluster as we create Cassandra tables to represent the batch and real-time views for our lambda architecture and also modify our batch and streaming jobs wrapping up this course. So without any further ado, let's get started.