Paths

Google: Professional Cloud Data Engineer

Authors: Janani Ravi, Vitthal Srinivasan

This skill path covers all the objectives needed to be a Data Engineer on Google Cloud. You will learn in depth how to use the products and services as well as how to complete the... Read more

Beginner

In this beginning section of the path you’ll learn how to use Dataproc, Google Composer, Dataflow, and Apache Stream Processing. You’ll be architecting solutions, and beginning to build out the pipeline for your data projects. After this section you’ll be ready for more intermediate topics like incorporating machine learning models.

Architecting Big Data Solutions Using Google Dataproc

by Janani Ravi

Nov 1, 2018 / 2h 17m

2h 17m

Start Course
Description

When organizations plan their move to the Google Cloud Platform, Dataproc offers the same features but with additional powerful paradigms such as separation of compute and storage. Dataproc allows you to lift-and-shift your Hadoop processing jobs to the cloud and store your data separately on Cloud Storage buckets, thus effectively eliminating the requirement to keep your clusters always running. In this course, Architecting Big Data Solutions Using Google Dataproc, you’ll learn to work with managed Hadoop on the Google Cloud and the best practices to follow for migrating your on-premise jobs to Dataproc clusters. First, you'll delve into creating a Dataproc cluster and configuring firewall rules to enable you to access the cluster manager UI from your local machine. Next, you'll discover how to use the Spark distributed analytics engine on your Dataproc cluster. Then, you'll explore how to write code in order to integrate your Spark jobs with BigQuery and Cloud Storage buckets using connectors. Finally, you'll learn how to use your Dataproc cluster to perform extract, transform, and load operations using Pig as a scripting language and work with Hive tables. By the end of this course, you'll have the necessary knowledge to work with Google’s managed Hadoop offering and have a sound idea of how to migrate jobs and data on your on-premise Hadoop cluster to the Google Cloud.

Table of contents
  1. Course Overview2m
  2. Introducing Google Dataproc for Big Data on the Cloud 42m
  3. Running Hadoop MapReduce Jobs on Google Dataproc 49m
  4. Working with Apache Spark on Google Dataproc 24m
  5. Working with Pig and Hive on Google Dataproc 18m

Architecting Serverless Big Data Solutions Using Google Dataflow

by Janani Ravi

Dec 14, 2018 / 2h 15m

2h 15m

Start Course
Description

Dataflow allows developers to process and transform data using easy, intuitive APIs. Dataflow is built on the Apache Beam architecture and unifies batch as well as stream processing of data. In this course, Architecting Serverless Big Data Solutions Using Google Dataflow, you will be exposed to the full potential of Cloud Dataflow and its radically innovative programming model. You will start this course off with a basic understanding of how Dataflow works for serverless compute. You’ll study the Apache Beam API used to build pipelines and understand what data sources, sinks, and transformations are. You’ll study the stages in a Dataflow pipeline and visualize it as a directed-acyclic graph. Next, you'll use Apache Beam APIs to build pipelines for data transformations in both Java as well as Python and execute these pipelines locally and on the cloud. You’ll integrate your pipelines with other GCP services such as BigQuery and see how you can monitor and debug slow pipeline stages. Additionally, you'll study different pipeline architectures such as branching and pipelines using side inputs. You’ll also see how you can apply windowing operations to perform aggregations on our data. Finally, you’ll work with Dataflow without writing any code using pre-built Dataflow templates that Google offers for common operations. At the end of this course, you should be comfortable using Dataflow pipelines to transform and process your data and integrate your pipelines with other Google services.

Table of contents
  1. Course Overview1m
  2. Introducing Dataflow49m
  3. Understanding and Using the Apache Beam APIs39m
  4. Creating and Using PCollections and Side Inputs28m
  5. Creating Pipelines from Google Templates16m

Architecting Stream Processing Solutions Using Google Cloud Pub/Sub

by Vitthal Srinivasan

Jan 8, 2019 / 1h 44m

1h 44m

Start Course
Description

As data warehousing and analytics become more and more integrated into the business models of companies, the need for real-time analytics and data processing has grown. Stream processing has quickly gone from being nice-to-have to must-have. In this course, Architecting Stream Processing Solutions Using Google Cloud Pub/Sub, you will gain the ability to ingest and process streaming data on the Google Cloud Platform, including the ability to take snapshots and replay messages. First, you will learn the basics of a Publisher-Subscriber architecture. Publishers are apps that send out messages, these messages are organized into Topics. Topics are associated with Subscriptions, and Subscribers need to listen in on subscriptions. Each subscription is a message queue, and messages are held in that queue until at least one subscriber per subscription has acknowledged the message. This is why Pub/Sub is said to be a reliable messaging system. Next, you will discover how to create topics, as well as how to push and pull subscriptions. As their names would suggest, push and pull subscriptions differ in who controls the delivery of messages to the subscriber. Finally, you will explore how to leverage advanced features of Pub/Sub such as creating snapshots, and seeking to a specific timestamp, either in the past or in the future. You will also learn the precise semantics of creating snapshots and the implications of turning on the “retain acknowledged messages” option on a subscription. When you’re finished with this course, you will have the skills and knowledge of Google Cloud Pub/Sub needed to effectively and reliably process streaming data on the GCP.

Table of contents
  1. Course Overview2m
  2. Getting Started with Cloud Pub/Sub29m
  3. Configuring Publishers, Subscribers, and Topics38m
  4. Using the Cloud Pub/Sub Client Library33m

Intermediate

In this section you’ll begin incorporating more intermediate products and functions in Google Cloud such as Big Query and Big Query ML, Google SQL Instances, Datastores, and Bigtable. This is the database heavy portion of the process and you’ll also spend time developing repositories. After this section you’ll be ready for the advanced section that will dive deeper into designing machine learning and working with APIs.

Architecting Data Warehousing Solutions Using Google BigQuery

by Janani Ravi

Oct 15, 2018 / 2h 48m

2h 48m

Start Course
Description

Organizations store massive amounts of data that gets collated from a wide variety of sources. BigQuery supports fast querying at a petabyte scale, with serverless functionality and autoscaling. BigQuery also supports streaming data, works with visualization tools, and interacts seamlessly with Python scripts running from Datalab notebooks. In this course, Architecting Data Warehousing Solutions Using Google BigQuery, you’ll learn how you can work with BigQuery on huge datasets with little to no administrative overhead related to cluster and node provisioning. First, you'll start off with an overview of the suite of storage products on the Google Cloud and the unique position that BigQuery holds. You’ll see how BigQuery compares with Cloud SQL, BigTable, and Datastore on the GCP and how it differs from Amazon Redshift, the data warehouse on AWS. Next, you’ll create datasets in BigQuery which are the equivalent of databases in RDMBSes and create tables within datasets where actual data is stored. You’ll work with BigQuery using the web console as well as the command line. You’ll load data into BigQuery tables using the CSV, JSON, and AVRO format and see how you can execute and manage jobs. Finally, you'll wrap up by exploring advanced analytical queries which use nested and repeated fields. You’ll run aggregate operations on your data and use advanced windowing functions as well. You’ll programmatically access BigQuery using client libraries in Python and visualize your data using Data Studio. At the end of this course, you'll be comfortable working with huge datasets stored in BigQuery, executing analytical queries, performing analysis, and building charts and graphs for your reports.

Table of contents
  1. Course Overview2m
  2. Understanding BigQuery in the GCP Service Taxonomy42m
  3. Using Datasets, Tables, and Views in BigQuery30m
  4. Getting Data in and out of BigQuery 27m
  5. Performing Advanced Analytical Queries in BigQuery46m
  6. Programmatically Accessing BigQuery from Client Programs19m

Building Machine Learning Models in SQL Using BigQuery ML

by Janani Ravi

Nov 20, 2018 / 1h 27m

1h 27m

Start Course
Description

This course demonstrates how to build and train machine learning models for linear and logistic regression using SQL commands on BigQuery, the Google Cloud Platform’s serverless data warehouse. In this course, Building Machine Learning Models in SQL Using BigQuery ML, you'll learn how to build and train machine learning models and how to employ those models for prediction - all with just simple SQL commands on data stored in BigQuery. First, you'll understand the different choices available on the GCP if you would like to build and train your models and see how you can make the right choice between these services for your specific use case. Then, you'll work with some real-world datasets stored in BigQuery to build linear regression and binary classification models. Because BigQuery allows you to specify training parameters to build and train your model in SQL, machine learning is made accessible to even those who are not familiar with high-level programming languages. Last, you'll study how to analyze the models that we built using evaluation and feature inspection functions in BigQuery, and run BigQuery commands on Cloud Datalab using a Jupyter notebook that is hosted on the GCP and closely integrated with all of GCPs services. By the end of this course, you'll have a good understanding of how you can use BigQuery ML to extract insights from your data by applying linear and logistic regression models.

Table of contents
  1. Course Overview2m
  2. Introducing Google BigQuery ML24m
  3. Building Regression and Classification Models39m
  4. Analyzing Models Using Evaluation and Feature Inspection Functions21m

Creating and Administering Google Cloud SQL Instances

by Janani Ravi

Sep 24, 2018 / 2h 29m

2h 29m

Start Course
Description

An important component of an organization's on-premises solution is the relational database. Cloud SQL is an RDBMS offering on the GCP which makes the operational and administrative aspects of databases very easy to handle. In this course, Creating and Administering Google Cloud SQL Instances, you will learn how to create, work with and manage Cloud SQL instances on the GCP. First, you will assess the range of data storage services on the GCP and understand when you would choose to use Cloud SQL over other technologies. Then, you will create Cloud SQL instances, connect to them using a simple MySQL client, and configure and administer these instances using the web console as well as the gcloud command line utility. Next, you will focus on how Cloud SQL can work in high-availability mode. After that, you will configure failover replicas for high-availability and simulate an outage event to see how the failover replica kicks in. FInally, you will see how to use read replicas for increased read throughput and how data can be migrated into Cloud SQL instances using a SQL dump or from CSV files. At the end of this course, you will be comfortable creating, connecting to, and administering Cloud SQL instances to manage relational databases on the Google Cloud Platform.

Table of contents
  1. Course Overview1m
  2. Understanding Cloud SQL in the GCP Service Taxonomy40m
  3. Creating Cloud SQL Instances59m
  4. Replication and Data Management46m

Architecting Big Data Solutions Using Google Bigtable

by Janani Ravi

Dec 4, 2018 / 2h 2m

2h 2m

Start Course
Description

Bigtable is Google’s proprietary storage service that offers extremely fast read and write speeds. It uses a sophisticated internal architecture which learns access patterns and moves around your data to mitigate the issue of hot-spotting. In this course, Architecting Big Data Solutions Using Google Bigtable, you’ll learn both the conceptual and practical aspects of working with Bigtable. You’ll learn how to best to design your schema to enable fast reads and write speeds and discover how data in Bigtable can be accessed using the command line as well as client libraries. First, you’ll study the internal architecture of Bigtable and how data is stored within it using the 4-dimensional data model. You’ll also discover how Bigtable clusters, nodes, and instances work and how Bigtable works with Colossus - Google’s proprietary storage system behind the scenes. Next, you’ll access Bigtable using both the HBase shell as well as cbt, Google’s command line utility. Later, you'll create and manage tables while practice exporting and importing data using sequence files. Finally, you’ll study how manual fail-overs can be handled when we have single cluster routing enabled. At the end of this course, you’ll be comfortable working with Bigtable using both the command line as well as client libraries.

Table of contents
  1. Course Overview2m
  2. Introducing Cloud Bigtable57m
  3. Interacting with Cloud Bigtable Using cbt and the HBase API36m
  4. Managing Cloud Bigtable Instances, Clusters, and Nodes26m

Advanced

In this section you’ll cover machine learning heavy topics such as working with AutoML, ML Engine, and designing data architectures that are specific to Google Cloud. After this section you will have learned the main critical functions and services you need on Google Cloud to work on the job as a Data Engineer.

Designing and Implementing Solutions Using Google Machine Learning APIs

by Janani Ravi

Oct 19, 2018 / 1h 37m

1h 37m

Start Course
Description

The Google Cloud Platform makes a wide range of machine learning (ML) services available as a part of Google Cloud AI. Google Cloud Machine Learning APIs are the most accessible and lightweight service which makes powerful ML models available to even novice programmers using simple, intuitive APIs. In this course, Designing and Implementing Solutions Using Google Machine Learning APIs, you'll learn how you can use and work with Google Machine Learning APIs, which makes powerful pre-trained models on Google’s datasets. First, you'll delve into an overview of the machine learning services suite available on the Google Cloud, and understand the features of each so you can make the right choice about what service makes sense for your use case. Next, you'll discover speech-based APIs allowing you to convert speech-to-text and text-to-speech with additional emphasis support using SSML, and how you can call these REST APIs using simple Python libraries. Then, you'll learn about Natural Language APIs and see how they can be used for sentiment analysis and for language translation. Finally, you'll explore the Vision and Video Intelligence APIs in order to perform face and label detection on images. By the end of this course, you'll have the necessary knowledge to choose the right ML API that fits your use case and use multiple APIs together to build more complex features for your product.

Table of contents
  1. Course Overview2m
  2. Introducing the Google Cloud ML APIs 22m
  3. Working with Speech and Text Using the Cloud ML APIs 31m
  4. Working with Language Using the Cloud ML APIs 18m
  5. Working with Images and Videos Using the Cloud ML APIs 22m

Designing and Implementing Solutions Using Google Cloud AutoML

by Janani Ravi

Oct 12, 2018 / 1h 41m

1h 41m

Start Course
Description

Most organizations want to harness the power of machine learning in order to improve their products, but they may not always have the expertise available in-house. In this course, Designing and Implementing Solutions Using Google Cloud AutoML, you’ll learn how you can train custom machine learning models on your dataset with just a few clicks on the UI or a few commands on a terminal window. This course will also show how engineers and analysts can harness the power of ML for common use cases by using AutoML to build their own model, trained on their own data, without needing any specific machine learning expertise. First, you'll see an overview of the suite of machine learning services available on the Google Cloud and understand the features of each so you can make the right choice of service for your use case. You’ll learn about the basic concepts underlying AutoML which uses neural architecture search and transfer learning to find the best neural network for your custom use case. Next, you'll explore AutoML’s translation model, and feed in sentence pairs to the TMX format to perform German-English translation. You’ll use your custom model for prediction from the UI, from the command line, and by using Python APIs. You’ll also learn to understand the significance of the BLEU score to analyze the quality of your translation model. Finally, you'll use the natural language APIs that AutoML offers to build a model for sentiment analysis of reviews and work with AutoML for image classification using the AutoML Vision APIs. You'll finish up by learning the basic requirements of the data needed to train this model and develop a classifier that can identify fruits. At the end of this course, you will be very comfortable choosing the right ML API that fits your use case and using AutoML to build complex neural networks trained on your own dataset for common problems.

Table of contents
  1. Course Overview2m
  2. Introducing Google Cloud AutoML20m
  3. Performing Custom Translation Using AutoML Translation38m
  4. Working with Language Using AutoML Natural Language23m
  5. Working with Images Using AutoML Vision16m

What you will learn

  • Dataproc
  • Dataflow and Apache Bean
  • GCP Pub/Sub
  • BigQuery
  • GCP Cloud SQL
  • GCP Cloud Spanner
  • Cloud Datastore
  • Frestore
  • BigTable
  • Datalab
  • ML Engine
  • Machine Learning APIs
  • Data Architecture on GCP

Pre-requisites

Learners should be familiar with cloud computing and the Google Cloud Platform. It is also assumed that Learners are already Data and ML professionals but are learning to complete their projects on Google Cloud Platform.