vlr training
Azure Data Engineer

 

 

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance

Azure Data Engineer Azure Data Factory and Spark Online Training Course Details

Course Duration

45 Days

Mon-Fri

8 am to 9 am (IST)

Mode of Training

Online

Azure Data Engineer Online Training Course Details

Basics
Introduction to Data Engineering
Python,SQL and Azure Portal Access


Introduction to ADF V2

  • Why ADF?
  • Terminology
  • How to create a data factory?

Practice basics of ADF

  • practice creating Linked services, Datasets, gobal parameters, and a simple pipeline
  • understand navigation of pages in ADF portal

Activities Sessions

  • Move and Transform
  • General
  • Iteration & Conditionals

scenario based pipeline building

  • Building pipelines using above mentioned set of activities

Debugging

  • debugging pipeline failures

Triggers in Pipeline

  • Theoritical & Practice

Real time project showcasing

  • Showing a real time project pipelines

Basics of Big Data

  • What is Big Data?
  • How to process Big Data?
  • What is Map Reduce?
  • What is Apache Spark?
  • differences between Map reduce and Spark?
  • How to practice Spark?

Spark Basics

  • What is Apache Spark?
  • Architecture of Apache Spark?
  • Spark’s Language API’s
  • Spark’s API’s
  • Spark Context, SQL Context
  • RDD Definition
  • First Hands-on program in Spark
  • Spark Session
  • Data Frames
  • Partitions
  • Lazy Evaluations
  • Transformations
  • Actions
  • Spark UI

Spark Structured API

  • Data Frames, Spark SQL, and Datasets
  • Py Spark Introduction
  • Overview of Structured API execution.
  • Understand different file formats for data processing
  • How to practice Spark in Google Colab?
  • Practice creating dataframes for various file formats
  • Selecting columns of a dataframe in various methods including using selectExpr
  • Inferring custom datatypes on the data in the dataframe
  • filter data in the dataframe based on a condition
  • Group by and collect_list() functions
  • creating a new column from an existing column
  • Renaming existing columns
  • removing columns from a dataframe
  • casting datatypes on the data in the dataframe
  • extract distinct values from a dataframe

Spark For ML

  • Random Sampling and Random Splits(optional)

Spark Structured API

  • Concatenating data(rows) in Dataframes
  • Join multiple dataframes
  • Sort Data In dataframes
  • Limit function in Spark
  • Repartition and Coalesce in Spark
  • Collecting rows to the driver node using collect()
  • Cache and persist dataframe
  • Usage of Lit() function and it’s usage
  • Working with dates and timestamps
  • Handling Null values in data
  • convert a spark dataframe into a pandas dataframe
  • Creating User Defined Functions in Spark
  • Write the data of the dataframe to external storage
  • create partitions while writing the the data into external storeage
  • create buckets on the data written to the datalake
  • pull data from a database directly using spark

Spark SQL

  • Introduction to SparkSQL
  • different types of views in SparkSQL
  • creating spark tables/views
  • practice all the above concepts using Spark SQL

Databricks

  • inroduction
  • how to create clusters and it’s types
  • what is data lake?
  • how to mount data lake to databricks?
  • creating notebooks and running code on databricks
  • databricks utilities and magic commands
  • widget’s in databricks
  • Lakehouse Architecture
  • Delta File format
  • creating delta tables
  • inserting data into delta tables
  • doing time travel with delta tables
  • uses of delta format

Optimisation techniques

  • Broadcast joins
  • spark shuffle partitions
  • vaccum command
  • optimise command
  • Repartition and Coalesce on the dataframes
  • using cache and persist when required

Improving performance of Spark Jobs

  • using multiprocessing package

ADF – Databricks Linkage

  • Call databricks notebooks from data factory
  • pass parameter’s to databricks notebooks

End-to-End project

  • A real-time project done using ADF, databricks, PySpark, SparkSQL

A quick review

  • doubts & Questions

Azure Data Engineer Online Training demo videos by srilkanth

Register Now for Azure Data Engineer Online Training