Course DP-203

 

My Samples

https://tinyurl.com/cbmctsamples

 

General Delivery Hints

Only some of the review questions are in the slide deck. Use the learn.microsoft.com content to review each module.

 

General Lab Hints

Use the T symbol to send text into Notepad, then copy and paste from there into the various tools.

The Skillable labs do not have checkboxes. Take careful note of where you are up to.

 

Slide Deck 01: Get started with data engineering on Azure

https://email.lakefs.io/axoletter-jun23

 

Module: Introduction to data engineering on Azure

http://craigb-mct.blogspot.com/2020/10/big-data-architecture-with-azure.html

 

Module: Introduction to Azure Data Lake Storage Gen2

The speaker in the Channel 9 video introducing Azure Data Lake Storage said that this didn't emulate a Hadoop file system; it was a HDFS.

 

Module: Introduction to Azure Synapse Analytics

Note that the Serverless SQL pool in Azure Synapse Analytics supports only external and temporary tables.

Lab 01: Explore Synapse

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes).

 

Slide Deck 02: Build data analytics solutions using Azure Synapse Analytics serverless SQL pools

 

Module: Use Azure Synapse serverless SQL pool to query files in a data lake

 

Lab 02: Serverless SQL (Demo)

Delivery Hint: I've loaded these data files and queries into honestcraigs.com.

 

Module: Use Azure Synapse serverless SQL pools to transform data in a data lake

 

Lab 03: Transform Data with SQL

Note that naming a user-defined procedure with a "sp_" prefix is extremely bad practice. Creating an object without specifying a schema name is just bad practice.

 

Module: Create a lake database in Azure Synapse Analytics

 

Lab 04: Lake Database (Demo)

 

Slide Deck 03: Perform data engineering with Azure Synapse Apache Spark Pools

 

Links


Notes

Splitting statements across lines

Generally, you should use parens.

https://stackoverflow.com/questions/53162/how-can-i-do-a-line-break-line-continuation-in-python

Execution flow

There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values).

https://stackoverflow.com/questions/57577106/execution-of-cmd-cells-in-databricks-notebook-based-on-some-condition

One execution flow tool is Databricks Workflows.

Dataframe

So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.

https://www.databricks.com/glossary/what-are-dataframes

https://www.databricks.com/glossary/what-is-rdd

 

Module: Analyze data with Apache Spark in Azure Synapse Analytics

 

Lab 05: Synapse Spark (Demo)

 

Module: Transform data with Spark in Azure Synapse Analytics

 

Lab 06: Transform Spark

Delivery Hint: Would work well as an instructor-led lab.

 

Module: Use Delta Lake in Azure Synapse Analytics

 

Lab 07: Spark Delta Lake

 

Slide Deck 04: Work with data warehouses using Azure Synapse Analytics

 

Module: Analyze data in a relational data warehouse

Integrity Constraints

Data warehouses still have keys, but they may not be implemented at the database layer, especially in massively parallel situations.

Why do Data warehouses don't have a Primary Key check that is enforced? (r/dataengineering)

Replication

Relevant to review question 3.

Consider using a hash-distributed table when:

  • The table size on disk is more than 2 GB.
  • The table has frequent insert, update, and delete operations.

Consider using the round-robin distribution for your table in the following scenarios:

  • When getting started as a simple starting point since it is the default
  • If there is no obvious joining key
  • If there is no good candidate column for hash distributing the table
  • If the table does not share a common join key with other tables
  • If the join is less significant than other joins in the query
  • When the table is a temporary staging table

From https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#what-is-a-distributed-table.

Backup


Lab 08: Data Warehouse (Demo)

 

Module: Load data into a relational data warehouse

Slowly Changing Dimension

https://en.wikipedia.org/wiki/Slowly_changing_dimension includes types 4, 5 and 6, as well as combinational approaches.

 

Lab 09: Load Data Warehouse

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~30 minutes). Note that the lab will probably take less time to do than to set up.

 

Slide Deck 05: Transfer and transform data with Azure Synapse Analytics Pipelines

Notes

https://stackoverflow.com/questions/76200692/possible-to-create-a-pipeline-run-of-synapse-from-power-automate 

 

Module: Build a data pipeline in Azure Synapse Analytics

Delivery Hint: Show the View JSON button.

 

Lab 10: Synapse Pipeline

Delivery Hint: Instructor-led lab?

 

Module: Use Spark Notebooks in an Azure Synapse Pipeline

 

Lab 11: Pipeline Notebook

 

Slide Deck 06: Work with hybrid transactional and analytical processing (HTAP) Solutions using Azure Synapse Analytics

 

Module: Plan hybrid transactional and analytical processing using Azure Synapse Analytics

Cosmos DB

For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.

Dataverse

What is Microsoft Dataverse?

 

Module: Implement Azure Synapse Link with Azure Cosmos DB

Links

https://learn.microsoft.com/en-us/azure/cosmos-db/secure-access-to-data?tabs=using-primary-key#primary-keys


Lab 14: Synapse Link (Cosmos DB)

 

Module: Implement Azure Synapse Link for SQL

This is supported by the Enterprise, Standard and Web editions of SQL Server 2022.

https://learn.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-2022?view=sql-server-ver16#ACS

 

Lab 15: Synapse Link (SQL) (Demo)

 

Slide Deck 07: Implement a data streaming solution with Azure Stream Analytics

Module: Get started with Azure Stream Analytics

Windowing Functions

https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

14. Data Engineering on MS Azure DP 203 - Stream Analytics windowing functions - YouTube

What are windowing functions in stream analytics - YouTube

 

Lab 17: Stream Analytics (Demo)

 

Module: Ingest streaming data using Azure Stream Analytics and Azure Synapse Analytics

 

Lab 18: Stream Analytics and Synapse

In the Create an output for the SQL table section, replace "Table: FactOrder" with "Table: dbo.FactOrder". *sigh*

 

Module: Visualize real-time data with Azure Stream Analytics and Power BI

 

Lab 19: Stream Analytics and Power BI (Demo)

 

Slide Deck 08: Govern data across an enterprise

A nice example of lack of consistency. The title of the Purview portal is "Microsoft Purview ► purviewzzzzzzz"; the title of the Synapse portal is "Microsoft Azure | Synapse Analytics ► synapsezzzzzzz".

Also, don't confude the Microsoft Purview compliance portal (compliance.microsoft.com) with the Microsoft Purview governance portal (web.purview.azure.com).

 

Module: Introduction to Microsoft Purview

 

Module: Integrate Microsoft Purview and Azure Synapse Analytics

 

Lab 22: Synapse and Purview

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes). Maybe schedule a break during this lab - the scan can take 10+ minutes.

 

Slide Deck 09: Data Engineering with Azure Databricks

Module: Explore Azure Databricks

Databricks extension for Visual Studio Code
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/tutorial

Run a Databricks notebook from another notebook

Lab 23: Explore Databricks (Demo)

 

Module: Use Apache Spark in Azure Databricks

 

Lab 24:Databricks Spark

 

Lab 25: Databricks Delta Lake (Optional Demo, not in slide deck)

 

Lab 26: Databricks SQL (Optional Demo, not in slide deck)

 

Module: Run Azure Databricks Notebooks with Azure Data Factory

 

Lab 27:Databricks ADF (Demo)

 

More Information

Data engineering with Azure Databricks - Training | Microsoft Learn

(The link is on the last slide of the M09 deck.)