Course DP-203

 

My Samples

https://tinyurl.com/cbmctsamples

 

General Delivery Hints

Only some of the review questions are in the slide deck. Use the learn.microsoft.com content to review each module.

 

General Lab Hints

Use the T symbol to send text into Notepad, then copy and paste from there into the various tools.

The Skillable labs do not have checkboxes. Take careful note of where you are up to.

Cloud Shell setup

 

1. Learning Path 1 (Slide Deck 1): Get started with data engineering on Azure

 

Module: Introduction to data engineering on Azure

http://craigb-mct.blogspot.com/2020/10/big-data-architecture-with-azure.html

What is the medallion lakehouse architecture?

https://email.lakefs.io/axoletter-jun23

 

Module: Introduction to Azure Data Lake Storage Gen2

The speaker in the Channel 9 video introducing Azure Data Lake Storage said that this didn't emulate a Hadoop file system; it was a HDFS.

 

Module: Introduction to Azure Synapse Analytics

Note that the Serverless SQL pool in Azure Synapse Analytics supports only external and temporary tables.

 

Lab 01: Explore Synapse

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes).

 

2. Learning Path 2 (Slide Deck 2): Build data analytics solutions using Azure Synapse Analytics serverless SQL pools

 

Links

https://www.red-gate.com/simple-talk/blogs/synapse-serverless-and-dedicated-pool-the-differences-no-one-told-you-about/#:~:text=The%20basic%20differences%20between%20Synapse,we%20choose%20and%20a%20constant

 

Module: Use Azure Synapse serverless SQL pool to query files in a data lake

 

Lab 02: Serverless SQL (demo)

Delivery Hint: I've loaded these data files and queries into honestcraigs.com.

 

Module: Use Azure Synapse serverless SQL pools to transform data in a data lake

 

Lab 03: Transform Data with SQL

Note that naming a user-defined procedure with a "sp_" prefix is extremely bad practice. Creating an object without specifying a schema name is just bad practice.

 

Module: Create a lake database in Azure Synapse Analytics

Note that Lake Databases do not support Delta format files.
https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/concepts-lake-database#data-storage

 

Lab 04: Lake Database (demo)

 

Module: Secure data and manage users in Azure Synapse serverless SQL pools (not in slide deck)

 

3. Learning Path 3 (Slide Deck 3): Perform data engineering with Azure Synapse Apache Spark Pools

 

Notes

Splitting statements across lines

Generally, you should use parens.

https://stackoverflow.com/questions/53162/how-can-i-do-a-line-break-line-continuation-in-python

Execution flow

There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values).

https://stackoverflow.com/questions/57577106/execution-of-cmd-cells-in-databricks-notebook-based-on-some-condition

One execution flow tool is Databricks Workflows.

Dataframe

So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.

https://www.databricks.com/glossary/what-are-dataframes

https://www.databricks.com/glossary/what-is-rdd

 

Module: Analyze data with Apache Spark in Azure Synapse Analytics

 

Lab 05: Synapse Spark (demo)

 

Module: Transform data with Spark in Azure Synapse Analytics

 

Lab 06: Transform Spark

Delivery Hint: Would work well as an instructor-led lab.

 

Module: Use Delta Lake in Azure Synapse Analytics

 

Lab 07: Spark Delta Lake

 

4. Learning Path 6 (Slide Deck 4): Work with data warehouses using Azure Synapse Analytics

 

Module: Analyze data in a relational data warehouse

Integrity Constraints

Data warehouses still have keys, but they may not be implemented at the database layer, especially in massively parallel situations.

Why do Data warehouses don't have a Primary Key check that is enforced? (r/dataengineering)

Replication

Relevant to review question 3.

Consider using a hash-distributed table when:

  • The table size on disk is more than 2 GB.
  • The table has frequent insert, update, and delete operations.

Consider using the round-robin distribution for your table in the following scenarios:

  • When getting started as a simple starting point since it is the default
  • If there is no obvious joining key
  • If there is no good candidate column for hash distributing the table
  • If the table does not share a common join key with other tables
  • If the join is less significant than other joins in the query
  • When the table is a temporary staging table

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#what-is-a-distributed-table

Backup & Restore

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/backup-and-restore

 

Lab 08: Data Warehouse (demo)

 

Module: Load data into a relational data warehouse

Slowly Changing Dimension

https://en.wikipedia.org/wiki/Slowly_changing_dimension includes types 4, 5 and 6, as well as combinational approaches.

 

Lab 09: Load Data Warehouse

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~30 minutes). Note that the lab will probably take less time to do than to set up.

 

Module: Manage and monitor data warehouse activities in Azure Synapse Analytics (not in slide deck)

 

Module: Secure a data warehouse in Azure Synapse Analytics (not in slide deck)

 

5. Learning Path 4 (Slide Deck 5): Transfer and transform data with Azure Synapse Analytics Pipelines

 

Notes

https://stackoverflow.com/questions/76200692/possible-to-create-a-pipeline-run-of-synapse-from-power-automate

 

Module: Build a data pipeline in Azure Synapse Analytics

Delivery Hint: Show the View JSON button.

 

Lab 10: Synapse Pipeline

Delivery Hint: Instructor-led lab?

 

Module: Use Spark Notebooks in an Azure Synapse Pipeline

 

Lab 11: Pipeline Notebook

 

6. Learning Path 7 (Slide Deck 6): Work with hybrid transactional and analytical processing (HTAP) Solutions using Azure Synapse Analytics

 

Module: Plan hybrid transactional and analytical processing using Azure Synapse Analytics

Cosmos DB

For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.

Dataverse

What is Microsoft Dataverse?

 

Module: Implement Azure Synapse Link with Azure Cosmos DB

https://learn.microsoft.com/en-us/azure/cosmos-db/secure-access-to-data?tabs=using-primary-key#primary-keys

 

Lab 14: Synapse Link (Cosmos DB)

 

Module: Implement Azure Synapse Link for SQL

This is supported by the Enterprise, Standard and Web editions of SQL Server 2022.

https://learn.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-2022?view=sql-server-ver16#ACS

 

Lab 15: Synapse Link (SQL) (demo)

Several of the columns have user-defined types. These will not be replicated, and will generate a warning, "Table contains unsupported columns…".

SELECT * FROM sys.types WHERE is_user_defined = 1;

 

7. Learning Path 8 (Slide Deck 7): Implement a data streaming solution with Azure Stream Analytics

 

Module: Get started with Azure Stream Analytics

Windowing Functions

https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

14. Data Engineering on MS Azure DP 203 - Stream Analytics windowing functions - YouTube

What are windowing functions in stream analytics - YouTube

 

Lab 17: Stream Analytics (demo)

 

Module: Ingest streaming data using Azure Stream Analytics and Azure Synapse Analytics

 

Lab 18: Stream Analytics and Synapse

Create an output for the SQL table

In practice you wouldn't use a sysadmin account here. Right…?

Run the streaming job to aggregate order data

To emphasize the ingestion of the streaming data, I suggest adding an ORDER BY clause to the last query.

SELECT TOP (100) * FROM OPENROWSET( BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/2024/**', FORMAT = 'CSV', PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS [result] ORDER BY StartTime DESC;

 

Module: Visualize real-time data with Azure Stream Analytics and Power BI

 

Lab 19: Stream Analytics and Power BI (demo)

 

8. Learning Path 9 (Slide Deck 9): Implement a data lakehouse analytics solution with Azure Databricks

 

FKA: Data Engineering with Azure Databricks

 

Module: Explore Azure Databricks

Databricks extension for Visual Studio Code
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/tutorial

Run a Databricks notebook from another notebook

Lab 23: Explore Databricks (demo)

 

Module: Perform data analysis with Azure Databricks (not in slide deck)

 

Module: Use Apache Spark in Azure Databricks

 

Lab 24: Databricks Spark

 

Module: Manage data with Delta Lake (not in slide deck)

 

Lab 25: Databricks Delta Lake (optional demo, not in slide deck)

 

Module: Build data pipelines with Delta Live Tables (not in slide deck)

 

Module: Deploy workloads with Azure Databricks Workflows (not in slide deck)

 

Lab 26: Databricks SQL (optional demo, not in slide deck)

20 Dec 2024. The instructions in Github were updated two weeks ago, Skillable just has an old version.

The instructions in the lab are for creating a legacy dashboard. These go out-of-service in April 7, 2025.

AI/BI dashboards (FKA Lakeview dashboards) can't be created on saved queries; only on database objects. To load the products data, use the Catalog option in the Data tab to load the products table ( Legacy > hive_metastore > adventureworks > products).

https://docs.databricks.com/en/sql/get-started/visualize-data-tutorial.html

Note that CREATE DATABASE is an alias for CREATE SCHEMA and that SCHEMA is preferred
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-database.html

 

Module: Run Azure Databricks Notebooks with Azure Data Factory (in slide deck but not in learn.microsoft.com)

 

Lab 27: Databricks ADF (demo)

 

More Information

Data engineering with Azure Databricks - Training | Microsoft Learn

(The link is on the last slide of the M09 deck.)

 

?. Learning Path 5 (doesn't have a slide deck!): Implement a Data Analytics Solution with Azure Synapse Analytics

 

Instructor note: I don't uderstand the point of this learning path! It's the "Intro to Synapse" LP from another course, plonked into the middle of DP-203. Why? If it is intended to be the intro module then why isn't it at the top?

 

Module: Introduction to Azure Synapse Analytics

 

Module: Use Azure Synapse serverless SQL pool to query files in a data lake

 

Module: Analyze data with Apache Spark in Azure Synapse Analytics

 

Module: Use Delta Lake in Azure Synapse Analytics

 

Module: Analyze data in a relational data warehouse

 

Module: Build a data pipeline in Azure Synapse Analytics

 

?. (Slide Deck 8): Govern data across an enterprise (has been removed from learn.microsoft.com?)

A nice example of lack of consistency. The title of the Purview portal is "Microsoft Purview ► purviewzzzzzzz"; the title of the Synapse portal is "Microsoft Azure | Synapse Analytics ► synapsezzzzzzz".

Also, don't confuse the Microsoft Purview compliance portal (compliance.microsoft.com) with the Microsoft Purview governance portal (web.purview.azure.com). Or the new Purview portal (purview.microsoft.com). :-)

 

Module: Introduction to Microsoft Purview

 

Module: Integrate Microsoft Purview and Azure Synapse Analytics

 

Lab 22: Synapse and Purview

Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes). Maybe schedule a break during this lab - the scan can take 10+ minutes.