My Samples
https://tinyurl.com/cbmctsamples
General Delivery Hints
Only some of the review questions are in the slide deck. Use the learn.microsoft.com content to review each module.
General Lab Hints
Use the T symbol to send text into Notepad, then copy and paste from there into the various tools.
The Skillable labs do not have checkboxes. Take careful note of where you are up to.
Cloud Shell setup
1. Learning Path 1 (Slide Deck 1): Get started with data engineering on Azure
Module: Introduction to data engineering on Azure
http://craigb-mct.blogspot.com/2020/10/big-data-architecture-with-azure.html
What is the medallion lakehouse architecture?
https://email.lakefs.io/axoletter-jun23
Module: Introduction to Azure Data Lake Storage Gen2
The speaker in the Channel 9 video introducing Azure Data Lake Storage said that this didn't emulate a Hadoop file system; it was a HDFS.
Module: Introduction to Azure Synapse Analytics
Note that the Serverless SQL pool in Azure Synapse Analytics supports only external and temporary tables.
Lab 01: Explore Synapse
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes).
2. Learning Path 2 (Slide Deck 2): Build data analytics solutions using Azure Synapse Analytics serverless SQL pools
Links
Module: Use Azure Synapse serverless SQL pool to query files in a data lake
Lab 02: Serverless SQL (demo)
Delivery Hint: I've loaded these data files and queries into honestcraigs.com.
Module: Use Azure Synapse serverless SQL pools to transform data in a data lake
Lab 03: Transform Data with SQL
Note that naming a user-defined procedure with a "sp_" prefix is extremely bad practice. Creating an object without specifying a schema name is just bad practice.
Module: Create a lake database in Azure Synapse Analytics
Note that Lake Databases do not support Delta format files.
https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/concepts-lake-database#data-storage
Lab 04: Lake Database (demo)
Module: Secure data and manage users in Azure Synapse serverless SQL pools (not in slide deck)
3. Learning Path 3 (Slide Deck 3): Perform data engineering with Azure Synapse Apache Spark Pools
Notes
Splitting statements across lines
Generally, you should use parens.
https://stackoverflow.com/questions/53162/how-can-i-do-a-line-break-line-continuation-in-python
Execution flow
There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values).
One execution flow tool is Databricks Workflows.
Dataframe
So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.
https://www.databricks.com/glossary/what-are-dataframes
https://www.databricks.com/glossary/what-is-rdd
Module: Analyze data with Apache Spark in Azure Synapse Analytics
Lab 05: Synapse Spark (demo)
Module: Transform data with Spark in Azure Synapse Analytics
Lab 06: Transform Spark
Delivery Hint: Would work well as an instructor-led lab.
Module: Use Delta Lake in Azure Synapse Analytics
Lab 07: Spark Delta Lake
4. Learning Path 6 (Slide Deck 4): Work with data warehouses using Azure Synapse Analytics
Module: Analyze data in a relational data warehouse
Integrity Constraints
Data warehouses still have keys, but they may not be implemented at the database layer, especially in massively parallel situations.
Why do Data warehouses don't have a Primary Key check that is enforced? (r/dataengineering)
Replication
Relevant to review question 3.
Consider using a hash-distributed table when:
- The table size on disk is more than 2 GB.
- The table has frequent insert, update, and delete operations.
Consider using the round-robin distribution for your table in the following scenarios:
- When getting started as a simple starting point since it is the default
- If there is no obvious joining key
- If there is no good candidate column for hash distributing the table
- If the table does not share a common join key with other tables
- If the join is less significant than other joins in the query
- When the table is a temporary staging table
Backup & Restore
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/backup-and-restore
Lab 08: Data Warehouse (demo)
Module: Load data into a relational data warehouse
Slowly Changing Dimension
https://en.wikipedia.org/wiki/Slowly_changing_dimension includes types 4, 5 and 6, as well as combinational approaches.
Lab 09: Load Data Warehouse
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~30 minutes). Note that the lab will probably take less time to do than to set up.
Module: Manage and monitor data warehouse activities in Azure Synapse Analytics (not in slide deck)
Module: Secure a data warehouse in Azure Synapse Analytics (not in slide deck)
5. Learning Path 4 (Slide Deck 5): Transfer and transform data with Azure Synapse Analytics Pipelines
Notes
Module: Build a data pipeline in Azure Synapse Analytics
Delivery Hint: Show the View JSON button.
Lab 10: Synapse Pipeline
Delivery Hint: Instructor-led lab?
Module: Use Spark Notebooks in an Azure Synapse Pipeline
Lab 11: Pipeline Notebook
6. Learning Path 7 (Slide Deck 6): Work with hybrid transactional and analytical processing (HTAP) Solutions using Azure Synapse Analytics
Module: Plan hybrid transactional and analytical processing using Azure Synapse Analytics
Cosmos DB
For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.
Dataverse
Module: Implement Azure Synapse Link with Azure Cosmos DB
Lab 14: Synapse Link (Cosmos DB)
Module: Implement Azure Synapse Link for SQL
This is supported by the Enterprise, Standard and Web editions of SQL Server 2022.
Lab 15: Synapse Link (SQL) (demo)
Several of the columns have user-defined types. These will not be replicated, and will generate a warning, "Table contains unsupported columns…".
SELECT * FROM sys.types WHERE is_user_defined = 1;
7. Learning Path 8 (Slide Deck 7): Implement a data streaming solution with Azure Stream Analytics
Module: Get started with Azure Stream Analytics
Windowing Functions
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
14. Data Engineering on MS Azure DP 203 - Stream Analytics windowing functions - YouTube
What are windowing functions in stream analytics - YouTube
Lab 17: Stream Analytics (demo)
Module: Ingest streaming data using Azure Stream Analytics and Azure Synapse Analytics
Lab 18: Stream Analytics and Synapse
Create an output for the SQL table
In practice you wouldn't use a sysadmin account here. Right…?
Run the streaming job to aggregate order data
To emphasize the ingestion of the streaming data, I suggest adding an ORDER BY clause to the last query.
SELECT TOP (100)
*
FROM
OPENROWSET(
BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/2024/**',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [result]
ORDER BY
StartTime DESC;
Module: Visualize real-time data with Azure Stream Analytics and Power BI
Lab 19: Stream Analytics and Power BI (demo)
8. Learning Path 9 (Slide Deck 9): Implement a data lakehouse analytics solution with Azure Databricks
FKA: Data Engineering with Azure Databricks
Module: Explore Azure Databricks
Databricks extension for Visual Studio Code
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/tutorial
Run a Databricks notebook from another notebook
Lab 23: Explore Databricks (demo)
Module: Perform data analysis with Azure Databricks (not in slide deck)
Module: Use Apache Spark in Azure Databricks
Lab 24: Databricks Spark
Module: Manage data with Delta Lake (not in slide deck)
Lab 25: Databricks Delta Lake (optional demo, not in slide deck)
Module: Build data pipelines with Delta Live Tables (not in slide deck)
Module: Deploy workloads with Azure Databricks Workflows (not in slide deck)
Lab 26: Databricks SQL (optional demo, not in slide deck)
20 Dec 2024. The instructions in Github were updated two weeks ago, Skillable just has an old version.
The instructions in the lab are for creating a legacy dashboard. These go out-of-service in April 7, 2025.
AI/BI dashboards (FKA Lakeview dashboards) can't be created on saved queries; only on database objects. To load the products data, use the Catalog option in the Data tab to load the products table ( Legacy > hive_metastore > adventureworks > products).
https://docs.databricks.com/en/sql/get-started/visualize-data-tutorial.html
Note that CREATE DATABASE
is an alias for CREATE SCHEMA
and that SCHEMA is preferred
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-database.html
Module: Run Azure Databricks Notebooks with Azure Data Factory (in slide deck but not in learn.microsoft.com)
Lab 27: Databricks ADF (demo)
More Information
Data engineering with Azure Databricks - Training | Microsoft Learn
(The link is on the last slide of the M09 deck.)
?. Learning Path 5 (doesn't have a slide deck!): Implement a Data Analytics Solution with Azure Synapse Analytics
Instructor note: I don't uderstand the point of this learning path! It's the "Intro to Synapse" LP from another course, plonked into the middle of DP-203. Why? If it is intended to be the intro module then why isn't it at the top?
Module: Introduction to Azure Synapse Analytics
Module: Use Azure Synapse serverless SQL pool to query files in a data lake
Module: Analyze data with Apache Spark in Azure Synapse Analytics
Module: Use Delta Lake in Azure Synapse Analytics
Module: Analyze data in a relational data warehouse
Module: Build a data pipeline in Azure Synapse Analytics
?. (Slide Deck 8): Govern data across an enterprise (has been removed from learn.microsoft.com?)
A nice example of lack of consistency. The title of the Purview portal is "Microsoft Purview ► purviewzzzzzzz"; the title of the Synapse portal is "Microsoft Azure | Synapse Analytics ► synapsezzzzzzz".
Also, don't confuse the Microsoft Purview compliance portal (compliance.microsoft.com) with the Microsoft Purview governance portal (web.purview.azure.com). Or the new Purview portal (purview.microsoft.com). :-)
Module: Introduction to Microsoft Purview
Module: Integrate Microsoft Purview and Azure Synapse Analytics
Lab 22: Synapse and Purview
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes). Maybe schedule a break during this lab - the scan can take 10+ minutes.