My Samples
https://tinyurl.com/cbmctsamples
General Delivery Hints
Only some of the review questions are in the slide deck. Use the learn.microsoft.com content to review each module.
General Lab Hints
Use the T symbol to send text into Notepad, then copy and paste from there into the various tools.
The Skillable labs do not have checkboxes. Take careful note of where you are up to.
Slide Deck 01: Get started with data engineering on Azure
https://email.lakefs.io/axoletter-jun23
Module: Introduction to data engineering on Azure
http://craigb-mct.blogspot.com/2020/10/big-data-architecture-with-azure.html
Module: Introduction to Azure Data Lake Storage Gen2
The speaker in the Channel 9 video introducing Azure Data Lake Storage said that this didn't emulate a Hadoop file system; it was a HDFS.
Module: Introduction to Azure Synapse Analytics
Note that the Serverless SQL pool in Azure Synapse Analytics supports only external and temporary tables.
Lab 01: Explore Synapse
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes).
Slide Deck 02: Build data analytics solutions using Azure Synapse Analytics serverless SQL pools
Module: Use Azure Synapse serverless SQL pool to query files in a data lake
Lab 02: Serverless SQL (Demo)
Delivery Hint: I've loaded these data files and queries into honestcraigs.com.
Module: Use Azure Synapse serverless SQL pools to transform data in a data lake
Lab 03: Transform Data with SQL
Note that naming a user-defined procedure with a "sp_" prefix is extremely bad practice. Creating an object without specifying a schema name is just bad practice.
Module: Create a lake database in Azure Synapse Analytics
Lab 04: Lake Database (Demo)
Slide Deck 03: Perform data engineering with Azure Synapse Apache Spark Pools
Links
Notes
Splitting statements across lines
Generally, you should use parens.
https://stackoverflow.com/questions/53162/how-can-i-do-a-line-break-line-continuation-in-python
Execution flow
There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values).
One execution flow tool is Databricks Workflows.
Dataframe
So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.
https://www.databricks.com/glossary/what-are-dataframes
https://www.databricks.com/glossary/what-is-rdd
Module: Analyze data with Apache Spark in Azure Synapse Analytics
Lab 05: Synapse Spark (Demo)
Module: Transform data with Spark in Azure Synapse Analytics
Lab 06: Transform Spark
Delivery Hint: Would work well as an instructor-led lab.
Module: Use Delta Lake in Azure Synapse Analytics
Lab 07: Spark Delta Lake
Slide Deck 04: Work with data warehouses using Azure Synapse Analytics
Module: Analyze data in a relational data warehouse
Integrity Constraints
Data warehouses still have keys, but they may not be implemented at the database layer, especially in massively parallel situations.
Why do Data warehouses don't have a Primary Key check that is enforced? (r/dataengineering)
Replication
Relevant to review question 3.
Consider using a hash-distributed table when:
- The table size on disk is more than 2 GB.
- The table has frequent insert, update, and delete operations.
Consider using the round-robin distribution for your table in the following scenarios:
- When getting started as a simple starting point since it is the default
- If there is no obvious joining key
- If there is no good candidate column for hash distributing the table
- If the table does not share a common join key with other tables
- If the join is less significant than other joins in the query
- When the table is a temporary staging table
Backup
Lab 08: Data Warehouse (Demo)
Module: Load data into a relational data warehouse
Slowly Changing Dimension
https://en.wikipedia.org/wiki/Slowly_changing_dimension includes types 4, 5 and 6, as well as combinational approaches.
Lab 09: Load Data Warehouse
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~30 minutes). Note that the lab will probably take less time to do than to set up.
Slide Deck 05: Transfer and transform data with Azure Synapse Analytics Pipelines
Notes
Module: Build a data pipeline in Azure Synapse Analytics
Delivery Hint: Show the View JSON button.
Lab 10: Synapse Pipeline
Delivery Hint: Instructor-led lab?
Module: Use Spark Notebooks in an Azure Synapse Pipeline
Lab 11: Pipeline Notebook
Slide Deck 06: Work with hybrid transactional and analytical processing (HTAP) Solutions using Azure Synapse Analytics
Module: Plan hybrid transactional and analytical processing using Azure Synapse Analytics
Cosmos DB
For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.
Dataverse
Module: Implement Azure Synapse Link with Azure Cosmos DB
Links
Lab 14: Synapse Link (Cosmos DB)
Module: Implement Azure Synapse Link for SQL
This is supported by the Enterprise, Standard and Web editions of SQL Server 2022.
Lab 15: Synapse Link (SQL) (Demo)
Slide Deck 07: Implement a data streaming solution with Azure Stream Analytics
Module: Get started with Azure Stream Analytics
Windowing Functions
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
14. Data Engineering on MS Azure DP 203 - Stream Analytics windowing functions - YouTube
What are windowing functions in stream analytics - YouTube
Lab 17: Stream Analytics (Demo)
Module: Ingest streaming data using Azure Stream Analytics and Azure Synapse Analytics
Lab 18: Stream Analytics and Synapse
In the Create an output for the SQL table section, replace "Table: FactOrder" with "Table: dbo.FactOrder". *sigh*
Module: Visualize real-time data with Azure Stream Analytics and Power BI
Lab 19: Stream Analytics and Power BI (Demo)
Slide Deck 08: Govern data across an enterprise
A nice example of lack of consistency. The title of the Purview portal is "Microsoft Purview ► purviewzzzzzzz"; the title of the Synapse portal is "Microsoft Azure | Synapse Analytics ► synapsezzzzzzz".
Also, don't confude the Microsoft Purview compliance portal (compliance.microsoft.com) with the Microsoft Purview governance portal (web.purview.azure.com).
Module: Introduction to Microsoft Purview
Module: Integrate Microsoft Purview and Azure Synapse Analytics
Lab 22: Synapse and Purview
Delivery Hint: Start the lab at the beginning of the lecture, do the lecture while the provisioning step is running (~20 minutes). Maybe schedule a break during this lab - the scan can take 10+ minutes.
Slide Deck 09: Data Engineering with Azure Databricks
Module: Explore Azure Databricks
Databricks extension for Visual Studio Code
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/vscode-ext/tutorial
Run a Databricks notebook from another notebook
Lab 23: Explore Databricks (Demo)
Module: Use Apache Spark in Azure Databricks
Lab 24:Databricks Spark
Lab 25: Databricks Delta Lake (Optional Demo, not in slide deck)
Lab 26: Databricks SQL (Optional Demo, not in slide deck)
Module: Run Azure Databricks Notebooks with Azure Data Factory
Lab 27:Databricks ADF (Demo)
More Information
Data engineering with Azure Databricks - Training | Microsoft Learn
(The link is on the last slide of the M09 deck.)