My Samples
https://tinyurl.com/cbmctsamples
General Delivery Hints
On the first morning, start the Skillable enviroment, activate the Azure sponsored trial, and do the setup lab (module 0) as early as possible.
General Lab Hints
Don't close any command windows. We will probably need to copy text from them.
Copy the names of your Azure resources (at least the suffix, the resource group name, and the data lake storage account name) and paste them into a text file saved to the desktop. You are going to be copying and pasting these names a lot during the labs.
Do not pause the Synapse Dedicated Pool unless it is the end of the day.
Bookmark Synapse Studio.
Module 0 - Lab environment setup with a pre-installed virtual machine
Before running any of the PowerShell commands, remove the AzureRM modules.
Get-Module -ListAvailable -Name "azurerm*" | Uninstall-Module
Before running dp-203-setup-Part01.ps1
, edit it in PowerShell ISE. On line 156 (between the Write-Host and New-AzResourceGroupDeployment cmdlets) add a line as follows.
Start-Sleep -Seconds 120
Once dp-203-setup-Part01.ps1
has finished, copy the "Values from the first setup script here" section and save it in a text file.
Module 1: Explore compute and storage options for data engineering workloads
Delivery hint: Run this as an instructor led lab, and fill the bits where we wait for scripts to complete with some lecture.
A note on case: Python's official website says to use snake_case. This course uses a horrible mixture of System and Apps Hungarian Notation camelCase.
Bronze, silver gold? See https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html.
Lab 1 - Explore compute and storage options for data engineering workloads
Before running dp-203-setup-Part02.ps1
, edit the file and check that the "Values from the first setup script here" section matches the one you saved.
Once dp-203-setup-Part02.ps1
has finished, copy the "Values from the first setup script here" section and save it in a text file.
Module 2: Run interactive queries using serverless SQL pools
Lab 2 - Run interactive queries using serverless SQL pools
Exercise 2, tasks 2 and 3: Both adding RBAC role assignments and adding accounts to groups can take a while to take effect
Exercise 2, task 5: We are writing the file to the Year-2016
folder. The screenshots showing 2019 are in error.
At the end of the lab, your account must be a member of two groups. If not, the OPENROWSET queries in later labs will not run.
Module 3: Data Exploration and Transformation in Azure Databricks
Execution flow
There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values). Databricks Workflows are one execution flow tool.
Dataframe
So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.
https://www.databricks.com/glossary/what-are-dataframes
https://www.databricks.com/glossary/what-is-rdd
Lab 3 - Data exploration and transformation in Azure Databricks
Notebook 1.Describe-a-dataframe, fun with dbutils.fs.ls
Hint for the Distinct Articles exercise (highlight it to show).
(spark
.read
.parquet(path)
.printSchema()
)
Module 4 Explore, transform, and load data into the Data Warehouse using Apache Spark
Ingest data with Apache Spark notebooks in Azure Synapse Analytics
Yes, "magic command" is a technical term. "There are some who call me… Tim."
Links:
https://docs.microsoft.com/en-us/azure/azure-monitor/overview
Lab 4 - Explore, transform, and load data into the Data Warehouse using Apache Spark
Note that some of the cells are supposed to error.
Lab setup and pre-requisites, Task 2: The following error may occur and can safely be ignored:
Invoke-SqlCmd: Principal 'user@domain.onmicrosoft.com' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.
Lab setup and pre-requisites, Task 2: The following error may occur and can safely be ignored:
07-create-wwi-perf-sale-heap with label CTAS : Sale_Heap.
Cannot index into a null array.
Exercise 1, Task 2, Step 5: There are some narrow spaces in the markdown code (in the TOP line, for example) which will come up with an orange underline. Replace them with normal spaces.
Exercise 1, Task 2, Steps 10 and 11: Do NOT open the bad csv file in Visual Studio Code. It will bring the machine to its knees.
Exercise 3, task 1: We could use Python f-strings here as well.
df = (spark
.read
.option('inferSchema', 'true')
.json(f'abfss://wwi-02@{datalake}.dfs.core.windows.net/online-user-profiles-02/*.json', multiLine=True)
)
Also note that we don't need both backslashes and parens - either works when splitting a python statement across multiple lines. I don't know why the author used both.
Module 5: Ingest and load data into the Data Warehouse
Scaling
Lab 5 - Ingest and load data into the data warehouses
Prereq: Lab 5 requires lab 4 to be completed.
Before running dp-203-setup-Part03.ps1
, edit the file and check that the "Values from the first setup script here" section matches the one you saved.
Module 6: Transform Data with Azure Data Factory or Azure Synapse Pipelines
CosmosDB
For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.
Lab 6 - Transform data with Azure Data Factory or Azure Synapse Pipelines
Prereq: Lab 6 requires labs 4 and 5 to be completed.
Lab Hint: Careful with the ADF pipeline instructions. The ADF web editor is, shall we say, unforgiving.
Lab Hint: Exercise 2, Task 1, Step 21: In the UserPreferredProducts Flatten transform, make sure you rename the output column "preferredProductId" not "preferredProducts".
Portal Change: Exercise 3, Task 1, Step 7: Setting Compute Size to Small sets the type to Basic and the core count to 4.
Sample item for Cosmos DB database OnlineUserProfile01.
{
"userId": 85601,
"cartId": "d9e9c98c-9dd9-4b20-807a-00cb3b876676",
"preferredProducts": [
3706,
2945,
3962,
1801,
2338,
4880,
4719
],
"productReviews": [
{
"productId": 718,
"reviewText": "This Awesome Metal Shoes, does exactly what it's suppose to do.",
"reviewDate": "2017-06-09T09:27:41.1779877+03:00"
},
{
"productId": 4868,
"reviewText": "It only works when I'm Norway.",
"reviewDate": "2016-12-03T15:31:56.7074622+02:00"
},
{
"productId": 1622,
"reviewText": "My terrier loves to play with it.",
"reviewDate": "2020-04-10T16:17:02.0999133+03:00"
}
]
}
Module 7: Integrate Data from Notebooks with Azure Data Factory or Azure Synapse Pipelines
Lab 7 - Integrate data from Notebooks with Azure Data Factory or Azure Synapse Pipelines
Prereq: Lab 7 requires labs 4, 5 and 6 to be completed.
Portal Change: Exercise 1, Task 2, Step 5: Setting Compute Size to Small sets the type to Basic and the core count to 4.
Module 8: End-to-end security with Azure Synapse Analytics
Lab 8 - End-to-end security with Azure Synapse Analytics
Prereq: Lab 8 requires lab 4 to be completed.
Module 9: Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link
https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link-frequently-asked-questions
Note that Synapse Link for Azure SQL and SQL Server 2022 recently went into public preview.
https://docs.microsoft.com/en-us/azure/synapse-analytics/synapse-link/sql-synapse-link-overview
Lab 9 - Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link
Prereq: Lab 9 requires labs 4, 5 and 6 to be completed.
Delivery Hint: Enabling Azure Synapse Link takes some time. Do exercise 1, task 1 of the lab at the beginning of the lecture.
Lab Hint: Exercise 3, Task 1, Step 5: The YOUR_ACCOUNT_NAME is the name of the CosmosDB account without the ".documents.azure.com".
Module 10: Real-time Stream Processing with Stream Analytics
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions.
Lab 10 - Real-time Stream Processing with Stream Analytics
Prereq: Lab 10 requires lab 4 to be completed.
Getting an error "To start the job you need to create an input" when trying to start the Stream Analytics job? Close the browser tab, open a new tab, open the Azure Portal page, and try again.
Module 11: Create a Stream Processing Solution with Event Hubs and Azure Databricks
Lab 11 - Create a stream processing solution with Event Hubs and Azure Databricks
If you are using Skillable labs then you will have to install the library mentioned in cell 2 of the Streaming-With-Event-Hubs-Demo notebook.
Exercise 3, Task 2. We are copying the connection string from the Event Hub Namespace's shared access policy, not from the Event Hub's policy (which is the opposite of the last lab).