Course DP-203 (old)


My Samples

https://tinyurl.com/cbmctsamples


General Delivery Hints

On the first morning, start the Skillable enviroment, activate the Azure sponsored trial, and do the setup lab (module 0) as early as possible.


General Lab Hints

Don't close any command windows. We will probably need to copy text from them.

Copy the names of your Azure resources (at least the suffix, the resource group name, and the data lake storage account name) and paste them into a text file saved to the desktop. You are going to be copying and pasting these names a lot during the labs.

Do not pause the Synapse Dedicated Pool unless it is the end of the day.

Bookmark Synapse Studio.


Module 0 - Lab environment setup with a pre-installed virtual machine

Before running any of the PowerShell commands, remove the AzureRM modules.
Get-Module -ListAvailable -Name "azurerm*" | Uninstall-Module

Before running dp-203-setup-Part01.ps1, edit it in PowerShell ISE. On line 156 (between the Write-Host and New-AzResourceGroupDeployment cmdlets) add a line as follows.
Start-Sleep -Seconds 120

Once dp-203-setup-Part01.ps1 has finished, copy the "Values from the first setup script here" section and save it in a text file.


Module 1: Explore compute and storage options for data engineering workloads

Delivery hint: Run this as an instructor led lab, and fill the bits where we wait for scripts to complete with some lecture.

A note on case: Python's official website says to use snake_case. This course uses a horrible mixture of System and Apps Hungarian Notation camelCase.

Bronze, silver gold? See https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html.


Lab 1 - Explore compute and storage options for data engineering workloads

Before running dp-203-setup-Part02.ps1, edit the file and check that the "Values from the first setup script here" section matches the one you saved.

Once dp-203-setup-Part02.ps1 has finished, copy the "Values from the first setup script here" section and save it in a text file. 


Module 2: Run interactive queries using serverless SQL pools


Lab 2 - Run interactive queries using serverless SQL pools

Exercise 2, tasks 2 and 3: Both adding RBAC role assignments and adding accounts to groups can take a while to take effect

Exercise 2, task 5: We are writing the file to the Year-2016 folder. The screenshots showing 2019 are in error.

At the end of the lab, your account must be a member of two groups. If not, the OPENROWSET queries in later labs will not run.


Module 3: Data Exploration and Transformation in Azure Databricks

Execution flow

There is no flow control inside a notebook - the notebook is the unit of execution (think of it as a function than can accept parameters and can return values). Databricks Workflows are one execution flow tool.

https://stackoverflow.com/questions/57577106/execution-of-cmd-cells-in-databricks-notebook-based-on-some-condition

https://www.databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html

https://docs.databricks.com/notebooks/notebook-workflows.html?searchString=&from=0&sortby=_score&orderBy=desc&pageNo=1&aggregations=%5B%5D&uid=7dc8d13f-90bb-11e9-98a5-06d762ad9a62&resultsPerPage=10&exactPhrase=&withOneOrMore=&withoutTheWords=&pageSize=10&language=en&state=1&suCaseCreate=false

Dataframe

So, what actually is a dataframe? It's a data structure that organises data into a 2-dimensional table, much like a spreadsheet.

https://www.databricks.com/glossary/what-are-dataframes

https://www.databricks.com/glossary/what-is-rdd


Lab 3 - Data exploration and transformation in Azure Databricks

Notebook 1.Describe-a-dataframe, fun with dbutils.fs.ls

# Cmd 8
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

fslsSchema = StructType(
  [
    StructField('path', StringType()),
    StructField('name', StringType()),
    StructField('size', LongType()),
    StructField('modtime', LongType())
  ]
)

path = source + "…"
files=dbutils.fs.ls(path)
df_files = spark.createDataFrame(filelist, fslsSchema)
df_files.display()

https://stackoverflow.com/questions/66166411/create-a-dataframe-out-of-dbutils-fs-ls-output-in-databricks

Hint for the Distinct Articles exercise (highlight it to show).

(spark
  .read
  .parquet(path)
  .printSchema()
)


Module 4 Explore, transform, and load data into the Data Warehouse using Apache Spark

Ingest data with Apache Spark notebooks in Azure Synapse Analytics

Yes, "magic command" is a technical term. "There are some who call me… Tim."

Links:
https://docs.microsoft.com/en-us/azure/azure-monitor/overview


Lab 4 -  Explore, transform, and load data into the Data Warehouse using Apache Spark

Note that some of the cells are supposed to error.

Lab setup and pre-requisites, Task 2: The following error may occur and can safely be ignored:
Invoke-SqlCmd: Principal 'user@domain.onmicrosoft.com' could not be created. Only connections established with Active Directory accounts can create other Active Directory users.

Lab setup and pre-requisites, Task 2: The following error may occur and can safely be ignored:
07-create-wwi-perf-sale-heap with label CTAS : Sale_Heap.
Cannot index into a null array.

Exercise 1, Task 2, Step 5: There are some narrow spaces in the markdown code (in the TOP line, for example) which will come up with an orange underline. Replace them with normal spaces.

Exercise 1, Task 2, Steps 10 and 11: Do NOT open the bad csv file in Visual Studio Code. It will bring the machine to its knees.

Exercise 3, task 1: We could use Python f-strings here as well.
df = (spark
.read
.option('inferSchema', 'true')
.json(f'abfss://wwi-02@{datalake}.dfs.core.windows.net/online-user-profiles-02/*.json', multiLine=True)
)

Also note that we don't need both backslashes and parens - either works when splitting a python statement across multiple lines. I don't know why the author used both.


Module 5: Ingest and load data into the Data Warehouse

Scaling

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.


Lab 5 - Ingest and load data into the data warehouses

Prereq: Lab 5 requires lab 4 to be completed.

Before running dp-203-setup-Part03.ps1, edit the file and check that the "Values from the first setup script here" section matches the one you saved.


Module 6: Transform Data with Azure Data Factory or Azure Synapse Pipelines

CosmosDB

For an introduction to querying, see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/getting-started.


Lab 6 - Transform data with Azure Data Factory or Azure Synapse Pipelines

Prereq: Lab 6 requires labs 4 and 5 to be completed.

Lab Hint: Careful with the ADF pipeline instructions. The ADF web editor is, shall we say, unforgiving.

Lab Hint: Exercise 2, Task 1, Step 21: In the UserPreferredProducts Flatten transform, make sure you rename the output column "preferredProductId" not "preferredProducts".

Portal Change: Exercise 3, Task 1, Step 7: Setting Compute Size to Small sets the type to Basic and the core count to 4.

Sample item for Cosmos DB database OnlineUserProfile01.

{ "userId": 85601, "cartId": "d9e9c98c-9dd9-4b20-807a-00cb3b876676", "preferredProducts": [ 3706, 2945, 3962, 1801, 2338, 4880, 4719 ], "productReviews": [ { "productId": 718, "reviewText": "This Awesome Metal Shoes, does exactly what it's suppose to do.", "reviewDate": "2017-06-09T09:27:41.1779877+03:00" }, { "productId": 4868, "reviewText": "It only works when I'm Norway.", "reviewDate": "2016-12-03T15:31:56.7074622+02:00" }, { "productId": 1622, "reviewText": "My terrier loves to play with it.", "reviewDate": "2020-04-10T16:17:02.0999133+03:00" } ] }


Module 7: Integrate Data from Notebooks with Azure Data Factory or Azure Synapse Pipelines


Lab 7 - Integrate data from Notebooks with Azure Data Factory or Azure Synapse Pipelines

Prereq: Lab 7 requires labs 4, 5 and 6 to be completed.

Portal Change: Exercise 1, Task 2, Step 5: Setting Compute Size to Small sets the type to Basic and the core count to 4.


Module 8: End-to-end security with Azure Synapse Analytics


Lab 8 - End-to-end security with Azure Synapse Analytics

Prereq: Lab 8 requires lab 4 to be completed.


Module 9: Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link

https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link-frequently-asked-questions

Note that Synapse Link for Azure SQL and SQL Server 2022 recently went into public preview.
https://docs.microsoft.com/en-us/azure/synapse-analytics/synapse-link/sql-synapse-link-overview


Lab 9 - Support Hybrid Transactional Analytical Processing (HTAP) with Azure Synapse Link

Prereq: Lab 9 requires labs 4, 5 and 6 to be completed.

Delivery Hint: Enabling Azure Synapse Link takes some time. Do exercise 1, task 1 of the lab at the beginning of the lecture.

Lab Hint: Exercise 3, Task 1, Step 5: The YOUR_ACCOUNT_NAME is the name of the CosmosDB account without the ".documents.azure.com".


Module 10: Real-time Stream Processing with Stream Analytics

https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions.


Lab 10 - Real-time Stream Processing with Stream Analytics

Prereq: Lab 10 requires lab 4 to be completed.

Getting an error "To start the job you need to create an input" when trying to start the Stream Analytics job? Close the browser tab, open a new tab, open the Azure Portal page, and try again.


Module 11: Create a Stream Processing Solution with Event Hubs and Azure Databricks


Lab 11 - Create a stream processing solution with Event Hubs and Azure Databricks

If you are using Skillable labs then you will have to install the library mentioned in cell 2 of the Streaming-With-Event-Hubs-Demo notebook.

Exercise 3, Task 2. We are copying the connection string from the Event Hub Namespace's shared access policy, not from the Event Hub's policy (which is the opposite of the last lab).