Overprepared Craig's Courseware Notes: Course DP-200T01-A (retired)

My Samples

Book Recommendations

Mastering Azure Analytics, by Zoiner Tejada.

Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud, by Robert Ilijason.

Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions, by Sudhir Rawat and Abhishek Narain.

Architecture

There are a number of different high-level architecture diagrams available for big data processing, with various names for the phases.

The most common versions has nine phases: Data Sources, Data Storage, Real-Time Message Ingestion, Batch Processing, Stream Processing, Machine Learning, Analytics & Reporting, Orchestration.

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/

Some Microsoft docs simplify it into four phases: Load & Ingest, Store, Process, Serve. Annoyingly, they often name them differently. For example, course DP-201 and its exam use the terms Ingestion, Data Storage, Analysis, and Virtualization. Except where they use Ingest, Process, Store, and Analyse/Report… Course DP-200 and its exam use Ingest, Store, Prep & Train, and Model & Serve. Sheesh.

https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end

Choice of Batch Processing services

Azure Data Lake Analytics, Azure Databricks and Azure HDInsight have a lot of overlap between their use cases (the Batch Processing section of the big picture)

This course only covers Databricks (possibly because it is the newest?).

Note that Data Lake Analytics hasn't seen any updates for a couple of years (and its query language, U-SQL, doesn't support Data Lake Storage Gen2). Does it have a future?

https://www.clearpeaks.com/cloud-analytics-on-azure-databricks-vs-hdinsight-vs-data-lake-analytics/

https://stackoverflow.com/questions/50679909/azure-data-lake-vs-azure-hdinsight

https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/

We could also mention Azure Batch, though it is more an HPC service than a BI service.
https://azure.microsoft.com/en-us/services/batch/

Module 1, Surveying the Azure Data Platform, Azure Data Lake Storage

There is no in-place upgrade available. Customers must migrate data from Gen1 to Gen2.

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2

For ingesting data, there is also ADLCopy.exe, but only for Data Lake Storage Gen1.

Module 2, Module Introduction, Create a Data Lake Storage Account (Gen2)

Enabling Hierarchical namespace can only be done at creation time.

Module 2, Module Introduction, Big Data Use Cases

Azure Analysis Services is not discussed in this course. It is mentioned briefly in course DP-201.

In short, it is for building semantic model cubes. It is a PaaS service equivalent to SQL Server Analysis Services PowerPivot. In the big picture, it goes in the Analytical Data Store section, though sometimes it is placed in the Analytics & Reporting section.

Lab 2

Note that the various resources you create are required in later labs. Do not delete them at the end of each lab.

Refer to the class resources handout.

Module 3, Introduction to Azure Databricks, Enterprise Security

RBAC requires the Premium pricing tier.

Module 3, Performing Transformations with Azure Databricks, ETL process

Message brokers.

https://www.softkraft.co/aws-kinesis-vs-kafka-comparison/

https://dzone.com/articles/evaluating-message-brokers-kafka-vs-kinesis-vs-sqs

https://epsagon.com/development/kafka-rabbitmq-or-kinesis-solution-comparison/

File types.

https://parquet.apache.org/

https://avro.apache.org/

IDEs

It is possible to connect IDEs like Visual Studio and Eclipse.
https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect

Lab 3, Exercise 3, Task 5, Step 6.

LODS markdown error: The scala code should be:

//Show result of reading the JSON file
df.show()

Module 4, Create an Azure Cosmos DB database built to scale, Create an Azure Cosmos DB account in the Azure portal

Note that Cosmos DB supports reserved capacity pricing.

You cannot change the data model of an account after creation. Additionally, you cannot query a particular model with another model's API.

Module 4, Create an Azure Cosmos DB database built to scale, How to Choose a Partition Key

Replace:
"The storage space for the data associated with each partition key cannot exceed 10 GB,"
With:
"The storage space for the data associated with each partition key cannot exceed 20 GB,"

https://docs.microsoft.com/en-us/azure/cosmos-db/concepts-limits

Module 4, Create an Azure Cosmos DB database built to scale, Creating a Database and a Collection in Cosmos DB

The general terms (and the ones the Azure Portal uses) are Database Account, Database, Container, Item.

The container is the unit of scalability. It is horizontally partitioned and replicated across multiple regions. Items added to a container are automatically distributed across a set of logical partitions based on the partition key.

The different models have specific terms for containers and items.
SQL: Container, Item
Cassandra: Table, Row
Mongo DB: Collection, Document
Gremlin: Graph, Node or Edge
Table: Table, Item

https://docs.microsoft.com/en-us/azure/cosmos-db/databases-containers-items

Module 4, Create an Azure Cosmos DB database built to scale, Creating a Database and a Collection in Cosmos DB

Note the Serverless option for simplicity and cost. Note that Serverless accounts can only run in one region (no georeplication).

How to choose between provisioned throughput and serverless

Module 4, Insert and query data in your Azure Cosmos DB database, Insert and query data in your Azure Cosmos DB database

https://docs.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started

Module 4, Insert and query data in your Azure Cosmos DB database, Running Complex Operations on your Data

All the database operations within the scope of a container's logical partition are transactionally executed within the database engine that is hosted by the replica of the partition.

https://docs.microsoft.com/en-us/azure/cosmos-db/database-transactions-optimistic-concurrency

Module 4, Insert and query data in your Azure Cosmos DB database, Distribute your Data Globally with Azure Cosmos DB

https://docs.microsoft.com/en-us/azure/cosmos-db/high-availability

Note that Cosmos DB can use Availability Zones.

https://docs.microsoft.com/en-us/azure/cosmos-db/high-availability#availability-zone-support

Module 4, Insert and query data in your Azure Cosmos DB database, Cosmos DB Data Consistency Levels

https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels

Note the bullet points listing the differences between same/different regions and single/multi-master accounts.

https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels-tradeoffs

Cosmos accounts configured for multi-master cannot be configured for strong consistency as it is not possible for a distributed system to provide an RPO of zero and an RTO of zero. Additionally, there are no write latency benefits for using strong consistency with multi-master as any write into any region must be replicated and committed to all configured regions within the account. This results in the same write latency as a single master account.

Definitions: Linearizability, Monotonic reads, monotonic writes, read-your-writes, write-follows-reads.

The strong and bounded staleness consistency levels consume approximately two times more RUs while performing read operations when compared to that of other relaxed consistency levels.

https://docs.microsoft.com/en-us/azure/cosmos-db/request-units

Lab 4, Exercise 1, Task 3, Step 4.

LODS markdown error: The SQL code should be:

SELECT
    p.id,
    p.manufacturer,
    p.description
FROM Products p
WHERE p.id ="1"

Module 5, Azure SQL Database, Creating an Azure SQL Database

Azure SQL Server is the logical container for SQL Databases and Dedicated SQL Pools (FKA SQL DW).

Module 5, Azure Synapse Analytics, Massively Parallel Processing(MPP) Concepts

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture

https://redmondmag.com/articles/2020/12/03/azure-purview-azure-synapse-analytics.aspx?m=1

Module 5, Azure Synapse Analytics, Azure Azure Synapse Analytics Table Geometries

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/cheat-sheet

https://docs.microsoft.com/en-gb/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Module 5, Azure Synapse Analytics, Azure Azure Synapse Analytics Table Geometries, Hash-distributed tables

Note the single column requirement. Multiple column replication is on the TODO list (as of June 2020).

Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Upload text Data into Azure Blob Store

The maximum row size is 1MB.

https://azure.microsoft.com/en-us/blog/increasing-polybase-row-width-limitation-in-azure-sql-data-warehouse/

Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Obtain the Azure Storage URL and Key

You can use a storage key, a service principal, or a managed identity. TODO: SAS tokens are unsupported.

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest

Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Review Questions, Question 1

The ideal number of compressed files is the maximum number of data reader processes per compute node. In SQL Server and Parallel Data Warehouse, the maximum number of data reader processes is 8 per node except Azure SQL Data Warehouse Gen2 which is 20 readers per node.

Lab 5, Exercise 2, Task 1

Replace all references to Azure Synapse Analytics (formerly SQL DW) with Dedicated SQL pool (formerly SQL DW).

The screenshots in the lab show a much older portal, with a heading of "SQL Data Warehouse". The live portal will be headed "Create dedicated SQL pool (formerly SQL DW)".

Lab 5, Exercise 2, Task 2

Typo: Replace steps 1 and 2 with:
1. In the Azure portal, click Resource groups, click awrgstudxx, and then click on dwhservicexx.

Module 6, Data Ingestion with Event Hubs

There are some really good questions in the FAQ, including firewall and domain whitelist configuration.

https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-faq

Module 6, Data Ingestion with Event Hubs, Introducing Event Hubs, Pricing

Dedicated gives you the option of creating an Event Hubs Cluster, a single-tenant deployments for customers with the most demanding streaming needs.

https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-dedicated-cluster-create-portal

Lab 6, Exercise 3, Task 1

Gotcha: In the .config file, the connection string (line 6) must include "Endpoint=" at the beginning. It must not have a semicolon at the end.

Lab 6, Exercise 4, Task 2

When creating the stream input:

Input alias: phonestream
Select Event Hub from your subscription: Selected
Subscription: <your subscription name>
Event Hub namespace: xx-phoneanalysis-ehn
Event hub name: Use existing, xx-phoneanalysis-eh
Event hub consumer group: Use existing, $Default
Authentication mode: Connection string
Event hub policy name: Use existing, xx-phoneanalysis-eh-sap
Other items: (Leave as default)

Lab 7 Hint

I suggest running this lab on the desktop machine. That way, you can use all the pixels on your screen. Using the browser fullscreen (F11) helps.

Download the github files for DP-200. That way you have the moviesDb.csv file.

Lab 7, Before the lab

Start the dedicated SQL pool.

1. In the portal, select the wrkspcxx Synapse workspace.

2. In the Firewalls blade, set Allow Azure services and resources to access this workspace to on and click Save.

3. In the SQL Pools blade, select the DWDB pool.

4. In the Overview blade, click Resume.

Lab 7, Exercise 2

The goal is to have a file called moviesdb.csv as the only file in the data/output folder in the Data Lake Storage account.

For the pipeline sink, open the sink dataset.

File Path: data / output / moviesdb.csv
First row as header: SELECTED

Import Schema: From sample file

If exercise 2 didn't work then use Storage Explorer or the Azure Portal to upload Labfiles\Starter\DP-200.7\Samplefiles\moviesDB.csv into the storage account.

Lab 7, Exercise 3, Task 1

Note that the Data Flow activity used to be called a Mapping Data Flow.

Lab 7, Exercise 3, Task 3

iif( 
  locate( "|", genres ) > 1,
    left( genres, locate( "|", genres ) - 1 ),
    genres
)

Lab 7, Exercise 3, Task 4, step f

After setting the key columns on the Settings tab, select the Mapping tab and disable Auto mapping.

Lab 7, Exercise 3, Task 5

The PolyBase accordian is now called staging.

Lab 7, Exercise 4, Task 3 and Task 5 (there is no Task 4)

The ADF Workspace editor has changed since these instructions were written. The left-hand menu now has Author, Monitor and Manage as three separate sections.

Select the Manage item in the left hand menu to create the linked service (steps 3 to 6 in task 3). Select the Author item to create the pipeline (task 5).

Module 8, Introduction to Security, Azure Government

Thegovernment Azure regions offer significant subsets of Azure services and functionality. If something is Azure isn't compliant with government regulations then it is not offered in the government regions.

Only US federal, state, local, and tribal governments and their partners have access to this dedicated instance with operations controlled by screened US citizens.

Microsoft have another set of physically-isolated Azure regions for government regulations, Azure Germany. Unlike the US Government regions, access to the Azure Germany regions is allowed to private organisations.

https://docs.microsoft.com/en-us/azure/germany/

Module 8, Securing Storage Accounts and Data Lake Storage, Shared Access Signatures

Best Practice is to use a service-level SAS associated with a Stored Access Policy. Why? Because shared access signatures are immutable. They can't be changed or revoked after creation (they can't even be listed - Azure does not store them anywhere).

https://docs.microsoft.com/en-us/rest/api/storageservices/define-stored-access-policy

Module 8, Securing Storage Accounts and Data Lake Storage, Shared Access Signatures

https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas

Module 8, Securing Data Stores, Dynamic Data Masking

As of Sep 2020, available in Azure SQL Database, Azure Synapse Analytics, and Azure SQL Managed Instance.

Lab 8, Exercise 4, Task 1

Typo: Replace steps 1 and 2 with:
"1. In the Azure portal, click Resource groups, click awrgstudxx, and then click on AdventureWorksLT."