My Samples
https://tinyurl.com/cbmctsamples
Book Recommendations
Mastering Azure Analytics, by Zoiner Tejada.
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud, by Robert Ilijason.
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions, by Sudhir Rawat and Abhishek Narain.
Architecture
There are a number of different high-level architecture diagrams available for big data processing, with various names for the phases.
The most common versions has nine phases: Data Sources, Data Storage, Real-Time Message Ingestion, Batch Processing, Stream Processing, Machine Learning, Analytics & Reporting, Orchestration.
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
Some Microsoft docs simplify it into four phases: Load & Ingest, Store, Process, Serve. Annoyingly, they often name them differently. For example, course DP-201 and its exam use the terms Ingestion, Data Storage, Analysis, and Virtualization. Except where they use Ingest, Process, Store, and Analyse/Report… Course DP-200 and its exam use Ingest, Store, Prep & Train, and Model & Serve. Sheesh.
Choice of Batch Processing services
Azure Data Lake Analytics, Azure Databricks and Azure HDInsight have a lot of overlap between their use cases (the Batch Processing section of the big picture)
This course only covers Databricks (possibly because it is the newest?).
Note that Data Lake Analytics hasn't seen any updates for a couple of years (and its query language, U-SQL, doesn't support Data Lake Storage Gen2). Does it have a future?
https://www.clearpeaks.com/cloud-analytics-on-azure-databricks-vs-hdinsight-vs-data-lake-analytics/
https://stackoverflow.com/questions/50679909/azure-data-lake-vs-azure-hdinsight
https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/
We could also mention Azure Batch, though it is more an HPC service than a BI service.
https://azure.microsoft.com/en-us/services/batch/
Module 1, Surveying the Azure Data Platform, Azure Data Lake Storage
There is no in-place upgrade available. Customers must migrate data from Gen1 to Gen2.
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2
For ingesting data, there is also ADLCopy.exe, but only for Data Lake Storage Gen1.
Module 2, Module Introduction, Create a Data Lake Storage Account (Gen2)
Enabling Hierarchical namespace can only be done at creation time.
Module 2, Module Introduction, Big Data Use Cases
Azure Analysis Services is not discussed in this course. It is mentioned briefly in course DP-201.
In short, it is for building semantic model cubes. It is a PaaS service equivalent to SQL Server Analysis Services PowerPivot. In the big picture, it goes in the Analytical Data Store section, though sometimes it is placed in the Analytics & Reporting section.
Lab 2
Note that the various resources you create are required in later labs. Do not delete them at the end of each lab.
Refer to the class resources handout.
Module 3, Introduction to Azure Databricks, Enterprise Security
RBAC requires the Premium pricing tier.
Module 3, Performing Transformations with Azure Databricks, ETL process
Message brokers.
https://www.softkraft.co/aws-kinesis-vs-kafka-comparison/
https://dzone.com/articles/evaluating-message-brokers-kafka-vs-kinesis-vs-sqs
https://epsagon.com/development/kafka-rabbitmq-or-kinesis-solution-comparison/
File types.
IDEs
It is possible to connect IDEs like Visual Studio and Eclipse.
https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect
Lab 3, Exercise 3, Task 5, Step 6.
LODS markdown error: The scala code should be:
//Show result of reading the JSON file df.show()
Module 4, Create an Azure Cosmos DB database built to scale, Create an Azure Cosmos DB account in the Azure portal
Note that Cosmos DB supports reserved capacity pricing.
You cannot change the data model of an account after creation. Additionally, you cannot query a particular model with another model's API.
Module 4, Create an Azure Cosmos DB database built to scale, How to Choose a Partition Key
Replace:
"The storage space for the data associated with each partition key cannot exceed 10 GB,"
With:
"The storage space for the data associated with each partition key cannot exceed 20 GB,"
https://docs.microsoft.com/en-us/azure/cosmos-db/concepts-limits
Module 4, Create an Azure Cosmos DB database built to scale, Creating a Database and a Collection in Cosmos DB
The general terms (and the ones the Azure Portal uses) are Database Account, Database, Container, Item.
The container is the unit of scalability. It is horizontally partitioned and replicated across multiple regions. Items added to a container are automatically distributed across a set of logical partitions based on the partition key.
The different models have specific terms for containers and items.
SQL: Container, Item
Cassandra: Table, Row
Mongo DB: Collection, Document
Gremlin: Graph, Node or Edge
Table: Table, Item
https://docs.microsoft.com/en-us/azure/cosmos-db/databases-containers-items
Module 4, Create an Azure Cosmos DB database built to scale, Creating a Database and a Collection in Cosmos DB
Note the Serverless option for simplicity and cost. Note that Serverless accounts can only run in one region (no georeplication).
How to choose between provisioned throughput and serverless
Module 4, Insert and query data in your Azure Cosmos DB database, Insert and query data in your Azure Cosmos DB database
https://docs.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started
Module 4, Insert and query data in your Azure Cosmos DB database, Running Complex Operations on your Data
All the database operations within the scope of a container's logical partition are transactionally executed within the database engine that is hosted by the replica of the partition.
https://docs.microsoft.com/en-us/azure/cosmos-db/database-transactions-optimistic-concurrency
Module 4, Insert and query data in your Azure Cosmos DB database, Distribute your Data Globally with Azure Cosmos DB
https://docs.microsoft.com/en-us/azure/cosmos-db/high-availability
Note that Cosmos DB can use Availability Zones.
https://docs.microsoft.com/en-us/azure/cosmos-db/high-availability#availability-zone-support
Module 4, Insert and query data in your Azure Cosmos DB database, Cosmos DB Data Consistency Levels
https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Note the bullet points listing the differences between same/different regions and single/multi-master accounts.
https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels-tradeoffs
Cosmos accounts configured for multi-master cannot be configured for strong consistency as it is not possible for a distributed system to provide an RPO of zero and an RTO of zero. Additionally, there are no write latency benefits for using strong consistency with multi-master as any write into any region must be replicated and committed to all configured regions within the account. This results in the same write latency as a single master account.
Definitions: Linearizability, Monotonic reads, monotonic writes, read-your-writes, write-follows-reads.
The strong and bounded staleness consistency levels consume approximately two times more RUs while performing read operations when compared to that of other relaxed consistency levels.
https://docs.microsoft.com/en-us/azure/cosmos-db/request-units
Lab 4, Exercise 1, Task 3, Step 4.
LODS markdown error: The SQL code should be:
SELECT p.id, p.manufacturer, p.description FROM Products p WHERE p.id ="1"
Module 5, Azure SQL Database, Creating an Azure SQL Database
Azure SQL Server is the logical container for SQL Databases and Dedicated SQL Pools (FKA SQL DW).
Module 5, Azure Synapse Analytics, Massively Parallel Processing(MPP) Concepts
https://redmondmag.com/articles/2020/12/03/azure-purview-azure-synapse-analytics.aspx?m=1
Module 5, Azure Synapse Analytics, Azure Azure Synapse Analytics Table Geometries
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/cheat-sheet
Module 5, Azure Synapse Analytics, Azure Azure Synapse Analytics Table Geometries, Hash-distributed tables
Note the single column requirement. Multiple column replication is on the TODO list (as of June 2020).
Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Upload text Data into Azure Blob Store
The maximum row size is 1MB.
Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Obtain the Azure Storage URL and Key
You can use a storage key, a service principal, or a managed identity. TODO: SAS tokens are unsupported.
Module 5, Using PolyBase to Load Data into Azure Synapse Analytics, Review Questions, Question 1
The ideal number of compressed files is the maximum number of data reader processes per compute node. In SQL Server and Parallel Data Warehouse, the maximum number of data reader processes is 8 per node except Azure SQL Data Warehouse Gen2 which is 20 readers per node.
Lab 5, Exercise 2, Task 1
Replace all references to Azure Synapse Analytics (formerly SQL DW) with Dedicated SQL pool (formerly SQL DW).
The screenshots in the lab show a much older portal, with a heading of "SQL Data Warehouse". The live portal will be headed "Create dedicated SQL pool (formerly SQL DW)".
Lab 5, Exercise 2, Task 2
Typo: Replace steps 1 and 2 with:
1. In the Azure portal, click Resource groups, click awrgstudxx, and then click on dwhservicexx.
Module 6, Data Ingestion with Event Hubs
There are some really good questions in the FAQ, including firewall and domain whitelist configuration.
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-faq
Module 6, Data Ingestion with Event Hubs, Introducing Event Hubs, Pricing
Dedicated gives you the option of creating an Event Hubs Cluster, a single-tenant deployments for customers with the most demanding streaming needs.
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-dedicated-cluster-create-portal
Lab 6, Exercise 3, Task 1
Gotcha: In the .config file, the connection string (line 6) must include "Endpoint=" at the beginning. It must not have a semicolon at the end.
Lab 6, Exercise 4, Task 2
When creating the stream input:
- Input alias: phonestream
- Select Event Hub from your subscription: Selected
- Subscription: <your subscription name>
- Event Hub namespace: xx-phoneanalysis-ehn
- Event hub name: Use existing, xx-phoneanalysis-eh
- Event hub consumer group: Use existing, $Default
- Authentication mode: Connection string
- Event hub policy name: Use existing, xx-phoneanalysis-eh-sap
- Other items: (Leave as default)
Lab 7 Hint
I suggest running this lab on the desktop machine. That way, you can use all the pixels on your screen. Using the browser fullscreen (F11) helps.
Download the github files for DP-200. That way you have the moviesDb.csv file.
Lab 7, Before the lab
Start the dedicated SQL pool.
1. In the portal, select the wrkspcxx Synapse workspace.
2. In the Firewalls blade, set Allow Azure services and resources to access this workspace to on and click Save.
3. In the SQL Pools blade, select the DWDB pool.
4. In the Overview blade, click Resume.
Lab 7, Exercise 2
The goal is to have a file called moviesdb.csv as the only file in the data/output folder in the Data Lake Storage account.
For the pipeline sink, open the sink dataset.
File Path: data / output / moviesdb.csv
First row as header: SELECTED
Import Schema: From sample file
If exercise 2 didn't work then use Storage Explorer or the Azure Portal to upload Labfiles\Starter\DP-200.7\Samplefiles\moviesDB.csv into the storage account.
Lab 7, Exercise 3, Task 1
Note that the Data Flow activity used to be called a Mapping Data Flow.
Lab 7, Exercise 3, Task 3
iif( locate( "|", genres ) > 1, left( genres, locate( "|", genres ) - 1 ), genres )
Lab 7, Exercise 3, Task 4, step f
After setting the key columns on the Settings tab, select the Mapping tab and disable Auto mapping.
Lab 7, Exercise 3, Task 5
The PolyBase accordian is now called staging.
Lab 7, Exercise 4, Task 3 and Task 5 (there is no Task 4)
The ADF Workspace editor has changed since these instructions were written. The left-hand menu now has Author, Monitor and Manage as three separate sections.
Select the Manage item in the left hand menu to create the linked service (steps 3 to 6 in task 3). Select the Author item to create the pipeline (task 5).
Module 8, Introduction to Security, Azure Government
Thegovernment Azure regions offer significant subsets of Azure services and functionality. If something is Azure isn't compliant with government regulations then it is not offered in the government regions.
Only US federal, state, local, and tribal governments and their partners have access to this dedicated instance with operations controlled by screened US citizens.
Microsoft have another set of physically-isolated Azure regions for government regulations, Azure Germany. Unlike the US Government regions, access to the Azure Germany regions is allowed to private organisations.
https://docs.microsoft.com/en-us/azure/germany/
Module 8, Securing Storage Accounts and Data Lake Storage, Shared Access Signatures
Best Practice is to use a service-level SAS associated with a Stored Access Policy. Why? Because shared access signatures are immutable. They can't be changed or revoked after creation (they can't even be listed - Azure does not store them anywhere).
https://docs.microsoft.com/en-us/rest/api/storageservices/define-stored-access-policy
Module 8, Securing Storage Accounts and Data Lake Storage, Shared Access Signatures
https://docs.microsoft.com/en-us/rest/api/storagerp/storageaccounts/listaccountsas
Module 8, Securing Data Stores, Dynamic Data Masking
As of Sep 2020, available in Azure SQL Database, Azure Synapse Analytics, and Azure SQL Managed Instance.
Lab 8, Exercise 4, Task 1
Typo: Replace steps 1 and 2 with:
"1. In the Azure portal, click Resource groups, click awrgstudxx, and then click on AdventureWorksLT."