Course DP-201T01-A — Course and Lab Notes


My Samples

https://tinyurl.com/cbmctsamples

Architecture

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end

Module 1, Azure Architecture Considerations

The Microsoft Azure Well-Architected Framework lists five pillars: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability (which includes resilience and availability), and Security.

https://docs.microsoft.com/en-us/azure/architecture/framework/

Module 1, Design for Security

See the CIA Triad: Confidentiality, Integrity and Availability.

Module 1, Design for Security, Defense in Depth

There are two versions of the table in common use on the web.

First, the original: Data, App, Host, Net, Perimeter, Physical, People.

Second, more recent: Data, App, Compute, Net, Perimeter, Identity & Access, Physical, People.

Modules 3 and 4, Lambda architecture

It looks like the diagram's key was cut out at some point, probably when it was split across modules 2 and 3.

Quick summary (numbers refer to the picture).
1. All data is pushed into both the batch layer and speed layer.
2. The batch layer has a master dataset (immutable, append-only set of raw data) and pre-computes the batch views.
3. The serving layer has batch views for fast queries.
4. The speed layer compensates for processing time (to the serving layer) and deals with recent data only.
5. All queries can be answered by merging results from batch views and real-time views or pinging them individually.

https://azure.microsoft.com/en-in/blog/lambda-architecture-using-azure-cosmosdb-faster-performance-low-tco-low-devops/

https://databricks.com/glossary/lambda-architecture

https://en.wikipedia.org/wiki/Lambda_architecture

Module 2, Design an Enterprise Business Intelligence Architecture, Ingestion and Data Storage

The reason the course says that direct transfer only works well for small volumes of data is that it doesn't leverage the MPP capabilities of products like Azure Synapse Workspace.

Using SSIS to feed data directly from, say, Microsoft SQL Server 2019 into Azure Dedicated SQL Pool is not a bad answer, but might not be the most performant answer.

The paragraph about bcp and azcopy worries me deeply. Those are not big data technologies.

I wouldn't call high watermark "the most common" approach; just "a common" approach.

Module 4, Security Design Considerations, Infrastructure Protection

To discuss: What is the difference between data loss and data leakage?

Module 4, Security Design Considerations, Encryption

This would be a great place to link Laurentiu Cristofor's blog "Who Needs Encryption", but Microsoft's web site team deleted it. So, let's ask the Wayback Machine. :-)

https://web.archive.org/web/20061206060557/http://blogs.msdn.com/lcris/archive/2006/11/30/who-needs-encryption.aspx

Module 4, Security Design Considerations, Encryption

"To use or read the encrypted data, it must be decrypted, which requires the use of a secret key."

This needs to be stressed: Encrypted data on its own is unusable! Any person or process that needs to use the data needs the key. This is not negotiable.

Encryption does not eliminate the need to keep data secret; it just means the data you have to keep secret (the key) is a lot smaller.