Design 46: AI & Machine Learning (MLOps)

Summary

This design implements Azure Machine Learning (AML).

Topology: The AML Workspace is in a Spoke VNet. Compute Clusters (Training) and Inference Endpoints (AKS) run in this spoke, peered to the Hub.

1. Key Design Decisions (ADR)

ADR-01: Security

Decision: Private Link Workspace.
Rationale: Data Scientists connect via VPN/Bastion. No public access to models or data.

ADR-02: Compute

Decision: Compute Clusters.
Rationale: Auto-scaling clusters that scale to 0 when not training.

2. High-Level Design (HLD)

+--------------+           +--------------------------+           +--------------+
|  Data        |           |        HUB VNet          |           |  SPOKE VNet  |
|  Scientist   |           |      (Bastion/VPN)       |           |  (Training)  |
+------+-------+           +------------+-------------+           +------+-------+
       |                                |                                |
       v                                | (Peering)                      |
+------+-------+                        v                                v
|  Laptop      |           +------------+-------------+           +------+-------+
|  (VS Code)   |---------->| Private DNS Zone         |<--------->|  AML         |
+--------------+           | (privatelink.api.azureml)|           |  Workspace   |
                           +--------------------------+           +------+-------+
                                                                         |
                                                                         v
                                                                  +--------------+
                                                                  |  Compute     |
                                                                  |  Cluster     |
                                                                  |  (GPU VMs)   |
                                                                  +--------------+

3. Low-Level Design (LLD)

                               PRIMARY REGION (East US)
+-----------------------------------------------------------------------+
| HUB VNet: vnet-hub (10.0.0.0/16)                                      |
|   +-----------------------+                                           |
|   | Private DNS Zone      |                                           |
|   +-----------|-----------+                                           |
|               |                                                       |
|               v (Peering)                                             |
+---------------|-------------------------------------------------------+
                |
+---------------|-------------------------------------------------------+
| SPOKE VNet: vnet-ml-spoke (10.1.0.0/16)                               |
|   +-----------------------+       +-----------------------+           |
|   | Subnet: AML           |       | Subnet: Training      |           |
|   | [AML Workspace]       |------>| [Compute Cluster]     |           |
|   | [Private Endpoint]    |       | (10.1.2.0/24)         |           |
|   +-----------------------+       +-----------|-----------+           |
+-----------------------------------------------|-----------------------+
                                                |
                                                v
                                    +-----------------------+
                                    | Storage / Registry    |
                                    | (Model Artifacts)     |
                                    +-----------------------+

                                      |
                                      | (Geo-Replication)
                                      v

                               SECONDARY REGION (West US)
+-----------------------------------------------------------------------+
| DR SPOKE VNet                                                         |
|   +-----------------------+                                           |
|   | AML Workspace (DR)    |                                           |
|   +-----------------------+                                           |
+-----------------------------------------------------------------------+

4. Component Rationale

Container Registry (ACR): Stores the Docker images for the models.
Key Vault: Stores credentials.

5. Strategy: High Availability (HA)

Compute: Clusters auto-heal.

6. Strategy: Disaster Recovery (DR)

Implementation: Manual Re-creation.
Process:

* Code and Data are replicated (Git + GRS Storage).

* In disaster, run Terraform to create new Workspace in West US.

* Re-run training pipelines.

7. Strategy: Backup

Models: Stored in Azure Blob Storage (GRS).

8. Strategy: Security

Identity: Managed Identity for Compute Clusters to access Data Lake.
Network: "Disable Public Access" on Workspace.

9. Well-Architected Framework Analysis

Reliability: High.
Security: High.
Cost Optimization: High. Set "Min Nodes = 0" on clusters.
Operational Excellence: High. MLOps (DevOps for ML).
Performance Efficiency: Excellent. GPU support.

10. Detailed Traffic Flow

1. Connect: Scientist connects to VPN.

2. Access: Opens Azure ML Studio (Private IP).

3. Code: Submits training job (Python).

4. Scale: AML scales up Compute Cluster.

5. Train: Cluster mounts data, trains model.

6. Register: Saves model to Registry.

7. Scale Down: Cluster shuts down.

11. Runbook: Deployment Guide (Azure Portal)

Phase 1: Create Spoke VNet

1. Search: "Virtual networks" -> + Create.

2. Resource Group: rg-ml-spoke.

3. Name: vnet-ml-spoke.

4. Region: East US.

5. Subnets:

* snet-aml: 10.1.1.0/24.

* snet-training: 10.1.2.0/24.

6. Create.

7. Peer to vnet-hub.

Phase 2: Create AML Workspace

1. Search: "Machine Learning" -> + Create.

2. Resource Group: rg-ml-spoke.

3. Name: aml-corp-[uniqueid].

4. Region: East US.

5. Storage account: Create new staml[uniqueid].

6. Key vault: Create new kv-aml-[uniqueid].

7. Application insights: Create new appi-aml-[uniqueid].

8. Container registry: Create new acraml[uniqueid].

9. Networking:

* Connectivity method: Private with Internet Outbound.

* Private endpoint: + Add.

* Name: pe-aml.

* Subnet: snet-aml.

* Integrate with private DNS zone: Yes (privatelink.api.azureml.ms).

10. Create.

Phase 3: Create Compute Cluster

1. Access: You must access AML Studio from a VM in the Hub (or VPN) because Public Access is disabled.

2. Login to Jumpbox. Open Browser -> ml.azure.com.

3. Select your workspace.

4. Go to Compute (Left Menu) -> Compute clusters -> + New.

5. Location: East US.

6. Virtual Machine Tier: Dedicated.

7. Virtual Machine Type: CPU (e.g., Standard_DS3_v2) or GPU (e.g., Standard_NC6).

8. Settings:

* Compute name: cpu-cluster.

* Min nodes: 0 (Cost saving).

* Max nodes: 4.

* Idle seconds before scale down: 120.

9. Advanced Settings:

* Virtual Network: vnet-ml-spoke.

* Subnet: snet-training.

* *Critical: This puts the compute nodes inside your VNet.*

10. Create.

Phase 4: Run a Job (Hello World)

1. Go to Notebooks.

2. Create a new file train.py.

```python

from azureml.core import Workspace, Experiment, ScriptRunConfig

ws = Workspace.from_config()

experiment = Experiment(workspace=ws, name='day1-experiment')

config = ScriptRunConfig(source_directory='.', script='train.py', compute_target='cpu-cluster')

run = experiment.submit(config)

```

3. Run.

4. Watch the Compute Cluster scale from 0 -> 1 node.