This design implements Azure Machine Learning (AML).
Topology: The AML Workspace is in a Spoke VNet. Compute Clusters (Training) and Inference Endpoints (AKS) run in this spoke, peered to the Hub.
+--------------+ +--------------------------+ +--------------+
| Data | | HUB VNet | | SPOKE VNet |
| Scientist | | (Bastion/VPN) | | (Training) |
+------+-------+ +------------+-------------+ +------+-------+
| | |
v | (Peering) |
+------+-------+ v v
| Laptop | +------------+-------------+ +------+-------+
| (VS Code) |---------->| Private DNS Zone |<--------->| AML |
+--------------+ | (privatelink.api.azureml)| | Workspace |
+--------------------------+ +------+-------+
|
v
+--------------+
| Compute |
| Cluster |
| (GPU VMs) |
+--------------+
PRIMARY REGION (East US)
+-----------------------------------------------------------------------+
| HUB VNet: vnet-hub (10.0.0.0/16) |
| +-----------------------+ |
| | Private DNS Zone | |
| +-----------|-----------+ |
| | |
| v (Peering) |
+---------------|-------------------------------------------------------+
|
+---------------|-------------------------------------------------------+
| SPOKE VNet: vnet-ml-spoke (10.1.0.0/16) |
| +-----------------------+ +-----------------------+ |
| | Subnet: AML | | Subnet: Training | |
| | [AML Workspace] |------>| [Compute Cluster] | |
| | [Private Endpoint] | | (10.1.2.0/24) | |
| +-----------------------+ +-----------|-----------+ |
+-----------------------------------------------|-----------------------+
|
v
+-----------------------+
| Storage / Registry |
| (Model Artifacts) |
+-----------------------+
|
| (Geo-Replication)
v
SECONDARY REGION (West US)
+-----------------------------------------------------------------------+
| DR SPOKE VNet |
| +-----------------------+ |
| | AML Workspace (DR) | |
| +-----------------------+ |
+-----------------------------------------------------------------------+
* Code and Data are replicated (Git + GRS Storage).
* In disaster, run Terraform to create new Workspace in West US.
* Re-run training pipelines.
1. Connect: Scientist connects to VPN.
2. Access: Opens Azure ML Studio (Private IP).
3. Code: Submits training job (Python).
4. Scale: AML scales up Compute Cluster.
5. Train: Cluster mounts data, trains model.
6. Register: Saves model to Registry.
7. Scale Down: Cluster shuts down.
1. Search: "Virtual networks" -> + Create.
2. Resource Group: rg-ml-spoke.
3. Name: vnet-ml-spoke.
4. Region: East US.
5. Subnets:
* snet-aml: 10.1.1.0/24.
* snet-training: 10.1.2.0/24.
6. Create.
7. Peer to vnet-hub.
1. Search: "Machine Learning" -> + Create.
2. Resource Group: rg-ml-spoke.
3. Name: aml-corp-[uniqueid].
4. Region: East US.
5. Storage account: Create new staml[uniqueid].
6. Key vault: Create new kv-aml-[uniqueid].
7. Application insights: Create new appi-aml-[uniqueid].
8. Container registry: Create new acraml[uniqueid].
9. Networking:
* Connectivity method: Private with Internet Outbound.
* Private endpoint: + Add.
* Name: pe-aml.
* Subnet: snet-aml.
* Integrate with private DNS zone: Yes (privatelink.api.azureml.ms).
10. Create.
1. Access: You must access AML Studio from a VM in the Hub (or VPN) because Public Access is disabled.
2. Login to Jumpbox. Open Browser -> ml.azure.com.
3. Select your workspace.
4. Go to Compute (Left Menu) -> Compute clusters -> + New.
5. Location: East US.
6. Virtual Machine Tier: Dedicated.
7. Virtual Machine Type: CPU (e.g., Standard_DS3_v2) or GPU (e.g., Standard_NC6).
8. Settings:
* Compute name: cpu-cluster.
* Min nodes: 0 (Cost saving).
* Max nodes: 4.
* Idle seconds before scale down: 120.
9. Advanced Settings:
* Virtual Network: vnet-ml-spoke.
* Subnet: snet-training.
* *Critical: This puts the compute nodes inside your VNet.*
10. Create.
1. Go to Notebooks.
2. Create a new file train.py.
```python
from azureml.core import Workspace, Experiment, ScriptRunConfig
ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name='day1-experiment')
config = ScriptRunConfig(source_directory='.', script='train.py', compute_target='cpu-cluster')
run = experiment.submit(config)
```
3. Run.
4. Watch the Compute Cluster scale from 0 -> 1 node.