Creating an Azure Machine Learning Workspace and Datastores using Bicep

This article explains how to create an Azure Machine Learning Workspace and Datastores using Bicep. Azure Machine Learning is a cloud service that accelerates and manages the machine learning project lifecycle. The core components of the Azure Machine Learning service include the Workspace, Managed Resources, Linked Services, Assets, and Dependencies.

To create an Azure Machine Learning Workspace, there are a few core components to consider. The Workspace is the core component. Managed resources include Azure Machine Learning Compute nodes that can be used for development environments. Compute Clusters are used for submitting training runs. Linked Services include Datastores and Compute targets. Assets can be an environment, experiments, pipelines, datasets, models, and/or endpoints. Dependencies are resources needed to execute your AML Workspace properly. The Azure Machine Learning Architecture is shown below:

Azure Machine Learning Workspace and Datastores

Azure Bicep, the new DSL language for deploying Azure resources declaratively, simplifies the process of creating an Azure Machine Learning Workspace with multiple datastores. A datastore is a mapping for the actual storage resource to the Azure Machine Learning Workspace. A Datastore provides an interface for your Azure Machine Learning storage accounts. A Dataset is an asset in your Machine Learning Workspace that helps you connect to the data and your storage service and makes the data available for your machine learning experiments.

When creating a dataset in Azure Machine Learning, you create a reference to the data in your storage service. Azure does not copy your data. This means there’s no storage cost incurred when creating datasets. A dataset is a pointer to other data that is stored on a storage resource. Datasets simplify access to data across your team. You only register data once, and then you can reuse it across different experiments. You can also use datasets as a direct input for your script or pipelines and help you check where data has been used.

Datasets can be created using the Azure Machine Learning Studio Portal and can be created from local files, from a datastore, database, or Open Datasets. Before interacting with the Azure Machine Learning Studio portal, a Workspace must be created.

To create an Azure Machine Learning Workspace with multiple datastores, you will need to install Bicep on your local machine, have Azure PowerShell or Azure CLI installed, an active Azure Subscription, a resource group, and a user with the owner/contributor role enabled in the Azure subscription.

The following Bicep file creates a new Azure Machine Learning Workspace with multiple datastores:

resource mlWorkspace 'Microsoft.MachineLearningServices/workspaces@2021-01-01' = {
  name: '${workspaceName}'
  location: '${location}'
  properties: {
    discoveryUrl: 'https://api.${location}.ml.azure.com/'
  }
}

resource datastore1 'Microsoft.MachineLearningServices/workspaces/datastores@2021-01-01' = {
  parent: mlWorkspace
  name: 'my-datastore-1'
  properties: {
    type: 'AzureBlob'
    subscriptionId: '${subscriptionId}'
    connectionString: '${connectionString}'
    container: 'my-container'
  }
}

resource datastore2 'Microsoft.MachineLearningServices/workspaces/datastores@2021-01-01' = {
  parent: mlWorkspace
  name: 'my-datastore-2'
  properties: {
    type: 'AzureBlob'
    subscriptionId: '${subscriptionId}'
    connectionString: '${connectionString}'
    container: 'my-container'
  }
}

The following parameters file passes the parameters:

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "workspaceName": {
            "value": "my-ml-workspace"
        },
        "location": {
            "value": "eastus"
        },
        "subscriptionId": {
            "value": "<your-subscription-id>"
        },
        "connectionString": {
            "value": "<your-storage-account-connection-string>"
        }
    }
}

To deploy the Bicep file to a resource group in the Azure subscription, use the following command in Azure PowerShell or Azure CLI:

$date = Get-Date -Format "MM-dd-yyyy"
$deploymentName = "MyMLDeployment"+"$date"
New-AzResourceGroupDeployment -Name $deploymentName -ResourceGroupName MyResourceGroup -TemplateFile .\azuredeploy.bicep -TemplateParameterFile .\azuredeploy.parameters.json -c

After deployment validation, you can access the Machine Learning Studio Portal and create additional datastores or new datasets using Bicep or the portal. In the portal, go to the Datastores option, and you will see the datastores recently created usingthe Bicep file.

In summary, using Azure Bicep, you can automate the creation of an Azure Machine Learning Workspace with multiple datastores, streamlining the deployment process and increasing productivity. By leveraging Bicep for Infrastructure-as-Code in Azure, you can easily manage the lifecycle of your machine learning projects and the resources needed when working with Azure Machine Learning. Additionally, using datasets in Azure Machine Learning can simplify data access across your team, reduce storage costs, and help you track data usage.