In Azure Machine Learning, datastores are abstractions for cloud data sources. They encapsulate the information required to connect to data sources. You can access datastores directly in code by using the Azure Machine Learning SDK, and use it to upload or download data. Azure Machine Learning supports the creation of datastores for multiple kinds of Azure data source, including:
Every workspace has two built-in datastores (an Azure Storage blob container, and an Azure Storage file container) that are used as system storage by Azure Machine Learning. There’s also a third datastore that gets added to your workspace if you make use of the open datasets provided as samples.
In most machine learning projects, you will likely need to work with data sources of your own - either because you need to store larger volumes of data than the built-in datastores support, or because you need to integrate your machine learning solution with data from existing applications.
To add a datastore to your workspace, you can register it using the graphical interface in Azure Machine Learning studio, or you can use the Azure Machine Learning SDK. For example, the following code registers an Azure Storage blob container as a datastore named blob_data
:
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='blob_data',
container_name='data_container',
account_name='az_store_acct',
account_key='123456abcde789…')
You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. For example, the following code lists the names of each datastore in the workspace:
for ds_name in ws.datastores:
print(ds_name)
You can get a reference to any datastore by using the Datastore.get()
method: blob_store = Datastore.get(ws, datastore_name='blob_data')
.
The workspace always includes a default datastore (initially, this is the built-in workspaceblobstore
datastore), which you can retrieve by using the get_default_datastore()
method of a Workspace object: default_store = ws.get_default_datastore()
.
When planning for datastores, consider the following guidelines:
workspaceblobstore
datastore).To change the default datastore, use the set_default_datastore()
method: ws.set_default_datastore('blob_data')
.
Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring. Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. You can create the following types of dataset:
To create a tabular dataset using the SDK, use the from_delimited_files
method of the Dataset.Tabular
class. The paths can include wildcards (for example, /files/*.csv
) making it possible to encapsulate data from a large number of files in a single dataset. The dataset in this example includes data from two file paths within the default datastore (the current_data.csv
file in the data/files
folder and all .csv
files in the data/files/archive/
folder). After creating the dataset, the code registers it in the workspace with the name csv_table
:
from azureml.core import Dataset
blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, 'data/files/current_data.csv'),
(blob_ds, 'data/files/archive/*.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace=ws, name='csv_table')
To create a file dataset using the SDK, use the from_files
method of the Dataset.File
class. The dataset in this example includes all .jpg
files in the data/files/images
path within the default datastore. After creating the dataset, the code registers it in the workspace with the name img_files
.
from azureml.core import Dataset
blob_ds = ws.get_default_datastore()
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')
After registering a dataset, you can retrieve it by using any of the following techniques:
The get_by_name
or get_by_id
method of the Dataset class.
import azureml.core
from azureml.core import Workspace, Dataset
# Load the workspace from the saved config file
ws = Workspace.from_config()
# Get a dataset from the workspace datasets collection
ds1 = ws.datasets['csv_table']
# Get a dataset by name from the datasets class
ds2 = Dataset.get_by_name(ws, 'img_files')
Datasets can be versioned, enabling you to track historical versions of datasets that were used in experiments, and reproduce those experiments with data in the same state. You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version
property. In this example, the .png
files in the images
folder have been added to the definition of the img_paths
dataset example used in the previous topic.
img_paths = [(blob_ds, 'data/files/images/*.jpg'),
(blob_ds, 'data/files/images/*.png')]
file_ds = Dataset.File.from_files(path=img_paths)
file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)
You can retrieve a specific version of a dataset by specifying the version parameter in the get_by_name
method of the Dataset class: img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)
.
You can read data directly from a tabular dataset by converting it into a Pandas or Spark dataframe:
df = tab_ds.to_pandas_dataframe()
# code to work with dataframe goes here, for example:
print(df.head())
When you need to access a dataset in an experiment script, you must pass the dataset to the script by:
Using a script argument for a tabular dataset. When you take this approach, the argument received by the script is the unique ID for the dataset in your workspace. In the script, you can then get the workspace from the run context and use it to retrieve the dataset by it’s ID. Here are examples of the ScriptRunConfig and Script, respectively:
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', tab_ds],
environment=env)
from azureml.core import Run, Dataset
parser.add_argument('--ds', type=str, dest='dataset_id')
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
dataset = Dataset.get_by_id(ws, id=args.dataset_id)
data = dataset.to_pandas_dataframe()
Using a named input for a tabular dataset. In this approach, you use the as_named_input
method of the dataset to specify a name for the dataset. Then in the script, you can retrieve the dataset by name from the run context’s input_datasets
collection without needing to retrieve it from the workspace. Note that if you use this approach, you still need to include a script argument for the dataset, even though you don’t actually use it to retrieve the dataset. Here are examples of the ScriptRunConfig and Script, respectively:
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', tab_ds.as_named_input('my_dataset')],
environment=env)
from azureml.core import Run
parser.add_argument('--ds', type=str, dest='ds_id')
args = parser.parse_args()
run = Run.get_context()
dataset = run.input_datasets['my_dataset']
data = dataset.to_pandas_dataframe()
You can use the to_path()
method to return a list of the file paths encapsulated by the dataset:
for file_path in file_ds.to_path():
print(file_path)
Just as with a Tabular dataset, there are two ways you can pass a file dataset to a script:
Using a script argument for a file dataset. Unlike with a tabular dataset, you must specify a mode for the file dataset argument, which can be as_download
or as_mount
. This provides an access point that the script can use to read the files in the dataset. In most cases, you should use as_download
, which copies the files to a temporary location on the compute where the script is being run. However, if you are working with a large amount of data for which there may not be enough storage space on the experiment compute, use as_mount
to stream the files directly from their source. Here are examples of the ScriptRunConfig and Script, respectively:
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', file_ds.as_download()],
environment=env)
from azureml.core import Run
import glob
parser.add_argument('--ds', type=str, dest='ds_ref')
args = parser.parse_args()
run = Run.get_context()
imgs = glob.glob(args.ds_ref + "/*.jpg")
Using a named input for a file dataset. In this approach, you use the as_named_input
method of the dataset to specify a name before specifying the access mode. Then in the script, you can retrieve the dataset by name from the run context’s input_datasets
collection and read the files from there. As with tabular datasets, if you use a named input, you still need to include a script argument for the dataset, even though you don’t actually use it to retrieve the dataset. Here are examples of the ScriptRunConfig and Script, respectively:
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', file_ds.as_named_input('my_ds').as_download()],
environment=env)
from azureml.core import Run
import glob
parser.add_argument('--ds', type=str, dest='ds_ref')
args = parser.parse_args()
run = Run.get_context()
dataset = run.input_datasets['my_ds']
imgs= glob.glob(dataset + "/*.jpg")
In this module, you learned how to:
Source: Microsoft Learn