reading-notes

Module 3 - Build and operate machine learning solutions with Azure Machine Learning

Work with Data in Azure Machine Learning

Introduction to datastores

In Azure Machine Learning, datastores are abstractions for cloud data sources. They encapsulate the information required to connect to data sources. You can access datastores directly in code by using the Azure Machine Learning SDK, and use it to upload or download data. Azure Machine Learning supports the creation of datastores for multiple kinds of Azure data source, including:

Every workspace has two built-in datastores (an Azure Storage blob container, and an Azure Storage file container) that are used as system storage by Azure Machine Learning. There’s also a third datastore that gets added to your workspace if you make use of the open datasets provided as samples.

In most machine learning projects, you will likely need to work with data sources of your own - either because you need to store larger volumes of data than the built-in datastores support, or because you need to integrate your machine learning solution with data from existing applications.


Using and managing datastores

To add a datastore to your workspace, you can register it using the graphical interface in Azure Machine Learning studio, or you can use the Azure Machine Learning SDK. For example, the following code registers an Azure Storage blob container as a datastore named blob_data:

  from azureml.core import Workspace, Datastore

  ws = Workspace.from_config()

  # Register a new datastore
  blob_ds = Datastore.register_azure_blob_container(workspace=ws, 
                                                    datastore_name='blob_data', 
                                                    container_name='data_container',
                                                    account_name='az_store_acct',
                                                    account_key='123456abcde789…')

You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK. For example, the following code lists the names of each datastore in the workspace:

  for ds_name in ws.datastores:
      print(ds_name)

You can get a reference to any datastore by using the Datastore.get() method: blob_store = Datastore.get(ws, datastore_name='blob_data').

The workspace always includes a default datastore (initially, this is the built-in workspaceblobstore datastore), which you can retrieve by using the get_default_datastore() method of a Workspace object: default_store = ws.get_default_datastore().

When planning for datastores, consider the following guidelines:

To change the default datastore, use the set_default_datastore() method: ws.set_default_datastore('blob_data').


Introduction to datasets

Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring. Datasets are typically based on files in a datastore, though they can also be based on URLs and other sources. You can create the following types of dataset:

To create a tabular dataset using the SDK, use the from_delimited_files method of the Dataset.Tabular class. The paths can include wildcards (for example, /files/*.csv) making it possible to encapsulate data from a large number of files in a single dataset. The dataset in this example includes data from two file paths within the default datastore (the current_data.csv file in the data/files folder and all .csv files in the data/files/archive/ folder). After creating the dataset, the code registers it in the workspace with the name csv_table:

  from azureml.core import Dataset

  blob_ds = ws.get_default_datastore()
  csv_paths = [(blob_ds, 'data/files/current_data.csv'),
              (blob_ds, 'data/files/archive/*.csv')]
  tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
  tab_ds = tab_ds.register(workspace=ws, name='csv_table')

To create a file dataset using the SDK, use the from_files method of the Dataset.File class. The dataset in this example includes all .jpg files in the data/files/images path within the default datastore. After creating the dataset, the code registers it in the workspace with the name img_files.

  from azureml.core import Dataset

  blob_ds = ws.get_default_datastore()
  file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
  file_ds = file_ds.register(workspace=ws, name='img_files')

After registering a dataset, you can retrieve it by using any of the following techniques:

Datasets can be versioned, enabling you to track historical versions of datasets that were used in experiments, and reproduce those experiments with data in the same state. You can create a new version of a dataset by registering it with the same name as a previously registered dataset and specifying the create_new_version property. In this example, the .png files in the images folder have been added to the definition of the img_paths dataset example used in the previous topic.

  img_paths = [(blob_ds, 'data/files/images/*.jpg'),
              (blob_ds, 'data/files/images/*.png')]
  file_ds = Dataset.File.from_files(path=img_paths)
  file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)

You can retrieve a specific version of a dataset by specifying the version parameter in the get_by_name method of the Dataset class: img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2).


Working with Tabular datasets

You can read data directly from a tabular dataset by converting it into a Pandas or Spark dataframe:

  df = tab_ds.to_pandas_dataframe()
  # code to work with dataframe goes here, for example:
  print(df.head())

When you need to access a dataset in an experiment script, you must pass the dataset to the script by:


Working with File datasets

You can use the to_path() method to return a list of the file paths encapsulated by the dataset:

  for file_path in file_ds.to_path():
      print(file_path)

Just as with a Tabular dataset, there are two ways you can pass a file dataset to a script:


Summary

In this module, you learned how to:

Source: Microsoft Learn