Azure Databricks is a data analytics and machine learning platform based on Apache Spark. The first set of tasks to be performed before using Azure Databricks for any kind of Data exploration and machine learning execution is to create a Databricks workspace and Cluster.
The following Python functions were developed to enable the automated provision and deployment of an Azure Databricks workspace and Cluster. The functions use a number of Azure Python third-party and standard libraries to accomplish these tasks. It has come in handy when there’s a need to quickly provision a Databricks workspace environment in Azure for testing, research and Data Exploration.
The first function in the Python script read_yaml_vars_file(yaml_file)
takes a yaml variable file path as argument, reads the yaml file and returns the required variables and values to be used for authenticating against the designated Azure subscription.
"""Function calls that create an Azure Databricks workspace.""" import json import logging import yaml import base64 import requests import time from azure.common.credentials import ServicePrincipalCredentials from azure.mgmt.resource import ResourceManagementClient from azure.mgmt.resource.resources.models import DeploymentMode YAML_VARS_FILE = "Workspace-DB50\\databricks_workspace_vars.yaml" TEMPLATE_PATH = "Workspace-DB50\\databricks_premium_workspaceLab.json" RESOURCE_GROUP_PARAMS = {"location": "eastus"} RESOURCE_GROUP_NAME = "RGDatabricks" JSON_REQUEST_PATH = "Workspace-DB50\\deploy_databricks_cluster_restapi.json" def read_yaml_vars_file(yaml_file): """Read yaml file for cred values""" # Azure Subscription ID with open(yaml_file) as databricks_workspace_file: databricks_ws_config = yaml.safe_load(databricks_workspace_file) subscription_id = databricks_ws_config["databricks-ws-config"]["subscription_id"] clientid = databricks_ws_config["databricks-ws-config"]["clientid"] key = databricks_ws_config["databricks-ws-config"]["key"] tenantid = databricks_ws_config["databricks-ws-config"]["tenantid"] api_endpoint = databricks_ws_config["databricks-ws-config"]["api-endpoint"] # Manage Resource Group Parameters return subscription_id, clientid, key, tenantid, api_endpoint
The second function get_auth_credentials(subscriptionid,client_id,client_key,tenant_id)
instantiates a ResourceManagement credential client object to be used to authenticate against the Azure Cloud subscription.
def get_auth_credentials(subscriptionid,client_id,client_key,tenant_id): # Databricks Service Principal/ Application ID CLIENTID = client_id # Databricks Application Key KEY = client_key # Azure Tenant ID TENANTID = tenant_id credentials = ServicePrincipalCredentials( client_id=CLIENTID, secret=KEY, tenant=TENANTID ) client_obj = ResourceManagementClient(credentials, subscriptionid) return client_obj
The deploy_databricks_workspace(client,template_path,resource_group_name,resource_group_params)
function does the heavy lifting by accepting the Resource Manager client token object, the Azure Databricks Workspace ARM template, resource group name and location parameter to provision the Databricks workspace.
def deploy_databricks_workspace(client,template_path,resource_group_name,resource_group_params): try: #test # Read template file with open(template_path) as template_file: template = json.load(template_file) # Define template deployment properties deployment_properties = { 'mode': DeploymentMode.incremental, 'template': template, } # Create Resource Group and Databricks Workspace resource print('Creating Resource Group') client.resource_groups.create_or_update( resource_group_name, resource_group_params) deployment_async_operation = client.deployments.create_or_update( resource_group_name, 'databrickswsdeployment', deployment_properties ) print("Beginning the deployment... \n\n") # Deploy the template deployment_async_operation.wait() print("Done deploying!!\n\n") except Exception as err: logging.debug("Exception occurred:", exc_info=True)
After successful provisioning of the Azure Databricks Workspace, I wrote the rest of the Python functions to accept the Workspace token as a Console input value. This token argument is used in the cluster_post_req_args(token, json_request_path)
function to create the requests library post action arguments: the header and message data. The message data is created from the REST API JSON file defined in the JSON_REQUEST_PATH
constant variable at the top of the script. This function returns the headers and message data which are used to submit the create cluster REST API request.
The token_console_input()
function is used to pause the Python script execution for as long as it takes to manually create the primary Databricks Workspace token from the Databricks UI. The token string will be fed to the console to continue the script execution and provision of the Cluster. The primary token needs to be created using the Databricks UI before automating token creation using the Databricks REST Token API to generate tokens for specific users.
def token_console_input(): """ input databricks workspace token """ ws_token = input("Enter Databricks Token: ") return ws_token def cluster_post_req_args(token, json_request_path): #pass """ post request config """ WORKSPACETOKEN = token.encode("ASCII") headers = {'Authorization': b"Basic " + base64.standard_b64encode(b"token:" + WORKSPACETOKEN)} with open(json_request_path) as json_request_file: json_request = json.load(json_request_file) data = json_request return headers, data def create_cluster_req(api_endpoint,headers_config,data): """Provision new cluster""" try: response = requests.post( api_endpoint, headers=headers_config, json=data ) return response except Exception as err: logging.debug("Exception occured with create_job:", exc_info = True)
I aim to continue expanding and updating this script to serve multiple use cases as they arise. The Python code can also be adapted to work within an Azure Automation Account Python runbook. Python support for Azure automation is now generally available, though it’s just Python 2. Hopefully, support for Python 3 (this code is based on Python 3) will become available in the near term. The full source code, JSON request, and YAML files are available in my Github repository.
Please do not use this code in a production environment. It was developed as a proof of concept example for training and research.