Programmatically Provision an Azure Databricks Workspace and Cluster using Python Functions.

Azure Databricks is a data analytics and machine learning platform based on Apache Spark. The first set of tasks to be performed before using Azure Databricks for any kind of Data exploration and machine learning execution is to create a Databricks workspace and Cluster.

The following Python functions were developed to enable the automated provision and deployment of an Azure Databricks workspace and Cluster. The functions use a number of Azure Python third-party and standard libraries to accomplish these tasks. It has come in handy when there’s a need to quickly provision a Databricks workspace environment in Azure for testing, research and Data Exploration.

The first function in the Python script read_yaml_vars_file(yaml_file) takes a yaml variable file path as argument, reads the yaml file and returns the required variables and values to be used for authenticating against the designated Azure subscription.

"""Function calls that
create an Azure Databricks workspace."""

import json
import logging
import yaml
import base64
import requests
import time
from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.resource.resources.models import DeploymentMode

YAML_VARS_FILE = "Workspace-DB50\\databricks_workspace_vars.yaml"
TEMPLATE_PATH = "Workspace-DB50\\databricks_premium_workspaceLab.json"
RESOURCE_GROUP_PARAMS = {"location": "eastus"}
RESOURCE_GROUP_NAME = "RGDatabricks"
JSON_REQUEST_PATH = "Workspace-DB50\\deploy_databricks_cluster_restapi.json"

def read_yaml_vars_file(yaml_file):
"""Read yaml
file for cred values"""

# Azure Subscription ID
with open(yaml_file) as databricks_workspace_file:
databricks_ws_config = yaml.safe_load(databricks_workspace_file)

subscription_id = databricks_ws_config["databricks-ws-config"]["subscription_id"]
clientid = databricks_ws_config["databricks-ws-config"]["clientid"]
key = databricks_ws_config["databricks-ws-config"]["key"]
tenantid = databricks_ws_config["databricks-ws-config"]["tenantid"]
api_endpoint = databricks_ws_config["databricks-ws-config"]["api-endpoint"]

# Manage Resource Group Parameters
return subscription_id, clientid, key, tenantid, api_endpoint

The second function get_auth_credentials(subscriptionid,client_id,client_key,tenant_id) instantiates a ResourceManagement credential client object to be used to authenticate against the Azure Cloud subscription.

def get_auth_credentials(subscriptionid,client_id,client_key,tenant_id):

# Databricks Service Principal/ Application ID
CLIENTID = client_id

# Databricks Application Key
KEY = client_key

# Azure Tenant ID
TENANTID = tenant_id

credentials = ServicePrincipalCredentials(
client_id=CLIENTID,
secret=KEY,
tenant=TENANTID
)
client_obj = ResourceManagementClient(credentials, subscriptionid)
return client_obj

The deploy_databricks_workspace(client,template_path,resource_group_name,resource_group_params) function does the heavy lifting by accepting the Resource Manager client token object, the Azure Databricks Workspace ARM template, resource group name and location parameter to provision the Databricks workspace.

def deploy_databricks_workspace(client,template_path,resource_group_name,resource_group_params):
try:
#test

# Read template file
with open(template_path) as template_file:
template = json.load(template_file)

# Define template deployment properties
deployment_properties = {
'mode': DeploymentMode.incremental,
'template': template,
}

# Create Resource Group and Databricks Workspace resource
print('Creating Resource Group')
client.resource_groups.create_or_update(
resource_group_name, resource_group_params)

deployment_async_operation = client.deployments.create_or_update(
resource_group_name,
'databrickswsdeployment',
deployment_properties
)
print("Beginning the deployment... \n\n")
# Deploy the template
deployment_async_operation.wait()
print("Done deploying!!\n\n")
except Exception as err:
logging.debug("Exception occurred:", exc_info=True)

After successful provisioning of the Azure Databricks Workspace, I wrote the rest of the Python functions to accept the Workspace token as a Console input value. This token argument is used in the cluster_post_req_args(token, json_request_path) function to create the requests library post action arguments: the header and message data. The message data is created from the REST API JSON file defined in the JSON_REQUEST_PATH constant variable at the top of the script. This function returns the headers and message data which are used to submit the create cluster REST API request.

The token_console_input() function is used to pause the Python script execution for as long as it takes to manually create the primary Databricks Workspace token from the Databricks UI. The token string will be fed to the console to continue the script execution and provision of the Cluster. The primary token needs to be created using the Databricks UI before automating token creation using the Databricks REST Token API to generate tokens for specific users.

def token_console_input():
""" input databricks
workspace token """

ws_token = input("Enter Databricks Token: ")

return ws_token

def cluster_post_req_args(token, json_request_path):
#pass
""" post request
config """

WORKSPACETOKEN = token.encode("ASCII")
headers = {'Authorization': b"Basic " +
base64.standard_b64encode(b"token:" + WORKSPACETOKEN)}
with open(json_request_path) as json_request_file:
json_request = json.load(json_request_file)
data = json_request
return headers, data

def create_cluster_req(api_endpoint,headers_config,data):
"""Provision
new cluster"""
try:
response = requests.post(
api_endpoint,
headers=headers_config,
json=data
)
return response
except Exception as err:
logging.debug("Exception occured with create_job:", exc_info = True)

I aim to continue expanding and updating this script to serve multiple use cases as they arise. The Python code can also be adapted to work within an Azure Automation Account Python runbook. Python support for Azure automation is now generally available, though it’s just Python 2. Hopefully, support for Python 3 (this code is based on Python 3) will become available in the near term. The full source code, JSON request, and YAML files are available in my Github repository.

Please do not use this code in a production environment. It was developed as a proof of concept example for training and research.

Leave a comment

Chinny Chukwudozie, Cloud Solutions.

Passion for all things Cloud Technology.