Write Data from Azure Databricks to Azure Synapse Analytics(formerly SQL DW) using ADLS Gen 2.)

In this post, I will attempt to capture the steps taken to load data from Azure Databricks deployed with VNET Injection (Network Isolation) into an instance of Azure Synapse DataWarehouse deployed within a custom VNET and configured with a private endpoint and private DNS.

Deploying these services, including Azure Data Lake Storage Gen 2 within a private endpoint and custom VNET is great because it creates a very secure Azure environment that enables limiting access to them.

But the drawback is that the security design adds extra layers of configuration in order to enable integration between Azure Databricks and Azure Synapse, then allow Synapse to import and export data from a staging directory in Azure Data Lake Gen 2 using Polybase and COPY statements. PolyBase and the COPY statements are commonly used to load data into Azure Synapse Analytics from Azure Storage accounts for high throughput data ingestion.

Quick Overview on how the connection works:

Access from Databricks PySpark application to Azure Synapse can be facilitated using the Azure Synapse Spark connector. The connector uses ADLS Gen 2, and the COPY statement in Azure Synapse to transfer large volumes of data efficiently between a Databricks cluster and an Azure Synapse instance.

Both the Databricks cluster and the Azure Synapse instance access a common ADLS Gen 2 container to exchange data between these two systems. In Databricks, Apache Spark applications read data from and write data to the ADLS Gen 2 container using the Synapse connector. On the Azure Synapse side, data loading and unloading operations performed by PolyBase are triggered by the Azure Synapse connector through JDBC. In Databricks Runtime 7.0 and above, COPY is used by default to load data into Azure Synapse by the Azure Synapse connector through JDBC because it provides better performance.

Implementation.

Step 1: Configure Access from Databricks to ADLS Gen 2 for Dataframe APIs.

a. The first step in setting up access between Databricks and Azure Synapse Analytics, is to configure OAuth 2.0 with a Service Principal for direct access to ADLS Gen2. The container that serves as the permanent source location for the data to be ingested by Azure Databricks must be set with RWX ACL permissions for the Service Principal (using the SPN object id). This can be achieved using Azure PowerShell or Azure Storage explorer.

Get the SPN object id:
Get-AzADServicePrincipal -ApplicationId dekf7221-2179-4111-9805-d5121e27uhn2 | fl Id
Id : 4037f752-9538-46e6-b550-7f2e5b9e8n83

Configure the OAuth2.0 account credentials in the Databricks notebook session:

# Configure OAuth 2.0 creds.
spark.conf.set("fs.azure.account.auth.type.adls77.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.adls77.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.adls77.dfs.core.windows.net", dbutils.secrets.get("akvscope", key="spnid"))
spark.conf.set("fs.azure.account.oauth2.client.secret.adls77.dfs.core.windows.net", dbutils.secrets.get("akvscope", key="spnsecret"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.adls77.dfs.core.windows.net", "https://login.microsoftonline.com/{}/oauth2/token".format(dbutils.secrets.get("akvscope", key="tenantid")))

b. The same SPN also needs to be granted RWX ACLs on the temp/intermediate container to be used as a temporary staging location for loading/writing data to Azure Synapse Analytics.

Step 2: Use Azure PowerShell to register the Azure Synapse server with Azure AD and generate an identity for the server.
Set-AzSqlServer -ResourceGroupName rganalytics -ServerName dwserver00 -AssignIdentity

Step 3: Assign RBAC and ACL permissions to the Azure Synapse Analytics server’s managed identity:

a. Assign Storage Blob Data Contributor Azure role to the Azure Synapse Analytics server’s managed identity generated in Step 2 above, on the ADLS Gen 2 storage account. This can be achieved using Azure portal, navigating to the IAM (Identity Access Management) menu of the storage account. It can also be done using Powershell.

b. In addition, the temp/intermediate container in the ADLS Gen 2 storage account, that acts as an intermediary to store bulk data when writing to Azure Synapse, must be set with RWX ACL permission granted to the Azure Synapse Analytics server Managed Identity . This can also be done using PowerShell or Azure Storage Explorer.

Step 4: Using SSMS (SQL Server Management Studio), login to the Synapse DW to configure credentials.
a. Run the following sql query to create a database scoped cred with Managed Service Identity that references the generated identity from Step 2:
CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Service Identity';

b. A master key should be created. In my case I had already created a master key earlier. The following query creates a master key in the DW:
CREATE MASTER KEY

Azure Data Warehouse does not require a password to be specified for the Master Key. If you make use of a password, take record of the password and store it in Azure Key vault.

c. Run the next sql query to create an external datasource to the ADLS Gen 2 intermediate container:
CREATE EXTERNAL DATA SOURCE ext_datasource_with_abfss WITH (TYPE = hadoop, LOCATION = ‘abfss://tempcontainer@adls77.dfs.core.windows.net/’, CREDENTIAL = msi_cred);

Step 5: Read data from the ADLS Gen 2 datasource location into a Spark Dataframe.

source_path="abfss://datacontainer@adls77.dfs.core.windows.net/sampledir/sampledata.csv"

df = spark.read.format("csv").options(header=True, inferschema=True).load(source_path)

Step 6: Build the Synapse DW Server connection string and write to the Azure Synapse DW.

I have configured Azure Synapse instance with a Managed Service Identity credential. For this scenario, I must set useAzureMSI to true in my Spark Dataframe write configuration option. Based on this config, the Synapse connector will specify “IDENTITY = ‘Managed Service Identity'” for the database scoped credential and no SECRET.

sql_pwd = dbutils.secrets.get("akvscope", key="sqlpwd")
dbtable = "dwtable"
url="jdbc:sqlserver://dwserver00.database.windows.net:1433;database=dwsqldb;user=dwuser@dwserver00;password=" +sql_pwd + ";encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

df.write.format("com.databricks.spark.sqldw").option("useAzureMSI", "true").mode("append").option("url", url).option("dbtable", dbtable).option("tempDir", "abfss://tempcontainer@adls77.dfs.core.windows.net/temp").save()

The following screenshot shows the notebook code:

Summary.
As stated earlier, these services have been deployed within a custom VNET with private endpoints and private DNS. The ABFSS uri schema is a secure schema which encrypts all communication between the storage account and Azure Data Warehouse.

The Managed Service Identity allows you to create a more secure credential which is bound to the Logical Server and therefore no longer requires user details, secrets or storage keys to be shared for credentials to be created.

The Storage account security is streamlined and we now grant RBAC permissions to the Managed Service Identity for the Logical Server. In addition, ACL permissions are granted to the Managed Service Identity for the logical server on the intermediate (temp) container to allow Databricks read from and write staging data.

Posted in Apache Spark, Azure Synapse DW | Tagged , , , , , , , , , , | Leave a comment

Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

Use Case.
In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since the pipeline needs to account for notifications, message queueing, proper tracking and fault tolerance to mention a few requirements. Manually designing and building out these services to work properly can be complex.

My initial architecture implementation involved using Azure Data Factory event-based trigger or tumbling window trigger to track the raw files, based on new blob (file) creation or last modified time attribute, then initiate a copy operation of the new files to a staging container.

The staging files become the source for an Azure Databricks notebook to read into an Apache Spark Dataframe, run specified transformations and output to the defined sink. A Delete activity is then used to clean up the processed files from the staging container.

This works, but it has a few drawbacks. Two of the key ones for me being:
1. The changed files copy operation could take hours depending on the number and size of the raw files. This will increase the end-to-end latency of the pipeline.
2. No easy or clear way to log/track the copied files’ processing to help with retries during a failure.

The ideal solution in my view, should include the following:

1. Enable the processing of incremental files at the source (Data Lake).
2. Eliminate the time consuming and resource intensive copy operation.
3. Maintain reliable fault tolerance.
4. Reduce overall end-to-end pipeline latency.

Enter Databricks Autoloader.

Proposed Solution.
To address the above drawbacks, I decided on Azure Databricks Autoloader and the Apache Spark Streaming API. Autoloader is an Apache Spark feature that enables the incremental processing and transformation of new files as they arrive in the Data Lake. It’s an exciting new feature that automatically provisions and configures the backend Event Grid and Azure Queue services it requires to work efficiently.

Autoloader uses a new Structured Streaming source called cloudFiles. As one of it’s configuration options, it accepts a an input source directory in the Data Lake and automatically processes new files as they land, with the option of also processing existing files in that directory. You can read more about the Autoloader feature here and here. But for the rest of this post, I will talk about my implementation of the feature.

Implement Autoloader.

Modifying my existing Architecture for Autoloader was not really complicated. I had to rewrite the data ingestion Spark code to use the Spark Structured Streaming API instead of the Structured API. My architecture looks like this:

The entire implementation was accomplished within a single Databricks Notebook. In the first notebook cell show below, I configured the Spark session to access the Azure Data Lake Gen 2 Storage via direct access using a Service Principal and OAuth 2.0.

# Import required modules
from pyspark.sql import functions as f
from datetime import datetime
from pyspark.sql.types import StringType
# Configure SPN direct access to ADLS Gen 2
spark.conf.set("fs.azure.account.auth.type.dstore.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.dstore.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.dstore.dfs.core.windows.net", dbutils.secrets.get("myscope", key="clientid"))
spark.conf.set("fs.azure.account.oauth2.client.secret.dstore.dfs.core.windows.net", dbutils.secrets.get("myscope", key="clientsecret"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.dstore.dfs.core.windows.net", "https://login.microsoftonline.com/{}/oauth2/token".format(dbutils.secrets.get("myscope", key="tenantid")))

Read a single source csv file into a Spark Dataframe to retrieve current schema. Then use the schema to configure the Autoloader readStream code segment.

df = spark.read.format("csv").option("inferSchema", True).option("header", True).load("abfss://rawcontainer@dstore.dfs.core.windows.net/mockcsv/file1.csv")

dataset_schema = df.schema

As mentioned earlier, Autoloader provides a new cloudFiles source with configuration options to enable the incremental processing of new files as they arrive in the Data Lake. The following configuration options need to be configured for Autoloader to work properly:

1. “cloudFiles.clientId”: Provision a Service Principal with Contributor access to the ADLS Gen 2 storage account resource group to enable Autoloader create an Event Grid service and Azure Queue storage service in the background. Autoloader creates these services automatically without any manual input.
2. “cloudFiles.includeExistingFiles”: This option determines whether to include existing files in the input path in the streaming processing versus only processing new files arrived after setting up the notifications. This option is respected only when you start a stream for the first time. Changing its value at stream restart won’t take any effect.
3. “cloudFiles.format”: This option specifies the input dataset file format.
4. “cloudFiles.useNotifications”: This option specifies whether to use file notification mode to determine when there are new files. If false, use directory listing mode.

The next code segment demonstrates reading the source files into a Spark Dataframe using the Streaming API, with the cloudFiles options:

connection_string = dbutils.secrets.get("myscope", key="storageconnstr")

df_stream_in = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", True).option("cloudFiles.format", "csv")\
            .option("cloudFiles.connectionString", connection_string)\
            .option("cloudFiles.resourceGroup", "rganalytics")\
            .option("cloudFiles.subscriptionId", dbutils.secrets.get("myscope", key="subid"))\
            .option("cloudFiles.tenantId", dbutils.secrets.get("myscope", key="tenantid"))\
            .option("cloudFiles.clientId", dbutils.secrets.get("myscope", key="clientid"))\
            .option("cloudFiles.clientSecret", dbutils.secrets.get("myscope", key="clientsecret"))\
            .option("cloudFiles.region", "eastus")\
            .schema(dataset_schema)\
            .option("cloudFiles.includeExistingFiles", True).load("abfss://rawcontainer@dstore.dfs.core.windows.net/mockcsv/")

Define a simple custom user defined function to implement a transformation on the input stream dataframes. For this demo, I used a simple function that adds a column with constant values to the dataframe. Also, define a second function that will be used to write the streaming dataframe to an Azure SQL Database output sink.

# Custom function
def time_col():
  pass
  return datetime.now().strftime("%d/%m/%Y %H:%M:%S")

# UDF Function
time_col_udf = spark.udf.register("time_col_sql_udf", time_col, StringType())

# Custom function to output streaming dataframe to Azure SQL
def output_sqldb(batch_df, batch_id):
#   Set Azure SQL DB Properties and Conn String
  sql_pwd = dbutils.secrets.get(scope = "myscope", key = "sqlpwd")
  dbtable = "staging"
  jdbc_url = "jdbc:sqlserver://sqlserver09.database.windows.net:1433;database=demodb;user=sqladmin@sqlserver09;password="+ sql_pwd +";encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
  # write the pyspark dataframe df_spark_test to Azure SQL DB
  batch_df.write.format("jdbc").mode("append").option("url", jdbc_url).option("dbtable", dbtable).save()

# Add new constant column using the UDF above to the streaming Dataframe
df_transform = df_stream_in.withColumn("time_col", f.lit(time_col_udf()))

Use the writeStream Streaming API method to output the transformed Dataframe into a file output sink as JSON files and into an Azure SQL DB staging table. The

foreachBatch()

method sets the output of the streaming query to be processed using the provided function.

# Output Dataframe to JSON files
df_out = df_transform.writeStream\
  .trigger(once=True)\
  .format("json")\
  .outputMode("append")\
  .option("checkpointLocation", "abfss://checkpointcontainer@dstore.dfs.core.windows.net/checkpointcsv")\
  .start("abfss://rawcontainer@dstore.dfs.core.windows.net/autoloaderjson")

# Output Dataframe to Azure SQL DB
df_sqldb = df_transform.writeStream\
  .trigger(once=True)\
  .foreachBatch(output_sqldb)\
  .outputMode("append")\
  .option("checkpointLocation", "abfss://checkpointcontainer@dstore.dfs.core.windows.net/checkpointdb")\
  .start()

The following screenshots show the successful write of the csv files as JSON and into the Azure SQL DB table:

Output JSON File Sink:

Output Staging Table:

Key Points.
1. A critical point of note in this pipeline configuration for my use case is the Trigger once configuration. The trigger once option enables running the streaming query once, then it stops. This means that I can schedule a job that runs the notebook however often is required, as a batch job. The cost savings for this option are huge.

I don’t have to leave a cluster running 24/7 just to perform a short amount of processing hourly or once a day. This is the best of both worlds. I code my pipeline as a streaming ETL, while running it as a batch process to save cost. And there’s no data copy operation involved that will very much add a lot more latency to the pipeline.

2. The checkpoint option in the writeStream query is also exciting. This checkpoint location preserves all of the essential information that uniquely identifies a query. Hence, each query must have a different checkpoint location, and multiple queries should never have the same location. It is a part of the fault tolerance functionality built into the pipeline.

Big data processing applications have grown complex enough that the line between real-time processing and batch processing has blurred significantly. The ability to use the Streaming API to simplify batch-like ETL pipeline workflows is awesome and a welcome development.

Posted in Azure Databricks | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Build a Jar file for the Apache Spark SQL and Azure SQL Server Connector Using SBT.

The Apache Spark Azure SQL Connector is a huge upgrade to the built-in JDBC Spark connector. It is more than 15x faster than generic JDBC connector for writing to SQL Server. In this short post, I articulate the steps required to build a JAR file from the Apache Spark connector for Azure SQL that can be installed in a Spark cluster and used to read and write Spark Dataframes to and from source and sink platforms.

1. Clone the microsoft/sql-spark-connector GitHub repository.

2. Download the SBT Scala Build Tool.

Download SBT here.

3. Open a shell console, navigate to root folder of the cloned repository and start the sbt shell as shown in the following screen shot:

4. Build the code files into the jar package using the following command:

sbt:spark-mssql-connector> package

5. Navigate to the “target” subfolder to locate the built jar file. Using the Azure Databricks Cluster Library GUI, upload the jar file as a spark library.

Posted in Unified Analytics | Tagged , , , , , , , | 5 Comments

Configure a Databricks Cluster-scoped Init Script in Visual Studio Code.

Databricks is a distributed data analytics and processing platform designed to run in the Cloud. This platform is built on Apache Spark which is currently at version 2.4.4.

In this post, I will demonstrate the deployment and installation of custom R based machine learning packages into Azure Databricks Clusters using Cluster Init Scripts. So, what is a Cluster Init Script ?

Cluster Initialization Scripts:

These are shell scripts that are run and processed during startup for each Databricks cluster node before the Spark driver or worker JVM starts. Some of the tasks performed by init scripts include:

1) Automate the installation of custom packages and libraries not included in the Databricks runtime. This can also include the installation of custom-built packages that are not available in public package repositories.
2) Automate the setting system properties and environment variables used by the JVM.
3) Modify Spark configuration parameters.
4) Modify the JVM system classpath in special cases.

Note: The Databricks CLI can also be used to install libraries/packages so long as the package file format is supported by Databricks. To install Python packages, use the Databricks pip binary located at /databricks/python/bin/pip (this path is subject to change) to ensure that Python packages install into the Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install.

Cluster Init Script Types:

There are two types of cluster scripts currently supported by Databricks:
1) Cluster-scoped scripts
2) Global Init scripts.

For this blog post, I will be focusing on Cluster-scoped init scripts. I will note that Global init scripts are developed to run on every cluster in a workspace.

Databricks clusters are provisioned with a specific set of libraries and packages included. For the custom R packages to be installed and available on the cluster nodes, they have to be compressed using the tar.gzip file format as a requirement. At the time of this writing, this file format is not supported for installing packages to a Databricks cluster as confirmed using the Databricks CLI libraries install subcommand:

Based on the above screen shot, we can see that for now, the supported package formats using the Databricks CLI are:

jar,
egg,
whl,
maven-coordinates,
pypi-package,
cran-package

Creating and Deploying an example Cluster Init Script:

This sample Cluster-init script was developed on a Windows 10 machine using the Visual Studio Code editor. Before the custom package can be executed and installed on the Cluster nodes, it will have to be deployed to the Databricks File System in the Databricks workspace, together with the cluster-init script file. To deploy the script to the Databricks File System (DBFS) for execution, I’ll use the Databricks CLI tool (which is a REST API wrapper). The following shell script sample first installs a public package xgboost from the Cran repository using the install.packages() R function, then implements a “for loop” to install the packages defined in the list. The install.packages() R function script is run from the cluster nodes’ ubuntu R interactive shell:

#!/bin/bash

sudo R --vanilla -e 'install.packages("forecast", repos="http://cran.us.r-project.org")'
MY_R_PACKAGES="ensembleR.tar.gzip"
for i in $MY_R_PACKAGES
do
  sudo R CMD INSTALL --no-lock /dbfs/custom_libraries/$i
done

The following one line databricks cli scripts create the dbfs directory location for the cluster-init script and custom packages path:

databricks.exe --profile labprofile fs mkdirs dbfs:/custom_libraries
databricks.exe --profile labprofile fs mkdirs dbfs:/custom_libraries
databricks.exe fs mkdirs dbfs:/databricks/packages_libraries
databricks.exe --profile labprofile fs cp .\my_cluster_init.sh dbfs:/custom_libraries
databricks.exe --profile labprofile fs cp .\ensembleR.tar.gzip dbfs:/databricks/packages_libraries/

Accessing and Viewing Cluster Init Script Logs:

In the event of an exception in the Cluster-init script execution, the Cluster creation operation fails and the Databricks workspace UI event log does not reveal much as to the cause of the failure.

The cluster init script logs located at dbfs:/databricks/init_scripts can be downloaded via the Databricks CLI as part of the debugging process.

Define the Cluster-scoped Init script in a Cluster creation API request:

{
    "cluster_name": "Demo-Cluster",
    "spark_version": "5.0.x-ml-scala2.11",
    "node_type_id": "Standard_D3_v2",
    "cluster_log_conf": {
        "dbfs" : {
          "destination": "dbfs:/cluster-logs"
        }
      },
      "init_scripts": [ {
        "dbfs": {
          "destination": "dbfs:/custom_libraries/my_cluster_init.sh"
        }
      } ],
    "spark_conf":{
          "spark.databricks.cluster.profile":"serverless",
          "spark.databricks.repl.allowedLanguages":"sql,python,r"
       },
     "custom_tags":{
          "ResourceClass":"Serverless"
       },
         "autoscale":{
          "min_workers":2,
          "max_workers":3
       }
  }

Databricks Workspace GUI Reference of the Cluster Init Script Location:

Confirm Installed Custom Packages:

To confirm successful installation of the R package, create a new Databricks notebook, attach the notebook to the newly created cluster and run the following R script:

%r
ip <- as.data.frame(installed.packages()[,c(1,3:4)])
rownames(ip) <- NULL
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
print(ip, row.names=FALSE)

Based on the screen shot above, the package was successfully deployed via the cluster init script. In a future post, I will explore techniques for debugging a failed Cluster Init script deployment and steps for resolving any exceptions during the cluster creation process.

Posted in Apache Spark, Bash, Cluster Init Scripts, Databricks Notebooks, Install.packages(), Logs, R, Shell | Tagged , , , , , , , , | Leave a comment

Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly refers to having data that is free of errors, inconsistencies, redundancies, poor formatting and other problems that might prevent the data from being used readily for either building distributed models or data visualization. Some of these tasks involve :

a) ensuring that data exploration and clean up are performed so that table column/property data types are correctly defined,
b) verifying schema length,
c) removal or replacement of “null”/missing values in table records/rows where necessary,
d) removing trailing and leading spaces in dataset values as they are migrated to Apache Spark on Azure Databricks,
e) verifying record counts and performing a comparison of source/target output datasets.
f) handling data duplication.

In this write-up, I demonstrate the implementation of Apache Spark Python API (PySpark), for developing code that accomplishes some of these tasks.

I also walk through steps to show how Databricks-Connect, a Spark client library, enables the ability to develop Apache Spark application code locally in Jupyter Notebooks and execute the code remotely, server-side on an Azure Databricks Cluster. I’ll show a quick demonstration of installing and configuring Databricks Connect on a Windows 10 machine. Please note that all code samples and tokens used in this post are for testing purposes only and are not to be directly used in production.

In other sections of this post, I’ll be showing code samples for:

1) Configuring Databricks-Connect to enable my local Apache Spark setup to interact with a remote Azure Databricks Cluster.
2) Creating a CSV file dataset on a remote Azure Databricks Workspace using the DBUtils PySpark utility on my local machine.
3) Ingest the csv dataset and create a Spark Dataframe from the dataset.
4) Create a Database by persisting the Dataframe to an Azure Databricks Delta table on the remote Azure Databricks workspace.
5) Verify the table schema, data types and schema count.
6) Use PySpark functions to display quotes around string characters to better identify whitespaces.
6) Explore Pyspark functions that enable the changing or casting of a dataset schema data type in an existing Dataframe to a different data type.
7) Using Pyspark to handle missing or null data and handle trailing spaces for string values.
8) Run a comparison between two supposedly identical datasets.

Toolset:

1) I’m currently running Python version 3.7.2 as a virtual environment on a Windows 10 machine. The Python virtual environment was setup using the Python Venv module.
2) Jupyter notebook version 4.4.0. This can be installed using the pip install jupyter command. I have configured the Jupyter notebook Kernel to run against my local Python virtual environment.
3) Databricks-Connect 5.3 PyPI Spark client library. This is an awesome tool that allows me write my Spark application locally while executing on a remote Azure Databricks Cluster.

Install and configure Databricks Connect on a windows 10 machine Python virtual environment.

Open a Powershell console, navigate to the Python virtual environment path and activate:

PS C:\virtualenv\dbconnect> .\Scripts\activate

Install Databricks Connect:

(dbconnect) PS C:\virtualenv\dbconnect> pip install databricks-connect

Configure Databricks Connect with remote Azure Databricks Cluster and Workspace parameters. The parameters displayed in the screen shot were provisioned in a lab workspace and have since been deprovisioned:

Create a SparkSession in my Jupyter Notebook and import the required PySpark dbutils library:

from pyspark.sql import SparkSession
from pyspark.dbutils import DBUtils
spark = SparkSession.builder.getOrCreate()
dbutils = DBUtils(spark)

Use the DBUtils PySpark utility to create a folder and csv file sample dataset in the remote Databricks Workspace:

dbutils.fs.rm("/demo_dir", True)
dbutils.fs.mkdirs("/demo_dir")

dbutils.fs.put("/demo_dir/type-conversion-example.csv", """
ItemID,Name,Desc,Owner,Price,Condition,DateRegistered,LocPrefCode,RentScoreCode,Usable,RentDate
1,Lawnmower,Small Hover mower,,150.785,Excellent,01/05/2012,0 ,0,2,
2,Lawnmower,Ride-on mower,Mike,370.388,Fair,04/01/2012,0,0 ,1,01/05/2018
3,Bike,BMX bike,Joe,200.286,Good,03/22/2013,0,0 ,1,03/22/2017
4,Drill,Heavy duty hammer,Rob,100.226,Good,10/28/2013,0,0,2,02/22/2018
5,Scarifier,"Quality, stainless steel",,200.282,,09/14/2013,0,0 ,2,01/05/2018
6,Sprinkler,Cheap but effective,Fred,180.167,,01/06/2014,,0,1,03/18/2019
7,Sprinkler,Cheap but effective,Fred,180.167,,01/06/2014,,0,1,03/18/2019

""", True)

Create PySpark Dataframe from the ingested CSV source file:

#Create Dataframe from the ingested CSV source file
df = spark.read.format("csv").options(header=True, inferschema=False).load("/demo_dir/type-conversion-example.csv")

Create a database and write the tools dataframe to a “toolsettable” table in the remote Azure Databricks hive metastore:

Here we use a combo of Spark SQL and the PySpark saveAsTable function to create a database and Databricks Delta table.

spark.sql("drop database if exists demodb cascade")
spark.sql("create database if not exists demodb")
spark.sql("use demodb")
df.write.format("delta").mode("overwrite").saveAsTable("demodb.toolsettable")

#Confirm Dataframe schema
df.printSchema()

#Confirm Dataframe schema length
len(df.schema)

The following screenshot displays the table in the remote Azure Databricks hive store:

Show the first three rows of the tools table:

Use the PySpark select and take functions to display specific columns and the trailing and leading whitespaces in the values for select columns:

df.take(3)

Using the take() function alone leaves the displayed rows cluttered and hard to read. I used the select() function to filter out unnecessary fields and make the displayed rows more readable as shown below.

df.select(df.ItemID, df.Price, df.DateRegistered, df.LocPrefCode, df.RentScoreCode, df.Usable).take(6)

Use the PySpark Cast function to change the data type of selected columns:

The Price column was originally ingested as a String data type, but in this section we use the Cast() function to change the data type to a decimal.
The DataRegistered column is changed from a String type to a date type using the to_date() PySpark function. The Usable column is changed to a Boolean data type and the RentDate is changed to a Unix Time data type using from_unixtime

"""
Cast Price(double) to Decimal, cast DateRegistered field of a given pattern/format of type string to date using the PySpark to_date function,
cast RentDate(String type) to TimestampType
and cast the Usable field of type string to boolean.

"""
from pyspark.sql.types import DecimalType, StringType, TimestampType, DateType, BooleanType
from pyspark.sql.functions import unix_timestamp, from_unixtime, to_date #(Converts a Column of pyspark.sql.types.StringType into pyspark.sql.types.DateType using the optionally specified format.)
spark.conf.set("spark.sql.session.timeZone", "America/Chicago")
#https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_date
df_data_prep = df.select(df.ItemID, df.Name, df.Owner, df.Price.cast(DecimalType(6,3)), df.Condition, to_date("DateRegistered", "MM/dd/yyyy").alias("DateRegistered"), df.LocPrefCode, df.RentScoreCode, df.Usable.cast(BooleanType()), from_unixtime((unix_timestamp("RentDate", "MM/dd/yyyy")),"MM/dd/yyyy").alias("RentDate_UnixTime"))

Use the withColumn function to Clean up trailing spaces and update values in LocationCode and RentScoreCode columns:

Also use the na.fill() to replace null/missing values with “777”.

#Use the withColumn function to Clean up trailing spaces and update values in LocationCode and RentScoreCode columns
from pyspark.sql.functions import when
df_data_prep = df_data_prep.withColumn("RentScoreCode", when(df_data_prep.RentScoreCode == "0 ",'0').otherwise(df_data_prep.RentScoreCode))
df_data_prep = df_data_prep.withColumn("LocPrefCode", when(df_data_prep.LocPrefCode == "0 ",'0').otherwise(df_data_prep.LocPrefCode))
df_data_prep = df_data_prep.na.fill('777')

Use the datacompy library to compare both datasets. This library uses the spark Dataframes.:

#Use the datacompy library to compare both datasets. This library uses the spark dataframes. It runs a comparison on all relevant fields.
# All other fileds can be dropped from the dataframes before comparing.
import datacompy
compare_object = datacompy.SparkCompare(spark, df_data_prep, df, join_columns=[('ItemID', 'ItemID')],rel_tol=0.001)
compare_object.report()

Posted in Azure Databricks | Tagged , , , , , , , , , , | Leave a comment

Setting Up Jupyter Notebook to Run in a Python Virtual Environment.

1) Install Jupyter on the local machine outside of any existing Python Virtual environment:
pip install jupyter --no-cach-dir
2) Create a Python Virtual environment.
mkdir virtualenv
cd virtualenv
python.exe -m venv dbconnect
3) Change directory into the virtual environment and activate
.\scripts\activate
4) Install ipykernel package in the virtual environment
pip install ipykernel
5) Use the following example to install a Jupyter kernel into the virtual environment
ipython.exe kernel install --user --name=dbconnect
This command returns the location of the Jupyter kernel.json file.
Open the file using your editor to confirm the config:
{
"argv": [
"c:\\virtualenv\\dbconnect\\scripts\\python.exe",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "dbconnect",
"language": "python"
}
6) Change directory outside the virtual environment and use the following command to verify that the kernel config
completed successfully in the Jupyter folder:
PS C:\> jupyter-kernelspec.exe list
Available kernels:
dbconnect C:\Users\chinny\AppData\Roaming\jupyter\kernels\dbconnect
python3 c:\python3\share\jupyter\kernels\python3
7) Verify that that Java is installed.

Posted in Python | Tagged , , , | Leave a comment

Automate Azure Databricks Job Execution using Custom Python Functions.

Introduction

Thanks to a recent Azure Databricks project, I’ve gained insight into some of the configuration components, issues and key elements of the platform. Let’s take a look at this project to give you some insight into successfully developing, testing, and deploying artifacts and executing models.

One note: This post is not meant to be an exhaustive look into all the issues and required Databricks elements. Instead, let’s focus on a custom Python script I developed to automate model/Job execution using the Databricks Jobs REST APIs. The script will be deployed to extend the functionality of the current CICD pipeline.

Background of the Databricks Project.

I’ve been involved in an Azure Databricks project for a few months now. The objective was to investigate the Databricks platform and determine best practice configuration and recommendations that will enable the client to successfully migrate and execute their machine learning models.

We started with a proof of concept phase to demonstrate some of the features and capabilities of the Databricks platform, then moved on to a Pilot phase.

The objective of the Pilot phase was to migrate the execution of machine learning models and their datasets from the Azure HDInsight and Apache Hive platform to Azure Databricks and Spark SQL. A few cons with the HDInsight platform had become obvious, some of which included:

a) Huge administrative overhead and time to configure and provision HDInsight clusters which can run into hours.
b) Limitations in managing the state of HDInsight Clusters which can’t really be shut down without deleting the cluster. The autoscaling and auto-termination features in Azure Databricks play a big role in cost control and overall Cluster management.
c) Model execution times are considerably longer on Azure HDInsight than on Azure Databricks.

Key Databricks Elements, Tools and Issues.

Some of the issues faced are related to:

Data Quality: We had to develop Spark code (using the PySpark API) to implement Data clean up, handle trailing and leading spaces and replace null string values. These were mostly due to the Data type differences between HDInsight Hive and Spark on Databricks, e.g char types on Hive and String types on Spark.

End-to-end CICD Tooling: We configured Azure Pipeline and Repos to cover specific areas of the CICD pipeline to the Databricks platform, but this was not sufficient for automating Model execution on Databricks without writing custom python code to implement this functionality.

Development: Implementing RStudio Server deployment on a Databricks Cluster to help with the development and debugging of models.

Azure Data Factory: We explored version 2, but at the time of initial testing, version control integration was not supported. At the time of this writing though, it is supported. While the configuration works and I confirmed connectivity to my Github repository, it appears the current integration allows for an ADF Pipeline template to be pushed to the defined repository root folder.

I have not been able to import existing notebook code from my repo, to be used as a Notebook activity. Maybe this use case is currently not supported/required or I’m yet to figure it out. Will continue to review.

Azure Databricks VNET Peering: Connecting the Azure Databricks workspace with existing production VNET was not possible at the start of this engagement. VNET peering is now currently supported.

Other Databricks elements and tools include:

Databricks Hive Metastore: Databricks’ central hive metastore that allows for the persistence of table data and metadata.

Databricks File System (DBFS): The DBFS is a distributed file system that is a layer over Azure Blob Storage. Files in DBFS persist to Azure Storage Account or AWS S3 bucket, so there’s no data loss even after a Cluster termination.

Cluster Init Scripts: These are critical for the successful installation of the custom R package dependencies on Databricks Clusters, keeping in mind that R does not appear to have as much feature and functionality support on Databricks as Python and Scala. At least at this time.

Databricks CLI: This is a python-based command-line, tool built on top of the Databricks REST API.

Databricks Notebooks: These enable collaboration, In-line multi-language support via magic commands, Data exploration during testing which in turn reduces code rewrites. I’m able to write PySpark and Spark SQL code and test them out before formally integrating them in Spark jobs.

Databricks-Connect: This is a python-based Spark client library that let us connect our IDE (Visual Studio Code, IntelliJ, Eclipse, PyCharm, e.t.c), to Databricks clusters and run Spark code. With this tool, I can write jobs using Spark native APIs like dbutils and have them execute remotely on a Databricks cluster instead of in the local Spark session.

DevOps: The steps involved in the integration of Databricks Notebooks and RStudio on Databricks with version control are pretty much straightforward. A number of version control solutions are currently supported: Bitbucket, Github and Azure DevOps. This brings me to the focus of this post. At the time of this writing, there is currently no CICD solution (that we know of), capable of implementing an end-to-end CICD pipeline for Databricks. We used the Azure DevOps Pipeline and Repos services to cover specific phases of the CICD pipeline, but I had to develop a custom Python script to deploy existing artifacts to the Databricks File System (DBFS) and automatically execute a job on a Databricks jobs cluster on a predefined schedule or run on submit.

MLFlow on Databricks: This new tool is described as an open source platform for managing the end-to-end machine learning lifecycle. It has three primary components: Tracking, Models, and Projects. It’s still in beta and I haven’t reviewed it in detail. I plan to do so in the coming weeks.

Solution.

Databricks Jobs are the mechanism to submit Spark application code for execution on the Databricks Cluster.In this Custom script, I use standard and third-party python libraries to create https request headers and message data, configure the Databricks token on the build server, check for the existence of specific DBFS-based folders/files and Databricks workspace directories and notebooks, delete them if necessary while creating required folders, copy existing artifacts and cluster init scripts for custom package installation on the Databricks Cluster, create and submit a Databricks Job using the REST APIs and execute/run the job. In the next section, I look at specific parts of the Python script.

Configure header and message data:

import sys
import base64
import subprocess
import json
import requests
import argparse
import logging
import yaml
from pprint import pprint

JOB_SRC_PATH = "c:/bitbucket/databricks_repo/Jobs/job_demo.ipynb"
JOB_DEST_PATH = "/notebooks/jobs_demo"
INIT_SCRIPT_SRC_PATH = "c:/bitbucket/databricks_repo/Workspace-DB50/cluster-init1a.sh"
INIT_SCRIPT_DEST_PATH = "dbfs:/databricks/rstudio/"
RSTUDIO_FS_PATH = "dbfs:/databricks/rstudio"
WORKSPACE_PATH = "/notebooks"
JOB_JSON_PATH = "c:/BitBucket/databricks_repo/Jobs/databricks_new_cluster_job.json"

with open("c:\\bitbucket\\databricks_repo\\jobs\\databricks_vars_file.yaml") as config_yml_file:
    databricks_vars = yaml.safe_load(config_yml_file)

JOBSENDPOINT = databricks_vars["databricks-config"]["host"]
DATABRICKSTOKEN = databricks_vars["databricks-config"]["token"]

def create_api_post_args(token, job_jsonpath):
    WORKSPACETOKEN = token.encode("ASCII")
    headers = {'Authorization': b"Basic " +
           base64.standard_b64encode(b"token:" + WORKSPACETOKEN)}
    with open(job_jsonpath) as json_read:
        json_data = json.load(json_read)
    data = json_data
    return headers, data

Configure Databricks Token on Build server:

#Configure token function
def config_databricks_token(token):
    """Configure
        databricks token"""
    try:
        databricks_config_path = "c:/Users/juser/.databrickscfg"
        with open(databricks_config_path, "w") as f:
            clear_file = f.truncate()
            databricks_config = ["[DEFAULT]\n","host = https://eastus.azuredatabricks.net\n","token = {}\n".format(token),"\n"]
            f.writelines(databricks_config)
    except Exception as err:
        logging.debug("Exception occured:", exc_info = True)

Create and run the job using the Python subprocess module that calls the databricks-cli external tool:

def create_job(job_endpoint, header_config, data):
    """Create
        Azure Databricks Spark Notebook Task Job"""
    try:
        response = requests.post(
            job_endpoint,
            headers=header_config,
            json=data
        )
        return response
    except Exception as err:
        logging.debug("Exception occured with create_job:", exc_info = True)
        

def run_job(job_id):
    """Use the passed job id to run a job.
      To be used with the create jobs api only, not the run-submit jobs api"""
    try:
        subprocess.call("databricks jobs run-now --job-id {}".format(job_id))
    except subprocess.CalledProcessError as err:
        logging.debug("Exception occured with create_job:", exc_info = True)

The actual Job task is executed as a notebook task as defined in the Job json file. The notebook task which contains sample PySpark ETL code, was used in order to demonstrate the preferred method for running an R based model at this time. The following Job tasks are currently supported in Databricks: notebook_task, spark_jar_task, spark_python_task, spark_submit_task. None of the defined tasks can be used to run R based code/model except the notebook_task. The spark_submit_task accepts --jars and --py-files to add Java and Python libraries, but not R packages which are packaged as “tar.gzip” files.

In the custom functions, I used the subprocess python module in combination with the databricks-cli tool to copy the artifacts to the remote Databricks workspace. The Create Jobs API was used instead of the Runs-Submit API because the former makes the Spark UI available after job completion, to view and investigate the job stages in the event of a failure. In addition, email notification is enabled using the Create Jobs REST API, which is not available with the Runs-Submit Jobs API. A screen shot of the Spark UI is attached:

Job Status:

Spark Job View:

Spark Job Stages:

A snippet of the JSON request code for the job showing the notebook_task section is as follows:


{...
"timeout_seconds": 3600,
"max_retries": 1,
"notebook_task": {
"_comment": "path param||notebook_path||notebooks/python_notebook",
"notebook_path": "/notebooks/jobs_demo",
"base_parameters": {
"argument1": "value 1",
"argument2": "value 2"
}
}
}

As stated at the start of this post, I developed this Python script to extend the existing features of a CICD pipeline for Azure Databricks. I’m continuosly updating it to take advantage of new features being introduced into the Azure Databricks ecosystem and possibly integrate it with tools still in development, but that could potentially improve the end-to-end Azure Databricks/Machine learning workflow.
The complete code can be found here.

Posted in Apache Spark, Azure Databricks, Cluster Init Scripts, Databricks Notebooks, Python | Tagged , , , , , , , , , , | 2 Comments

Provisioning a Jenkins Instance Container with Persistent Volume in Azure Kubernetes Service.

In this post, I want to write about my experience testing and using Azure Kubernetes service to deploy a Jenkins Instance solution that is highly available and resilient. With the Kubernetes persistent volume feature, an Azure disk can be dynamically provisioned and attached to a Jenkins Instance container deployment. In another scenario, an existing Azure Disk containing data related to a software team’s Jenkins projects could be attached to a new Jenkins deployment in a kubernetes cluster, to maintain continuity. Also, data loss in the event of a Pod failure can be averted since the persistent volume storage has a lifecycle independent of any individual pod that uses the persistent volume and can be managed as a separate cluster resource.

Jenkins automation server is one of the most in demand DevOps tools today. Therefore, deploying a Jenkins instance for software development and engineering teams in a fast, flexible and highly available way is key to maintaining an efficient and smooth running CI/CD and testing process.With Azure Kubernetes, we are able to deploy multiple Jenkins instances customized for each team on a centrally controlled cluster.

Prerequisites.

All the Kubernetes related work in this post was done on my Windows 10 machine. The tools are available for both Windows and Linux.

The following tools need to be installed and configured on the working machine before creating and configuring an Azure Kubernetes cluster.
1) Azure CLI : Azure command line tool for creating Kubernetes Clusters.
2) Kubectl : Follow the link to download the Kubectl executable and put kubectl.exe somewhere in your system PATH.
3) Azure Subscription.

I’ll also mention that I used the awesome Kompose.exe tool to convert my existing Docker-Compose files to Kubernetes compatible yaml files. Obviously, the yaml files needed to be edited after the convert process.

Login to Azure and create a resource group for the Kubernetes services.

In this and following sections, I will detail the steps I followed to setup and deploy a highly available Jenkins deployment in Azure Kubernetes using first a) A dynamic Azure Disk and b) A static existing Azure Disk with data.

az login --username jenkins.deployment@cicd.com --password passw04rd

Set a subscription to be the current active subscription:

az account set --subscription 0fr96513-fea6-4bca-ad18-311920der789

Create a resource group:
az group create --resource-group rgaks --location centralus

Create an Azure Kubernetes Cluster in the resource group above.

At this time, using PowerShell to create an Azure kubernetes cluster is not supported. I suspect this will change in the future. For now, I will use Azure CLI az aks command to create a new cluster in the new resource group:

az aks create --resource-group rgaks --name akscluster0 --node-count 2 --node-vm-size Standard_D2s_v3 --ssh-key-value C:\SSH_KubeCluster\sshkey-kubecluster.pub

The one line of az aks create code provisions a Kubernetes service object in the rgaks resource group. In addition, it also automatically creates another resource group containing a two node cluster with all corresponding resources as shown in the following screen shot :

After creating the Kubernetes cluster, I’ll connect to the cluster from my powershell console by running the az aks get-credentials command to set the current security context on the /.kube/config file and use the kubectl command to verify the state of the kubernetes cluster nodes:

az aks get-credentials --name akscluster0 --resource-group rgaks

kubectl.exe get nodes

Setup Persistent Storage.

At this point, the cluster nodes are ready to host a Jenkins Container Instance in a kubernetes Pod.Azure Kubernetes clusters are created with two default storage classes as displayed in the screen shot below. Both of these classes are configured to work with Azure disks.A storage class is used to define how a unit of storage is dynamically created. This saves me the step of having to create a storage class manifest file to be used by my persistent volume claim yaml file. The default storage class provisions a standard Azure disk. The managed-premium storage class provisions a premium Azure disk. I verified this using the kubectl get storageclasses command:

I selected the managed-premium class for my persistent volume claim configuration. To provision persistent storage to be used for this deployment, I created a yaml file of kind: PersistentVolumeClaim to request a storage unit of capacity 5Gi based on the managed-premium storage class.The pvc will be referenced in my Jenkins deployment yaml file to attach/map the automatically created Azure Disk volume to the Jenkins data home folder path. The following yaml file defines the dynamic creation of a 5Gi unit of storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: azure-managed-disk
annotations:
volume.beta.kubernetes.io/storage-class: managed-premium
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

Use the Persistent volume to deploy a Jenkins Instance.

I merged the persistent volume claim yaml file with the Jenkins application deployment and service yaml files to automatically provision the storage, provision the Jenkins deployment, map the persistent storage as a volume to the Jenkins instance and create a service type LoadBalancer to enable me access the Jenkins instance from outside the kubernetes cluster virtual network. The full and correctly indented yaml file can be found here on my GitHub page. Using the kubectl.exe tool, I can run a one line script that provisions the deployment, services and persistent volumes from a single yaml file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
kompose.cmd: C:\Kompose\kompose.exe convert -f .\docker-compose.yml
kompose.service.type: LoadBalancer
kompose.version: 1.13.0 (84fa826)
creationTimestamp: null
labels:
io.kompose.service: jenkinsbox
name: jenkinsbox
spec:
replicas: 1
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: jenkinsbox
spec:
containers:
- image: jenkinsci/blueocean
name: jenkins-container
volumeMounts:
- mountPath: "/var/jenkins_home"
name: volume
ports:
- containerPort: 8080
- containerPort: 50000
resources: {}
volumes:
- name: volume
persistentVolumeClaim:
claimName: azure-managed-disk
initContainers:
- name: permissionsfix
image: alpine:latest
command: ["/bin/sh", "-c"]
args:
- chown 1000:1000 /var/jenkins_home;
volumeMounts:
- name: volume # Or you can replace with any name
mountPath: /var/jenkins_home # Must match the mount path in the args line
restartPolicy: Always
status: {}
---
apiVersion: v1
kind: Service
metadata:
annotations:
kompose.cmd: C:\Kompose\kompose.exe convert -f .\docker-compose.yml
kompose.service.type: LoadBalancer
kompose.version: 1.13.0 (84fa826)
creationTimestamp: null
labels:
io.kompose.service: jenkinsbox
name: jenkinsbox
spec:
type: LoadBalancer
ports:
- name: "jenkinsport"
port: 8080
targetPort: 8080
selector:
io.kompose.service: jenkinsbox
status:
loadBalancer: {}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: azure-managed-disk
annotations:
volume.beta.kubernetes.io/storage-class: managed-premium
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

The screen shot displays the one line kubectl script and the deployment, services and persistent volumes created:

kubectl.exe apply -f .\jenkinsbox-k8s-all-in-one.yaml

Check the status to the deployments,services and pods.

Start initial configuration of the Jenkins Instance in kubernetes container.

Use the following command to retrieve the initialAdmin default password for Jenkins:

kubectl.exe logs jenkinsbox-55f58fcbcb-2ltqx

Use the public ip address from the kubernetes service in the above screen shot to access the Jenkins initial setup page:

After initial configuration of Jenkins, I created a sample pipeline job:

Simulate failure and recovery.

To verify failure and recovery, I’ll delete the pod using the following command in the screen shot:

kubectl.exe delete pods --all

The screen shot indicates the pod deletion and immediate automatic creation of a new pod with same storage volume to match the number of replicas defined the deployment yaml file. In the next screen shot, I simply login to the Jenkins instance without going through the setup wizard of a new instance. I can also confirm that the pipeline job created in the preceding steps is still available:

Verify recovery by deleting the current Jenkins deployment and using existing Azure Disk for static disk persistent volume on new deployment.

In this example, I delete the deployment and services created in th epreceding steps and develop a yaml file config to deploy a new Jenkins instance using the existing disk provisioned above for an Azure static disk persistent storage volume mapped to a new Jenkins deployment.

Create a new deployment using the new yaml file:

kubectl.exe apply -f .\jenkinsbox-deployment-static-disk.yaml

The following screen shot confirms that all the deployment components and pods have been successfully provisioned with the existing persistent storage volume mapped to the new pod.

As soon as the new deployment and pod are running. I can login to the existing jenkins instance without a setup prompt (since the static disk contains data from the initial deployments in the preceding steps) and confirm the existing pipeline Job which is available since the existing disk is mapped to the new kubernetes pod.

My new yaml file does not have a persistent volume claim section. I simply reference the diskUri of the existing managed disk in the deployment section.I used my jenkinsbox-deployment-static-disk.yaml file on Github for the new deployment to map an existing azure disk to the new deployment. The following snippet displays the section of the yaml file that maps the existing Azure disk:

apiVersion: extensions/v1beta1
kind: Deployment
.......
volumes:
- name: azure
azureDisk:
kind: Managed
diskName: kubernetes-dynamic-pvc-2d4066e6-7f34-11e8-a8dc-0a58ac1f067c
diskURI: /subscriptions/0c696513-fea6-4bca-ad18-3119de65acef/resourceGroups/MC_rgaks_akscluster0_centralus/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-2d4066e6-7f34-11e8-a8dc-0a58ac1f067c
restartPolicy: Always
........

Blockers/Issues encountered.

A problem I experienced during initial deployment was a failed container. After digging into the logs for the container by using the kubectl.exe get pods, kubectl.exe logs pods and kubectl.exe describe pods commands, I noticed the pod was stuck in a “ContainerCreating” loop. I also observed the following kubernetes event log messages: “MountVolume.SetUp failed for volume” and “do not have required permission”.

The pod could not attach the disk volume. This is similar to an issue that occurs using bind mounts in Docker. Resolving the issue in Docker is easy. But I couldn’t immediately determine how to resolve it within a kubernetes deployment environmet. This issue occurs because by default, non-root users do not have write permission on the volume mount path for NFS-backed storage. Some common app images, such as Jenkins and Nexus3, specify a non-root user that owns the mount path in the Dockerfile. When a container is created from this Dockerfile, the creation of the container fails due to insufficient permissions of the non-root user on the mount path.

After some research, I found the following solution that uses an InitContainer in my deployment, to give a non-root user that is specified in my Dockerfile write permissions for the volume mount path inside the container.

The init container creates the volume mount path inside the container, changes the mount path to be owned by the correct (non-root) user, and closes. Then, my Jenkins container starts with the non-root user that must write to the mount path. Because the path is already owned by the non-root user, writing to the mount path is successful. The full example on using the InitContainer can also be found here. The following is a section of my deployment yaml file that defines the InitContainer:

initContainers:
- name: permissionsfix
image: alpine:latest
command: ["/bin/sh", "-c"]
args:
- chown 1000:1000 /var/jenkins_home;
volumeMounts:
- name: azure # Or you can replace with any name
mountPath: /var/jenkins_home # Must match the mount path in the args line

Limitations

Having tested both Azure Disks and Azure File Share as Persistent Volume options for deploying Jenkins container in AKS and I have observed that without a doubt the Azure Disk option provides much faster performance and response times for the application.

The drawback for me though lies in the fact that the access mode for Azure Disk persistent volumes is ReadWriteOnce. This means that an Azure disk can be attached to only one cluster node at a time. In the event of a node failure or update, it could take anywhere between 1-5 minutes for the Azure disk to get detached and attached to the next available node. This means that a Pod/Container/Application may incur a short outage while Kubernetes schedules it for another node. This scenario could slightly impact the high availability offering of AKS. Hopefully, Azure will improve on this behavior soon.

Posted in Azure Kubernetes, Kubernetes | Tagged , , , , , , , , , , , , , , , | Leave a comment

PowerShell function to Provision a Windows Server EC2 Instance in AWS Cloud.

Introduction.
Microsoft just updated the ASWPowerShell module to better enable Cloud administrators manage and provision cloud resources in the AWS cloud space while using the same familiar PowerShell tool. As at last count today, the AWSPowerShell module contains almost four thousand cmdlets:

This means Microsoft is committed to expanding on PowerShell functionality as a robust tool for managing both Azure and Amazon cloud platforms.
In this post I want to quickly demonstrate how to provision an AWS EC2 instance using PowerShell. The following steps help accomplish this objective.

Install the ASWPowerShell Module.
For this post, I’ll be using the version 5.1.16299.98 of Windows PowerShell as indicated in the following screen shot:

I’ll install the AWSPowerShell module using the Find-Module cmdlet.

PS C:\Scripts> Find-Module -Name AWSPowerShell | Install-Module -Force

Configure AWS Credential Profile.
During initial signup for an AWS account, a root account is created with full administrative access. According to AWS best practices, while making API calls and using PowerShell to programmatically access and manage resources, a sub user account should be created with corresponding access key ID and secret key credentials. This way, if the keys are compromised, the associated user can be disabled instead of risking the compromise of the root account and all the resources associated with it.

Use the Users tab of the IAM (Identity and Access Management) console in the AWS portal to create a subuser and generate the corresponding access key ID and secret key.

After generating the keys, I’ll use the Set-AWSCredential cmdlet to save and persist the the credential keys to my local AWS SDK store for use across multiple PowerShell sessions. The Initialize-AWSDefaultConfiguration cmdlet sets the new profile and region as active within the PowerShell session. The following script accomplishes this task. Please note that the AccessKey and SecretKey parameter values are represented by variables:

#Set and Initialize AWS Credential and Profile for login and authentication
Set-AWSCredential -AccessKey $AccessKey -SecretKey $SecretKey -StoreAs AWSDemoProfile
Initialize-AWSDefaultConfiguration -ProfileName AWSDemoProfile -Region us-west-1

Create an EC2 Key Pair.
Use the New-EC2KeyPair cmdlet to create an EC2 key pair. This cmdlet calls the Amazon Elastic Compute Cloud CreateKeyPair API. It creates a 2048-bit RSA key pair with the specified name. Amazon EC2 stores the public key and displays the private key to be saved to a file. The private key is returned as an unencrypted PEM encoded PKCS#1 private key. The private key is used during the logon operation to a virtual machine to create a password for login. If a key with the specified name already exists, Amazon EC2 returns an error.Up to five thousand key pairs can be created per region. The key pair is available only in the region in which it is created. In the following script, I create the key, assign the key pair object to a variable and save the key material property of the key pair object locally to a file:

#Create Keypair for decrypting login creds
$awskey = New-EC2KeyPair -KeyName demo1key
$awskey.KeyMaterial | Out-File -FilePath C:\AWSCred\mykeypair.pem

Provision a Non-Default Virtual Private Cloud (VPC).
The first time I created my AWS account, a default VPC provisioned with a private ip address scheme.For the purpose of this post, I would prefer to create a custom non-default vpc with an address range of my choice. Unlike the default vpc, the non-default vpc does not have internet connectivity. Some extra configuration is needed to enable internet connectivity to the non-default vpc.
The following tasks are accomplished by the PowerShell script to enable internet connectivity for the custom non-default vpc:
Create the non-default vpc and enable dns hostnames
Tag the vpc with a friendly name
Create a custom subnet for the vpc and tag it
Create an internet gateway and attach it to the custom vpc
Create a custom route table for internet access and associate it with the custom subnet


#Create non default virtual private cloud/virtual network, enable dns hostnames and tag the vpc resource
$Ec2Vpc = New-EC2Vpc -CidrBlock "10.0.0.0/16" -InstanceTenancy default
Edit-EC2VpcAttribute -VpcId $Ec2Vpc.VpcId -EnableDnsHostnames $true
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "MyVPC"
New-EC2Tag -Resource $Ec2Vpc.VpcId -Tag $Tag

#Create non default subnet and tag the subnet resource
$Ec2subnet = New-EC2Subnet -VpcId $Ec2Vpc.VpcId -CidrBlock "10.0.0.0/24"
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "MySubnet"
New-EC2Tag -Resource $Ec2subnet.SubnetId -Tag $Tag
#Edit-EC2SubnetAttribute -SubnetId $ec2subnet.SubnetId -MapPublicIpOnLaunch $true

#Create Internet Gateway and attach it to the VPC
$Ec2InternetGateway = New-EC2InternetGateway
Add-EC2InternetGateway -InternetGatewayId $Ec2InternetGateway.InternetGatewayId -VpcId $ec2Vpc.VpcId
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "MyInternetGateway"
New-EC2Tag -Resource $Ec2InternetGateway.InternetGatewayId -Tag $Tag

#Create custom route table with route to the internet and associate it with the subnet
$Ec2RouteTable = New-EC2RouteTable -VpcId $ec2Vpc.VpcId
New-EC2Route -RouteTableId $Ec2RouteTable.RouteTableId -DestinationCidrBlock "0.0.0.0/0" -GatewayId $Ec2InternetGateway.InternetGatewayId
Register-EC2RouteTable -RouteTableId $Ec2RouteTable.RouteTableId -SubnetId $ec2subnet.SubnetId

Create Security Group.
In this section, I’ll create a security group with a rule to enable remote desktop access to the EC2Instance VM.
#Create Security group and firewall rule for RDP
$SecurityGroup = New-EC2SecurityGroup -Description "Non Default RDP Security group for AWS VM" -GroupName "RDPSecurityGroup" -VpcId $ec2Vpc.VpcId
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "RDPSecurityGroup"
New-EC2Tag -Resource $securityGroup -Tag $Tag
$iprule = New-Object Amazon.EC2.Model.IpPermission
$iprule.ToPort = 3389
$iprule.FromPort = 3389
$iprule.IpProtocol = "tcp"
$iprule.IpRanges.Add('0.0.0.0/0')
Grant-EC2SecurityGroupIngress -GroupId $securityGroup -IpPermission $iprule -Force

Get the AMI (Amazon Machine Image) and create an Elastic Public IP Address to be attached to the EC2Instance after initialization..
#Retrieve Amazon Machine Image Id property for Windows Server 2016
$imageid = (Get-EC2ImageByName -Name WINDOWS_2016_BASE).ImageId

#Allocate an Elastic IP Address for use with an instance VM
$Ec2Address = New-EC2Address -Domain vpc
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "MyElasticIP"
New-EC2Tag -Resource $Ec2Address.AllocationId -Tag $Tag

Launch or Provision the EC2 Instance Virtual machine.
#Launch EC2Instance Virtual Machine
$ec2instance = New-EC2Instance -ImageId $imageid -MinCount 1 -MaxCount 1 -InstanceType t2.micro -KeyName mykeypair -SecurityGroupId $securityGroup -Monitoring_Enabled $true -SubnetId $ec2subnet.SubnetId
$Tag = New-Object Amazon.EC2.Model.Tag
$Tag.Key = "Name"
$Tag.Value = "MyVM"
$InstanceId = $ec2instance.Instances | Select-Object -ExpandProperty InstanceId
New-EC2Tag -Resource $InstanceId -Tag $Tag

Associate the Elastic Public IP Address to the EC2 Instance.
#Assign Elastic IP Address to the EC2 Instance VM
$DesiredState = “Running”
while ($true) {
$State = (Get-EC2Instance -InstanceId $InstanceId).Instances.State.Name.Value
if ($State -eq $DesiredState) {
break;
}
“$(Get-Date) Current State = $State, Waiting for Desired State=$DesiredState”
Start-Sleep -Seconds 5
}
Register-EC2Address -AllocationId $Ec2Address.AllocationId -InstanceId $InstanceId

Display EC2 Instance Properties.
#Display VM instance properties
(Get-EC2Instance -InstanceId $InstanceId).Instances | Format-List

Remove or Terminate EC2 Instances.
#Clean up and Terminate the EC2 Instance
Get-EC2Instance | Remove-EC2Instance -Force

Logon to the EC2Instance using Remote Desktop protocol.
Login to the EC2 Instance Virtual machine can be initiated using the AWS EC2 Dashboard.The private key portion of the keypair will be used to create a password to login to the Virtual Machine as indicated in the following screen shots:

Select the EC2 Instance and click on the Connect button. On the Connect to your Instance page, click on the Get Password button.

On the Get Password page, copy and paste the private key from the keypair file into the content field and click to decrypt the key.

Copy the displayed password, download the RDP file and login to the EC2Instance. It is recommended to change the password and create a new local user after logon.

The full PowerShell Script can be found at my Github Repository

Posted in AWS | Tagged , , , , , , | Leave a comment