Chinny Chukwudozie, Cloud Solutions.

Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

September 30, 2020

Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since…

Build a Jar file for the Apache Spark SQL and Azure SQL Server Connector Using SBT.

June 29, 2020

The Apache Spark Azure SQL Connector is a huge upgrade to the built-in JDBC Spark connector. It is more than 15x faster than generic JDBC connector for writing to SQL Server. In this short post, I articulate the steps required to build a JAR file from the Apache Spark connector for Azure SQL that can…

Configure a Databricks Cluster-scoped Init Script in Visual Studio Code.

March 2, 2020

Databricks is a distributed data analytics and processing platform designed to run in the Cloud. This platform is built on Apache Spark which is currently at version 2.4.4. In this post, I will demonstrate the deployment and installation of custom R based machine learning packages into Azure Databricks Clusters using Cluster Init Scripts. So, what…

Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

March 1, 2020

In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly refers to having data that is free of errors, inconsistencies, redundancies, poor formatting and other…

Setting Up Jupyter Notebook to Run in a Python Virtual Environment.

May 20, 2019

1) Install Jupyter on the local machine outside of any existing Python Virtual environment: pip install jupyter –no-cach-dir 2) Create a Python Virtual environment. mkdir virtualenv cd virtualenv python.exe -m venv dbconnect 3) Change directory into the virtual environment and activate .\scripts\activate 4) Install ipykernel package in the virtual environment pip install ipykernel 5) Use…

Programmatically Provision an Azure Databricks Workspace and Cluster using Python Functions.

May 16, 2019

Azure Databricks is a data analytics and machine learning platform based on Apache Spark. The first set of tasks to be performed before using Azure Databricks for any kind of Data exploration and machine learning execution is to create a Databricks workspace and Cluster. The following Python functions were developed to enable the automated provision…

Automate Azure Databricks Job Execution using Custom Python Functions.

March 23, 2019

Introduction Thanks to a recent Azure Databricks project, I’ve gained insight into some of the configuration components, issues and key elements of the platform. Let’s take a look at this project to give you some insight into successfully developing, testing, and deploying artifacts and executing models. One note: This post is not meant to be…

Provisioning a Jenkins Instance Container with Persistent Volume in Azure Kubernetes Service.

July 3, 2018

In this post, I want to write about my experience testing and using Azure Kubernetes service to deploy a Jenkins Instance solution that is highly available and resilient. With the Kubernetes persistent volume feature, an Azure disk can be dynamically provisioned and attached to a Jenkins Instance container deployment. In another scenario, an existing Azure…

PowerShell function to Provision a Windows Server EC2 Instance in AWS Cloud.

January 25, 2018

Introduction. Microsoft just updated the ASWPowerShell module to better enable Cloud administrators manage and provision cloud resources in the AWS cloud space while using the same familiar PowerShell tool. As at last count today, the AWSPowerShell module contains almost four thousand cmdlets: This means Microsoft is committed to expanding on PowerShell functionality as a robust…

Thoughts on the Meltdown and Spectre Processor Vulnerabilities.

January 5, 2018

Summary: A new class of security vulnerabilities referred to as “Speculative execution side-channel attacks” also known as “Meltdown and Spectre” were publicly disclosed by Cyber security researchers this week. Given the gravity of these flaws, many concerns have been rightly raised. In this article I will cover their impact as well as the Microsoft Cloud’s…