Chinny Chukwudozie, Ai architecture.

AI Solutions and Agentic Engineering.

Tag: Azure Databricks

Designing and Implementing a Modern Data Architecture on Azure Cloud.

I just completed work on the digital transformation, design, development, and delivery of a cloud native data solution for one of the biggest professional sports organizations in north America. In this post, I want to share some thoughts on the selected architecture and why we settled on it This Architecture was chosen to meet the…

jbernec

May 22, 2022

Modern Data Architecture, Uncategorized

Azure, Azure Data Factory, Azure Databricks, Azure Key Vault, Azure Synapse, Delta Lake, Modern Data Architecture, Power BI, Private Endpoint
Ingest Azure Event Hub Telemetry Data with Apache PySpark Structured Streaming on Databricks.

Overview. Ingesting, storing and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub. Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on…

jbernec

May 17, 2021

Azure Databricks, Azure Event Hub

Azure Data Factory, Azure Databricks, Azure Event Hub, Databricks, Databricks Notebooks, Databricks REST API 2.0, EntityPath, Event Hub Apache Connector, Event Hub Connection String, IoT, PySpark, Python, Telemetry
Write Data from Azure Databricks to Azure Dedicated SQL Pool(formerly SQL DW) using ADLS Gen 2.

In this post, I will attempt to capture the steps taken to load data from Azure Databricks deployed with VNET Injection (Network Isolation) into an instance of Azure Synapse DataWarehouse deployed within a custom VNET and configured with a private endpoint and private DNS. Deploying these services, including Azure Data Lake Storage Gen 2 within…

jbernec

November 13, 2020

Apache Spark, Azure Synapse DW

ADLS Gen 2, Apache Spark, Azure Databricks, Azure Key Vault, Azure SQL DataWarehouse, Azure Synapse Analytics, Azure Synapse Connector, Database Scoped Credential, formerly Azure SQL DataWarehouse, Managed Service Identity, SQL
Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since…

jbernec

September 30, 2020

Azure Databricks

Analytics, Apache Spark, Apache Spark Connector, Apache Spark JDBC Connector, Autoloader, Azure Data Factory, Azure Data Lake Gen 2, Azure Databricks, Azure Event Grid, Azure SQL DB, Big Data, cloudFiles, CSV, Data, ETL, Ingestion, JSON, Pipeline, PySpark, Python, Queue Service, schema, StructType, Structured Streaming API, udf, Unified Analytics
Build a Jar file for the Apache Spark SQL and Azure SQL Server Connector Using SBT.

The Apache Spark Azure SQL Connector is a huge upgrade to the built-in JDBC Spark connector. It is more than 15x faster than generic JDBC connector for writing to SQL Server. In this short post, I articulate the steps required to build a JAR file from the Apache Spark connector for Azure SQL that can…

jbernec

June 29, 2020

Unified Analytics

Apache Spark, Azure Databricks, Azure Databricks Cluster, Microsoft, sbt, Spark, sql-spark-connector, Unified Analytics
Configure a Databricks Cluster-scoped Init Script in Visual Studio Code.

Databricks is a distributed data analytics and processing platform designed to run in the Cloud. This platform is built on Apache Spark which is currently at version 2.4.4. In this post, I will demonstrate the deployment and installation of custom R based machine learning packages into Azure Databricks Clusters using Cluster Init Scripts. So, what…

jbernec

March 2, 2020

Apache Spark, Bash, Cluster Init Scripts, Databricks Notebooks, Install.packages(), Logs, R, Shell

Apache Spark, Azure Databricks, Bash, Cluster Init Scripts, Databricks CLI, Databricks Notebooks, Install.packages(), Logs, R
Programmatically Provision an Azure Databricks Workspace and Cluster using Python Functions.

Azure Databricks is a data analytics and machine learning platform based on Apache Spark. The first set of tasks to be performed before using Azure Databricks for any kind of Data exploration and machine learning execution is to create a Databricks workspace and Cluster. The following Python functions were developed to enable the automated provision…

jbernec

May 16, 2019

Apache Spark, Azure Automation Account, Azure Databricks, Python

ARM Templates, Automation, Azure Automation, Azure Databricks, Azure Databricks Cluster, Create Cluster API, Databricks REST API 2.0, Python3, yaml