Chinny Chukwudozie, Ai architecture.

AI Solutions and Agentic Engineering.

Tag: Python

Ingest Azure Event Hub Telemetry Data with Apache PySpark Structured Streaming on Databricks.

Overview. Ingesting, storing and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub. Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on…

jbernec

May 17, 2021

Azure Databricks, Azure Event Hub

Azure Data Factory, Azure Databricks, Azure Event Hub, Databricks, Databricks Notebooks, Databricks REST API 2.0, EntityPath, Event Hub Apache Connector, Event Hub Connection String, IoT, PySpark, Python, Telemetry
Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since…

jbernec

September 30, 2020

Azure Databricks

Analytics, Apache Spark, Apache Spark Connector, Apache Spark JDBC Connector, Autoloader, Azure Data Factory, Azure Data Lake Gen 2, Azure Databricks, Azure Event Grid, Azure SQL DB, Big Data, cloudFiles, CSV, Data, ETL, Ingestion, JSON, Pipeline, PySpark, Python, Queue Service, schema, StructType, Structured Streaming API, udf, Unified Analytics
Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly refers to having data that is free of errors, inconsistencies, redundancies, poor formatting and other…

jbernec

March 1, 2020

Azure Databricks

Apache Spark, APIs, CSV, Databricks-Connect, DataCompy, Dataframes, Jupyter Notebook, PySpark, Python, Python Virtual Environmet, Venv
Automate Azure Databricks Job Execution using Custom Python Functions.

Introduction Thanks to a recent Azure Databricks project, I’ve gained insight into some of the configuration components, issues and key elements of the platform. Let’s take a look at this project to give you some insight into successfully developing, testing, and deploying artifacts and executing models. One note: This post is not meant to be…

jbernec

March 23, 2019

Apache Spark, Azure Databricks, Cluster Init Scripts, Databricks Notebooks, Python

Azure Data Factory, Databricks, Databricks CLI, Git, Jobs API, Jobs REST API, Logging module, MLFlow, Python, Subprocess module, Version Control

Tag: Python

Ingest Azure Event Hub Telemetry Data with Apache PySpark Structured Streaming on Databricks.

Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

Automate Azure Databricks Job Execution using Custom Python Functions.