Tag Archives: PySpark

Ingest Azure Event Hub Telemetry Data with Apache PySpark Structured Streaming on Databricks.

Overview. Ingesting, storing and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub. … Continue reading

Posted in Azure Databricks, Azure Event Hub | Tagged , , , , , , , , , , , , | 1 Comment

Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in … Continue reading

Posted in Azure Databricks | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly … Continue reading

Posted in Azure Databricks | Tagged , , , , , , , , , , | Leave a comment