Tag: Apache Spark
-
Write Data from Azure Databricks to Azure Dedicated SQL Pool(formerly SQL DW) using ADLS Gen 2.
In this post, I will attempt to capture the steps taken to load data from Azure Databricks deployed with VNET Injection (Network Isolation) into an instance of Azure Synapse DataWarehouse deployed within a custom VNET and configured with a private endpoint and private DNS. Deploying these services, including Azure Data Lake Storage Gen 2 within…
-
Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.
Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since…
-
Build a Jar file for the Apache Spark SQL and Azure SQL Server Connector Using SBT.
The Apache Spark Azure SQL Connector is a huge upgrade to the built-in JDBC Spark connector. It is more than 15x faster than generic JDBC connector for writing to SQL Server. In this short post, I articulate the steps required to build a JAR file from the Apache Spark connector for Azure SQL that can…
-
Configure a Databricks Cluster-scoped Init Script in Visual Studio Code.
Databricks is a distributed data analytics and processing platform designed to run in the Cloud. This platform is built on Apache Spark which is currently at version 2.4.4. In this post, I will demonstrate the deployment and installation of custom R based machine learning packages into Azure Databricks Clusters using Cluster Init Scripts. So, what…
-
Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.
In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly refers to having data that is free of errors, inconsistencies, redundancies, poor formatting and other…