Skip to content

Chinny Chukwudozie, Ai architecture.

AI Solutions and Agentic Engineering.

  • About
  • Publish PySpark Streaming Query Metrics to Azure Log Analytics using the Data Collector REST API.

    Overview. At the time of this writing, there doesn’t seem to be built-in support for writing PySpark Structured Streaming query metrics from Azure Databricks to Azure Log Analytics. After some research, I found a work around that enables capturing the Streaming query metrics as a Python dictionary object from within a notebook session and publishing…

    jbernec

    November 27, 2020
    PySpark Streaming Logs
    Azure Databricks Cluster, Azure Log Analytics, Azure Monitor, HTTP Data Collector API, PySpark Application Logs, PySpark Streaming Logs, Python Wheel Package, setup.py
  • Write Data from Azure Databricks to Azure Dedicated SQL Pool(formerly SQL DW) using ADLS Gen 2.

    In this post, I will attempt to capture the steps taken to load data from Azure Databricks deployed with VNET Injection (Network Isolation) into an instance of Azure Synapse DataWarehouse deployed within a custom VNET and configured with a private endpoint and private DNS. Deploying these services, including Azure Data Lake Storage Gen 2 within…

    jbernec

    November 13, 2020
    Apache Spark, Azure Synapse DW
    ADLS Gen 2, Apache Spark, Azure Databricks, Azure Key Vault, Azure SQL DataWarehouse, Azure Synapse Analytics, Azure Synapse Connector, Database Scoped Credential, formerly Azure SQL DataWarehouse, Managed Service Identity, SQL
  • Incrementally Process Data Lake Files Using Azure Databricks Autoloader and Spark Structured Streaming API.

    Use Case. In this post, I will share my experience evaluating an Azure Databricks feature that hugely simplified a batch-based Data ingestion and processing ETL pipeline. Implementing an ETL pipeline to incrementally process only new files as they land in a Data Lake in near real time (periodically, every few minutes/hours) can be complicated. Since…

    jbernec

    September 30, 2020
    Azure Databricks
    Analytics, Apache Spark, Apache Spark Connector, Apache Spark JDBC Connector, Autoloader, Azure Data Factory, Azure Data Lake Gen 2, Azure Databricks, Azure Event Grid, Azure SQL DB, Big Data, cloudFiles, CSV, Data, ETL, Ingestion, JSON, Pipeline, PySpark, Python, Queue Service, schema, StructType, Structured Streaming API, udf, Unified Analytics
  • Build a Jar file for the Apache Spark SQL and Azure SQL Server Connector Using SBT.

    The Apache Spark Azure SQL Connector is a huge upgrade to the built-in JDBC Spark connector. It is more than 15x faster than generic JDBC connector for writing to SQL Server. In this short post, I articulate the steps required to build a JAR file from the Apache Spark connector for Azure SQL that can…

    jbernec

    June 29, 2020
    Unified Analytics
    Apache Spark, Azure Databricks, Azure Databricks Cluster, Microsoft, sbt, Spark, sql-spark-connector, Unified Analytics
  • Configure a Databricks Cluster-scoped Init Script in Visual Studio Code.

    Databricks is a distributed data analytics and processing platform designed to run in the Cloud. This platform is built on Apache Spark which is currently at version 2.4.4. In this post, I will demonstrate the deployment and installation of custom R based machine learning packages into Azure Databricks Clusters using Cluster Init Scripts. So, what…

    jbernec

    March 2, 2020
    Apache Spark, Bash, Cluster Init Scripts, Databricks Notebooks, Install.packages(), Logs, R, Shell
    Apache Spark, Azure Databricks, Bash, Cluster Init Scripts, Databricks CLI, Databricks Notebooks, Install.packages(), Logs, R
  • Data Preparation of PySpark Dataframes in Azure Databricks Cluster using Databricks Connect.

    In my limited experience with processing big data workloads on the Azure Databricks platform powered by Apache Spark, it has become obvious that a significant part of the tasks are targeted towards Data Quality. Data quality in this context mostly refers to having data that is free of errors, inconsistencies, redundancies, poor formatting and other…

    jbernec

    March 1, 2020
    Azure Databricks
    Apache Spark, APIs, CSV, Databricks-Connect, DataCompy, Dataframes, Jupyter Notebook, PySpark, Python, Python Virtual Environmet, Venv
  • Setting Up Jupyter Notebook to Run in a Python Virtual Environment.

    1) Install Jupyter on the local machine outside of any existing Python Virtual environment: pip install jupyter –no-cach-dir 2) Create a Python Virtual environment. mkdir virtualenv cd virtualenv python.exe -m venv dbconnect 3) Change directory into the virtual environment and activate .\scripts\activate 4) Install ipykernel package in the virtual environment pip install ipykernel 5) Use…

    jbernec

    May 20, 2019
    Python
    Jupyter Kernel, Jupyter Notebook, Python Virtual Environmet, Venv
  • Programmatically Provision an Azure Databricks Workspace and Cluster using Python Functions.

    Azure Databricks is a data analytics and machine learning platform based on Apache Spark. The first set of tasks to be performed before using Azure Databricks for any kind of Data exploration and machine learning execution is to create a Databricks workspace and Cluster. The following Python functions were developed to enable the automated provision…

    jbernec

    May 16, 2019
    Apache Spark, Azure Automation Account, Azure Databricks, Python
    ARM Templates, Automation, Azure Automation, Azure Databricks, Azure Databricks Cluster, Create Cluster API, Databricks REST API 2.0, Python3, yaml
  • Automate Azure Databricks Job Execution using Custom Python Functions.

    Introduction Thanks to a recent Azure Databricks project, I’ve gained insight into some of the configuration components, issues and key elements of the platform. Let’s take a look at this project to give you some insight into successfully developing, testing, and deploying artifacts and executing models. One note: This post is not meant to be…

    jbernec

    March 23, 2019
    Apache Spark, Azure Databricks, Cluster Init Scripts, Databricks Notebooks, Python
    Azure Data Factory, Databricks, Databricks CLI, Git, Jobs API, Jobs REST API, Logging module, MLFlow, Python, Subprocess module, Version Control
  • Provisioning a Jenkins Instance Container with Persistent Volume in Azure Kubernetes Service.

    In this post, I want to write about my experience testing and using Azure Kubernetes service to deploy a Jenkins Instance solution that is highly available and resilient. With the Kubernetes persistent volume feature, an Azure disk can be dynamically provisioned and attached to a Jenkins Instance container deployment. In another scenario, an existing Azure…

    jbernec

    July 3, 2018
    Azure Kubernetes, Kubernetes
    AKS, azure cli, Azure Kubernetes, Container, Deployments, Dynamic Azure Disk, Jenkins, Kompose.exe, kubectl, Kubernetes, Persistent Volume, Persistent Volume Claim, Pods, Services, Static Azure Disk, yaml
←Previous Page
1 2 3 4 … 14
Next Page→

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
 

Loading Comments...
 

    • Subscribe Subscribed
      • Chinny Chukwudozie, Ai architecture.
      • Join 36 other subscribers
      • Already have a WordPress.com account? Log in now.
      • Chinny Chukwudozie, Ai architecture.
      • Subscribe Subscribed
      • Sign up
      • Log in
      • Report this content
      • View site in Reader
      • Manage subscriptions
      • Collapse this bar