Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`

--

--

--

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

32 Hrs+ Downtime

Bokeh 0.12.9 Released

Building production grade EKS clusters using Terraform

Flutter App Development Cost: How to Estimate the Budget and Maintain A Flutter Mobile App

Reflecting on Code for Australia’s 2021

Setting Up a Twilio Webook In Ruby on Rails

Reading Text Aloud in Microsoft Access

Fun with Skybox’s in Unity

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Caio Moreno

Caio Moreno

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)

More from Medium

Databricks Basics (Databases, Tables and Views)

(Azure) Databricks: accelerating big data analytics with the Spark connector for Azure SQL

Tutorial: Create a Single Node Databricks Cluster in Azure Data Factory

Databricks — A History