Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format

In this tutorial I will demonstrate how to process your Event Hubs Capture (Avro files) located in your Azure Data Lake Store using Azure Databricks (Spark).

This tutorial is based on this article created by Itay Shakury.

In case you need to read more files use *.avro.

I used /*/*/*/*/*/* because of /YEAR/MONTH/DAY/HOUR/MINUTE/SECOND

I am only using the PartitionID 0.

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`

Source code:
https://gist.github.com/caiomsouza/12e246d3be85f9bc0c060cd20b729016

Preview data

Save the results to Parquet format

Save the results to CSV format

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)