Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format

Prof. Dr. Caio Moreno
2 min readOct 22, 2018

In this tutorial I will demonstrate how to process your Event Hubs Capture (Avro files) located in your Azure Data Lake Store using Azure Databricks (Spark).

This tutorial is based on this article created by Itay Shakury.

In case you need to read more files use *.avro.

I used /*/*/*/*/*/* because of /YEAR/MONTH/DAY/HOUR/MINUTE/SECOND

I am only using the PartitionID 0.

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf ="com.databricks.spark.avro").load(in_path)
jsonRdd ="string")) x: x[0])
data = # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`

Source code:

Preview data

Save the results to Parquet format

Save the results to CSV format



Prof. Dr. Caio Moreno

Solutions Architect and Data Scientist @databricks | Adjunct Professor at @IEuniversity | PhD @unicomplutense (Opinions are my own)