Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format
2 min readOct 22, 2018
In this tutorial I will demonstrate how to process your Event Hubs Capture (Avro files) located in your Azure Data Lake Store using Azure Databricks (Spark).
This tutorial is based on this article created by Itay Shakury.
In case you need to read more files use *.avro.
I used /*/*/*/*/*/* because of /YEAR/MONTH/DAY/HOUR/MINUTE/SECOND
I am only using the PartitionID 0.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON#do whatever you want with `data`
Source code:
https://gist.github.com/caiomsouza/12e246d3be85f9bc0c060cd20b729016
Preview data
Save the results to Parquet format
Save the results to CSV format