Sitemap

Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format

2 min readOct 22, 2018

In this tutorial I will demonstrate how to process your Event Hubs Capture (Avro files) located in your Azure Data Lake Store using Azure Databricks (Spark).

This tutorial is based on this article created by Itay Shakury.

In case you need to read more files use *.avro.

I used /*/*/*/*/*/* because of /YEAR/MONTH/DAY/HOUR/MINUTE/SECOND

Press enter or click to view image in full size

I am only using the PartitionID 0.

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`
Press enter or click to view image in full size

Source code:
https://gist.github.com/caiomsouza/12e246d3be85f9bc0c060cd20b729016

Preview data

Press enter or click to view image in full size

Save the results to Parquet format

Press enter or click to view image in full size

Save the results to CSV format

Press enter or click to view image in full size

--

--

Caio Moreno
Caio Moreno

Written by Caio Moreno

Solutions Architect @databricks | Professor | PhD | Ex-Microsoft | Ex-Avanade/Accenture | Ex-Pentaho/Hitachi | Ex-AOL | Ex-IT4biz CEO. (Opinions are my own)

Responses (1)