Processing Event Hubs Capture files (AVRO Format) using Spark (Azure Databricks), save to Parquet or CSV format

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
in_path = '/mnt/iotsmarthousedatalake/rawdata/sandbox/eventhubiotsmarthouse/eventhubiotsmarthouse/eventhubiotsmarthouse/0/*/*/*/*/*/*.avro'#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`

--

--

--

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Ac Dc Vst Plugins

Indoor Lighting Design Using DIAlux 4.13

How to Optimize Data Usage Over MQTT?

NumPy Init & Python Review

From Python to Go. My journey at Beat.

API Design Series — Part 1

Your First NodeMCU/Lua Script Flash on ESP8266!!!

What are TestNg Listeners and Types of Listeners in TestNG?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Caio Moreno

Caio Moreno

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)

More from Medium

Activate Thrift Server for ODBC/JDBC and Spark Connector on Spark

Concept of Big Data

Things to know about Databricks Clusters as Data Engineer

How to realize Defect Prevention with Predictive Maintenance in Azure