Why Data Lake and Databrick?

2 min readFeb 19, 2019

I’ve got the question asking Why Data Lake and Databricks and wrote the text below, maybe this could help you as well.

A Data Lake gives you the capability to store different types of data (unstructured, semi-structured and structured).

Definition:

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. … The term data lake is often associated with Hadoop-oriented object storage.

The term was created by James Dixon, Pentaho CTO some years ago.

The first data lakes were created using Hadoop, Microsoft took Hadoop and rebranded it as Azure Data Lake Store, of course with some very nice new features and integrations with Azure Services in general.

With Azure Data Lake Store you can have a Hadoop Data Lake on the Microsoft cloud (PaaS) in a very easy way with the potential to scale without limits, you just have to pay more. That’s great. Believe me, it is a big pain to manage an on-premise Hadoop cluster, Microsoft just made this easy and simple.

You can use a Data Lake as the RAW Area, where you keep your data without modification, you can also transform your data and sink to your data lake, this could be a part of your Modern Data Warehouse. This will give you the flexibility to onboard data much easier and transform data in many different formats and sizes.

Databricks is the unified tool for Data Engineers and Data Scientist, one tool to handle many different use cases (Transformations, Big Data, and Machine Learning).

For aggregate reports, you will most of the time need a Database like Azure SQL DB (SMP) or Azure SQL Data Warehouse (MPP). Power BI can connect to Azure Data Lake, but for management reports with aggregate data, you usually connect Power BI to Azure SQL DB or Azure SQL DW.

SSIS is old technology, but if you really want to keep your SSIS packages you can run them using Azure Data Factory v2.

Why Data Lake and Databrick?

Written by Caio Moreno

No responses yet