The important role of Azure Data Lake

Caio Moreno
3 min readOct 17, 2018

--

Every company in this world should create a Data Lake. If you want to become a data-driven company, you need a data lake; and in our days, with the power of the cloud, there are no more excuses to not implement it.

Having said that, I would like to explain in this tutorial some concepts and how to create an Azure Data Lake Store.

What is a Data Lake?

According to Wikipedia:

A data lake is a system or repository of data stored in its natural format, usually, object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

In the Azure World, we can use the product Azure Data Lake Store to implement a Cloud Data Lake.

Azure Data Lake Storage Gen1 is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.

For those who do not know, the term Data Lake was created by James Dixon, CTO of Pentaho.

According to Wikipedia:

James Dixon, then chief technology officer at Pentaho, allegedly coined the term to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers said that data lakes could “put an end to data silos. In their study on data lakes, they noted that enterprises were “starting to extract and place data for analytics into a single, Hadoop-based repository.”

Hortonworks, Google, Oracle, Microsoft, Zaloni, Teradata, Cloudera, and Amazon now all have data lake offerings.

Why do I need a Data Lake?

We are living in a world awash with expanding amounts of data. Some of it has been generated by business intelligence workloads, and some of it is less structured content that’s produced during manufacturing processes, or by retail point-of-sale devices and an ever-growing number of mobile, intelligent devices. Then, of course, there is the Internet of Things, and its growing number of connected devices continuously streaming out, increasing volumes of structured and unstructured data.

This huge wave of data is overwhelming many existing enterprise storage infrastructures, regardless of whether the intent is to store and process the data locally, in a cloud service provider’s data center, or in some combination of the two. “Data lakes” are designed to address this data storage challenge, making the data more useful and accessible, and still allowing enterprises to meet their security, privacy and data governance needs.

Azure Data Lake

Azure Data Lake includes all of the capabilities required to make it easy for developers, data scientists and analysts to store data of any size and shape and at any speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with batch, streaming and interactive analytics. Azure Data Lake works with existing IT investments for identity, management and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so that you can extend current data applications. We’ve drawn on the experience of working with enterprise customers and running some of the largest-scale processing and analytics in the world for Microsoft businesses such as Office 365, Xbox Live, Azure, Windows, Bing and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that’s ready to meet your current and future business needs.

--

--

Caio Moreno
Caio Moreno

Written by Caio Moreno

Solutions Architect @databricks | Professor | PhD | Ex-Microsoft | Ex-Avanade/Accenture | Ex-Pentaho/Hitachi | Ex-AOL | Ex-IT4biz CEO. (Opinions are my own)

No responses yet