What Is a Data Lake and Why Is It Essential for Big Data?

If you’re someone like me, you enjoy structure, neatness, and simplicity.

But in some cases, it’s best to step back and allow organized chaos to unfold. This is the basis of something called a data lake.

What is a data lake?

Data lake definition

A data lake is a repository for structured, unstructured, and semi-structured data. Data lakes are much different from data warehouses since they allow data to be in its rawest form without needing to be converted and analyzed first.

In simpler terms, all types of data that are generated by both humans and machines can be loaded into a data lake for classification and analysis later on.

Data warehouses, on the other hand, require data to be properly structured before any work can get done.

To get a deeper understanding of data lakes and why they’re the optimal candidate for housing big data, it’s important to dive into what makes them so different from data warehouses.

Data lake vs. data warehouse

Both data lakes and data warehouses are repositories for data. That’s about the only similarity between the two. Now, let’s touch on some of the key differences:

Data lakes are designed to support all types of data, whereas data warehouses make use of highly structured data – in most cases.
Data lakes store all data that may or may not be analyzed at some point in the future. This principle doesn’t apply to data warehouses since irrelevant data is typically eliminated due to limited storage.
The scale between data lakes and data warehouses is drastically different due to our previous points. Supporting all types of data and storing that data (even if it’s not immediately useful) means data lakes need to be highly scalable.
Thanks to metadata (data about data), users working with a data lake can gain basic insight about the data quickly. In data warehouses, it often requires a member of the development team to access the data – which could create a bottleneck.
Lastly, the intense data management required for data warehouses means they’re typically more expensive to maintain compared to data lakes.

James Dixon, founder and Chief Technology Officer of Pentaho, coined the term “data lake” after providing an analogy differentiating data lakes from data warehouses.

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state,” said Dixon. “The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
James Dixon
founder and Chief Technology Officer of Pentaho

Data lake architecture

So, how are data lakes capable of storing such vast and diverse amounts of data? What is the underlying architecture of these massive repositories?

Data lakes are built upon a schema-on-read data model. A schema is essentially the skeleton of a database outlining its model and how data will be structured within it. Think of a blueprint.

The schema-on-read data model means you can load your data in the lake as-is without having to worry about its structure. This allows for much more flexibility.

Data warehouses, on the other hand, are comprised of schema-on-write data models. This is a much more traditional model for databases.

Every set of data, every relationship, and every index in the schema-on-write data model must be clearly defined ahead of time. This limits flexibility, especially when adding in new sets of data or features that could potentially create gaps within the database.

The schema-on-read data model acts as the backbone of a data lake, but the processing framework (or engine) is how data actually gets loaded into one.

Below are the two processing frameworks which “ingest” data into data lakes:

Batch processing – Millions of blocks of data processed over long periods of time (hours-to-days). The least time-sensitive method for processing big data.
Stream processing – Small batches of data processed in real-time. Stream processing is becoming increasingly valuable for businesses that harness real-time analytics.

Hadoop, Apache Spark, and Apache Storm are among the more commonly used big data processing tools which are capable of either batch or stream processing.

Some tools are particularly useful for processing unstructured data such as sensor activity, images, social media posts, and internet clickstream activity. Other tools prioritize processing speed and usefulness with machine learning programs.

Once the data is processed and ingested into the data lake, it’s time to make use of it.

What are data lakes used for?

Data warehouses rely on structure and clean data, whereas data lakes allow data to be in its most natural form. This is because advanced analytic tools and mining software intake raw data and transform it into useful insight.

Big data analytics

Big data analytics will dive into a data lake in an attempt to uncover patterns, market trends, and customer preferences to help businesses make informed predictions faster. This is done through four different analyses.

Descriptive analysis – A retrospective analysis looking at “where” a problem may have occurred for a business. Most big data analytics today are actually descriptive because they can be generated quickly.
Diagnostic analysis – Another retrospective analysis looking at “why” a specific problem may have occurred for a business. This is slightly more in-depth than descriptive analytics.
Predictive analysis – When AI and machine learning software are applied, this analysis can provide an organization with predictive models of what may occur next. Because of the complexity of generating predictive analyses, it's not widely adopted yet.
Prescriptive analysis – The future of big data analytics is prescriptive analyses which not only assist in decision-making efforts but may even be able to provide an organization with a set of answers. There is a very high-level of machine learning usage with these analyses.

Data mining

Data mining is defined as “knowledge discovery in databases,” and is how data scientists uncover previously unseen patterns and truths through various models.

For example, a clustering analysis is a type of data mining technique that can be applied to a set within a data lake. This will group large amounts of data together based on their similarities.

Through data visualization tools, data mining helps clear up the chaotic nature of unstructured, raw forms of data.

Data lake challenges

Data lakes may be flexible, scalable, and quick to load, but it does come at a price.

Ingesting unstructured data requires a lack of data governance and processes that ensure the right data is being looked at. For most businesses – especially those that have yet to adopt big data – having unorganized, uncleaned data isn’t an option.

Misuse of metadata or processes to keep the data lake in check can actually lead to something called a data swamp. You wouldn’t go swimming in a swamp, would you?

There’s also the issue of data security.

Data lakes are a fairly new concept in IT, which means some of the tools are still working out the security kinks. One of these kinks is ensuring only the right people have access to sensitive data loaded into the lake.

But like any new technology, these issues will resolve with time.

TIP: Ready to take a deeper dive into the data world? Learn the basics of master data management (MDM) and why it's important for businesses.

The role of data lakes with big data

Despite some of the challenges of data lakes, the fact remains that more than 80 percent of all data is unstructured. As more businesses turn to big data for future opportunities, the application of data lakes will rise.

Unstructured data like social media posts, phone call recordings, and clickstream activity contain valuable information that cannot be withheld in data warehouses.

While data warehouses are strong in structure and security, big data simply needs to unconfined so it can flow freely into data lakes.

Check out our full guide on structured vs unstructured data for a more in-depth explanation or read up on the importance of big data engineering.

Devin Pickell

Devin is a former senior content specialist at G2. Prior to G2, he helped scale early-stage startups out of Chicago's booming tech scene. Outside of work, he enjoys watching his beloved Cubs, playing baseball, and gaming. (he/him/his)