Hadoop data lake
A Hadoop data lake is a data management platform comprising one or more Hadoop clusters used principally to process and store non-relational data such as log files. Internet clickstream records, sensor data, JSON objects, images and social media posts. Such systems can also hold transactional data pulled from relational databases, but they’re designed to support analytics applications, not to handle transaction processing themselves.
Download this free guide
New Release: DBMS Software Buyer’s Guide 2016
Get help sorting through the different types of database management system (DBMS) software so you can choose what’s best for your organization.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
While the data lake concept can be applied more broadly to include other types of systems, it most frequently involves storing data in the Hadoop Distributed File System (HDFS ) across a set of clustered compute nodes based on commodity server hardware.
Because of the use of commodity hardware and Hadoop’s standing as an open source technology, proponents claim that Hadoop data lakes provide a less expensive repository for analytics data than traditional data warehouses do. In addition, their ability to hold a diverse mix of structured, unstructured and semi-structured information can make them a more suitable platform for big data management and analytics applications than data warehouses based on relational software are.
However, a Hadoop data lake can be used to complement an enterprise data warehouse rather than to supplant it entirely. A Hadoop cluster can offload some data processing work from an EDW and host new analytics applications; filtered data sets or summarized results can then be sent to the data warehouse for further analysis by business users.
The contents of a Hadoop data lake need not be immediately incorporated into a formal database schema or consistent data structure, enabling users to store raw data as is; information can then either be analyzed in its raw form or prepared for specific analytics uses as needed. As a result, data lake systems tend to employ extract, load and transform (ELT ) methods for collecting and integrating data, instead of the extract, transform and load (ETL) approaches typically used in data warehouses. Data can be extracted and processed outside of HDFS using MapReduce. Spark and other data processing frameworks.
Potential uses are varied. For example, Hadoop data lakes can pool varied legacy data sources, collect network data from multiple remote locations and serve as a way staion for data that is overloading another system.
The Hadoop data lake isn’t without its critics, or challenges for users. Spark as well as the Hadoop framework itself can support file architectures other than HDFS. Meanwhile, data warehouse advocates contend that similar architectures — for example, the data mart — have a long lineage and that Hadoop and related open source technologies still need to mature significantly in order to match the functionality and reliability of data warehousing environments. Even experienced Hadoop data lake users say that a successful implementation requires a strong architecture and disciplined data governance policies — without those things, they warn, data lake systems can become out-of-control dumping grounds.
This was last updated in May 2015
Continue Reading About Hadoop data lake
Apache Flink Apache Flink is an in-memory and disk-based distributed data processing platform for use in big data streaming applications. See complete definition Apache Spark Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered. See complete definition big data analytics Big data analytics is the process of examining large and varied data sets — i.e. big data — to uncover hidden patterns. See complete definition