Organizations are modernizing their data architectures which ensures agility increases, time to insight as soon as data gets generated is shortened and importantly it provides a scalable platform for storage and processing. Data lakes are becoming a corner stone in their journey towards digital transformation.
An Enterprise Data Warehouse (EDW) is used to store important business data. The EDW is designed to capture the essence of the business from other enterprise systems such as customer relationship management (CRM), inventory, and sales transactions systems, and allow analysts and business users to gain insight and make important business decisions from that data. But new technologies—including streaming and social data from the Web or from connected devices on the Internet of things (IoT)—is driving much greater data volumes, higher expectations from users, and a rapid globalization of economies. Organizations are realizing that traditional EDW technologies can’t meet their new business needs. As a result, many organizations are turning to Data Lakes !
A data lake is a central location in which to store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools—typically tools in the extended Hadoop family—to extract value quickly and inform key organizational decisions. Because all data is welcome, data lakes are an emerging and powerful approach to the challenges of data integration in a traditional EDW, especially as organizations turn to mobile and cloud-based applications and the IoT.
Functions of a Data Lake
- Scalable / extensible to capture both batch and streaming data
- Provide capability to business logic, filters, validation, data quality, routing etc. business requirements
Storage & Retention
- Depending on the requirements data is placed into a repository
- Metadata management
- Policy based Data retention is provided
- Processing is provided for both batch and near-real time use cases
- Provisioning workflows for repeatable data processing
- Provide late data arrival handling
- Dashboard and applications that provide valuable business insights
- Data is made available to consumers using API, MQ feed, DB access
The main advantage of this architecture is that data can come into the data lake from anywhere, including online transaction processing (OLTP) or operational data store (ODS) systems, an EDW, logs or other machine data, or from cloud services. These source systems include many different formats, such as file data, database data, ETL, streaming data and even data coming in through APIs.
The data is first loaded into a transient loading zone, where basic data quality checks are performed using MapReduce or Spark by leveraging the Hadoop cluster. Once the quality checks have been performed, the data is loaded into Hadoop in the raw data zone, and sensitive data can be redacted so it can be accessed without revealing sensitive or vulnerable data.
Data scientists and business analysts alike dip into this raw data zone for sets of data to discover. An organization can, if desired, perform standard data cleansing and data validation methods and place the data in the trusted zone. This trusted repository contains both master data and reference data. Master data is the basic data sets that have been cleansed and validated. For example, a healthcare organization may have master data sets that contain basic member information (names, addresses,) and members’ additional attributes (dates of birth, social security numbers). An organization needs to ensure that this reference data kept in the trusted zone is up to date using change data capture (CDC) mechanisms. Reference data, on the other hand, is considered the single source of truth for more complex, blended data sets. For example, that healthcare organization might have a reference data set that merges information from multiple source tables in the master data store, such as the member basic information and member additional attributes to create a single source of truth for member data. Anyone in the organization who needs member data can access this reference data and know they can depend on it.
From the trusted area, data moves into the discovery sandbox, for wrangling, discovery, and exploratory analysis by users and data scientists. Finally, the consumption zone exists for business analysts, researchers, and data scientists to dip into the data lake to run reports, do “what if ” analytics, and otherwise consume the data to come up with business insights for informed decision-making.
Most importantly, underlying all of this must be an integration platform that manages, monitors, and governs the metadata, the data quality, the data catalog, and security. Although companies can vary in how they structure the integration platform, in general, governance must be a part of the solution.
Metadata is extraordinarily important to managing your data lake. An integrated data lake management platform makes metadata creation and maintenance an integral part of the data lake processes. This is essential, as without effective metadata, data dumped into a data lake may never be seen again. Metadata is critical for making sure data is leveraged to its fullest. Whether manually collected or automatically created during data ingestion, metadata allows your users to locate the data they want to analyze. It also provides clues for future users to understand the contents of a data set and how it could be reused.
Sometimes you need to see what data sets exist in the data lake, the properties of those data sets, the ingestion history of the data set, the data quality, and the key performance indicators (KPIs) of the data as it was ingested. You should also see the data profile, and all the metadata attributes, including those that are business, technical, and operational. All of these things need to be abstracted to a level to where the user can understand them, and use that data effectively—this is where the data lake catalog comes in. Your management platform should make it easy to create a data catalog, and to provide that catalog to business users, so they can easily search it—whether searching for source system, schema attributes, subject area, or time range. With a data catalog, users can find data sets that are curated, so that they don’t spend time cleaning up and preparing the data. This has already been done for them, particularly in cases of data that has made it to the trusted area. Users are thus able to select the data sets they want for model building without involving IT, which shortens the analytics timeline.
An important part of the data lake architecture is to first put data in a transitional or staging area before moving it to the raw data repository. It is from this staging area that all possible data sources, external or internal, are either moved into Hadoop or discarded. As with the visibility of the data, a managed ingestion process enforces governance rules that apply to all data that is allowed to enter the data lake. Governance rules can include any or all of the following:
If data needs to be protected by encryption—if its visibility is a concern—it must be encrypted before it enters the data lake.
Provenance and lineage
It is particularly important for the analytics applications that business analysts and data scientists will use down the road that the data provenance and lineage is recorded. You may even want to create rules to prevent data from entering the data lake if its provenance is unknown.
A managed ingestion process allows you to set governance rules that capture the metadata on all data before it enters the data lake’s raw repository.
You can also set data cleansing standards that are applied as the data is ingested in order to ensure only clean data makes it into the data lake.
Enterprises are also looking ahead. They see that to be truly valuable, the data lake can’t be a silo, but must be one of several platforms in a carefully considered end-to-end modern enterprise data architecture. Just as you must think of metadata from an enterprise-wide perspective, you need to be able to integrate your data lake with external tools that are part of your enterprise-wide data view. Only then will you be able to build a data lake that is open, extensible, and easy to integrate into your other business-critical platforms.
Reporting – Dashboarding & Business Use Case Implementation
Once data has been provisioned in the Data Lake, you will then need to start deriving value of the data from the platform that has been implemented. This is the time when Data Analysts / Data Scientists will derive value of the data either from the Refined Zone or Sandbox to meet business goals.
Various third-party reporting tools such as Tableau, QlikView, Power BI etc. can be used for reporting & dashboarding purposes. The tools can connect through an ODBC Connection with the on-premise or cloud data lake to display appealing visualizations as shown below:
Data Scientists however can utilize the data from any of the zones to implement a business use case. One of the common use cases is that of Customer 3600 which is described below:
Customer Master Data from disparate data sources is ingested into the Data Lake, transformed as per business rules, mastered using Machine Learning (ML) techniques to form golden records, stitched with data coming in from transactional systems to form a Customer Data Hub.
Vice President – Technology
Parag is heading the Big Data, Data Warehousing and BI initiatives across the USA, India, Middle East and Africa. Parag carries more than 25 years of experience in the IT Industry and has PMP, Oracle Hyperion and SAP Supply Chain Certifications to his credit.