Secure Data Sharing with Snowflake

OLAP System: With operational databases, data practitioners store data in the OLTP (Online Transaction Processing) database to facilitate transaction with consumer applications but not to analyze it. With OLAP systems, when a user wants to analyze the data and create data visualizations to show it in the dashboard, this helps different LOB’s (Line of Business) in the company to make better decisions.

With growing needs of data platforms, organizations started hiring different sets of data professionals:

  • Data Engineers: Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Their goal is to make data accessible so that organizations can use it to evaluate and optimize their performance.
  • Data Scientists: A data scientist is responsible for collecting, analyzing, and interpreting extremely large amounts of data. They use advanced analytics technologies, including machine learning and predictive modeling for making better decision and useful suggestion for future marketing.
  • Data Analyst: A data analyst collects, cleans, and interprets data sets to answer a question or solve a problem. They work in many industries, including business, finance, criminal justice, science, medicine, and government.

To generate the Dashboard faster for data visualization and quick decision making for business, data needs to be store in such a way, which provide performance for above processes. Also, OLTP system should not have performance issues due to large analytical queries. So, the need is to build a system with analytics query in mind, OLAP (Online Analytical Processing) systems addresses these needs. This specialized platform/system called ‘Data Warehouses,’ which come with huge costs for storage and compute system.

But operational data is siloed in the OLTP system, and we need data in OLAP system, so to load data in OLAP system we need to follow below processes:

  • Export data from OLTP system
  • Use Logic to transform them into the format for analytical need.
  • Load transform data in OLTP system
  • Data Engineers create and maintain these ETL (Export, Transform, Load) tools while, Data Scientist and Data Analysts can make use of up-to-date data in OLAP system.

Introduction of Data Lake:

Now the problem is, unstructured data (Audio, chat, video, email etc.), which cannot be loaded in OLAP system. Also, cost is the biggest challenge when we need performance for large-scale analytics.

We need intermediary staging area where we can store all structured, semi-structured and un-structured data, which is known as Data Lake.

Data Lake and data warehouse had to be on-premise, so organization had to spend large amount to buy powerful computer and faster storage to power up Analytics System, which added cost on data warehouse licenses and support. Now, organization started using open-source Hadoop with lesser cost. Nowadays data lake and data warehouse storage come with main cloud providers AWS, Azure and GCP.

Data Engineer started using ETL to store data from OLAP to Data Lake. Semi-structured and un-structured data are also coming and storing into Data Lake and do another ETL to load sub-set of mission critical data in data warehouse.

 

 

Now we have three copies of Data

  • The original data in OLTP System
  • Copy in our Data Lake
  • Now data in OLAP System

This is the biggest challenge for Data Engineer to keep track of all the Data copy and keep in sync with real time data.
Another problem is, you cannot give all access of all your data to all consumers to maintain governance, regulatory and security compliances of the enterprise.
Data Engineer create Data Mart, which is sub-set of data warehouses. But this adding up more storage cost and more copies of same data.

Data Analysts extract the data and export it in their local computers/laptops for faster performance on their favorite dashboard. Now we have another copy of data that Data Engineer is not aware about. If you have more Data Engineer in your organization, more copy of data which may not in sync.

Data Scientists also extract data from Data Lake and keep the data in their local system for faster performance, so, more copy of data which Data Engineer are not aware about. This problem with growth of data copies is known as Data Drift.

In the current digital era, enterprises everywhere need to share data near to real time.

Industry Impact:

Healthcare providers create a single source of patient data, including electronic health records along with clinical, administrative, operational, and Internet of Medical Things (IoMT) data. Similarly, Payers share your claims, billing, and other data with providers, and share formulary and diagnostics data with pharmacy and lab partners. Life sciences companies Share your clinical, sales, marketing, and other data across business units and externally with research collaborators. Research institutions: Aggregate and share health-outcomes data collected from providers, life sciences companies, and other partners.

Financial services companies use the Data Cloud to centralize their data for deeper insights into their institution, securely access second and third-party data for broader customer and investor views, reduce fraud and risk exposure, and digitize and automate processes to focus on delivering high-value services. Financial institutions, financial services partners, financial data providers and the customer experience shared data among themselves and create a 360-degree view for a customer.

Securely shared data for the retail supply chain:

The Data Cloud enables retailers to centralize their data and securely share live data with supply chain partners while seeking to optimize pricing and inventory strategies, increase margins, and ensure consumer privacy. Retailers centralize siloed data across your supply chain, inventory, point of sale, CRM, customer loyalty, and marketing analytics systems. CPG organizations access transaction data from your retailers, combine it with other data, analyze it, and provide back brand management insights. Suppliers access omnichannel sales data from retailers to predict product quantities and minimize out-of-inventory or overstock events. The customer experience: Deliver seamless and personalized experiences to customers by enabling access to data across retail and CPG businesses.

Secure Data Sharing through Snowflake resolves the above-mentioned issues, Users can also share selected objects in their database with other Snowflake or external accounts.

Shared data, namely Snowflake objects encapsulate the complete information that must be shared from database. Each share consists of:

  • Privileges granting access to the database(s)
  • Schema containing objects to share.
  • Privileges granting access to specific objects in the database.
  • Details of consumer accounts with which the database and its objects can be shared.

With a flexible architecture, Snowflake creates a network of providers that can share data with multiple consumers (including internal users) and consumers that can access shared data from multiple providers:

In the shared database, all objects are accessible to users in the consumer account:

 

Shares are secure, configurable, and 100% controlled by the provider account:

  • New objects added to a share are available to all consumers thus, providing real-time access to shared data.
  • Access to a share (or any objects from the shared database) can be revoked at any time.

Data can be shared only between Snowflake accounts. to the data provider may share data with a consumer who does not already have a Snowflake account and/or is not a licensed Snowflake customer.

To facilitate sharing data with these consumers, Snowflake creates reader accounts to support providers. Reader accounts (formerly known as “read-only accounts”) offer quick, easy, and cost-effective methods to share data without consumers becoming a Snowflake customer.

Each reader account belongs to the provider account that created it. Like the standard consumer accounts, the provider account uses shares to share databases with reader accounts.  A reader account consumes data received only from the provider account that created it:

 

 

Users in a reader account can raise a query on the share data but cannot perform any of the DML tasks, which are available in a full account (data loading, insert, update, etc.).

Source: https://docs.snowflake.com/en/user-guide/data-sharing-intro.html

Technical demonstration will continue in next article.