Data Warehouse

What is Business Intelligence?

BI is the act of transforming raw/operational data into useful information for business analysis

How Does it Work?

  1. BI based on Data Warehouse technology extracts information from a company’s operational systems
  2. The data is transformed (cleaned and integrated), and loaded into Data Warehouse
  3. Since this data is credible, it is used for business insights.

Why Data Warehouse?

  1. Data collected from various sources and stored in various databases (Oracle, SQL server, MySQL…) cannot be directly visualized
  2. The data first needs to be integrated and then processed before visualization takes place.

What is Data Warehouseing?

  1. A central location where consolidated data from multiple locations (databases) are stored
  2. DWH is maintained separately from an organization’s operational database. (DWH is another copy)
  3. End uses access it whenever any information is needed.
  4. Data Warehouse is not loaded every time new data is added to databases.

What are The Advantages of a Data Warehouse?

  1. Strategic questions can be answered by studying trends (from past data).
  2. Data Warehousing is faster and more accurate
  3. Data Warehouse is not a product that a company can go and purchase, it needs to be designed and depends entirely on the company’s requirement.

End User
=> Take the data from opeartional systems
=> Integrate the data from multiple sources
=> Standardize the data and remove inconsistencies
=> Store the data in format suitable for easy access
=> Return result to end user

Properties of a Data Warehouse

A Data Warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managemnet’s decision-making process

  1. Subject-oriendted: Data is categorized and stored by business subject rather than by application. (data in data warehouse are suitable for business requirements)
  2. Integrated: Data on a given subject is collected from disparate sources and stored in a single place
  3. Time-variant: Data is stored as a series of snapshots, each representing a period of time
  4. Non-volatile: Typically data in the data warehouse is not updated or deleted

Key Terminologies Related to DWH Architechture

OLTP (Online Transaction Processing) vs OLAP (Online Analytical Processing)

Relational Database (OLTP) Analytical Data Warehouse (OLAP)
Contains current data Contains historical data
Useful in running the business Useful in analyzing the business
Based on Entity Relationship Model Based on Star, Snowflake and Fact Constellation Schema
Provides primitive and highly detailed data Provides summarized and consolidated data
Used for writing data into the database Used for reading data from data warehouse
Database size ranges from 100MB to 1GB Data warehouse size ranges from 100GB to 1TB
Fast, provides high performance Highly flexible, but not fast
Number of records accessed is in tens Number of records accessed is in millions
Example: All bank transactions made by a customer Example: Bank transactions made by a customer at a particular time

OLTP Examples:

  1. A supermarket server which records every single product purchased at that market
  2. A back server which records every time a transaction is made for a particular account
  3. A railway reservation server which records the transactions of a passenger

OLAP Examples:

  1. Bank manager wants to know how many customers are utilizing the ATM of his branch. Based on this he may take a call whether to continue with the ATM or relocate it.
  2. An insurance company wants to know the number of policies each agent has sold. This will help in better performance management of agents.

ETL

ETL is the process of extracting the data from various sources, transforming this data to meet your requirement and then loading it into a target data warehouse. (Tools: Talend, Informatica…)

Data Mart

  1. Data mart is a smaller version of the data warehouse which deals with a single subject.
  2. Data mart are focused on one area, hence, they draw data from a limited number of sources
  3. Time taken to build data marts is very less compared to the time taken to build a data warehouse
Data Warehouse Data Marts
Enterprise wide data Department wide data
Multiple subject areas Single subject data
Multiple data sources Limited data sources
Occupies large memory Occupies limited memory
Longer time to implement Shorter time to implement

Types of Data Mart

  • Dependent Data Mart
  1. The Data is first extracted from the OLTP systems and them populated in the central DWH
  2. From the DWH, the data travels to the Data Mart
  • Independent Data Mart
  1. The data is directly received from the source system
  2. This is suitable for small organizations or small groups within an organization
  • Hybrid Data Mart
  1. The data is fed both from OLTP systems as well as the Data Warehouse

Metadata

  1. Metadata is defined as data about data
  2. Matadata in a DWH defines the source data. (Flat file, Relational Database and other objects)
  3. Matadata is used to define which table is source and target, and which concept is used to build business logic called transformation to the actual output.

DWH Architecture

DWH Architecture