In Data Analytics
0
41 Views

What is Data Warehouse?

A Data Warehouse is a collective bunch of data of corporate world and these retrieve data from the external source and other operational systems. It handles the enterprise’s daily transaction operations which helps in making the business decision due to reporting and analysis tools at different levels. Furthermore, data is inserted into the warehouse database through ETL process (data extraction, data transformation, and data loading ) by using OLTP applications like external providers and mainframe applications etc.

The whole concept of DW was established in the 1980s by IBM employees “Paul Murphy and Barry Devlin”. They formally enhanced “Business Data Warehouse”. Therefore, this concept is specially formed to provide the data flow from external operational programs to make the better decision for organizations perspective.

Basically, Data warehouse provides consolidated data, transactional data, and historical data from different sources. Data warehouse segregates the workload analysis and enhances the organization data from other external links.

Why the need of Data Warehouse?

A data warehouse is a domain system of OLTP (Online transaction processing). So, data can be stored in the data warehouse through data warehouse appliance for months and years to make the historical analysis. A warehouse appliance is a mixture of software and hardware tools. Therefore, these tools store the data in the database of a data warehouse. Basically, the storage capacity of data warehouse comes around in petabyte and terabyte ranges.

So, let’s understand the practical meaning of data warehousing with an example. So, let’s assume a big organization which stores data (information) related to its employees, their date of joining, date of leaving, their salaries, product estimation, client and customer information, invoices and sales etc. Furthermore, the head of an organization may want any information related to the measurement of cost-reduction. As a result, the answer will rely on historical analysis of the data. This is the main advantage of data warehousing.

Business Analysis Framework

This framework tells the actual working of a data warehouse. A (Business Analyst) BA retrieves the valuable information from the DW (Data Warehouse). Further, it identifies the performance and makes crucial decision to win with other market stakeholders. For this reason, a data warehouse increases the organization’s productivity. It also provides the consistent workflow to make the relationship strong with customers.

Properties: Data warehouse

There are some features of the data warehouse as follows.

Integrated: It consolidates the data from external sources like relational DBMS, extended DBMS, and Flat files etc. It also provides the effective data analysis in the data warehouse.

Subject Oriented: This is the main feature of the data warehouse. Data warehousing is the subject-oriented system because that provides the data of a subject instead of company’s ongoing operations. These subject can be the product, branch information, sales, and revenue etc. A simple data warehousing does not require any ongoing operations. It also provides data analysis and modeling for decision making.

Non-Volatile: In this situation, old data is not deleted while the new record is added to the database. Old data is merged with new data by using SQL JOINS query. A data warehouse is apart from the external operational sources that do not reflect the in the database of the data warehouse.

Time-Variant: Extracted or collected data is analyzed for a particular period of time. Here, data provides in historical form.

ETL Process in Data Warehouse

ETL is a combination structure of extraction of data, the transformation of data and loading of data. These ETL process help in preparation of data in the database management. It also extracts data from external sources, then transforms the data by using concatenation, data aggregation, and calculations rules. Later, it loads the information in the data warehouse. As a result, it simply analyzes the business data to make the crucial business decision. So, let us understand the ETL process flow in brief as follows.

Extraction

Extract process reads the data from unknown external database and extracts in the desired form to make the data easily accessible for business objectives. On another hand, the major point to extract data is to fetch the relevant information from the system with minimum resources. As a result, this step does not affect the data.

Transformation

This function work like filtration and data cleaning. This step is applied to some important lookup rules and formulas on extracted data as per the business requirements. Then, it converts the data into the desired table or a database. Therefore, this phase of ETL contains verification, validation, and rejection of data. First of all, it analyzes the extracted data through verification that defines the extracted data is in the correct format or not.

Secondly, It validates the extracted data means make sure that data is extracted correctly what we required for. Here, you can use SQL during the validation process it will help you to find the correct data. You can even find the duplicate data, sorting of data, data conversion, and consistency of the data in the database.

After validation process, the query result will give you a final data. Now, apply set of rules to get the duplicate and unwanted data. As a result, it automatically will reject the data. This is how transformation process works.

Loading

This is the final stage of ETL process. This phase defines the resultant of data extraction and transformation. After two steps of EPL data then loads into the database of the data warehouse. You can also take help of SELECT command while data loading into the database. As a result, once you load the data then refresh the database.

Data Warehouse Architecture Model (Three-Tier Architecture)

A data warehouse is an RDBMS (Relational Database management System), that is well developed for query analysis. A data warehouse maintains huge data in the database to make the changes in historical analysis. In this session, I will discuss the three-tier architecture model of the data warehouse. In conclusion, this three-tier architecture contains three layers as follows.

  • Bottom-Tier Layer
  • Middle-Tier Layer
  • Top-Tier Layer

Bottom-Tier Layer

This is the bottom tier layer contain all of the database servers in the data warehouse which relates to the relational database system. Data is extracted from external and operational databases due to some backends tool. Then, feed data into the database of the warehouse by performing ETL process (Extract, Transform, and Loading). Basically, we use ETL process to load data into the bottom layer of the data warehouse architecture model. In addition, the bottom layer contains several entities like as

  • Data warehouse
  • Data Marts
  • Metadata Repository
  • Monitoring & Administration

1. Data Warehouse

A data warehouse is a sophisticated form of operational database that contains only required information and allow to get fast access to the information.

2. Data Marts

Data Mart is an optimized version of the data warehouse which relies on a particular subjected area. A data mart maintains access layer to collect the relevant information from the organizational data store. Therefore, data marts only focus on specific business architecture to build the warehouse in a large size that takes less time and generate organization revenue. Basically, it divides the data into small portion like sales information, customers information, and items information etc. In addition, some key points need to remember of data mart as follows.

  • Data Marts are Windows, Unix and Linux based servers that create data marts. It also contains low cost to create data marts.
  • Implementation of data marts cycles cannot be stored for a long time means it only valid for weeks not for months and years.
  • Data marts are much smaller in size but more flexible. Data marts can customize the data by their departments.
  • Because of the user understanding, data mart requires less implementation as compared to the data warehouse. As a result, data mart can increase business process.

Types: Data Mart

Data marts contain its three type as follows

  • Dependent data mart directly sourced from the database of a data warehouse.
  • Independent data mart sourced with more than one data warehouse.
  • Hybrid data mart converts the data from data warehouse to operational database.

So, let’s understand the data marts types in brief.

Dependent Data Mart

A dependent data mart access data source directly from the single data warehouse database. This type of data mart depends on only enterprise data warehouse or centralized data warehouse. You can access only one database at a time. But, if you want more than one then you have to individually configure them is called dependent data mart.

Dependent data mart provides a way to end-users that can easily access data warehouse and data marts together, but it will depend on the need. As a result, this data mart works from top to bottom as seen in the image above.

Independent Data Mart

This type of data mart works without the use of any centralized system or data enterprise warehouse. Independent data mart divides the data into small portion within an organization. Therefore, independent data mart does not contain any relation value with data enterprise warehouse and other operational data marts. Hence, it works independently from the operational system. For this reason, you need to develop new centralized store where an end-user can analyze the data multiple times.

Hybrid Data Mart

A hybrid data mart is a mixture of multiple inputs from the external sources in the data warehouse. Basically, it works on the ad-hoc environment. It is designed to work on more than a database and also provide the fastest implementation for any big organization. It requires less cleaning process and supports big database. Hybrid data mart contains more flexibility than dependent and independent data marts for small data applications.

Implementation process of Data Mart

Data mart contains five implementation process as follows.

  • Designing
  • Constructing
  • Populating
  • Accessing
  • Managing

Designing

Designing is the 1st phase of implementation in the data mart, that covers all actions between the initial request of data mart to collective data through the basic data mart requirements. It also discovers both physical & logical model of the data mart. In addition, this phase contains some tasks as follows.

  • Collect the required technical and business requirement.
  • It verifies the source of data and selects the proper subset of data.
  • It helps in making the physical and logical architecture of the data mart.

Constructing

Constructing is the 2nd phase of datamart, it creates the logical and physical architecture of the data mart that provides fast access to data. In addition, this phase contains some tasks as follows.

  • An RDBMS stores the constructed data which contains the addition of the data & deletion of the data.
  • By using SQL query, you can perform several functions like indexes, views, and tables.
  • Constructed data remains secure by using SQL security libraries.

Populating

This is the third phase of data mart process, that covers all the steps which are used in above two phase of data mart like getting the data from the external source, cleaning of the data and modifying of the data. In addition, this phase contains some tasks as follows.

  • Select the target sources for getting the data.
  • Extraction of data from the sources.
  • Data cleaning and data transformation.
  • Loading and storing the data (metadata) into the database.

Accessing

This is the 4th phase of data mart implementation, this phase makes the data usable for data analysis, querying the data, creating the chart, graphs, and reports to publish them into the external market for business objectives. Furthermore, an end-user submits the set of queries and display the final result of the queries.

Managing

This is the fifth last phase of data mart implementation, that covers all of the tasks which relate to management like as.

  • Provide user access authentication.
  • Provide secure, clean, and accurate data to the data warehouse.
  • Optimize the system for better result and performance.
  • Provide the backup in case of system failure.

3. Metadata Repository

Metadata is another form of simple data. A simple data which represent different data format call as metadata. Metadata segregate the especially relevant information of the data. It also provides the detailed information of the data. So, let’s assume an index of the notebook consumes as a metadata for the internal content of the notebook. It also defines the warehouse objects and roadmap in the data warehouse. Now, understand the metadata repository, which is an internal part of data warehouse system. As a result, it contains some metadata categories as follows.

Business Metadata: This type of metadata covers business information like business goals, owner information, changing policies information and business definition etc.

Operational Metadata: It contains lineage of data and data currency. Data currency defines the active mode of data, purged and archived. Data lineage refers to the data transformation and migration rules.

Mapping of data from operating system to Data warehouse: This type of metadata contains sources database, the content of the data sources, extraction of data, data partition, cleaning of data, the transformation of data, purging of data and refreshing of data rules etc.

Summarization Algorithms: It contains data aggregation, dimension algorithms, data granularity and data summarization etc.

4. Monitoring & Administration

This element is basically used for data refreshment and disaster recovery. It set the limitation in size of data warehouse and provide data in synchronization. Further, it controls the ranges of query and numbers format. It helps to increase performance better of the database.

Middle-Tier Layer

In the middle-tier, there is an OLAP server that works as a single processing combination of ROLAP and MOLAP database systems. This OLAP server provides the multidimensional way to get the data model. For this reason, analysts and managers can provide consistent and fastest way to get the information.

ROLAP stands for Relational On-Line Analytical Process that manipulate the stored data in the relational database. ROLAP model can easily maintain the large data. MOLAP stands for Multidimensional OLAP system that maintains dynamic multidimensional operations and the data. Data can be stored in the multidimensional cube. It provides fastest data retrieval due to MOLAP cubes. The whole structure of middle layer is to represent the abstracted view of the data warehouse. As a result, this layer contains common behavior between database and end-user.

Components of OLAP Server

There are three components in OLAP system as follows.

  • Relational OLAP
  • Multidimensional OLAP
  • Hybrid OLAP

Relational OLAP

Basically, ROLAP works on the stored data in the relational database. ROLAP server defines the relationship between front-end(client) tools and relational back-end tools. ROLAP server provides the way to manage and store the data by using relational and extended DBMS. Therefore, this server can easily handle a large number of data. But, it contains a drawback that it has some limited SQL functionalities to maintain the database and aggregate tables.

Multidimensional OLAP

Multidimensional OLAP system handles the multidimensional data through array-based storage systems. It also provides an optimal solution for dice and slice operations. MOLAP perform the fastest access of the data in comparison with relational OLAP. It can also handle complex calculations. But, it contains some drawbacks like data aggregation cannot perform while changing the dimension of cubes. MOLAP maintains a small number of data.

Hybrid OLAP

Hybrid is a complete mixture of MOLAP & ROLAP servers. A hybrid server provides the fastest accessibility than MOLAP & ROLAP servers. It contains better scalability and computation. It also maintains a big number of data (More than a terabyte).

Note: OLAP contains additional type as “Specialized SQL servers”, that allows to execute advanced functions of SQL query and provide better processing support than Snowflake and Star schemas.

So, let’s look at the difference between OLAP and OLTP. This is the most important part in data warehouse, which could be asked in the interviews for data warehousing profile.

OLAP vs OLTP

OLAP

OLTP

On-line Analytical Processing system

On-line Transaction Processing System

It contains a functionality of query management.

OLTP contains a functionality of modifying the database.

It performs single modification due to only SELECT command.

This performs several modifications due to INSERT, UPDATE & DELETE

OLAP use only data warehousing process

It uses only extended DBMS

OLAP provides data analysis and retrieval of the data.

OLTP provides data manipulation and modification.

Tables do not perform data normalization

Tables perform data normalization

It creates seconds to minutes response time

It creates in a millisecond.

Data is not organized in the database

Data is organized in the database.

It contains the read-only process.

It contains both read and write process.

OLAP belongs to customer oriented system.

OLTP belongs to the market-oriented system.

It solves complex query due to data aggregation rules.

Here, the query is simple and standardized form.

OLAP requires back up time to time but is not mandatory.

It requires full back up.

It is subject oriented. Example: DB changes with the subject like marketing, purchase, and sale etc.

It is application oriented. Example: Retails, Airline, and banking etc.

OLAP is used by only Users like Managers, CEO, MD etc.

OLTP is used by Database professionals and DBA.

Hundreds of users can perform actions in OLAP

Thousands of users can perform actions in OLTP.

OLAP help to increase business productivity.

It helps in user’s productivity,

A user can create the view by using spreadsheets

It can easily maintain the view.

It provides historical data processing

OLTP provides day to day data processing

It mainly focuses on Data out.

It mainly focuses on Data In.

Totaly based on Snowflake and Star Schema.

Purely based on ER Model (Entity Relationship Model)

Top-Tier Layer

Top-Tier is a front-end client-side layer in a three-tier model of the data warehouse. This contains reporting & query tools, data mining & data analysis tools. Here, reporting tools contain report writers & productive reporting tools. Analysis tools are also used to make charts based on data mining result. Data mining tools define the especially relevant information from the hidden patterns. Now, we have completed the three-architecture model of the data warehouse.

Conclusion

I hope you guys enjoyed this tour of the data warehouse. A simple data warehouse is a hub where the especially relevant information resides in the database. There are many databases for storing the value but I explained the relational and extended database management system. Because they both contain maximum data storage than other databases. This is a complete tour since definition to its conclusion. In addition, this article will help to learn what exactly data warehouse is? And why is this so important?

I also explained the whole concept of the data warehouse from the scratch which will help to learn for freshers in their interviews and the exams. In the data warehouse architecture, metadata plays a vital role as it species the values, the point of connection, source, features and usage of the data warehouse. In addition, last but not the least the main topic of this article is the difference between OLAP and OLTP. So, most of the companies who hire a candidate or employee for data warehousing profile they must ask these differences.

Thank you!

RECOMMENDED POSTS

Start typing and press Enter to search