In this blog, we are going to learn about the ETL, which is a process to load data into the database. Data, which is the most basic term of today’s Internet world is actually not that simple as it sounds.
It is basic but not simple, there are many technologies wandering around this word “Data”. Data Integration, ETL, BIG DATA, Relational Database, Data Mining, Data Warehousing, SEO are the fewer one but there are many more things related to data.
Data Integration: ETL Resultant
The Data Integration (DI) is the basic need of Data warehouse. DI refers to collect raw data from a number of sources and keep it in a single place. Most of the organizations maintain a separate unit or department to handle their database. The main reason behind this is that each data unit has its own importance and it is used to carry out important factors related to:
- Market research of the product
- Sale prediction in the upcoming year
- Employee productivity ratio
- Companies yearly revenue etc.
DI enables you to maintain a different type of data in a single schema, which is searchable, maintainable and retrievable. Below are few other examples of few other requirements of data integration:
- Co-relationship of different types of data.
- Making information more use-able or share-able
- Avoiding multiple references to single information
ETL: Basic Concept
ETL is the basic process to do Data Integration and maintain a Data warehouse from number source systems. It refers to a process that load data in Data warehouse from a number of flats files. These files can be a spreadsheet or .CSV files. In ETL, E stands for Extract, T stands for Transform and L stands for Load.
While processing data through ETL process, the transformation is also done through different calculations such as arithmetic function on numeric values, Date function on Date values. Joining different fields, removing unexpected symbols and incorrect data field can also be done.
Applying various rules to clean the data during the transformation of data is generally known as a Cleansing process. After transformation of data, data is finally loaded into the data warehouse. There are many transaction systems such as Oracle, Microsoft that use Data warehouse that keep data for analysis and reporting applications.
Difference between ETL and Database Testing
ETL processes to load data into a data warehouse but it is not a database. Although Database testing and ETL testing both are related to data validation testing ETL testing is completely different from database testing. In Database testing, testing is performed on the data which transfers from one system to a transaction system but ETL testing is a testing to validate data in the data warehouse.
Some other major differences between Database testing and ETL testing are as follows:
- Database testing is to test data on the basis of Normalization, which is the most important Database factor. For example, Primary key and foreign key is maintained or not. ETL testing is for maintaining movement of data from Source to Target system. The main aim of ETL testing is to maintain Data Warehouse, which is especially used for BI applications.
- Database testing is generally performed in a transactional system, which must have some business flow and rules to be work on. ETL testing is performed on historical data. It is not mandatory that there is a business flow persist on data tested through ETL.
- Database testing is generally used in OLTP system whereas ETL testing is applied on OLAP systems.
- Database testing is based on the Entity-Relationship model but there is no such model followed in ETL testing.
Common terms used in ETL
There are a number of terms used in ETL process. Whenever, we are talking about ETL, we must be aware of the basic terms used in ETL. Some of the most common ETL terms are as follows:
- Source System: It refers to a system from where data need to be fetched for keeping it in Datawarehouse. It can be a database, Application, flat files, spreadsheets.
- Mapping: It refers to a relationship between data available in source and to be transferred to another destination.
- Staging Area: This refers to a place where data get processed just before inserting it into Data warehouse.
- Metadata: It refers to the data of data means few properties of data such as its structure, objects, business rule, type etc.
- Cleansing: It refers to a process to remove anomalies from the raw data.
Common challenges faced during ETL testing
ETL testing is one of the most demanding testing profile. ETL tester must have very good knowledge of database and also be very skillful to find out cleansing algorithm required to transform the data.
Below is the list of common challenges faced during ETL testing:
- Data Loss: Many times data gets lost due to improper file handling or storage while processing a number of raw files.
- Inaccurate Data: Due to insufficient cleansing process, sometimes data got corrupt and causes incorrect, incomplete or duplicate data.
- Lengthy and complex ETL for historical data: Data warehouse generally stores a large volume of old data. To test his historical data, ETL process becomes too complex and sometimes lengthy.
- Limited access to ETL jobs: Tester who is responsible for ETL testing don’t have access to ETL Jobs, that directly impact the BI report.
- Limited information on business requirement: ETL testers don’t have complete information related to required report and actual business rule for which ETL process is getting executed.
- Involvement of complex SQL concepts: ETL tester must have very good knowledge of SQL queries, which is used to validate data in the target system.
- Less information of Source-to-target mapping: In many cases, ETL tester doesn’t have complete information on source-to-target mapping.
There is nothing new to learning ETL testing. A manual tester can also become an ETL tester, the only thing is to learn is SQL, Database and Shell script.
A manual tester is very skillful in finding out the scenario if his skill merges with the concepts of Database and then a person can easily become an ETL tester. Future of ETL testing is very bright as it is base of Big data application!