Data Warehouse Testing Approach

Question

I am working on testing space on Data Warehousing. In the scope I got newly created and dimensions and facts which should be validated. As per my knowledge and information got via browsing I would decide to cover for following

Schema validation of Facts and Dimension tables as per spec
Data duplicate check for Facts and Dimension table
Look-up validation for dimension table

Is there anything else that I can verify here?

In addition just curious how can I check whether data correctly populated to Fact table and row count, correct surrogate keys etc. In developers point of view are they using DML scripts to load the data?

QArea · Answer 1 · 2015-03-10T15:04:02.940

Testing the Database

The database is tested in the following three ways:

Testing the database manager and monitoring tools - To test the database manager and the monitoring tools, they should be used in the creation, running, and management of test database.
Testing database features - Here is the list of features that we have to test:

-Querying in parallel

-Create index in parallel

-Data load in parallel

3.Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies. http://www.tutorialspoint.com/dwh/dwh_testing.htm

Also you can use ETL testing (Extract, Transform, and Load). ETL Testing Techniques: 1) Verify that data is transformed correctly according to various business requirements and rules. 2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation. 3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data. 4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.

Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.

Also you can test Schedule, Backup Recovery, Operational Environment, the Application and Logistic of the Test For more information about ETL Testing / Data Warehouse Testing please visit http://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/

UPD:

Creating Indexes in Parallel

Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.

The considerations for creating index tablespaces are similar to those for creating other tablespaces. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.

Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.

Partitioned indexes are created in parallel using partition granules, so the maximum DOP possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher DOP is used. The following statement could be used to create a local index on the fact table:

CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL 
PARTITION jan95 TABLESPACE Tsidx1, 
PARTITION feb95 TABLESPACE Tsidx2, 
... 
PARALLEL(DEGREE 12) NOLOGGING;

To backup or restore January data, you only need to manage tablespace Tsidx1.

Parallel Query Tuning

The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:

symmetric multiprocessors (SMP), clusters, or massively parallel systems
high I/O bandwidth (that is, many datafiles on many different disk drives)
underutilized or intermittently used CPUs (for example, systems where CPU usage is typically less than 30%)
sufficient memory to support additional memory-intensive processes such as sorts, hashing, and I/O buffers

If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.

Here you can read about this in more detail: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqo.htm#1559

This resources I hope will be helpful for you: https://wiki.postgresql.org/wiki/Parallel_Query_Execution

https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx

http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67775/ch24_pex.htm#1978

Thanks for your reply. Could you please elaborate on following point i.e. Querying in parallel, Create index in parallel, Data load in parallel with regards to how we do those. — Shabar, Mar 07 '15 at 22:54
In view of system testing, do we need to go into that level or are those part of unit testing? — Shabar, Mar 15 '15 at 02:17

score 1 · Answer 2 · answered Apr 14 '15 at 09:03

1

I am an ETL tester. for data validation and data quality testing in data warehouse follow below checks

1) metadata testing - testing the structure of underlying tables and their structure (as per design document). 2) data validation - in data validation you test the mapping transformations using SQL and PL/SQL. We generally test it using Source and target table count, Source minus Target, Source Intersect Target and Target minus Source.

3) Duplicate check : To ensure no redundancy in data warehouse. 4) loading strategy check : to check if your target table is SCD or delete on reload (depends on requirements.)

answered Apr 14 '15 at 09:03

anurag

590
3
8
27

Is there a particular reason you using `INTERSECT ` rather than `INNER JOIN ` – Shabar Apr 20 '15 at 05:17
You can use inner join also instead, But as while doing the test query preparation we generally prepare source and target queries separately. so its easier to do intersect between created queries rather that putting inner join between those two queries. for E.G. Source query : select emp_id,emp_name from employee1; Target Query : select rep_id,rep_name from master_emp; Intersect query will be easy to read and understand as it will be select emp_id,emp_name from employee1 INTERSECT – anurag Oct 01 '15 at 08:30
But the issue is, when using INTERSECT operator the number and the order of the columns must be the same in all queries as well data type must be compatible. – Shabar Oct 02 '15 at 02:16
Exactly...thats why we use intersect operactor so datatype of target column, all columns in target table can be tested. – anurag Oct 04 '15 at 10:03

Data Warehouse Testing Approach

2 Answers2