•
Employee Number
: 135379, 117560
•
Employee Name
: Ashish Ranjan, Hossein Sadiq
•
Name of the Project
: CDW-MAC QA
•
Location
: SJM Towers, Bangalore
•
Designation
: ASE, AST
•
Contact Number
: 080 – 6660 – 6262/6134
•
White paper Topic
: DWH-ETL Testing Approach
•
E- mail id
:
[email protected],
[email protected]
DWH-ETL Testing Approach Ashish Ranjan,
[email protected], Tata Consultancy Services Ltd. Hossein Sadiq,
[email protected], Tata Consultancy Services Ltd. Abstract This paper describes how Data Warehouse (DWH)-ETL testing Best Practices for Manual and Automation provides the competitive edge to Tata Consultancy Services (TCS) in the competitive IT servicing industry. Streamlining Manual DWH-ETL testing to increase the Test Coverage and reducing the Time to market and aiming for incremental Automation is the focal point of this paper. DWH-ETL Automation framework, at the end of the paper, focuses on how the incremental Automation approach can ultimately fit into the Automation Framework for end to end testing DWH-ETL applications. The Incremental approach of Automation believes on the philosophy that ROI for any Automation solution should start reaping the benefits right after the first component gets built. In order to insure this, components should be capable enough to run independently as well as be flexible enough to fit into the overall framework. This ensures reducing operational costs and managing the consumption of technology resources, to maximize business value. It also explains the value-addition and the cost efficiencies which these Best practices and Automation approach brings to the customers. Relevant case studies from TCS are included in this paper to explain the concept.
1.0 Introduction Many organizations today are challenged to do more with lesser resources and to make cost reduction a strategic priority. Most Testing teams constantly face the challenge of innovating solutions that add value to their customers, who in turn are looking at reducing costs and increasing the coverage. DWH-ETL testing is one such area. The maturity level of DWH-ETL testing, in general, with respect to Testing Methodology is very low.
Typically a team should first move towards standardisation of Manual Testing processes. After Manual Testing processes standardization the team should move towards Automation.
2.0 What is Data Warehouse? A data warehouse is the main repository of the organization's historical data, its corporate memory. For example, a Credit Card company would use the information that's stored in its data warehouse to find out which months of the year their Customers have a very high rate of defaulting on their Credit Card Payment, or the spending habits of different segments of society and age group. In other words, the data warehouse contains the raw material for management's decision support system.
3.0 What is DWH-ETL? Extract, transform, and load (ETL) is a process in data warehousing that involves •
Extracting data from outside sources,
•
Transforming it to fit business needs, and ultimately
•
Loading it into the data warehouse.
ETL can in fact generally refer to a process that loads any database.
4.0 DWH-ETL Testing There is a need for more focused testing regarding DWH-ETL processes. When testing ETL: •
Validation should be done that all the data specified gets extracted.
•
Test should include the check to see that the transformation and cleansing process are working correctly.
•
Testing the data loaded and their count, into each target field to ensure correct execution of business rules such as valid-value constraints, cleansing and calculations.
•
If there are different jobs in an application, the job dependencies should also be verified.
ETL Applications can be language based or Tool based. If the capability of the ETL tool is fully utilised it reduces some of the Testing effort otherwise required in language based application.
4.1
Manual Testing Strategy for DWH-ETL
In this section Manual System Testing and Regression Testing strategy has been discussed in detail. Most of the processes mentioned below have been streamlined in order to make the transition to Automation as smooth as possible.
For System Testing: From a very high level any DWH-ETL application has inputs which undergo a transformation and results in an output. Hence for system testing, the general steps which need to be followed for success and coverage are: 1) Scenarios based on the Documents 2) Test data prepared based on Scenarios 3) Expected results based on the Transformation and the Test data 4) Run the Application using the Test Data 5) Capture the Actual output 6) Analyse the results after comparing the Expected to Actual Pre-requisites: 1) Business Requirement document 2) Functional Requirement document 3) Technical Requirement document 4) Source to Target Mapping document 5) Code diff 6) Traceability Matrix
How to come up with Scenario Sheet? Based on the Function Requirement document, Technical Requirement document and Source to Target Mapping document scenarios should be created to check the new functionalities going in. Scenarios should have both positive and negative cases. Typically scenarios can be created using an Excel sheet with one of the worksheet having them in plain English. If it is a complex application where intermediate files are being created, the application should be broken into sub-applications. A sub-application will be any set of jobs clubbed together which has a physical input and a physical output. Scenario sheet should be created at sub-application level and then merged into a final application level Scenario sheet. Each Scenario should have corresponding source file/table, source field/column names and target file/table, target field/column names even at the Sub-application level. The values put in above source fields/columns can be hypothetical ones, but should be unique and should have correct corresponding values (after transformation) put in the Target fields/columns. The above scenarios should be discussed with the Business Analyst, Development Manager, Test Lead w.r.t coverage, duplication and relevancy. After being agreed upon by all Stakeholders, 2 new worksheets should be created in the same excel sheet – Input & Expected Output. Input worksheet should have real world values, with field level details which can be processed through the application. Expected Output worksheet should have the Expected Target values if the Input worksheet values are passed through the Application. How to create Test Data? If the Input used by the Application are files then based on the “Input Worksheet” of the Excel sheet and the “File layout”, the input files used by the application should be mocked up.
If the Input used by the Application is from the table, then again based on the “Input Worksheet” of the Excel sheet, the corresponding values should be inserted into the Columns of the concerned tables. How to ensure the Mocked up Test Data file gets picked up when application runs? At a very high level there are 3 ways to ensure it depending on the application 1. Renaming the file 2. Changing the Header and Trailer 3. Getting the name from the Control DB and naming the file based on that. How to capture the Actual Output for Analysis? The best way to do so is to import the data from the output file or table to a New Excel Sheet. Excel has the built-in functionality of importing delimited or fixed column width flat text files. Also most of the DB querying tools like Toad, SQL Navigator and SQL Developer have capabilities to save the result in Excel. Storing the Output simplifies comparison between expected and actual and also helps in storing the results in readable format for future reference. How to analyse the results? Field to field comparison should be done between the Expected and the Actual. Any discrepancies should be highlighted. First the scenario having the difference should be double checked for the Expected value and once satisfied defect should be raised for that Scenario.
For Regression Testing: Generally doing regression testing of DWH-ETL applications seems a very daunting and time consuming task. Typically due to resource/time constraints: •
Random Sampling of unchanged high critical functionalities is targeted
•
System testing of random sampled functionality is done
This leaves plenty of code going into the production with very low confidence level.
•
Most of the unchanged medium and low critical functionality never gets tested release after release.
•
Sometimes due to time constraint and high number of critical functionality in the Application, some of the critical functions never get verified.
Regression testing is done to ensure that the new code did not affect anything other that what it was supposed to affect. This purpose can be achieved by taking a snapshot of the Application before the code gets implemented and a snapshot after the code gets implemented. A difference in both the snapshots will show what got changed. Anything which was not supposed to get changed will also be highlighted. Important thing to remember is that the above can only work if the input remains constant. Thus for Regression testing, the test file should remain constant. There are 2 ways in which a Regression Input file can be created: 1) By taking a sample of production Input file and desensitising it. 2) By incrementally adding the System Test Input Data file to a Regression Test file with each release. Progressively with each release, the Regression Test file will get better in its scope and coverage. Once the above is achieved, to regression test a release, same file should be run through the application before and after code is implemented. The output should be captured for before and after runs and stored preferably in an Excel Sheet in different worksheets. A simple comparison will highlight the differences between the 2 worksheets. There will be differences either because 1) It is supposed to be different - the new change for this release or 2) New code broke something it was not supposed to affect. Segregating 2 and 1 and analysing the 2s will pin-point what the new code broke. This is typically what any regression testing aims for.
4.2
Incremental Automation Test Strategy for DWH-ETL
Most of the Manual Activities mentioned above can be effectively converted into independent Automated Components. For starting the Automation Initiative a Testing project need not wait for all the components to get developed, as time and resource
considerations permit they can be incrementally developed and benefits can start trickling in right after the first component gets developed. Later in the paper a complete Automation Framework will be discussed. Same framework with minor modifications can be used to Automate most of the Backend application Testing. Below are the Activities which can be automated:
For System Testing: How to create Test Data? If the Input file used by the application is a serial file with a specific format. A tool can be built which reads from the Excel worksheet and create a serial file according to the specific file format. If the Input file is a MFS file, the flat file created as above can be ftped to the Unix box and another ETL component can created it. How to ensure the Mocked up Test Data file gets picked up when application runs? A component can be built which based on the application requirement, can 1. FTP the serial Test data file from windows to application Unix directory 2. Rename the file so that it can be picked. For Control DB based applications the component can Query the DB for the file name value and rename the file . How to capture the Actual Output for Analysis? Two components can be built 1. Converts the different file formats to a serial comma delimited file. 2. Connect to the DB and query for the table update done by the Application and extract it in an Excel sheet. How to analyse the results? A simple comparison tool can be built which takes in the Expected Result worksheet and Final Output as Inputs and does a comparison to show the difference at field level.
This tool can be used for the Regression testing too.
For Regression Testing: It was mentioned earlier that for Regression Testing: “The output should be captured for before and after runs and stored preferably in an Excel Sheet in different worksheets. A simple comparison will highlight the differences between the 2 worksheets” The Comparison tool built for System Testing can be used equally well here. Small modifications may be required based on the Project Requirement.
4.3 DWH-ETL Automation Framework Development Framework development plays a key role in providing flexibility to the tool for enhancing the capabilities for future usage. Careful thought has to be put in while development of the framework to arrive at generic components that constitutes the framework. The component model for the framework under discussion is shown in the schematic below. It is specific to Ab Initio/Tivoli but all the components are flexible and can be combined to work effectively based on the system under test / testing requirements. The framework also supports scalability to incorporate other similar modules / applications. The components explained below have been considered at a very high level. Some new component/functionality can be added, removed or modified to customize based on the application and testing automation requirements.
Framework component model
Component 1 - To convert Excel Tab Sheet to ASCII files o The User Defined Excel File should be pulled from User Defined Location. o Should be able to select any user defined Tab in any user defined worksheet and convert the information in cells in a specified Column and Row Range into a ASCII Text File. o The ASCII File should be named according to User Defined name. o The ASCII File should be placed in User Defined Location Component 2 - FTP files from Windows Platform to Unix box o Component should pick the User Defined file. o Component should be able to ftp the picked file to User Defined Unix box ( Different Applications reside on Different Servers) Component 3 - Changes ASCII to MFS, SFS, EBCDIC depending on APP o Sub Components can be defined which are responsible for only one type of conversion.
o The component should be able to correctly Map the fields from one format to another rather than just convert the format.(DML Dependency should be considered) o The Kind of Format change and the DML should be User Defined. o The component should also place the file after conversion to the User Defined landing Directories of the Apps for it to be able to pick it. o File should be renamed to User Defined Names. Component 4 - Updates Control DB to pick our files using Framework DB o The Control DB should be updated with the correct file name so that the App can pick it when it is running. o It should validate whether the Values in the Control DB for the App is correct or not. o If above is not correct it should Raise and Alert and stop. Component 5 - Runs jobs one after another in the order mentioned in the Framework DB o It should be able to mock the way Tivoli runs. o If any job fails it should raise an alert and stop. Component 6 - Changes MFS, SFS, EBCDIC to ASCII depending on APP o Sub Components can be defined which are responsible for only one type of conversion. o The component should be able to correctly Map the fields from one format to another rather than just convert the format.(DML Dependency should be considered) o The Kind of Format change and the DML should be User Defined. o The component should be able to pick up the file for conversion from the User Defined output Directories of the Apps. Component 7 - FTPs Output files from Unix box o Component should pick the User Defined file from user defined location and User defined Unix Box.
o Component should be able to ftp the picked file to User defined Windows location and box. Component 8 - Extracts Table loads in App Tables to ASCII file o Should be able to extract values from User Defined Table and DB using User Defined Query. o Should Extract and place the file in a CSV file format in a Window's Directory. Component 9 - Compares Expected Output to Actual Output o Should be able to locate the Output ASCII file and the Expected Output File among all the files and pick them for comparison. o Limitation of 256 columns needs to be addressed. Component 10 - Updates Pass or Fail in QC o Based on pass or fail QC/TD should be updated.
5.0 Success Stories Case Studies: 1. Problem description: For a leading Investment Bank the overall objective was to identify areas for Process Improvement to increase overall Test Efficiency. The challenges were: Limited Test Coverage (2%-5%) - Due to Time / Resource constraints only few record / Interface are validated. Tedious and Time Consuming Process. Prone to Human Error Solution: TCS team has provided the optimal solution based on the above framework in mind and provided a tool which connects to different databases of the investment bank’s various data warehousing applications and retrieves the baseline data and the input data, compares the set of data against each other and reports any differences. 2. Problem description:
A bank’s DWH Testing division was looking for ways to Improve Manual Testing efficiency and leverage Automation. Their confidence level for any complete Automation solution was very low. They wanted to try Automating few things but without huge investment of effort and time. Solution: TCS team has provided the optimal solution here by suggesting the above Incremental Automation approach. It was readily adopted and with each component getting developed the Cost saving realizations started tricking in. TCS then suggested the DWH Automation Framework and that too has been adopted.
6.0 Lessons Learnt Establish clear and reasonable expectations o Establish what percentage of your tests are good candidates for automation o Eliminate overly complex or one-of-a kind tests as candidates o Get a clear understanding of the Automation Testing Requirements o Technical personnel to develop and use the tool. o An effective manual testing process must exist before automation is possible. "Ad hoc" testing cannot be automated. You should have: o Detailed, repeatable test cases, which contain exact expected results o A standalone test environment with a restorable database
Managing Resistance to Change The tool does not replace the testers. It helps them by: o Performing the boring/repeatable tasks performed while testing o Freeing up some of their time so that they can create better, more effective test cases Specific application changes are still going to be manually tested. Some of these tests may be automated afterward for regression testing. o Everyone need not be trained to code, all that they have to learn is a different method of testing.
Staffing Requirements One area that organizations desiring to automate testing seem to consistently miss is the staffing issue. Automated test tools use "scripts" which automatically execute test cases. As mentioned earlier in this paper, these "test scripts" are programs. They are written in whatever scripting language the tool uses. Since these are programs, they must be managed in the same way that application code is managed.
7.0 Conclusion DWH manual testing process can be streamlined making it easier to test the key areas and increase coverage. It will also increase the ease of Testing Backend applications. With respect to Automation, an Incremental Automation based test strategy can be easy, cost effective and efficient to implement. A framework that has the flexibility of scaling up, modularity and data dependency can deliver great benefits to the Team and Organization. References o Totally Data-Driven Automated Testing - A White Paper By Keith Zambelich o What is data warehouse testing? by Rob Levy o Wikipedia Authors Ashish Ranjan, TCS-JPMC Relationship Email:
[email protected] Hossein Sadiq, TCS-JPMC Relationship Email:
[email protected]