. You can use either of the following methods to delete more than one port at a time. You can repeatedly hit the cut button; or You can highlight several records and then click the cut button. Use
INFORMATICA CONFIDENTIAL
BEST PRACTICES
388 of 818
Editing Expressions Follow either of these steps to expedite validation of a newly created expression: Click on the
To: Add a new field or port Copy a row Cut a row Move current row down Move current row up Paste a row Validate the default value in a transformation Open the Expression Editor from the expression field To start the debugger
Press Alt + F Alt + O Alt + C Alt + W Alt + U Alt + P Alt + V F2, then press F3 F9
Repository Object Shortcuts A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the referenced object). Whenever possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository, and reduces maintenance. Follow these steps to create a repository object shortcut: INFORMATICA CONFIDENTIAL
BEST PRACTICES
389 of 818
1. Expand the shared folder. 2. Click and drag the object definition into the mapping that is open in the workspace. 3. As the cursor enters the workspace, the object icon appears along with a small curve; as an example, the icon should look like this:
4. A dialog box appears to confirm that you want to create a shortcut. If you want to copy an object from a shared folder instead of creating a shortcut, hold down the
Workflow Manager Navigating the Workspace When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the operation you are performing:
To: Create links
Press: Press Ctrl+F2 to select first task you want to link. Press Tab to select the rest of the tasks you want to link
Press Ctrl+F2 again to link all the tasks you selected F2 Edit tasks name in the workspace Expand a selected node and all its children SHIFT + * (use asterisk on numeric keypad) Move across to select tasks in the workspace Tab Select multiple tasks Ctrl + Mouseclick
Repository Object Shortcuts Mappings that reside in a “shared folder” can be reused within workflows by creating shortcut mappings. A set of workflow logic can be reused within workflows by creating a reusable worklet.
Last updated: 13-Feb-07 17:25
INFORMATICA CONFIDENTIAL
BEST PRACTICES
390 of 818
Working with JAVA Transformation Object Challenge Occasionally special processing of data is required that is not easy to accomplish using existing PowerCenter transformation objects. Transformation tasks like looping through data 1 to x number of times is not a functionality native to the existing PowerCenter transformation objects. For these situations, the Java Transformation provides the ability to develop Java code with unlimited possibilities for transformation capabilities. This Best Practice addresses questions that are commonly raised about using JTX and how to make effective use of it, and supplements the existing PowerCenter documentation on the JTX.
Description The “Java Transformation” (JTX) introduced in PowerCenter 8.0 provides a uniform means of entering and maintaining program code written in Java to be executed for every record being processed during a session run. The Java code is maintained, entered, and viewed within the PowerCenter Designer tool. Below is a summary of some of typical questions about JTX.
Is a JTX a passive or an active transformation? A JTX can be either passive or active. When defining a JTX you must choose one or the other type. Once you make this choice you will not be able to change it without deleting the JTX, saving the repository and recreating the object. Hint: If you are working with a versioned repository, you will have to purge the deleted JTX from the repository before you can recreate it with the same name.
What parts of a typical Java class can be used in a JTX? The following standard features can be used in a JTX: “static” initialization blocks can be defined on the tab “Helper Code”. “import” statements can be listed on the tab “Import Packages”. “static” variables of the Java class as a whole (i.e., counters for instances of this class) as well as non-static member variables (for every single instance) can be defined on the tab “Helper Code”. Auxiliary member functions or “static” functions may be declared and defined on the tab “Helper Code”. “static final” variables may be defined on the tab “Helper Code”. However, they are private by nature; no object of any other Java class will be able to utilize these. Auxiliary functions (static and dynamic) can be defined on the tab “Helper Code”. Important Note: Before trying to start a session utilizing additional “import” clauses in the Java code, make sure that the environment variable CLASSPATH contains the necessary .jar files or directories before the PowerCenter Integration Service has been started. All non-static member variables declared on the tab “Helper Code” are automatically available to every partition of a partitioned session without any precautions. In other words, one object of the respective Java class that is generated by PowerCenter will be instantiated for every single instance of the JTX and for every session partition. For example, if you utilize two instances of the same reusable JTX and have set the session to run with three partitions, then six individual objects of that Java class will be instantiated for this session run.
What parts of a typical Java class cannot be utilized in a JTX? The following standard features of Java are not available in a JTX: Standard and user-defined constructors Standard and user-defined destructors Any kind of direct user-interface, be it a Swing GUI or a console-based user interface
INFORMATICA CONFIDENTIAL
BEST PRACTICES
391 of 818
What else cannot be done in a JTX? One important note for a JTX is that you cannot retrieve, change, or utilize an existing DB connection in a JTX (such as a source connection, a target connection, or a relational connection to a LKP). If you would like to establish a database connection, use JDBC in the JTX. Make sure in this case that you provide the necessary parameters by other means.
How can I substitute constructors and the like in a JTX? User-defined constructors are mainly used to pass certain initialization values to a Java class that you want to process only once. The only way in a JTX to get this work done is to pass those parameters into the JTX as a normal port; then you define a boolean variable (initial value is “true”). For example, the name might be “constructMissing” on the Helper Code tab. The very first block in the On Input Row block will then look like this: if (constructMissing) { … // do whatever you would do in the constructor constructMissing = false; } Interaction with users is mainly done to provide input values to some member functions of a class. This usually is not appropriate in a JTX because all input values should be provided by means of input records. If there is a need to enable immediate interaction with a user for one or several or all input records, use an inter-process communication mechanism (i.e., IPC) to establish communication between the Java class associated with the JTX and an environment available to a user. For example, if the actual check to be performed can only be determined at runtime, you might want to establish a JavaBeans communication between the JTX and the classes performing the actual checks. Beware, however, that this sort of mechanism causes great overhead and subsequently may decrease performance dramatically. Although in many cases such requirements indicate that the analysis process and the mapping design process have not been executed optimally.
How do I choose between an active and a passive JTX? Use the following guidelines to identify whether you need an active or a passive JTX in your mapping: As a general rule of thumb, a passive JTX will usually execute faster than an active JTX . If one input record equals one output record of the JTX, you will probably want to use a passive JTX. If you have to produce a varying number of output records per input record (i.e., for some input values the JTX will generate one output record, for some values it will generate no output records, for some values it will generate two or even more output records) you will have to utilize an active JTX . There is no other choice. If you have to accumulate one or more input records before generating one or more output records, you will have to utilize an active JTX . There is no other choice. If you have to do some initialization work before processing the first input record, then this fact does in no way determine whether to utilize an active or a passive JTX. If you have to do some cleanup work after having processed the last input record, then this fact does in no way determine whether to utilize an active or a passive JTX. If you have to generate one or more output records after the last input record has been processed, then you have to use an active JTX. There is no other choice except changing the mapping accordingly to produce these additional records by other means.
How do I set up a JTX and use it in a mapping? As with most standard transformations you can either define a reusable JTX or an instance directly within a mapping. The following example will describe how to define a JTX in a mapping. For this example assume that the JTX has one input port of data type String and three output ports of type String, Integer, and Smallint.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
392 of 818
Note: As of version 8.1.1 the PowerCenter Designer is extremely sensitive regarding the port structure of a JTX; make sure you read and understand the Notes section below before designing your first JTX, otherwise you will encounter issues when trying to run a session associated to your mapping. 1. Click the button showing the java icon, then click on the background in the main window of the Mapping Designer. Choose whether to generate a passive or an active JTX (see “How do I choose between an active and a passive JTX” above). Remember, you cannot change this setting later. 2. Rename the JTX accordingly (i.e., rename it to “JTX_SplitString”). 3. Go to the Ports tab; define all input-only ports in the Input Group, define all output-only and input-output ports in the Output Group. Make sure that every output-only and every input-output port is defined correctly. 4. Make sure you define the port structure correctly from the onset as changing data types of ports after the JTX has been saved to the repository will not always work. 5. Click Apply. 6. On the Properties tab you may want to change certain properties. For example, the setting "Is Partitionable" is mandatory if this session will be partitioned. Follow the hints in the lower part of the screen form that explain the selection lists in detail. 7. Activate the tab Java Code. Enter code pieces where necessary. Be aware that all ports marked as input-output ports on the Ports tab are automatically processed as pass-through ports by the Integration Service. You do not have to (and should not) enter any code referring to pass-through ports. See the Notes section below for more details. 8. Click the Compile link near the lower right corner of the screen form to compile the Java code you have entered. Check the output window at the lower border of the screen form for all compilation errors and work through each error message encountered; then click Compile again. Repeat this step as often as necessary until you can compile the Java code without any error messages. 9. Click OK. 10. Only connect ports of the same data type to every input-only or input-output port of the JTX. Connect output-only and input-output ports of the JTX only to ports of the same data type in transformations downstream. If any downstream transformation expects a different data type than the type of the respective output port of the JTX, insert an EXP to convert data types. Refer to the Notes below for more detail. 11. Save the mapping. Notes: The primitive Java data types available in a JTX that can be used for ports of the JTX to connect to other transformations are Integer, Double, and Date/Time. Date/time values are delivered to or by a JTX by means of a Java “long” value which indicates the difference of the respective date/time value to midnight, Jan 1st, 1970 (the so-called Epoch) in milliseconds; to interpret this value, utilize the appropriate methods of the Java class GregorianCalendar. Smallint values cannot be delivered to or by a JTX. The Java object data types available in a JTX that can be used for ports are String, byte arrays (for Binary ports), and BigDecimal (for Decimal values of arbitrary precision). In a JTX you check whether an input port has a NULL value by calling the function isNull("name_of_input_port"). If an input value is NULL, then you should explicitly set all depending output ports to NULL by calling setNull("name_of_output_port"). Both functions take the name of the respective input / output port as a string. You retrieve the value of an input port (provided this port is not NULL, see previous paragraph) simply by referring to the name of this port in your Java source code. For example, if you have two input ports i_1 and i_2 of type Integer and one output port o_1 of type String, then you might set the output value with a statement like this one: o_1 = "First value = " + i_1 + ", second value = " + i_2; In contrast to a Custom Transformation, it is not possible to retrieve the names, data types, and/or values of passthrough ports except if these pass-through ports have been defined on the Ports tab in advance. In other words, it is impossible for a JTX to adapt to its port structure at runtime (which would be necessary, for example, for something like a Sorter JTX). If you have to transfer 64-bit values into a JTX, deliver them to the JTX by means of a string representing the 64-bit number and convert this string into a Java “long” variable using the static method Long.parseLong(). Likewise, to deliver a 64-bit integer from a JTX to downstream transformations, convert the “long” variable to a string which will be an output port of the JTX (e.g. using the statement o_Int64 = "" + myLongVariable ). As of version 8.1.1, the PowerCenter Designer is very sensitive regarding data types of ports connected to a JTX. Supplying a JTX with not exactly the expected data types or connecting output ports to other transformations expecting other data types (i.e., a string instead of an integer) may cause the Designer to invalidate the mapping such that the only
INFORMATICA CONFIDENTIAL
BEST PRACTICES
393 of 818
remedy is to delete the JTX, save the mapping, and re-create the JTX. Initialization Properties and Metadata Extensions can neither be defined nor retrieved in a JTX. The code entered on the Java Code sub-tab “On Input Row” is inserted into some other code; only this complete code constitutes the method “execute()” of the resulting Java class associated to the JTX (see output of the link "View Code" near the lower-right corner of the Java Code screen form). The same holds true for the code entered on the tabs “On End Of Data” and “On Receiving Transactions” with regard to the methods. This fact has a couple of implications which will be explained in more detail below. If you connect input and/or output ports to transformations with differing data types, you might get error messages during mapping validation. One such error message occurring quite often indicates that the byte code of the class cannot be retrieved from the repository. In this case, rectify port connections to all input and/or output ports of the JTX and edit the Java code (inserting one blank comment line usually suffices) and recompile the Java code again. The JTX (Java Transformation) doesn't currently allow pass-through ports. Thus they have to be simulated by splitting them up into one input port and one output port, then the values of all input ports have to be assigned to the respective output port. The key here is the input port of every pair of ports has to be in the Input Group while the respective output port has to be in the Output Group. If you do not do this, there is no warning in designer but it will not function correctly.
Where and how to insert what pieces of Java code into a JTX? A JTX always contains a code skeleton that is generated by the Designer. Every piece of code written by a mapping designer is inserted into this skeleton at designated places. Because all these code pieces do not constitute the sole content of the respective functions, there are certain rules and recommendations as to how to write such code. As mentioned previously, a mapping designer can neither write his or her own constructor nor insert any code into the default constructor or the default destructor generated by the Designer. All initialization work can be done in either of the following two ways: as part of the “static{}” initialization block, by inserting code that in a standalone class would be part of the destructor into the tab On End Of Data, by inserting code that in a standalone class would be part of the constructor into the tab On Input Row. The last case (constructor code being part of the On Input Row code) requires a little trick: constructor code is supposed to be executed once only, namely before the first method is called. In order to resemble this behavior, follow these steps: 1. On the tab Helper Code, define a boolean variable (i.e., “constructorMissing”) and initialize it to “true”. 2. At the beginning of the On Input Row code, insert code that looks like the following: if( constructorMissing) { … // do whatever the constructor should have done constructorMissing = false; } This will ensure that this piece of code is executed only once, namely directly before the very first input row is processed. The code pieces on the tabs “On Input Row”, “On End Of Data”, and “On Receiving Transaction” are embedded in other code. There is code that runs before the code entered here will execute, and there is more code to follow; for example, exceptions raised within code written by a developer will be caught here. As a mapping developer you cannot change this order, so you need to be aware of the following important implication. Suppose you are writing a Java class that performs some checks on an input record and, if the checks fail, issues an error message and then skips processing to the next record. Such a piece of code might look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { logMessage( “ERROR: one of the two checks failed!”); INFORMATICA CONFIDENTIAL
BEST PRACTICES
394 of 818
return; } // else insertIntoTarget( inputRecord); countOfSucceededRows ++; This code will not compile in a JTX because it would lead to unreachable code. Why? Because the “return” at the end of the “if” statement might enable the respective function (in this case, the method will have the name “execute()”) to “ignore” the subsequent code that is part of the framework created by the Designer. In order to make this code work in a JTX, change it to look like this: if (firstCheckPerformed( inputRecord) && secondCheckPerformed( inputRecord)) { logMessage( “ERROR: one of the two checks failed!”); } else { insertIntoTarget( inputRecord); countOfSucceededRows ++; } The same principle (never use “return” in these code pieces) applies to all three tabs On Input Row, On End Of Data, and On Receiving Transaction. Another important point is that the code entered on the On Every Record tab is embedded in a try-catch block. So never include any try-catch code on this tab.
How fast does a JTX perform? A JTX communicates with PowerCenter by means of JNI (Java Native Invocation). This mechanism has been defined by Sun Micro-systems in order to allow Java code to interact with dynamically linkable libraries. Though JNI has been designed to perform fast, it still creates some overhead to a session due to: the additional process switches between the PowerCenter Integration Service and the Java Virtual Machine (JVM) that executes as another operating system process Java not being compiled to machine code but to portable byte code (although this has been largely remedied in the past years due to the introduction of Just-In-Time compilers) which is interpreted by the JVM The inherent complexity of the genuine object model in Java (except for most sorts of number types and characters everything in Java is an object that occupies space and execution time). So it is obvious that a JTX cannot perform as fast as, for example, a carefully written Custom Transformation. The rule of thumb is for simple JTX to require approximately 50% more total running time than an EXP of comparable functionality. It can also be assumed that Java code utilizing several of the fairly complex standard classes will need even more total runtime when compared to an EXP performing the same tasks.
When should I use a JTX and when not? As with any other standard transformation, a JTX has its advantages as well as disadvantages. The most significant disadvantages are:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
395 of 818
The Designer is very sensitive in regards to the data types of ports that are connected to the ports of a JTX. However, most of the troubles arising from this sensitivity can be remedied rather easily by simply recompiling the Java code. Working with “long” values representing days and time within, for example, the GregorianCalendar can be extremely difficult to do and demanding in terms of runtime resources (memory, execution time). Date/time ports in PowerCenter are by far easier to use. So it is advisable to split up date/time ports into their individual components, such as year, month, and day, and to process these singular attributes within a JTX if needed. In general a JTX can reduce performance simply by the nature of the architecture. Only use a JTX when necessary. A JTX always has one input group and one output group. For example, it is impossible to write a Joiner as a JTX. Significant advantages to using a JTX are: Java knowledge and experience are generally easier to find than comparable skills in other languages. Prototyping with a JTX can be very fast. For example, setting up a simple JTX that calculates the calendar week and calendar year for a given date takes approximately 10-20 minutes. Writing Custom Transformations (even for easy tasks) can take several hours. Not every data integration environment has access to a C compiler used to compile Custom Transformations in C. Because PowerCenter is installed with its own JDK, this problem will not arise with a JTX.
In Summary If you need a transformation that adapts its processing behavior to its ports, a JTX is not the way to go. In such a case, write a Custom Transformation in C, C++, or Java to perform the necessary tasks. The CT API is considerably more complex than the JTX API, but it is also far more flexible. Use a JTX for development whenever a task cannot be easily completed using other standard options in PowerCenter (as long as performance requirements do not dictate otherwise). If performance measurements are slightly below expectations, try optimizing the Java code and the remainder of the mapping in order to increase processing speed.
Last updated: 04-Jun-08 19:14
INFORMATICA CONFIDENTIAL
BEST PRACTICES
396 of 818
Error Handling Process Challenge For an error handling strategy to be implemented successfully, it must be integral to the load process as a whole. The method of implementation for the strategy will vary depending on the data integration requirements for each project. The resulting error handling process should however, always involve the following three steps: 1. Error identification 2. Error retrieval 3. Error correction This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.
Description A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as: Relational database error logging Email notification of workflow failures Session error thresholds The reporting capabilities of PowerCenter Data Analyzer Data profiling These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart below:
Error Identification The first step in the error handling process is error identification. Error identification is often achieved through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and referential integrity constraints at the database. This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors, and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables. Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter
INFORMATICA CONFIDENTIAL
BEST PRACTICES
397 of 818
repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was called within the mapping.
Error Retrieval The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as “number of errors is greater than zero”).
Error Correction The final step in the error handling process is error correction. As PowerCenter automates the process of error identification, and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports emailing a report directly through the web-based interface to make the process even easier. For further automation, a report broadcasting rule that emails the error report to a developer’s inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred. Considerations made during error correction include: The ‘owner’ of the data should always fix the data errors. For example, if the source data is coming from an external system, then the errors should be sent back to the source system to be fixed. In some situations, a simple re-execution of the session will reprocess the data. Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate processing of rows. Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be manually inserted into the target table using a SQL statement. Any approach to correct erroneous data should be precisely documented and followed as a standard. If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.
Data Profiling Option For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that provides profile reporting such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports can be delivered to users through the same easy-to-use application.
Integrating Error Handling, Load Management, and Metadata Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for
INFORMATICA CONFIDENTIAL
BEST PRACTICES
398 of 818
successful operation and management. The flow chart below illustrates this in the end-to-end load process.
Error handling underpins the data integration system from end-to-end. Each of the load components performs validation checks, the results of which must be reported to the operational team. These components are not just PowerCenter processes such as INFORMATICA CONFIDENTIAL
BEST PRACTICES
399 of 818
business rule and field validation, but cover the entire data integration architecture, for example: Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)? Source File Validation. Is the source file datestamp later than the previous load? File Check. Does the number of rows successfully loaded match the source rows read?
Last updated: 09-Feb-07 13:42
INFORMATICA CONFIDENTIAL
BEST PRACTICES
400 of 818
Error Handling Strategies - B2B Data Transformation Challenge The challenge for B2B Data Transformation (B2B DT) based solutions is to create efficient accurate processes for transforming data to appropriate intermediate data formats and to subsequently transform data from those formats to correct output formats. Error handling strategies are a core part of assuring the accuracy of any transformation process. Error handling strategies in B2B Data Transformation solutions should address the following two needs: 1. Detection of errors in the transformation leading to successive refinement of the transformation logic during an iterative development cycle. 2. Designed for correct error detection, retrieval and handling in production environments. In general errors can be characterized as either expected or unexpected. An expected error is an error condition that we do anticipate to occur periodically. For example, a printer running out of paper is an expected error. In a B2B scenario this may correspond to a partner company sending a file in an incorrect format – although it is an error condition it is expected from time to time. Usually processing of an expected error is part of normal system functionality and does not constitute a failure of the system to perform as designed. Unexpected errors typically occur when the designers of a system believe a particular scenario is handled, but due to logic flaws or some other implementation fault, the scenario is not handled. These errors might include hardware failures, out of memory situations or unexpected situations due to software bugs. Errors can also be classified by severity (e.g., warning errors and fatal errors). For unexpected fatal errors, the transformation process is often unable to complete and may result in a loss of data. In these cases, the emphasis is on prompt discovery and reporting of the error and support of any troubleshooting process. Often the appropriate action for fatal unexpected errors is not addressed at the individual B2B Data Transformation translation level but at the level of the calling process. This Best Practice describes various strategies for handling expected and unexpected errors both from production and development troubleshooting points of view and discusses the error handling features included as part of Informatica’s B2B Data Transformation 8.x.
Description This Best Practice is intended to help designers of B2B DT solutions decide which error handling strategies to employ in their solutions and to familiarize them with new features in Informatica B2B DT.
Terminology B2B DT is used as a generic term for the parsing, transformation and serialization technologies provided in Informatica’s B2B DT products. These technologies have been made available through the B2B Data Transformation, Unstructured Data Option for PowerCenter and as standalone products known as B2B Data Transformation and PowerExchange for Complex Data. Note: Informatica’s B2B DT was previously known as PowerExchange for Complex Data Exchange (CDE) or Item field Content Master (CM).
Errors in B2B Data Transformation Solutions There are several types of errors possible in a B2B data transformation. The common types of errors that should be handled while designing and developing are: Logic errors Errors in structural aspects of inbound data (missing syntax etc)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
401 of 818
Value errors Errors reported by downstream components (i.e., legacy components in data hubs) Data-type errors for individual fields Unrealistic values (e.g., impossible dates) Business rule breaches Production Errors vs. Flaws in the Design- Production errors are those where the source data or the environmental setup does not conform to the specifications for the development whereas flaws in design occur when the development does not conform to the specification. For example, a production error can be an incorrect source file format that does not conform to the specification layout given for development. A flaw in design could be as trivial as defining an element to be mandatory where the possibility of non-occurrence of the element cannot be ruled out completely. Unexpected Errors vs. Expected Errors: Expected errors are those that can be anticipated for a solution scenario based upon experience (e.g., the EDI message file does not conform to the latest EDI specification). Unexpected errors are most likely caused by environment set up issues or unknown bugs in the program (e.g., a corrupted file system). Severity of Errors – Not all the errors in a system are equally important. Some errors may require that the process be halted until they are corrected (e.g., an incorrect format of source files). These types of errors are termed as critical/fatal errors. Whereas there could be a scenario where a description field is longer than the field length specified, but the truncation does not affect the process. These types of errors are termed as warnings. The severity of a particular error can only be defined with respect to the impact it creates on the business process it supports. In B2B DT the severity of errors are classified into the following categories: Information A normal operation performed by B2B Data Transformation. Warning
A warning about a possible error. For example, B2B Data Transformation generates a warning event if an operation overwrites the existing content of a data holder. The execution continues.
Failure
A component failed. For example, an anchor fails if B2B Data Transformation cannot find it in the source document. The execution continues.
Optional Failure
An optional component, configured with the optional property, failed. For example, an optional anchor is missing from the source document. The execution continues.
Fatal error
A serious error occurred, for example, a parser has an illegal configuration. B2B Data Transformation halts the execution.
Unknown
The event status cannot be determined.
Error Handling in Data Integration Architecture The Error Handling Framework in the context of B2B DT defines a basic infrastructure and the mechanisms for building more reliable and fault-tolerant data transformations. It integrates error handling facilities into the overall data integration architecture. How do you integrate the necessary error handling into the data integration architecture? User interaction: Even in erroneous situations the data transformation should behave in a controlled way and the user should be informed appropriately about the system’s state. The user must interact between the error’s handling to avoid cyclic dependencies. Robustness: The error handling should be simple. All additional code for handling error situations makes the transformation more complex, which itself increases the probability of errors. Thus the error handling should provide some basic mechanism for handling internal errors. However, for the error handling code it is even more important to be correct and to avoid any nested error situations. Separation of error handling code: Without any separation the normal code will be cluttered by a lot of error handling code. This makes code less readable, error prone and more difficult to maintain. Specific error handling versus complexity: Errors must be classified more precisely in order to handle them effectively and to take measures tailored to specific errors.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
402 of 818
Detailed error information versus complexity: Whenever the transformation terminates due to an error suitable information is needed to analyze the error. Otherwise, it is not feasible to investigate the original fault that caused the error. Performance: We do not want to pay very much for error handling during normal operation. Reusability: The services of the error handling component should be designed for reuse because it is a basic component useful for a number of transformations.
Error Handling Mechanisms in B2B DT The common techniques that would help a B2B DT designer in designing an error handling strategy are summarized below. Debug: This method of error handling underlines the usage of the built-in capabilities of B2B DT for most basic errors. The debug of a B2B DT parser/serializer can be done in multiple ways. Highlight the selection of an element on the example source file. Use a Writeval component along with disabling automatic output in the project property. Use of the disable and enable feature for each of the components. Run the parser/serializer and browse the event log for any failures. All the debug components should be removed before deploying the service in production. Schema Modification: This method of error handling demonstrates one of the ways to communicate the erroneous record once it is identified. The erroneous data can be captured at different levels (e.g., at field level or at record level). The XML schema methodology dictates to add additional XML elements into the schema structure for the error data and error message holding. This allows the developer to validate each of the elements with the business rules and if any element or record does not conform to the rules then that data and a corresponding error message can be stored in the XML structure. Error Data in a Different File: This methodology stresses the point of storing the erroneous records or elements in a separate file other than the output data stream. This method is useful when a business critical timeline for the data processing cannot be compromised for a couple of erroneous records. This method allows the processing for the correct records to be done and the erroneous records to be inspected and corrected as a different stream function. In this methodology the business validations are done for each of the elements with the specified rules and if any of the elements or records fails to conform, they are directed to a predefined error file. The path to the file is generally passed in the output file for further investigation or the path of the file is a static path upon which a script is executed to send those error files to operations for correction.
Design Time Tools for Error Handling in B2B DT A failure is an event that prevents a component from processing data in the expected way. An anchor might fail if it searches for text that does not exist in the source document. A transformer or action might fail if its input is empty or has an inappropriate data type. A failure can be a perfectly normal occurrence. For example, a source document might contain an optional date. A parser contains a Content anchor that processes the date, if it exists. If the date does not exist in a particular source document, the Content anchor fails. By configuring the transformation appropriately, you can control the result of a failure. In the example, you might configure the parser to ignore the missing data and continue processing. B2B Data Transformation offers various mechanisms for error handling during design time Feature
Description
B2B DT event log
This is a B2B DT specific event generation mechanism where each event corresponds to an action taken by a transformation such as recognizing a particular lexical sequence. It is useful in the troubleshooting of work in progress, but event files can grow very large, hence it is not recommended for production systems. It is distinct from the event system offered by other B2B DT products and from the OS based event system.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
403 of 818
Custom events can be generated within transformation scripts. Event based failures are reported as exceptions or other errors in the calling environment. B2B DT Trace files
Trace files are controlled by the B2B DT configuration application. Automated strategies may be applied for the recycling of trace files
Custom error information
At the simplest levels, custom errors can be generated as B2B DT events (using the AddEventAction). However if the event mechanism is disabled for memory or performance reasons, these are omitted. Other alternatives include generation of custom error files, integration with OS event tracking mechanisms and integration with 3rd party management platform software. Integration with OS eventing or 3rd party platform software requires custom extensions to B2B DT.
The event log is the main trouble shooting tool in B2B DT solutions. It captures all of the details in an event log file when an error occurs in the system. These files can be generated when testing in a development studio environment or running a service engine. These files reside in the CM_Reports directory specified in the CM_Config file under the installation directory of B2B DT. In the studio environment the default location is Results/events.cme in the project folder. The error messages appearing in the event log file are either system generated or user-defined (which can be accomplished by adding the add event action). The ADDEVENT Action enables the developer to pass a user-defined error message to the event log file in case a specific error condition occurs. Overall the B2B DT event mechanism is the simplest to implement. But for large or high volume production systems, the event mechanism can create very large event files, and it offers no integration with popular enterprise software administration platforms. Informatica recommends using B2B DT Events for troubleshooting purposes during development only. In some cases, performance constraints may determine the error handling strategy. For example updating an external event system may cause performance bottlenecks and producing a formatted error report can be time consuming. In some cases operator interaction may be required that could potentially block a B2B DT transformation from completing. Finally, it is worth looking at whether some part of the error handling can be offloaded outside of B2B DT to avoid performance bottlenecks. When using custom error schemes, consider the following: Multiple invocations of the same transformation may execute in parallel Don’t hardwire error file paths Don’t assume a single error output file Avoid the use of the B2B DT event log for production systems (especially when processing Excel files). The trace files capture the state of the system along with the process ID and failure messages. It creates the reporting of the error along with the time stamp. It captures details about the system in different category areas, including file system, environment, networking etc. It gives details about the process id and the thread id that was processing the execution. It aids in getting the system level error (if there is one). The name of the trace file can be modified in the Configuration wizard. The maximum size of the trace file can also be limited in the CMConfiguration editor. If the Data Transformation Engine runs under multiple user accounts, the user logs may overwrite each other, or it may be difficult to identify the logs belonging to a particular user. Prevent this by configuring users with different log locations. In addition to the logs of service events, there is an Engine initialization event log. This log records problems that occur when the Data Transformation Engine starts without reference to any service or input data. View this log to diagnose installation problems such as missing environment variables. The initialization log is located in the CMReports\Init directory.
New Error Handling Features in B2B DT 8.x
INFORMATICA CONFIDENTIAL
BEST PRACTICES
404 of 818
Using the Optional Property to Handle Failures If the optional property of a component is not selected, a failure of the component causes its parent to fail. If the parent is also non-optional, its own parent fails, and so forth. For example, suppose that a Parser contains a Group, and the Group contains a Marker. All the components are non-optional. If the Marker does not exist in the source document, the Marker fails. This causes the Group to fail, which in turn causes the Parser to fail. If the optional property of a component is selected, a failure of the component does not bubble up to the parent. For example, suppose that a Parser contains a Group, and the Group contains a Marker. In this example, suppose that the Group is optional. The failed Marker causes the Group to fail, but the Parser does not fail. Note however that certain components lack the optional property because the components never fail, regardless of their input. An example is the Sort action. If the Sort action finds no data to sort, it simply does nothing. It does not report a failure.
Rollback If a component fails, its effects are rolled back. For example, suppose that a Group contains three non-optional Content anchors that store values in data holders. If the third Content anchor fails, the Group fails. Data Transformation rolls back the effects of the first two Content anchors. The data that the first two Content anchors already stored in data holders is removed. The rollback applies only to the main effects of a transformation such as a parser storing values in data holders or a serializer writing to its output. The rollback does not apply to side effects. In the above example, if the Group contains an ODBCAction that performs an INSERT query on a database, the record that the action added to the database is not deleted.
Writing a Failure Message to the User Log A component can be configured to output failure events to a user-defined log. For example, if an anchor fails to find text in the source document, it can write a message in the user log. This can occur even if the anchor is defined as optional so that the failure does not terminate the transformation processing. The user log can contain the following types of information: Failure level: Information, Warning, or Error Name of the component that failed Failure description Location of the failed component in the IntelliScript Additional information about the transformation status (such as the values of data holders)
CustomLog The CustomLog component can be used as the value of the on_fail property. In the event of a failure, the CustomLog component runs a serializer that prepares a log message. The system writes the message to a specified location. Property
Description
run_serializer
A serializer that prepares the log message
output
The output location. The options include: MSMQOutput. Writes to an MSMQ queue. OutputDataHolder. Writes to a data holder. OutputFile. Writes to a file. ResultFile. Writes to the default results file of the transformation. OutputCOM. Uses a custom COM component to output the data. Additional choices: OutputPort. The name of an AdditionalOutputPort where the data is written. StandardErrorLog. Writes to the user log.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
405 of 818
Error Handling in B2B DT with PowerCenter Integration In a B2B DT solution scenario, both expected and unexpected errors can occur, whether caused by a production issue or a flaw in the design. If the right error handling processes are not in place, then if an error occurs, the processing aborts with a description of the error in the log (event file). This can also results in data loss if the erroneous records are not captured and reported correctly. It also fails the program it is called from. For example if the B2B Data Transformation service is used through PowerCenter UDO/B2B DT, then an error causes the PowerCenter session to fail. This section focuses on how to orchestrate PowerCenter and B2B DT if the B2B DT services are being called from a PowerCenter mapping. Below are the most common ways of orchestrating the error trapping and error handling mechanism. 1. Use PowerCenter’s Robustness and Reporting Function: In general the PowerCenter engine is very robust and powerful enough to handle complex erroneous scenarios. Thus the usual practice is to perform any business validation or valid values comparison in PowerCenter. This enables error records to be directed to the already established Bad Files or Reject Tables in PowerCenter. This feature also allows the repository to store information about the number of records loaded and the number of records rejected and thus aids in easier reporting of errors. 2. Output the Error in an XML Tag: When complex parsing validations are involved, B2B DT is more powerful than PowerCenter in handling them (e.g., String function and regular expression). In these scenarios the validations are performed in the B2B DT engine and the schema is redesigned to capture the error information in the associated tag of the XML. When this XML is parsed in a PowerCenter mapping the error tags are directed to be stored in the custom build error reporting tables from which the reporting of the errors can be done. The design of the custom build error tables will depend on the design of the error handling XML schema. Generally these tables correspond one-to-one with the XML structure with few additional metadata fields like processing date, Source System, etc. 3. Output to the PowerCenter Log Files: If unexpected error occurs in the B2B DT processing then the error descriptions and details are stored in the log file directory as specified in the CMconfig.xml. The path to the file and the fatal errors are reported to the PowerCenter Log so that the operators can quickly detect problems. This unexpected error handling can be exploited with care for the user defined errors in the B2B DT transformation by adding the Addevent Action and marking the error type as “Failure”.
Best Practices for Handling Errors in Production In a production environment the turnaround time of the processes should be as short as possible and as automated as possible. Using B2B DT integration with Power Center these requirements should be met seamlessly without intervention from IT professionals for error reporting, the correction of the data file and the reprocessing of data. Example Scenario 1 – HIPAA Error Reporting
INFORMATICA CONFIDENTIAL
BEST PRACTICES
406 of 818
Example Scenario 2 – Emailing Error Files to Operator Below is a case study for an implementation at a major financial client. The solution was implemented with total automation for the sequence of error trapping, error reporting, correction and reprocessing of data. The high level solution steps are: Analyst receives loan tape via Email from a dealer Analyst saves the file to a file system on a designated file share A J2EE server monitors the file share for new files and pushes them to PowerCenter PowerCenter invokes B2B DT to process (passing XML data fragment, supplying path to loan tape file and other parameters) Upon a successful outcome, PowerCenter saves the data to the target database PowerCenter notifies the Analyst via Email On failure, PowerCenter Emails the XLS error file containing the original data and errors
Last updated: 24-Feb-09 16:41
INFORMATICA CONFIDENTIAL
BEST PRACTICES
407 of 818
Error Handling Strategies - Data Warehousing Challenge A key requirement for any successful data warehouse or data integration project is that it attains credibility within the user community. At the same time, it is imperative that the warehouse be as up-to-date as possible since the more recent the information derived from it is, the more relevant it is to the business operations of the organization, thereby providing the best opportunity to gain an advantage over the competition. Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information. Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e., before the business tries to use the information in error). The types of error to consider include: Source data structures Sources presented out-of-sequence ‘Old’ sources represented in error Incomplete source files Data-type errors for individual fields Unrealistic values (e.g., impossible dates) Business rule breaches Missing mandatory data O/S errors RDBMS errors These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.
Description In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change. Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored (typically as freetext comments). In every case, there is a risk that the source data does not match what the data warehouse expects. Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current version of the warehouse can be published). Preferably, error data should be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no resources are wasted trying to load it. As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source
INFORMATICA CONFIDENTIAL
BEST PRACTICES
408 of 818
data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data. As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load management, data quality and key management, and operational processes and procedures. Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred. Quality management defines the criteria whereby data can be identified as in error; and error management identifies the specific error(s), thereby allowing the source data to be corrected. Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors, perhaps indicating a failure in operational procedure. Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level flow chart below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
409 of 818
Error Management Considerations High-Level Issues From previous discussion of load management, a number of checks can be performed before any attempt is made to load a source data set. Without load management in place, it is unlikely that the warehouse process will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot (in so far as nearly all maintenance and development resources will be working full time to manually correct bad data in the warehouse). The following assumes that you have implemented load management processes similar to Informatica’s best practices. Process Dependency checks in the load management can identify when a source data set is missing, duplicates a previous version, or has been presented out of sequence, and where the previous load failed but has not yet been corrected. Load management prevents this source data from being loaded. At the same time, error management processes should record the details of the failed load; noting the source instance, the load affected, and when and why the load was aborted. Source file structures can be compared to expected structures stored as metadata, either from header information or by attempting to read the first data row. Source table structures can be compared to expectations; typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by simply running a ‘describe’ command against the table (again comparing to a pre-stored version in metadata). Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational application). In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and any other relevant process-level details.
Low-Level Issues Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further error management processes need to be applied to the individual source rows and fields. Individual source fields can be compared to expected data-types against standard metadata within the repository, or additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more worryingly) is likely to be processed unpredictably.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
410 of 818
Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the potentially crushing impact on performance if a particularly error-filled load occurs. Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to implement such a process. Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However, from an error handling perspective, such errors must still be identified and recorded, even when key management techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular source data fails, it is difficult to realize when there is a systematic problem in the source systems. Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of events (e.g., a customer query, followed by a booking request, followed by a confirmation, followed by a payment). If the events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem with the source system, or the way in which the source system is being used. An important principle to follow is to try to identify all of the errors on a particular row before halting processing, rather than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the first, re-running, and then identifying a second error (which halts the load for a second time).
OS and RDBMS Issues Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference information is missing). However, there is little that can be done to identify the more generic RDBMS problems that are likely to occur; changes to schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table space extents available, missing partitions and the like. Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax, and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to be an impact on the data warehouse, or are not aware that the data warehouse managers need to be kept up to speed. In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNIX host are amended, bad files themselves (or even the log files) may not be accessible. Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a load to ‘complete’ should be absolutely the last step in a given process, any failure before, or including, that point leaves the load in an ‘incomplete’ state. Subsequent runs should note this, and enforce correction of the last load before beginning the new one. The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and DBAs have proper and working communication with the data warehouse management to allow proactive control of changes. Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if they occur.
Auto-Correction vs. Manual Correction Load management and key management best practices (Key Management in Data Warehousing Solutions) have already defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is preserved, and incorrect key values are corrected as soon as the source system provides the missing data.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
411 of 818
We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely from scratch. In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user training. The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and managed.
Error Management Techniques Simple Error Handling Structure The following data structure is an example of the error metadata that should be captured as a minimum within the error handling strategy.
The example defines three main sets of information: The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including: process-level (e.g., incorrect source file, load started out-of-sequence) row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and reconciliation (e.g., incorrect row numbers, incorrect file total etc.). The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of error for particular loads and of the distribution of error types. It is linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered. The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that was in error (using the source natural keys and row number) together with a string of field identifier/value pairs concatenated together. It is not expected that this information will be deconstructed as part of an automatic correction
INFORMATICA CONFIDENTIAL
BEST PRACTICES
412 of 818
load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent reporting.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL
BEST PRACTICES
413 of 818
Error Handling Strategies - General Challenge The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.
Description Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals: The need for accurate information. The ability to analyze or process the most complete information available with the understanding that errors can exist.
Data Integration Process Validation In general, there are three methods for handling data errors detected in the loading process: Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data. Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions. The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the target if it is correct, and it would then be loaded into the data mart using the normal process. Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. With Reject None, the complete set of data is loaded, but the data may not support correct transactions or aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each night’s load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts. Reject Critical. This method provides a balance between missing information and incorrect information. It involves
INFORMATICA CONFIDENTIAL
BEST PRACTICES
414 of 818
examining each row of data and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data architecture. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to understand that some information may be held out of the target, and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies.
Handling Errors in Dimension Profiles Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created with date stamps that indicate when the change took place. This allows power users to review the target data using either current (As-Is) or past (As-Was) views of the data. A profile record should occur for each change in the source data. Problems occur when two fields change in the source system and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the correction process is complicated further. The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.
Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000
Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday
Field 2 Value Black BRed BRed Red
Field 3 Value Open 9 – 5 Open 9 – 5 Open 24hrs Open 24hrs
Three methods exist for handling the creation and update of profiles: 1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.
Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000
Profile Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000
INFORMATICA CONFIDENTIAL
Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday
Field 2 Value Field 3 Value Black Open 9 – 5 Black Open 9 – 5 Black Open 24hrs Red Open 24hrs
BEST PRACTICES
415 of 818
By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been created. 2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the profile record for the change to Field 3. If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system. 3. The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both.
Date 1/1/2000 1/5/2000 1/10/2000 1/15/2000 1/15/2000
Profile Date 1/1/2000 1/5/2000 1/10/2000 1/5/2000 (Update) 1/10/2000 (Update)
Field 1 Value Closed Sunday Open Sunday Open Sunday Open Sunday
Field 2 Value Black Black Black Red
Field 3 Value Open 9 – 5 Open 9 – 5 Open 24hrs Open 9-5
Open Sunday
Red
Open 24hrs
If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records, when in reality a new profile record should have been entered.
Recommended Method A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile.
Data Quality Edits Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators can be used to: Show the record and field level quality associated with a given record at the time of extract. Identify data sources and errors encountered in specific records. Support the resolution of specific record error types via an update and resubmission process. Quality indicators can be used to record several types of errors – e.g., fatal errors (missing primary key value), missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
416 of 818
The following types of errors cannot be processed: A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has been replaced or not. The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual unique records within the file are not identifiable. While information can be provided to the source system site indicating there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis. In these error types, the records can be processed, but they contain errors: A required (non-key) field is missing. The value in a numeric or date field is non-numeric. The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is used for this validation. When an error is detected during ingest and cleansing, the identified error type is recorded.
Quality Indicators (Quality Code Table) The requirement to validate virtually every data element received from the source data systems mandates the development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data quality problems, systemic issues, business process problems and information technology breakdowns. The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required Field, “3”-Wrong Data Type/Format, “4”Invalid Data Value and “5”-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.
Handling Data Errors The need to periodically correct data in the target is inevitable. But how often should these corrections be performed? The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.
Reject Tables vs. Source System As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source systems, then these fixes must be applied correctly to the target data.
Attribute Errors and Default Values Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research). Attribute
INFORMATICA CONFIDENTIAL
BEST PRACTICES
417 of 818
errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target. When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter thetarget. Some rules that have been proposed for handling defaults are as follows:
Value Types Reference Values Small Value Sets Other
Description Default Attributes that are foreign keys to Unknown other tables Y/N indicator fields No Any other type of attribute Null or Business provided value
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the ‘Unknown’ value. (All reference tables contain a value of ‘Unknown’ for this purpose.) The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these values, we use the value that represents off or ‘No’ as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate ‘Null’ into these fields, which means “undefined” in the target. After a source system value is corrected and passes validation, it is corrected in the target.
Primary Key Errors The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new key is actually an update to an old key in the source system. For example, a location number is assigned and the new location is transferred to the target using the normal process; then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture. Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct the data. The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the restore to correct the errors.
DM Facts Calculated from EDW Dimensions If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement. If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
418 of 818
Data Stewards Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.
Reference Tables The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures. The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. For example, the SOURCE column in FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data steward would be responsible for entering in the translation table the following values:
Source Value O S W
Code Translation OFFICE STORE WAREHSE
These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the following entries into the translation table to maintain consistency across systems:
Source Value OF ST WH
Code Translation OFFICE STORE WAREHSE
The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL process uses the reference table to populate the following values into the target:
Code Translation OFFICE STORE WAREHSE
Code Description Office Retail Store Distribution Warehouse
Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.
Dimensional Data New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.) There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
419 of 818
data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as ‘Pending Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it can be loaded. When the dimensional value is left as ‘Pending Verification’ however, facts may be rejected or allocated to dummy values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists them. A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to be merged, requiring manual intervention. The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but really represent two different products). In this case, it is necessary to restore the source information for all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.
Manual Updates Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.
Multiple Sources The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information. When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with the same information. To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the one profile record created for that day. One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can use the field level information to update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process.
Last updated: 05-Jun-08 12:48
INFORMATICA CONFIDENTIAL
BEST PRACTICES
420 of 818
Error Handling Techniques - PowerCenter Mappings Challenge Identifying and capturing data errors using a mapping approach, and making such errors available for further processing or correction.
Description Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected transformation or database constraint errors.
Data Validation Errors The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements. Consider the following questions: What types of data errors are likely to be encountered? Of these errors, which ones should be captured? What process can capture the possible errors? Should errors be captured before they have a chance to be written to the target database? Will any of these errors need to be reloaded or corrected? How will the users know if errors are encountered? How will the errors be stored? Should descriptions be assigned for individual errors? Can a table be designed to store captured errors and the error descriptions? Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to the session log and the reject/bad file, thus improving performance. Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to them. This approach can be effective for many types of data content error, including: date conversion, null values intended for not null target fields, and incorrect data formats or data types.
Sample Mapping Approach for Data Validation Errors In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written to not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated from the data flow and logged in an error table. One solution is to implement a mapping similar to the one shown below:
An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors. A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to the mapping name and the ROW_ID would be created by a sequence generator. The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
421 of 818
The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of reference for reporting. The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns: ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them. The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for a not null target field could generate an error message such as ‘NAME is NULL’ or ‘DOB is NULL’. This step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping). After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL. The following table shows how the error data produced may look.
Table Name: NAME NULL Table Name: FOLDER_NAME CUST CUST CUST
CUSTOMER_ERR DOB ADDRESS ROW_ID NULL NULL 1 ERR_DESC_TBL MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE DIM_LOAD 1 Name is NULL 10/11/2006 DIM_LOAD 1 DOB is NULL 10/11/2006 DIM_LOAD 1 Address is 10/11/2006 NULL
MAPPING_ID DIM_LOAD SOURCE CUSTOMER_FF CUSTOMER_FF CUSTOMER_FF
Target CUSTOMER CUSTOMER CUSTOMER
The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of data validation errors. Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data validation errors as ‘soft’ or ‘hard’. A ‘hard’ error can be defined as one that would fail when being written to the database, such as a constraint error. A ‘soft’ error can be defined as a data content error. A record flagged as ‘hard’ can be filtered from the target and written to the error tables, while a record flagged as ‘soft’ can be written to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting. Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors, the operations team can effectively communicate data quality issues to the business users.
Constraint and Transformation Errors Perfect data can never be guaranteed. In implementing the mapping approach described above to detect errors and log them to an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the validated data to the database; however the relational database management system (RDBMS) may reject it for some unexpected reason. An RDBMS may, for example, reject data if constraints are violated. Ideally, we would like to detect these database-level errors automatically and send them to the same error table used to store the soft errors caught by the mapping approach described above. In some cases, the ‘stop on errors’ session property can be set to ‘1’ to stop source data for which unhandled errors were encountered from being loaded. In this case, the process will stop with a failure, the data must be corrected, and the entire source may need to be reloaded or recovered. This is not always an acceptable approach. An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the records that were found to be in error. This can be achieved by configuring the ‘stop on errors’ property to 0 and switching on relational error logging for a session. By default, the error-messages from the RDBMS and any uncaught transformation errors are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS. The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However, the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table allow us to retrieve any RDBMS errors: · SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX) View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved. · TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the target transformation. · TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target when the error occurred. · ERROR_MSG: Error message generated by the RDBMS
INFORMATICA CONFIDENTIAL
BEST PRACTICES
422 of 818
With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process ends, ERR_DESC_TBL will contain both ‘soft’ errors and ‘hard’ errors. One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key and target row number). The translation table can then be used by the post-load session to identify the source key by the target row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat file, or a primary key in the case of a relational data source.
Reprocessing After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations team can, therefore, fix the data in the source that resulted in ‘soft’ errors and may be able to explain and remediate the ‘hard’ errors. Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If initial loading, all rows are passed through the filter, validated, and loaded. With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.
Last updated: 01-Feb-07 18:53
INFORMATICA CONFIDENTIAL
BEST PRACTICES
423 of 818
Error Handling Techniques - PowerCenter Workflows and Data Analyzer Challenge Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and reprocess the corrected data.
Description Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of acceptable data quality; and process errors, which are driven by the stability of the process itself. The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider the following questions: What tools and methods can help in detecting all the possible errors? What tools and methods can help in correcting the errors? What is the best way to reconcile data across multiple systems? Where and how will the errors be stored? (i.e., relational tables or flat files) A robust error handling strategy can be implemented using PowerCenter’s built-in error handling capabilities along with Data Analyzer as follows: Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process failures. Data Errors: Setup the ETL process to: Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables for analysis, correction, and reprocessing. Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows. Setup customized Data Analyzer reports and dashboards at the project level to provide information on failed sessions, sessions with failed rows, load time, etc.
Configuring an Email Task to Handle Process Failures Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the event of a session failure. Create a reusable email task and use it in the “On Failure Email” property settings in the Components tab of the session, as shown in the following figure.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
424 of 818
When you configure the subject and body of a post-session email, use email variables to include information about the session run, such as session name, mapping name, status, total number of records loaded, and total number of records rejected. The following table lists all the available email variables: Email Variables for Post-Session Email Email Variable %s %e %b %c %i %l %r %t %m %n %d %g %a
Description Session name. Session status. Session start time. Session completion time. Session elapsed time (session completion time-session start time). Total rows loaded. Total rows rejected. Source and target table details, including read throughput in bytes per second and write throughput in rows per second. The PowerCenter Server includes all information displayed in the session detail dialog box. Name of the mapping used in the session. Name of the folder containing the session. Name of the repository containing the session. Attach the session log to the message. Attach the named file. The file must be local to the PowerCenter Server. The following are valid file names: %a
Note: The file name cannot include the greater than character (>) or a line break. Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include these variables in the email message only.
Configuring Row Error Logging in PowerCenter PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when compared with a
INFORMATICA CONFIDENTIAL
BEST PRACTICES
425 of 818
custom error handling solution. When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the PowerCenter Server logs error information that allows you to determine the cause and source of the error. The PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error code, error message, repository name, folder name, session name, and mapping information. This error metadata is logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR() function, such as business rule violations. Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you enable error logging and chose the ‘Relational Database’ Error Log Type, the PowerCenter Server offers you the following features: Generates the following tables to help you track row errors: PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source row. PMERR_MSG. Stores metadata about an error and the error message. PMERR_SESS. Stores metadata about the session. PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype, when a transformation error occurs. Appends error data to the same tables cumulatively, if they already exist, for the further runs of the session. Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors to go to one set of error tables, you can specify the prefix as ‘EDW_’ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you specify the same error log table name prefix for all sessions.
Example: In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the EDW Customer table. The Customer Table in EDW has the following structure: CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY) CUSTOMER_NAME NULL VARCHAR2(30) CUSTOMER_STATUS NULL VARCHAR2(10) There is a primary key constraint on the column CUSTOMER_ID. To take advantage of PowerCenter’s built-in error handling features, you would set the session properties as shown below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
426 of 818
The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB Connection’ and ‘Table name Prefix’ values are given accordingly. When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into the Error Tables as shown below: EDW_PMERR_DATA WORKFLOW_ RUN_ID 8
WORKLET_ SESS_ TRANS_NAME TRANS_ RUN_ID INST_ ROW_ID ID 0 3 Customer_Table 1
8
0
3
Customer_Table 2
8
0
3
Customer_Table 3
TRANS_ROW DATA
SOURCE_ ROW_ID
D:1001:00000000 -1 0000|D:Elvis Pres|D:Valid D:1002:00000000 -1 0000|D:James Bond|D:Valid D:1003:00000000 -1 0000|D:Michael Ja|D:Valid
SOURCE_ ROW_ TYPE -1
SOURCE_ LINE_ ROW_ NO DATA N/A 1
-1
N/A
1
-1
N/A
1
EDW_PMERR_MSG WORKFLOW_ SESS_ SESS_ RUN_ID INST_ID START_TIME 6 3 9/15/2004 18:31 7 3 9/15/2004 18:33 8 3 9/15/2004 18:34
REPOSITORY_ FOLDER_ WORKFLOW_ TASK_ MAPPING_ LINE_ NAME NAME NAME INST_PATH NAME NO pc711 Folder1 wf_test1 s_m_test1 m_test1 1 pc711
Folder1
wf_test1
s_m_test1
m_test1
1
pc711
Folder1
wf_test1
s_m_test1
m_test1
1
EDW_PMERR_SESS WORKFLOW_ SESS_ SESS_ REPOSITORY_ FOLDER_ WORKFLOW_ TASK_ MAPPING_ LINE_ RUN_ID INST_ID START_TIME NAME NAME NAME INST_PATH NAME NO 6 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1
INFORMATICA CONFIDENTIAL
BEST PRACTICES
427 of 818
7
3
8
3
18:31 9/15/2004 18:33 9/15/2004 18:34
pc711
Folder1
wf_test1
s_m_test1
m_test1
1
pc711
Folder1
wf_test1
s_m_test1
m_test1
1
EDW_PMERR_TRANS WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME 8
3
Customer_Table
TRANS_GROUP TRANS_ATTR LINE_ NO 1 Input Customer _Id:3, Customer _Name:12, Customer _Status:12
By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.
Error Detection and Notification using Data Analyzer Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer is Informatica’s powerful business intelligence tool that is used to provide insight into the PowerCenter repository metadata. You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best practices: Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made into the error tables PMERR_DATA or PMERR_TRANS. Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter folders for easy analysis. Configure reports to provide detailed information of the row level errors for each session. This can be accomplished by using the four error tables as sources of data for the reports
Data Reconciliation Using Data Analyzer Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing tedious queries, comparing two separately produced reports, or using constructs such as DBLinks. Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and reusable way to accomplish data reconciliation. Using Data Analyzer’s reporting capabilities, you can select data from various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers through aggregate reports. You can further schedule the reports to run automatically every time the relevant PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any discrepancies. For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.
Last updated: 09-Feb-07 14:22
INFORMATICA CONFIDENTIAL
BEST PRACTICES
428 of 818
Application ILM log4j Settings Challenge The ILM Archive is an intensive process that can require a large amount of space for log files. The log4j settings provide a way to control what information is logged, how large each file can grow, how many backups are retained and where the logs files are created. Because the requirements for logging will change as the Data Archive project moves through various phases the log file settings must also change.
Description Log file settings can be modified in the log4j.properties file located in the webapp/WEB-INF directory off of the root web server directory. The default setting for logs allows for the creation of almost 10GB of log files and it is not set to debug level. In the initial development phase of a project the logs should contain debug information to help diagnose any problems that come up during development. When the project moves to the testing phase the requirements may change and once in production the requirements will change again.
Default Settings The default settings allow for the creation of 1,000 files of 10MB each in size and logging is set to INFO. # Set root logger level to DEBUG and its only appender to R. log4j.rootCategory=INFO, LOGFILE log4j.logger.com.applimation.server.ManageApplimation=DEBUG log4j.logger.com.applimation.aop.AMEngineLogAdvice=DEBUG log4j.logger.com.applimation.aop.AMLogAdvice=DEBUG log4j.logger.com.applimation.server.StdOut=DEBUG log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender log4j.appender.LOGFILE.File= ./logs/applimation.log # Control the maximum log file size log4j.appender.LOGFILE.MaxFileSize=10240KB # Archive log files (one backup file here) log4j.appender.LOGFILE.MaxBackupIndex=1000 log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p [%c] %m%n
Development Phase Settings For the development phase, Informatica recommends changing the log level to DEBUG and also reducing the number of log files. # Set root logger level to DEBUG and its only appender to R. log4j.rootCategory=DEBUG, LOGFILE log4j.logger.com.applimation.server.ManageApplimation=DEBUG log4j.logger.com.applimation.aop.AMEngineLogAdvice=DEBUG log4j.logger.com.applimation.aop.AMLogAdvice=DEBUG log4j.logger.com.applimation.server.StdOut=DEBUG log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender log4j.appender.LOGFILE.File= ./logs/applimation.log # Control the maximum log file size log4j.appender.LOGFILE.MaxFileSize=10240KB
INFORMATICA CONFIDENTIAL
BEST PRACTICES
429 of 818
# Archive log files (one backup file here) log4j.appender.LOGFILE.MaxBackupIndex=20 log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p [%c] %m%n
Testing Phase Settings For the testing phase the number of log files can be further reduced to save space but the log level should remain at DEBUG. # Set root logger level to DEBUG and its only appender to R. log4j.rootCategory=DEBUG, LOGFILE log4j.logger.com.applimation.server.ManageApplimation=DEBUG log4j.logger.com.applimation.aop.AMEngineLogAdvice=DEBUG log4j.logger.com.applimation.aop.AMLogAdvice=DEBUG log4j.logger.com.applimation.server.StdOut=DEBUG log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender log4j.appender.LOGFILE.File= ./logs/applimation.log # Control the maximum log file size log4j.appender.LOGFILE.MaxFileSize=10240KB # Archive log files (one backup file here) log4j.appender.LOGFILE.MaxBackupIndex=10 log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p [%c] %m%n
Production Phase Settings Once the project is in the production phase the amount of information written to the log can be reduced by changing the log level to INFO since most issues should have been resolved during the development and testing phases. # Set root logger level to DEBUG and its only appender to R. log4j.rootCategory=INFO, LOGFILE log4j.logger.com.applimation.server.ManageApplimation=DEBUG log4j.logger.com.applimation.aop.AMEngineLogAdvice=DEBUG log4j.logger.com.applimation.aop.AMLogAdvice=DEBUG log4j.logger.com.applimation.server.StdOut=DEBUG log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender log4j.appender.LOGFILE.File= ./logs/applimation.log # Control the maximum log file size log4j.appender.LOGFILE.MaxFileSize=10240KB # Archive log files (one backup file here) log4j.appender.LOGFILE.MaxBackupIndex=10 log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p [%c] %m%n
Last updated: 28-Oct-10 02:31
INFORMATICA CONFIDENTIAL
BEST PRACTICES
430 of 818
Using Parallelism - ILM Archive Challenge Production source applications gather and store data across years of operations, and relocating these often large volumes of data can be time consuming. Achieving maximum performance is important to quickly and efficiently remove and move data from the production system to the Informatica Lifecycle Management (ILM) Archive. ILM provides parallelism to increase performance; however, there are key considerations in order to fully utilize the ILM Archive performance capabilities.
Description When utilizing ILM, the recommended approach is to use parallelism as much as possible. The Data Archive application uses parallelism on two levels: on the application server or on the database server itself. Parallelism is leveraged on the application server by spawning multiple threads to issue multiple SQL statements at once. It is leveraged on the Oracle database level using parallel hints in the metadata to use Oracle parallel DML. It is not necessary to add parallel hints to all the metadata, and in some cases the parallel hints will already be there where a need has previously been identified. Systems can have nuances that must be considered. In one system a statement may perform well without any hints, while on another system the same statement would perform poorly.
Identifying Where to Add Parallel Hints Parallel hints are used during the Candidate Generation step, the Copy to Staging step and, rarely, the Delete from Source step. To identify where hints may be needed, an archive cycle needs to be run for the amount of data that will make up a typical cycle for that particular entity. For example, if the plan is to archive data on a quarterly basis, then run the cycle for one quarter of data. Once the cycle is complete, the log files for each of the steps can be examined to determine individual statements that are taking a long time relative to the other statements. During Candidate Generation the header information for all of the candidates meeting the cycle criteria are inserted into an interim table. If any of those insert statements take too much time, a parallel hint can be added to the select for the insert, and the insert will automatically get a parallel hint when the cycle is run. Keep in mind that a degree of parallelism of 4 for insert will result in 8 parallel slaves being created: 4 workers for the insert and 4 workers for the select. However, only 4 of those will be active at one point in time. The parallel_max_servers init.ora parameter on the source instance needs to be set high enough to support the requested degree of parallelism. Once the candidates are inserted into the interim table, then business rules are applied to each of the candidates. The business rules are actually update statements on the interim table where a business condition exists. If a business rule is taking too long, then a parallel hint can be added to the condition statement for the rule and the update statement will automatically also get a hint. Again, a degree of parallelism for update of 4 will result in 8 workers being used. The Copy to Staging step for large tables may require a parallel hint and will also result in twice the number of workers specified for the insert degree of parallelism. The tables requiring a parallel hint can be identified by reviewing the log from the Copy to Staging step. The only time parallel hints in the metadata will be used for the Delete step is when there is no primary key defined in the metadata for that table, and this condition should be avoided. Just by setting a degree of parallelism for delete and choosing the option to use ROWID for delete will cause tables to be deleted in parallel, which is the best practice. Using parallel workers instead of actual parallel DML will divide the ROWIDs among the number of workers specified and not lock the entire table, while use of parallel DML will lock the entire table.
Processing During Steps There are several steps involved in the archive and purge process, and the processing method is different between the steps. Setting the degree of parallelism to 1 for everything will result in Java application server parallelism being used for the timeconsuming steps, and setting the value to anything greater than 1 will result in database level parallelism. The steps that don’t take too much time will always use Java application server level parallelism. For the time-consuming steps, database level parallelism is recommended.
Generate Candidates INFORMATICA CONFIDENTIAL
BEST PRACTICES
431 of 818
The Generate Candidates step executes in step order. The insert degree of parallelism is used for gathering the candidates (insert statements), and the update degree is used for business rules (update statements).
Build Staging The Build Staging step processes multiple tables simultaneously up to the maximum number of threads allowed. It also creates the staging tables and does not take much time since it is creating empty tables.
Copy to Staging The Copy to Staging step processes the tables sequentially in step order and uses Oracle parallel DML if the insert degree of parallelism is greater than 1. However, setting the insert degree to 1 will enable Java parallelism. Data Archive processes multiple tables simultaneously up to the maximum number of threads allowed.
Validate Destination The Validate Destination step processes multiple tables simultaneously up to the maximum number of threads allowed. It also compares the structure of the tables in the stage area to the tables in the target and will update the target table automatically if a difference is found. This step does not usually take much time.
Copy to Destination The Copy to Destination step processes the tables sequentially in step order and uses Oracle parallel DML if the insert degree of parallelism is greater than 1. However, Java application server parallelism can be enabled by setting the insert degree to 1. Data Archive processes multiple tables simultaneously up to the maximum number of threads allowed. Restore cycles process the Copy to Destination step sequentially in step order.
Delete From Source The Delete From Source step executes in the reverse of the insert order. To control the insert order and delete order in Enterprise Data Manager, select the Interim table and select the Tables tab in the right pane. It processes multiple tables simultaneously up to the maximum number of threads allowed if a parallel degree of 1 for delete is used. Choosing the Use ROWID For Delete check box for the source will cause the deletes to be done using ROWIDs, which is also recommended. If the Use Oracle Parallel DML check box is not checked, the number of rows to be processed will be divided among ILM parallel workers. If the Use Oracle Parallel DML check box is checked then Oracle parallel DML will be used. This is the most timeconsuming step and has the greatest need for parallelism.
Purge Staging The Purge Staging step processes multiple tables simultaneously up to the maximum number of threads allowed. It drops the staging tables and does not take much time.
Metadata Tuning To enable parallelism for a specific table in the Copy to Staging step, complete the following tasks: 1. Set the relevant Degree of Parallelism attribute in the Data Archive Project. Note: Descriptions of attributes appear later in the document. 2. Add parallel hints to the Data Archive metadata using the Enterprise Data Manager (EDM). For example, to archive the PO_DISTRIBUTIONS_ALL table with parallelism, add the following parallel hint to the insert statement metadata for the table: A.po_header_id IN (SELECT /*+ FULL(X) CARDINALITY(X, 1) PARALLEL(X, #) */ X.po_header_id FROM XA_4149_PO_HEADERS_INTERIM X WHERE X.purgeable_flag = 'Y') The pound sign (#) is a placeholder that is replaced at runtime with the corresponding Degree of Parallelism attribute. This variable enables the same metadata to work with different degrees of parallelism, such as the Test and Production environments. A Data Archive insert statement executes in parallel if the Insert Degree of Parallelism is set and a parallel hint exists in the metadata WHERE clause. Using the WHERE clause above, the engine constructs an insert statement as follows:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
432 of 818
INSERT /*+ APPEND PARALLEL (Z, 8) */ INTO AM_STAGE.AA_3666113 Z SELECT /*+ PARALLEL (A, 8) */ A.*, A.ROWID FROM PO.PO_DISTRIBUTIONS_ALL A WHERE A.po_header_id IN (SELECT /*+ FULL(X) CARDINALITY(X, 1) PARALLEL(X, 8) */ X.po_header_id FROM XA_4149_PO_HEADERS_INTERIM X WHERE X.purgeable_flag = 'Y') If the metadata does not include a PARALLEL(X, #) hint, the Data Archive engine does not process the table in parallel. Also, no parallel hint is added to the insert clause of the statement. This allows control of which tables the Data Archive engine archives in parallel. The FULL and CARDINALITY hints ensure that the optimizer uses the proper execution plan when selecting the data to archive. The FULL hint suggests that Oracle does a full table scan on the interim table. The CARDINALITY hint tells the optimizer that the subselect only returns one row, which is not accurate, but it ensures that the ERP table is accessed using an index. It is important to tune the statement properly for the parallel processing to work efficiently. Parallel and other hints may be required in multiple places in the metadata statement text, often combined with other hints, to be effective and improve performance. If help is needed tuning one or more statements, contact Informatica Global Customer Support. The Data Archive engine can also execute business rule update statements and candidate insert statements in parallel, and the implementation is the same. The Update Degree of Parallelismattribute controls the parallelism for these update statements, and the Insert Degree of Parallelism attribute controls the insert parallelism. Data Archive does not require metadata changes to support delete parallelism. If the Delete Degree of Parallelismattribute is greater than 1, the Data Archive engine adds the necessary parallel hints automatically. However, it only processes tables without a primary key metadata constraint in parallel if the Use Oracle Parallel DML For Deletesource repository attribute is enabled and the metadata delete statement contains a parallel hint.
Oracle Parallel DML for Delete Delete parallelism is achieved by using Oracle's parallel query/DML features. These features are invoked by adding parallel hints to the delete statements that are executed. This method is the fastest way to delete data in Oracle. The following example shows a typical delete statement using Oracle parallelism: DELETE/*+ PARALLEL(A,8) */ FROM PO.PO_DISTRIBUTIONS_ALL A WHERE (ROWID, PO_DISTRIBUTION_ID) IN ( SELECT/*+ CARDINALITY(X,1) PARALLEL(X,8) */ APPLIMATION_ROW_ID, PO_DISTRIBUTION_ID FROM AM_STAGE.AA_3666113 X)
Parallel Delete Controlled by Data Archive Data Archive spawns and controls multiple delete processes. Each delete process is a Java thread, where each worker processes a range of rows. This method is slower than Oracle's parallelism, but does not lock the table. Therefore, use this option when archive cycles are scheduled during business hours. The following are sample delete statements executed by parallel worker threads: DELETE FROM PO.PO_DISTRIBUTIONS_ALL A WHERE (ROWID, PO_DISTRIBUTION_ID) IN ( SELECT/*+ CARDINALITY(X,1) */ APPLIMATION_ROW_ID, PO_DISTRIBUTION_ID FROM AM_STAGE.AA_3666113_T1 X WHERE APPLIMATION_ROW_NUM BETWEEN 1 AND 100000) DELETE FROM PO.PO_DISTRIBUTIONS_ALL A WHERE (ROWID, PO_DISTRIBUTION_ID) IN ( SELECT /*+ CARDINALITY(X,1) */ APPLIMATION_ROW_ID, PO_DISTRIBUTION_ID FROM AM_STAGE.AA_3666113_T1 X WHERE APPLIMATION_ROW_NUM BETWEEN 1000001 AND 200000) Consider the following factors before using Oracle parallel DML: INFORMATICA CONFIDENTIAL
BEST PRACTICES
433 of 818
Oracle parallelism during delete operations requires an exclusive lock on the table that is deleted, preventing other users from updating the same table. Read access on these tables is still possible. Use Oracle parallelism if archive cycles are scheduled outside of business hours when no users need write access to the ERP system. The source or ERP tables need to support parallel DML. Tables created in an Oracle 8/8i database and then upgraded to 9i/10g do not support parallel DML. Use the query below to check the relevant table property: select a.property, decode(bitand(a.property,536870912), 0, 'DISABLED', 'ENABLED') pdml_enabled from sys.tab$ a, dba_objects b where a.obj# = b.object_id and b.object_name = '&table_name' and b.owner = '&owner'; If a table does not support parallel DML, Oracle provides a means to enable it. For more information, contact Informatica Global Customer Support.
Performance-Related Source Connection Attributes The connection used for a specific Data Archive project has attributes that will control how any data archive project using the connection will be processed.
Use Staging If Use Staging is enabled, the Data Archive engine temporarily copies the data to be archived to staging tables on the ERP instance. Enable this option to increase performance. Once set for a connection, this option cannot be changed.
Use ROWID for Delete If Use ROWID for Delete is enabled, the Data Archive engine stores the ROWID of the source rows in the staging tables and the Delete from Source step deletes rows using the ROWID. This is the fastest way to delete rows in Oracle and is the recommended setting.
Use Oracle Parallel DML for Delete The way Data Archive processes delete statements during the Delete from Source step is controlled by the Use Oracle Parallel DML for Delete source repository attribute. If this attribute is enabled, Data Archive uses Oracle's parallel DML. Otherwise, the Data Archive engine controls delete parallelism.
Archive Project Setup There are settings within a project that control the execution of a specific run for a project that will have an impact on performance.
Analyze Interim The Analyze Interim attribute controls when optimizer statistics are gathered for interim tables during Generate Candidates. Analyze the interim table for optimal table and index statistics for the Oracle Cost Base Optimizer. Select one of the following options: After Insert. The interim table is analyzed after data is inserted. After Insert and Update. The interim table is analyzed after every insert and update statement. After Update. The interim table is analyzed after business rule update statements. None. No statistics are gathered.
Delete Commit Interval The Delete Commit Interval attribute controls the commit frequency during delete operations during the Delete from Source step. If the value is 0 or if the table processed does not have a primary key metadata constraint defined, then the delete is executed as a single statement and no commit interval is used.
Insert Commit Interval
INFORMATICA CONFIDENTIAL
BEST PRACTICES
434 of 818
The Insert Commit Interval attribute controls the commit frequency during insert operations when an "INSERT AS SELECT" type statement cannot be used. Insert as Select cannot be used in the following cases: The table has a LONG column. If the Database Link to Source destination repository attribute is not set during the Copy to Destination step. If the table processed has a user-defined type column, such as WF_EVENT_T, and the database version is Oracle 9i or lower during the Copy to Destinationstep. The Generate Candidates and Copy to Staging steps use Insert as Select type statements except when the table contains a LONG column. Undo usage is minimal during these steps because the Data Archive engine adds an APPEND hint to the insert statement. In this case, Oracle does a direct path data load above the table high water mark.
Degree of Parallelism The degree of parallelism within a project can be set for insert, update and delete separately. In general, the possible degree of parallelism depends on the following factors: The number of CPUs available on the database server. Use the following query to find out the number of CPUs available to an Oracle instance: select value from v$parameter where name = 'cpu_count'; The setting of the parallel_max_servers initialization parameter. Use the following query to check the current setting: select value from v$parameter where name = 'parallel_max_servers'; Set this parameter to at least twice the value of the degree of parallelism chosen. For example, to execute a delete statement with a degree of 8, Oracle needs to spawn 16 parallel threads, where 8 threads work on the necessary select to retrieve the rows and 8 perform the delete.
Delete Degree of Parallelism The degree of parallelism used for delete operations during the Delete from Source step.
Insert Degree of Parallelism The degree of parallelism used for insert operations during the Generate Candidates, Copy to Staging and Copy to Destination steps.
Update Degree of Parallelism The degree of parallelism used for update operations during the Generate Candidates step.
Restore Project Setup Within a restore project there are attributes that can be set to change the performance of a restore.
Delete Commit Interval Delete Commit Interval is unused. Restore cycles issue a single delete statement. Because single commit delete statements possibly delete large volumes of data, size the UNDO tablespace on the history database accordingly.
Insert Commit Interval The Insert Commit Interval attribute controls the commit frequency during insert operations when an Insert as Select statement can be used. Insert as Select cannot be used in the following cases: The table has a LONG column. The Database Link to Source destination repository attribute is not set during the Copy to Destination step. The table processed has a user defined type column, such as WF_EVENT_T, and the database version is Oracle 9i or lower during the Copy to Destination step.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
435 of 818
The Generate Candidates and Copy to Staging steps always use Insert as Select type statements except if the table has a LONG column. Undo usage is minimal during these steps because the Data Archive engine adds an APPEND hint to the insert statement. In this case, Oracle does a direct path data load above the table high water mark.
Delete Degree of Parallelism Delete Degree of Parallelism is currently unused.
Insert Degree of Parallelism The degree of parallelism used for insert operations during the Generate Candidates, Copy to Staging and Copy to Destination steps.
Java Parallelism - Processing Multiple Tables Simultaneously The Data Archive engine can process multiple tables simultaneously depending on the step executed and the value of the corresponding Degree of Parallelism attribute. Setting a specific degree of parallelism to 1 will cause Java parallelism to be used for the corresponding step. In most cases the database parallelism will be more effective, but in some cases application server parallelism may be desired. If help is needed to determine where Java parallelism should be used, contact Informatica Global Customer Support. The maximum number of concurrent processes or threads depends on the value of the informia.maxActiveAMThread property. Set this property in the conf.properties file located in the web server root directory. The default is 10. This property is a systemwide setting. It does not apply to a single archive cycle. It limits the total number of Java threads that the Data Archive engine executes in parallel. Use the following URL to see all active and pending threads in the Data Archive thread queue manager: http://
Last updated: 28-Oct-10 03:11
INFORMATICA CONFIDENTIAL
BEST PRACTICES
436 of 818
Business Case Development Challenge Establishing an Integration Competency Center (ICC) or shared infrastructure takes money, resources and management attention. While most enterprises have requirements and defined standards for business case documents to justify projects that involve a financial expenditure or significant organizational change, integration initiatives present unique challenges that are associated with multi-functional cross-organizational initiatives. While most IT personnel are quite capable of articulating the technical case for the degree of centralization that the ICC, standard technology or shared integration system represents; proving the business case is likely to be more challenging as the technical case alone is unlikely to secure the required funding. It is important to identify the departments and individuals that are likely to benefit directly and indirectly from the implementation of the ICC. This Best Practice provides a systematic approach to researching, documenting and presenting the business justification for these sorts of complex integration initiatives.
Description The process to establish a business case for an ICC or shared infrastructure such as an Enterprise Data Warehouse, ETL hub, Enterprise Service Bus, or Data Governance program (to name just a few) is fundamentally an exercise in analysis and persuasion. This process is demonstrated graphically in the figure below:
The following sections describe each step in this process.
Step 1: Clarify Business Need Data integration investments should be part of a business strategy that must, in turn, be part of the overall corporate strategy. Mismatched IT investments will only move the organization in the wrong direction more quickly. Consequently, an investment in data integration should (depending on the enterprise requirements) be based on a requirement to: Improve data quality Reduce future integration costs Reduce system architecture complexity Increase implementation speed for new systems Reduce corporate costs Support business priorities The first step in the business case process therefore is to state the integration problem in such a way as to clearly define the circumstances leading to the consideration of the investment. This step is important because it identifies both the questions to be resolved by the analysis and the boundaries of the investigation. The problem statement identifies the need to be satisfied, the problem to be solved or the opportunity to be exploited. The problem statement should address: the corporate and program goals and other objectives affected by the proposed investment a description of the problem, need, or opportunity a general indication of the range of possible actions Although the immediate concern may be to fulfill the needs of a specific integration opportunity, you must, nevertheless, consider the overall corporate goals. A business solution that does not take into account corporate priorities and business strategies may never deliver its expected benefits due to unanticipated changes within the organization or its processes. There is a significant danger associated with unverified assumptions that can derail business case development at the outset of the process. It is imperative to be precise about the business need that the ICC is designed to address; abandon preconceptions and get to the core of the requirements. Do not assume that the perceived benefits of a centralized service such as an ICC are so obvious that they do not need to be specifically stated. In summary, key activities in Step 1 include: Define and agree on the problem, opportunity or goals that will guide the development of the business case. Use brainstorming techniques to envision how you would describe the business need in a compelling way. Start with the end in mind based on what you know. Prepare a plan to gather the data and facts needed to justify the vision.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
437 of 818
Understand the organization’s financial accounting methods and business case development standards. Understand the enterprise funding approval governance processes. Note: It is important to review the core assumptions as the business case evolves and new information becomes available.
TIP
Management by Fact (MBF) is a tool used in a number of methodologies including Six-Sigma, CMM, and QPM. It is a concise summary of quantified problem statement, performance history, prioritized root causes, and corresponding countermeasures for the purpose of data-driven problem analysis and management. MBF: Uses the facts Eliminates bias Tightly couples resources and effort to problem-solving Clarifies the problem Use “4 Whats” to help quantify the problem statement Quantify the gap between actual and desired performance Determines root cause Separate beliefs from facts Use “5 Whys” – by the time you have answered the 5th “why” you should understand the root cause
The key output from Step 1 is a notional problem statement (with placeholders or “guesses” for key measures) and several graphical sketches representing a clearly defined problem/needs statement using business terms to describe the problem. The following figure shows the basic structure of a problem statement with supporting facts and proposed resolution that can be summarized on a one-page PowerPoint slide. While the basic structure should be defined in Step 1, the supporting details and action plans will emerge from the subsequent analysis steps.
Step 2: Identify Options and Define Approach The way in which you describe solutions or opportunities is likely to shape the analysis that follows. Do not focus on specific technologies, products or methods, as this may exclude other options that might produce the same benefits, but at a lower cost or increased benefits for the same cost. Instead, try to identify all of the possible ways in which the organization can meet the business objectives described in the problem statement. In this way, the options that are developed and analyzed will have a clear relationship to the organization’s needs. Unless this relationship is clear, you may be accused of investing in technology for technology’s sake. Available options must include the base case, as well as a range of other potential solutions. The base case should show how an organization would perform if it did not pursue the data integration investment proposal or otherwise change its method of operation. It is important to highlight any alternative solutions to the integration investment. A description of what is meant by doing nothing is required here. It is not adequate to state the base case simply as the continuation of the current situation. It must account for future developments over a period long enough to serve as a basis of comparison for a new system. For example, an organization that keeps an aging integration technique may face increasing maintenance costs as the systems gets older and the integrations more complex. There may be more frequent system failures and changes causing longer periods of down time. Maintenance costs may become prohibitive, service delays intolerable or workloads unmanageable. Alternatively, demand for a business unit’s services may ultimately decrease, permitting a reduction of costs without the need for an integration investment. Be sure to examine all the options in both the short and long-term. Short Term: The document should highlight the immediate effect of doing nothing. For example the competition may have already implemented systems such as a Customer Data Integration hub or a Data Quality program and are able to offer more competitive services. Thus, the enterprise may already be
INFORMATICA CONFIDENTIAL
BEST PRACTICES
438 of 818
losing market share because of its inability to change and react to market conditions. If there is significant market share loss, it should be presented so as to emphasize the need for something to be done. Long Term: The base case should predict the long-term costs and benefits of maintaining the current method of operation, taking into account the known external pressures for change, such as predicted changes in demand for service, budgets, and staffing or business direction. Problems can be solved in different ways and to different extents. In some cases, options are available that concentrate on making optimum use of existing systems or on altering current procedures. These options may require little or no new investment and should be considered. A full-scale analysis of all options is neither achievable nor necessary. A screening process is the best way to ensure that the analysis proceeds with only the most promising options. Screening allows a wide range of initial options to be considered, while keeping the level of effort reasonable. Establishing a process for screening options has the added advantage of setting out in an evaluation framework the reasons for selecting, as well as rejecting, particular options. Options should be ruled out as soon as it becomes clear that other choices are superior from a cost-benefit perspective. A comparative cost-benefit framework should quickly identify the key features likely to make a difference among options. Grouping options with similar key features can help identify differences associated with cost disadvantages or benefit advantages that would persist even if subjected to more rigorous analysis. Options may be ruled out on the basis that their success depends too heavily on unproven technology or that they just will not work. Take care not to confuse options that will not work with options that are merely less desirable. Options that are simply undesirable will drop out when you begin to measure the costs and benefits. The objective is to subject options to an increasingly rigorous analysis. A good rule of thumb is that, when in doubt about the economic merits of a particular option, the analyst should retain it for subsequent, more detailed rounds of estimation. To secure funds in support of ICC infrastructure investments, a number of broad-based strategies and detailed methods can be used. Below are five primary strategies that address many of the funding challenges: 1. Recurring quick wins. This strategy involves making a series of small incremental investments as separate projects, each of which provides demonstrable evidence of progress. This strategy works best when the work can be segmented. 2. React to a crisis. While it may not be possible to predict when a crisis will occur, experienced integration staff are often able to see a pattern and anticipate in what areas a crisis is likely to emerge. By way of analogy, it may not be easy to predict when the next earthquake will occur, but we can be quite accurate about predicting where it is likely to occur based on past patterns. The advantage that can be leveraged in a crisis situation is that senior management attention is clearly focused at solving the problem. A challenge however is that there is often a tendency to solve the problem quickly, which may not allow sufficient time for a business case that addresses the underlying structural issues and root causes of the problem. This strategy therefore requires that the ICC team perform some advance work and be prepared with a rough investment proposal for addressing structural issues and be ready to quickly present it when the opportunity presents itself. 3. Executive vision. This strategy relies on ownership being driven by a top level executive (e.g., CEO, CFO, CIO, etc.) who has control over a certain amount of discretionary funding. In this scenario, a business case may not be required because the investment is being driven by a belief in core principles and a top-down vision. This is often the path of least resistance if you have the fortune to have an executive with the appropriate vision that aligns with the ICC charter/mission. The downside is that if the executive leaves the organization or is promoted into another role, the ICC momentum and any associated investment may fade away if insufficient cross-functional support has been developed. 4. Ride on a wave. This strategy involves tying the infrastructure investment to a large project with definite ROI and implementing the foundational elements to serve future projects and the enterprise overall rather than just the large project’s needs. Examples include purchasing the hardware and software for an enterprise data integration hub in conjunction with a corporate merger/acquisition program or building an enterprise hub as part of a large ERP system implementation. This strategy may make it easier to secure the funds for an infrastructure that is hard to justify on its own merits, but has the risk of becoming too project-specific and not as reusable by the rest of the enterprise. 5. Create the wave. This strategy involves developing a clear business case with defined benefits and a revenue/cost sharing model that are agreed to in advance by all stakeholders who will use the shared infrastructure. This is one of the most difficult strategies to execute because it requires a substantial up-front investment in building the business case and gaining broad-based organizational support. But it can also be one of the most rewarding because all the hard work to build support and address the “political” issues is done early. In summary, the activities in Step 2 to identify the options and develop the approach are: Assemble the evaluation team Include a mix of resources if possible, including some change agents and change resistors Use internal resources that “know their way around” the organization Use external resources to ask the naive questions and side-step internal politics Prepare for evaluation including understanding the organizational financial standards and internal accounting methods Define the options and a framework for structuring the investment decision Identify the options (including the baseline “status quo” option) Screen the options based on short-term and long-term effects Determine the Business Case Style The key deliverable resulting from step 2 is a list of options and a rough idea of the answers to the following questions for each of them: How does the solution relate to the enterprise objectives and priorities? What overall program performance results will the option achieve? What may happen if the option is not selected?
INFORMATICA CONFIDENTIAL
BEST PRACTICES
439 of 818
What additional outcomes or benefits may occur if this option is selected? Who are the stakeholders? What are their interests and responsibilities? What effect does the option have on them? What will be the implications for the organization’s human resources? What are the projected improvements in timeliness, productivity, cost savings, cost avoidance, quality and service? How much will it cost to implement the integration solution? Does the solution involve the innovative use of technology? If so, what risks does that involve? TIP Return on Investment (ROI) is often used as a generic term for any kind of measure that compares the financial costs and benefits of an action. A more narrow finance definition is “rate of return” – for example, if you put $100 into a savings account and have $105 a year later, the ROI is 5% Some of the most common ROI methods are: Payback period in months or years is equal to the investment amount divided by the incremental annual cash flow; for example, if I invest $1 million, how long will it take to earn the same amount in incremental revenue? Net Present Value (NPV) is the present (discounted) value of future cash inflows minus the present value of the investment and any associated future cash outflows. Internal Rate of Return (IRR) is the discount rate that results in a net present value of zero for a series of future cash flows.
Step 3: Determine Costs It is necessary to define the costs associated with options that will influence the investment decision. Be sure to consider all the cost dimensions, including: Fixed vs. direct vs. variable costs Internal vs. external costs Capital vs. operating costs One-time costs vs. ongoing cost model Before defining costs, it is generally useful to create a classification for the various kinds of activities that will make up the project effort. This structure will vary greatly from project to project. Following is an example for a data integration project that has a large number of interfaces with several level of complexity. This classification will help categorize the different integration efforts within the data integration initiative. The categories are based on 1) the number of fields that must be transformed between source and target; 2) the number of targets for each transformation; and 3) the complexity of the data structure. The categories are: Simple Two applications, fewer than 50 field and simple flat message layouts Moderate Two or more applications, fewer than 50 fields or hierarchical message layouts Complex Multiple applications, more than 50 fields or complex message layouts Note: This is a very simple example and some projects may have 10 or more classification schemes for various aspects of the project. Once the classification for a given project is determined, it can be used to develop cost models for various scenarios. The fixed and direct costs are the costs that do not vary on the number of integration interfaces that are to be built. They can be costs that are incurred over a period of time but can be envisaged as one end number at the end that period. Direct up-front costs are the ‘out-of-pocket’ development and implementation expenses. These can be substantial and should be carefully assessed. Fortunately, these costs are generally well documented and easily determined, except for projects that involve new technology or software applications. The main categories of direct/fixed costs are: hardware and peripherals packaged and customized software initial data collection or conversion of archival data data quality analysis and profiling facilities upgrades, including site preparation and renovation design and implementation testing and prototyping documentation additional staffing requirements initial user training transition, such as costs of running parallel systems quality assurance and post-implementation reviews Direct ongoing costs are the ‘out-of-pocket’ expenses that occur over the lifecycle of the investment. The costs to operate a facility, as well as to develop or implement an option, must be identified. The main categories of direct ongoing costs are: salaries for staff software license fees; maintenance and upgrades computing equipment and maintenance
INFORMATICA CONFIDENTIAL
BEST PRACTICES
440 of 818
user support ongoing training reviews and audits Note: Not all of these categories are included in every data integration implementation. It is important to pick the costs that reflect your implementation accurately. The primary output from Step 3 is a financial model (typically a spreadsheet) around the options with different views according to the interests of the main stakeholders. TIP Business Case Bundling Some elements of a program may be necessary but hard to justify on their own; include these elements with other elements that are easier to justify. Enterprise Level Licensing In determining costs, there is often an opportunity for getting help from external resources such as suppliers of actual or potential technology and services since they build business cases for a living and often can suggest creative solutions. For example, there are many licensing options for all the components within the Informatica product suite; your Account Manager can provide information on the costs associated with the organizational options you have identified. In the context of an ICC, there may well be significant cost advantages in licensing at the level of the enterprise. TIP A Total Cost of Operation (or Ownership) TCO assessment ideally offers a picture of not only the cost of purchase but all aspects in the further use and maintenance of a solution. This includes items such as: Development expenses, testing infrastructure and expenses, and deployment costs Costs of training support personnel and the users of the system Costs associated with failure or outage (planned and unplanned) Diminished performance incidents (i.e., if users are kept waiting) Costs of security breaches (in loss of reputation and recovery costs) Costs of disaster preparedness and recovery, floor space and electricity Marginal incremental growth, decommissioning, e-waste handling, and more When incorporated in any financial benefit analysis (e.g., ROI, IRR, EVA), a TCO provides a cost basis for determining the economic value of that investment. TCO can, and often does, vary dramatically against TCA (total cost of acquisition) Although TCO is far more relevant in determining the viability of any capital investment, many organizations make ROI investment decisions by considering only the initial implementation costs..
Step 4: Define Benefits This step identifies and quantifies the potential benefits of a proposed integration investment. Both quantitative and qualitative benefits should be defined. Determine how improvements in productivity and service are defined and also the methods for realizing the benefits. Direct (hard) and indirect (soft) benefits Financial model of the benefits Collect industry studies to complement internal analysis Identify anecdotal examples to reinforce facts Define how benefits will be measured To structure the evaluation, you will have to clearly identify and quantify the project’s advantages. A structure is required to set a range within which the benefits of an integration implementation can be realized. Conservative, moderate, and optimistic values are used in the attempt to produce a final range, which realistically contains the benefits to the enterprise of an integration project, but also reflects the difficulty of assigning precise values to some of the elements. Conservative values reflect the highest costs and lowest benefits possible Moderate values reflect those you believe to most accurately reflect the true value of the data integration implementation Optimistic estimates reflect values that are highly favorable but also not improbable Many of the greatest benefits of a data integration are realized months or even years after the project has been completed. These benefits should be estimated for each of the three-to-five years following the project. In order to account for the changing value of money over time, all future pre-tax costs and benefits are discounted at the Internal Rate of Return (IRR) percent. Direct (Hard) Benefits: The enterprise will immediately notice a few direct benefits from improving data integration in its projects. These are mainly cost savings over traditional point-to-point integration in which interfaces and transformations between applications are hard-coded, and the cost savings the enterprise will incur due to the enhanced integration and automation made possible by data integration. Key considerations include: Cost savings Reduction in complexity
INFORMATICA CONFIDENTIAL
BEST PRACTICES
441 of 818
Reduction in staff training Reduction in manual processes Incremental revenue linked directly to the project Governance and compliance controls that are directly linked Indirect (Soft) Benefits: The greatest benefits from an integration project usually stem from the extended flexibility the system will have. For this reason, these benefits tend to be longer-term and indirect. Among them are: Increase in market share Decrease in cost of future application upgrades Improved data quality and reporting accuracy Decrease in effort required for integration projects Improved quality of work for staff and reduced turnover Better management decisions Reduced wastage and re-work Ability to adopt a managed service strategy Increased scalability and performance Improved services to suppliers and customers Increase in transaction auditing capabilities Decreased time to market for mission critical projects Increased security features Improved regulatory compliance It is possible to turn indirect benefits into direct benefits by performing a detailed analysis and working with finance and management stakeholders to gain support. This may not always be necessary, but often is essential (especially with a “Make the Wave” business case style). Since it can take a lot of time and effort to complete this analysis, the recommended best practice is to select only one indirect benefit as the target for a detailed analysis. Refer to Appendix C at the end of this document for an additional list of possible business value categories and analysis options.
TIP
Turn subjective terms into numbers Fact-based, quantitative drivers and metrics are more compelling than subjective ones. Outcome-based objectives are more persuasive than activitybased measures. If detailed analysis is not feasible for the entire scope, it may be sufficient to use a combination of big-picture top-down numbers for macro-level analysis plus a micro-level analysis on a representative piece of the whole. The following figure illustrates an example of a compelling exposition of ICC benefits. Note that the initial cost of integrations developed by the ICC is greater than hand-coding, but after 100 integrations the ICC cost is less. In this example (which is based on a real-life case), the enterprise developed more than 300 integrations per year, which translates into a saving of $3 million per year.
It is also useful to identify the project beneficiaries and to understand their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart – a useful reference document that ensures that all project team members understand the corporate/business organization.
TIP
Leverage Industry Studies Industry studies from research firms such as Gartner, Forrester and AMR Research can be used to highlight the value of an integration approach. For example, all the percentages below are available from various research reports. This example is for a US-based telecommunications company with annual revenue of $10B. The industry study suggests that an ICC would save $30million per year for an organization of this size. Company Revenue Telecommunications Industry)
% of revenue spent on IT1
INFORMATICA CONFIDENTIAL
$ 10,000,000,000 5.0% $ 500,000,000
BEST PRACTICES
442 of 818
% of IT budget spent on investments 1 % of investment projects spent on Integration2
40.0% $ 200,000,000 35.0% $ 70,000,000
% of Integration project savings resulting from an ICC3
% of IT budget spent on MOOSE1 % of MOOSE spent on maintenance (guesstimate - no study available) % of integration savings on maintenance costs resulting from an ICC3 Total potential annual savings resulting from an ICC Notes:
30.0% $ 21,000,000 60.0% $ 300,000,000 15.0% $ 45,000,000 20.0% $ 9,000,000
$ 30,000,000
1. Forrester, 11-13-2007, "US IT Spending Benchmarks For 2007" 2. Gartner, 11-6-2003, "Client Issues for Application Integration" 3. Gartner, 4-4-2008, "Cost Cutting Through the Use of an Integration Competency Center or SOA Center of Excellence" Moose = Maintain and Operate the IT Organization, Systems, and Equipment
Step 5: Analyze Options After you have identified the options, the next step is to recommend one. Before selecting an option to recommend, you need to have a good understanding of the organization’s goals, its business processes, and the business requirements that must be satisfied. To evaluate investment options, select criteria that will allow measurement and comparison. The following list presents some possible analyses, starting with those that involve hard financial returns and progressing to those that are more strategic: Analysis of cost effectiveness: demonstrates, in financial terms, improvements in performance or in service delivery; and shows whether the benefits from the data integration investment outweigh its costs. Analysis of displaced or avoided costs: compares the proposed system’s costs to those of the system it would displace or avoid; and may justify the proposal on a least-cost basis if it can be assumed that the new system will have as many benefits as the current system. Work value analysis: requires analysis of work patterns throughout the organization and of ways that would re-adjust the number and types of skills required; and assumes that additional work needs to be done, that management allocates resources efficiently, and that workers allocate time efficiently. Cost of quality analysis: estimates the savings to be gained by reducing the cost of quality assurance, such as the cost of preventing or repairing a product failure; and can consider savings that are internal and external to the organization, such as the enterprise’s cost to return a product. Option value analysis: estimates the value of future opportunities that the organization may now pursue because of the project; uses decision trees and probability analysis; and includes savings on future projects, portions of the benefits of future projects, and reductions in the risks associated with future projects. Analysis of technical importance: justifies an infrastructure investment because a larger project that has already received approval could not proceed without it. This is likely when enterprises initiate data integration program as a consequence of a merger or acquisition and two large ERP systems need to communicate. Alignment with business objectives: includes the concept of strategic alignment modeling, which is one way to examine the interaction between IT strategy and business strategy; and allows managers to put a value on the direct contribution of an investment to the strategic objectives of the organization. Analysis of level-of-service improvements: estimates the benefits to enterprises of increases in the quantity, quality, or delivery of services; and must be done from the enterprise’s viewpoint. Research and development (R&D): is a variant of option value analysis, except that the decision on whether to invest in a large data integration project depends on the outcome of a pilot project; is most useful for high-risk projects, where R&D can assess the likelihood of failure and help managers decide whether to abort the project or better manage its risks; and requires management to accept the consequences of failure and to accept that the pilot is a reasonable expense in determining the viability of an data integration project. TIP Use analytical techniques, such as discounted cash flow (DCF), internal rate of return (IRR), return on investment (ROI), net present value (NPV), or break-even/payback analysis to estimate the dollar value of options. After you have quantified the costs and benefits, it is essential to conduct a cost-benefit analysis of the various options. Showing the incremental benefits of each option relative to the base case requires less analysis, since the analyst does not have to evaluate the benefits and costs of an entire program or service. Some benefits may not be quantifiable. Nevertheless, these benefits should be included in the analysis, along with the benefits to individuals within and external to the organization. You have to look at the project from two perspectives: the organization’s perspective as the supplier of products and services; and the enterprise’s or public’s perspective as the consumer of those services. Hard cost savings come from dedicated resources (people and equipment) while more uncertain savings come from allocated costs such as overheads
INFORMATICA CONFIDENTIAL
BEST PRACTICES
443 of 818
and workload. When estimating cost avoidance, keep these two types of savings separate. Assess how likely it is that the organization will realize savings from allocated resources, and estimate how long it will take to realize these savings. TIP Since the cost-benefit analysis is an important part of the decision-making process, verify calculations thoroughly. Check figures on spreadsheets both before and during the analysis. Include techniques and assumptions in the notes accompanying the analysis of each option.
Step 6: Evaluate Risks Step 6 presents ways to help identify and evaluate the risks that an integration investment may face so that they can be included in the business case. It also discusses how to plan to control, or minimize the risk associated with implementing a data integration investment. Key activities are: Identify the risks Characterize in terms of impact, likelihood of occurrence, and interdependence Prioritize to determine which risks need the most immediate attention Devise an approach to assume, avoid or control the risks The purpose of risk assessment and management is to determine and resolve threats to the successful achievement of investment objectives and especially to the benefits identified in the business case. The assessment and management of risk are ongoing processes that continue throughout the duration of an integration implementation and are used to make decisions about the project implementation. The first decision faced by an integration investment option is whether to proceed. The better the risks are understood and planned for when this decision is made, the more reliable a decision and the better the chances of success. The method underlying most risk assessment and management approaches can be summarized by the following five-step process: 1. 2. 3. 4. 5.
identify the risks facing the project characterize the risks in terms of impact, likelihood of occurrence, and interdependence prioritize the risks to determine which need the most immediate attention devise an approach to assume, avoid or control the risks monitor the risks
All but the last of these can and should be undertaken as part of the business-case analysis conducted prior to the decision to proceed. TIP A group can assess risk more thoroughly than an individual. Do not use the unreliable practice of discounting the expected net gains and then assuming that the remainder is safe. Not all risks are created equal. For each risk identified, characterize the degree of risk in terms of: its impact on the project (e.g., slight delay or show-stopper) the probability of its occurrence (e.g., from very unlikely to very likely) its relationship to other risks (e.g., poor data quality can lead to problems data mapping) Once the risks have been identified and characterized, they can then be ranked in order of priority to determine which should be tackled first. Priority should be based on a combination of an event’s impact, likelihood, and interdependence. For example, risks that have a severe impact are very likely to occur. Therefore, they should be dealt with first to avoid having to deal with additional risks. You can assign priorities to risk factors by assigning a weight to each risk for each of the three characteristics (i.e., impact, likelihood and interdependence) and multiplying the three values to create a composite score. The risk with the highest score gets the highest priority. TIP A general rule of thumb is to develop a risk mitigation plan only for the top five risks based on the rationale that a) there is no point focusing on lower priority risks if the major ones aren’t addressed and b) due to limited management attention it is not feasible to tackle too many at once. After you have mitigated some of the highest priority risks, re-evaluate the list on an ongoing basis and focus again on the top five. Three main types of risks arise in IT projects: Lack of control. Risks of this type arise from a project team’s lack of control over the probability of occurrence of an event and/or its consequences. For example, the risks related to senior managers’ decisions are often a result of the lack of control a project team has over senior managers. Lack of information. Risks of this type arise from a project team’s lack of information regarding the probability of occurrence of an event or its consequences. For example, risks related to the use of new technologies are often the result of a lack of information about the potential or performance of these technologies. Lack of time. Risks of this type arise from a project team’s inability to find the time to identify the risks associated with the project or a given course of action, or to assess the probability of occurrence of an event or the impact of its consequences. There are three main types of responses to risk in data integration projects and they are listed in ascending order of their potential to reduce risk:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
444 of 818
Assume. In this type of response, a department accepts the risk and does not take action to prevent an event’s occurrence or to mitigate its impact. Control. In this type of response, a department takes no action to reduce the probability of occurrence of an event, but upon occurrence, attempts to mitigate its impact. Avoid. In this type of response, a department takes action prior to the occurrence of an event in order either to reduce its probability of occurrence or mitigate its impact. Selection of a type of response depends on the priority assigned to a risk, its nature (i.e., whether it is amenable to control or avoidance), and the resources available to the project. In general, the higher the priority of a risk, the more vigorous the type of response applied. TIP Do not avoid or “hide” risks. The credibility of the business case will be enhanced by clearly identifying risks and developing a mitigation strategy to address them.
TIP Do not assume that maintaining the status quo constitutes minimum risk. What is changing in the external environment, in the behavior of your customers, suppliers and competitors?
Step 7: Package the Case Assemble the business case documentation and package it for consumption by the targeted stakeholders. Key activities in this step include: Identify the audience Prepare the contents of the report Package the case in different formats to make a compelling and engaging presentation Descriptive graphics Animation or simulation of process or data quality issues Case studies and anecdotes Comparative financial data Customer testimonials Proposal Writing contains further details and advice for developing persuasive proposals. TIP The way the business case is presented can make a significant difference; sometimes format may be more important than content. Graphics go a long way; it’s worth experimenting with many different graphic techniques to find the one that works.
Step 8: Present the Case The best analysis and documentation will be useless unless the decision-makers buy in and give the necessary approvals. Step 8 provides suggestions to help ensure that your recommendations get a fair hearing. Utilize both formal and informal methods to present the proposal successfully. Key activities in this step include: Find the right sponsor(s) Leverage individual and group presentations Promotion and cross-functional buy-in Model, pilot and test market the proposed solution Find a sponsor who can galvanize support for the business case, as well as for the subsequent implementation. The sponsor should be a person in a senior management position. Avoid starting a project without a sponsor. The investment proposal will compete for the attention of decision-makers and the organization as a whole. This attention is crucial and Informatica can be a vital element in helping enterprises to lobby the project decision makers throughout the lifecycle of the decision process. Consequently, the proposal itself must be promoted, marketed, and sold. Market your proposal with an eye towards the enterprise culture and the target audience. Word of mouth is an important, but often overlooked, way of delivering limited information to a finite group. A business case must convince decision-makers that the analysis, conclusions, and recommendations are valid. To do this, use theoretical or practical models, pilot projects, and test marketing. Remember, seeing is believing and a demonstration is worth a thousand pictures. Furthermore, the model or pilot allows the assessment of any ongoing changes to the environment or to the assumptions. One can then answer the ”what if” scenarios that inevitably surface during the decision-making process. At the same time, there is a basis for re-assessment and revision in the investments. Leverage the 3-30-3-30 technique to prepare the appropriate message. See Appendix A at the end of this document for details.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
445 of 818
TIP Remember that presentations are never about simply communicating information – they are about initiating some action. The presentation is not about what you want to tell your audience, but what they need to know in order to take action. Prepare for your presentation with the questions: What action do I want the stakeholders to take? What questions will they have and what information will they need in order to take the desired action?
Step 9: Review Results Step 9 outlines a process for conducting ongoing reviews during the lifecycle of the data integration project. Be realistic in your assessment of the feedback from the preceding stage. Whatever the difficulties encountered in the decision-making process, follow up after the decision to review the ongoing validity of the investment and reinforce support. Key activities in this step include: Plan for scheduled and unscheduled reviews Develop a stakeholder communication plan Initiate key metrics tracking and reporting Reviews help to verify that the IT investment decision remains valid, and that all costs and benefits resulting from that decision are understood, controlled, and realized. The investment analysis contained in the business case defines the goals of the implementation project and serves as a standard against which to measure the project’s prospects for success at review time. The following types of reviews can be conducted: Independent reviews. These are conducted by an independent party at major checkpoints to identify environmental changes, overrun of time and cost targets, or other problems. Internal peer reviews. The object of the peer review is for the group to verify that the project is still on course and to provide expert advice, counsel and assistance to the project manager. In this way, the combined skills and experience of internal staff is applied to the project. External peer reviews. ICC’s may also draw upon similar people in other departments or organizations to provide a different perspective and to bring a wide range of expertise to bear on project strategies, plans and issues. Project team sanity checks. Another source of early warning for project problems is the project team members. These people are the most intimately aware of difficulties or planned activities that may pose particular challenges. Oversight reviews. These reviews, under a senior steering committee, should be planned to take place at each checkpoint to reconfirm that the project is aligned with ICC priorities and directions and to advise senior management on project progress. Investment reviews. The enterprise auditor can also review the performance of projects and, upon completion, the performance of the investment. At an investment review, the auditor reviews and verifies the effect of the investment to ascertain that the investment was justified. The reviews should start as soon as money is spent on the investment. Major project reviews should be scheduled to coincide with the release of funds allocated to the project. In this approach, the project sponsor releases only the funds needed to reach the next scheduled review. The performance of the project is reviewed at each scheduled checkpoint or when the released funds run out. After review, departmental management can decide to proceed with the project as planned, modify the project or its funding, or even terminate the project, limiting the loss to the amount previously released. Investment reviews can be scheduled to coincide with project reviews during investment implementation. The first investment review should be conducted no later than the midpoint of the project schedule, when the deliverables are under development. The second should be conducted after the end of the implementation project, when the deliverables have just started to be used in production. A final review should be conducted after the investment has been in production for between six months and a year. The exact dates for these reviews should, ideally, be determined by the timing of the investment deliverables. This timing should be clearly indicated in the investment plan. The approved investment analysis should form the basis for criteria used in all reviews. The project schedule of deliverables, based on the investment analysis, establishes the timing criteria for project reviews. After each review, the sponsor should say whether the investment will stop or continue. An investment may be stopped, even temporarily, for any of the following reasons: There is no agreement on how to conduct the review. The review showed that most of the expected results were not achieved. There were changes to the approved investment analysis, and it was not clear that the enterprise was made aware of the full implications of the changes. Changes to the approved investment analysis were accepted, but there was no additional funding for the changes or the enterprise had not accepted the new risks. For the final investment review, the enterprise should demonstrate to the auditor that the investment achieved the expected results, and the auditor should report on the investment’s level of success. TIP Leverage techniques to maintain effective communications and avoid organizational support diffusion. Keeping the business case as a “live” document will allow updates and course correction to reflect changing priorities and market pressures. Follow-through to build credibility for the next project.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
446 of 818
Appendix A: 30-3-30-3 for Presenting the Business Case 30 Seconds
Generate curiosity (e.g., Purpose of the Session “elevator speech”) Focus of the Session
3 Minutes Describe status (e.g., status report)t
30 Minutes Educate value (e.g., review session)
Current state status and value Issues, concerns, success Future oriented and focus on provided to the business and stories the positive technology users
You want the audience Your enthusiasm and passion How much you have achieved Data integration is valuable to think what? for data integration with little or no funding but not easy Message
Simple and high level; establish connections or relationships
Audience Action Desired
Request for additional Support for data integration information regarding and the ICC integration and your initiative
Segmented into the layers; simple and straightforward
Points of integration, how data quality impacts the business and customers
3 Hours Collaboration (e.g., conference) Whole picture, cover all aspects of integration. Leave no stone unturned… E.g., ICC activities are integrated into all aspects of the project lifecycle Detailed definitions, examples of value, stress the importance of growth
Understand the value as well as the utility of data Agreement and consensus integration
Adapted from R. Todd Stephens, 2005. Used by permission.
Appendix B – Case Studies Case Studies This section provides two ICC case studies based on real-world examples. The case studies have been disguised to allow them to be as specific as possible about the details.
Case Study 1: Shared Services ICC—an Executive Vision Investment Strategy In case study 1, several senior IT executives of GENCO had a strong belief that the organization would benefit from a shared services integration team. An ICC was established including a team of software developers with the expectation that the group would develop some highly reusable software and would recover most of the staff costs by charging their time out to projects. After almost one year, it became clear that the line-of-business (LOB) project teams were not accepting the ICC, so some changes were made to the team to turn it around. The turnaround began with the introduction of a new ICC director and the development of a business case: specifically, a financial justification to create a framework of reusable components such that the traditional data integration effort could be transformed from a custom development effort into a more efficient assembly process. The business case took three months to develop and resulted in approval of an investment of $3 million. The underlying premise of the business case was simple: GENCO was building more than 100 batch and real-time interfaces per year at an average cost of $30,000 per interface and an average development time of 30 days. And because there were no enterprise-wide standards, each interface was a “work of art” that presented challenges to maintain and support; the proposal was to invest $3 million to produce a standard framework to reduce the cost per interface to $10,000 and shorten the development lifecycle to 10 days; the hard savings would be $2 million in the first year plus soft benefits of reducing the time-to-market window and standardizing the integration software to reduce software maintenance. While the business case was compelling, it was not easy to come up with the numbers. For example, some project teams did not want to share any information about their cost or time to build integrations. On the surface, their excuse was that they didn’t have metrics and their staff were all so busy that they didn’t have time to do analysis on past projects. The underlying reason may have been that they didn’t believe in the value of an ICC and were fearful of losing control over some aspect of their work to a centralized group. In another example, data from a large project was uncovered that showed that the average cost to build an interface was $50,000 but the IT executive in charge refused to acknowledge the numbers on the basis that it was an “apples to oranges” comparison and that the numbers therefore weren’t relevant (the real reason may have been more political). In the end, it required a negotiation with the executive to agree on the $30,000 baseline metric. Although the actual baseline cost was higher, the negotiated baseline was still sufficient to make a strong business case. The $3-million investment was approved even though only one-third of it was needed to fund the reusable software. The rest was used for implementing a metadata repository and a semi automated process to effectively manage and control the development of interfaces by a distributed team (including team members in India), educating and training the LOB project teams on how to use the new capability, and creating a subsidy to allow the ICC to sell the initial integration project work at a lower rate than the actual cost in the first few months until the cost efficiencies took hold. Note that the funding request did not split the $3 million into the various components. It used the quantifiable cost reduction opportunity that had significant hard benefits to justify a broader investment, which included elements that were more difficult to quantify and justify. What were the results? In the 18 months after the business case was approved, the ICC delivered 300 integrations in line with the projected cost reductions, which meant that the financial results significantly exceeded the approved plan. Furthermore, a typical integration was being built in five days or less, which also exceeded the time-to-market goal. The ICC made an effort to communicate progress on a quarterly basis to the CIO and the executive
INFORMATICA CONFIDENTIAL
BEST PRACTICES
447 of 818
team with particular emphasis on the measurable benefits. Finally, the metadata repository to track and report progress of all integration requests was in place with an easy-to-use interface for project managers to have visibility into the process. This turned out to be one of the major factors in breaking down the “not invented here” syndrome by providing transparency to the project teams and following through on delivery commitments. This was another key factor in sustaining cross-functional support after the initial funding approval.
Case Study 2: Integration Hub Consolidation—a “Creating the Wave” Investment Strategy A CIO was once heard to exclaim, “I have a billion-dollar budget and no money to spend.” This wasn’t the CIO of BIGCO (the pseudonym for this case study), but it could have been. The problem at BIGCO was that an ever increasing portion of the annual IT budget was being spent just to keep the lights on for items such as ongoing maintenance of applications, regulatory changes demanded by the federal government, disaster recovery capabilities mandated by the board, and ongoing operations. One of the biggest perceived drivers of this trend was unnecessary complexity in the IT environment. Clearly, some amount of complexity is necessary in a modern IT environment due to the inherent intricacy of a multinational business operating in many legal jurisdictions, with millions of customers, 100,000-plus employees, hundreds of products, and dozens of channels for customers and suppliers to interact. However, a tremendous amount of unnecessary complexity at BIGCO was self-imposed by past practices such as acquiring other companies without fully consolidating the systems, implementation of application systems in silos resulting in duplicate and overlapping data and functions across the enterprise, lack of governance resulting in incremental growth of systems to address only tactical needs, and integration as an afterthought without an enterprise standard framework. No one at BIGCO disagreed with the problem all the way from the CEO (who discussed it regularly in public forums) to the CIO to the software developers. Metaphorically, much of the low-hanging fruit had already been picked but the really “juicy” fruit was still at the top of the tree. It was hard to pick because of the challenges mentioned at the introduction of this paper. This case explores how these challenges were addressed in a specific scenario: consolidating 30 legacy integration systems and transforming them into an efficient enterprise hub using the latest technologies. The 30 systems had been built up incrementally over 10 years through thousands of projects without a master architectural blueprint. Each change was rational on its own, but the result had multiple instances of middleware in a complex integration situation that clearly cost too much to maintain, was difficult to change, and was susceptible to chaotic behavior in day-to-day operations. A lot of money was at stake in this case. The 30 systems had an annual run-rate operating cost of $50 million, and an initial back-of-the-envelope analysis showed that it could be cut in half. While there was some top-down executive support, much broader cross-organizational support was necessary, so the ICC team decided to use the “Creating the Wave” strategy. The first step was to build a business case. This turned out to be a 6-month exercise involving a core team of four staff members, who engaged more than 100 stakeholders from multiple functions across the enterprise. They started out by gathering 18 months of historical cost information about each of the 30 systems. Some stakeholders didn’t think 18 months was sufficient, so the team went to three years of history and for many of the systems eventually tracked down five years of history. At the core of the business case, the ICC team wanted to show what would happen to the $50-million run-rate cost over the next three years under the status quo scenario and compare it to the run-rate cost in a simplified environment. They used MS Excel to construct the financial business model. It started as a couple of worksheets that grew over time. The final version was 13MB and comprised 48 worksheets showing five years of history and three years of projections for various scenarios, plus month-by-month project costs for two years. All of it was sliced and diced to show various views for different organizational groups. What were the results of this case study? The final business model showed that an investment of $20 million would result in a net ongoing operational saving of $25 million per year. The gross savings relative to the baseline $50 million per year cost was actually projected to reduce by $30 million, but because the project was also introducing new capabilities for building an enterprise hub, the new capabilities were projected to add $5 million per year to the run-rate operating cost. The net savings were $25 million annually. The lesson here once again is to include some hard-to- justify elements in a larger project that can be justified.
Appendix C – Business Value Analysis 1. ASK, using the table below for ideas/examples: What is the business goal of this project? Is this relevant? For example, is the business goal of this project to….? What are the business metrics or KPIs associated with this goal? How will the business measure the success of this project? Are any of these examples relevant? 2. PROBE: If the business sponsor needs more help understanding how data impacts business value, use these example projects and data capabilities to probe. These data integration projects are often associated with this business goal. Is this data integration project being driven by this business goal? How does data accessibility affect the business? Does having access to all your data improve the business? Do these examples resonate? How does data availability affect the business? Does having data available when it's needed improve the business? Do these examples resonate? How does data quality affect the business? Does having good data quality improve the business? Do these examples resonate?
INFORMATICA CONFIDENTIAL
BEST PRACTICES
448 of 818
How does data consistency affect the business? Does having consistent data improve the business? Do these examples resonate? How does data audit ability affect the business? Does having an audit trail on your data improve the business? Do these examples resonate? How does data security affect the business? Does ensuring secure data access improve the business? Do these examples resonate? 3. DOCUMENT the key metrics and estimated impact based on the sponsor’s input: The key business metrics relevant to this project The sponsor's estimated impact on that metric (e.g. increase cross-sell rate from 3% to 5%) What is the estimated dollar value of that impact? (sponsor must provide this estimate based on their own calculations Business Value Category
Explanation
Typical Metrics
Data Integration Examples
A. INCREASE REVENUE New Customer Acquisition
Lower the costs of acquiring new customers
- cost per new customer acquisition - cost per lead - # new customers acquired/month per sales rep or per office/store
- Marketing analytics - Customer data quality improvement - Integration of 3rd party data (from credit bureaus, directory services, salesforce.com, etc.)
Cross-Sell / Up sell
Increase penetration and sales within existing customers
- % cross-sell rate - # products/customer - % share of wallet - customer lifetime value
- Single view of customer across all products, channels - Marketing analytics & customer segmentation - Customer lifetime value analysis
Sales and Channel Management
Increase sales productivity, and improve visibility into demand
- sales per rep or per employee - close rate - revenue per transaction
- Sales/agent productivity dashboard - Sales & demand analytics - Customer master data integration - Demand chain synchronization
New Product / Service Delivery
- # new products launched/year Accelerate new - new product/service launch time product/service introductions, and improve - new product/service adoption rate "hit rate" of new offerings
- Data sharing across design, development, production and marketing/sales teams - Data sharing with 3rd parties e.g. contract manufacturers, channels, marketing agencies, etc.
Pricing / Promotions
Set pricing and promotions - margins to stimulate demand while - profitability per segment - cost-per-impression, cost-per-action improving margins
- Cross-geography/cross-channel pricing visibility - Differential pricing analysis and tracking - Promotions effectiveness analysis
B. LOWER COSTS Supply Chain Management
Lower procurement costs, increase supply chain visibility, and improve inventory management
- purchasing discounts - inventory turns - quote-to-cash cycle time - demand forecast accuracy
- product master data integration - demand analysis - cross-supplier purchasing history
Production & Service Delivery
Lower the costs to manufacture products and/or deliver services
- production cycle times - cost per unit (product) - cost per transaction (service) - straight-through-processing rate
- cross-enterprise inventory rollup - scheduling and production synchronization
- distribution costs per unit - average delivery times - delivery date reliability
- integration with 3rd party logistics management and distribution partners
Logistics & Distribution Lower distribution costs and improve visibility into distribution chain Invoicing, Collections and Fraud Prevention
- # invoicing errors Improve invoicing and collections efficiency, and - DSO (days sales outstanding) - % uncollectible detect/prevent fraud - % fraudulent transactions
Financial Management
Streamline financial management and reporting
- End-of-quarter days to close - Financial reporting efficiency - Asset utilization rates
- Financial data warehouse & reporting - Financial reconciliation - Asset management & tracking
Compliance (e.g. SEC/SOX/Basel II/ PCI) Risk
Prevent compliance outages to avoid investigations, penalties, and negative impact on brand
-# negative audit/inspection findings - probability of compliance lapse - cost of compliance lapses (fines, recovery costs, lost business) - audit/oversight costs
- Financial reporting - Compliance monitoring & reporting
Financial/Asset Risk Management
Improve risk management of key assets, including financial, commodity, energy or capital assets
- errors & omissions - probability of loss - expected loss - safeguard and control costs
- Risk management data warehouse - Reference data integration - Scenario analysis - Corporate performance management
Business Continuity/ Disaster Recovery Risk
Reduce downtime and lost - mean time between failure (MTBF) business, prevent loss of - mean time to recover (MTTR)
- invoicing/collections reconciliation - fraud detection
C. MANAGE RISK
INFORMATICA CONFIDENTIAL
BEST PRACTICES
- Resiliency and automatic failover/recovery for all data integration processes
449 of 818
key data, and lower recovery costs
- recovery time objective (RTO) - recover point objective (RPO-- data loss)
Examples of Key Capabilities by Data Attribute Business Value Category
Accessibility
Availability
Quality
Consistency
Audit Ability
Security
A. INCREASE REVENUE Sharing of correlated customer data with sales and third party channels to reduce channel conflict and duplication
Predict the impact of changes, e.g. switching credit bureaus or implementing a new marketing system
Secure access to valuable customer lead, financial and other information
Accurate, complete, deReal-time duplicated customer data to customer analytics enabling create single view tailored crossselling at customer touch points
Single view of customer reconciling differences in business definitions & structures across groups
Improve governance of customer master data by maintaining visibility into definition of and changes to data
Customer data privacy and security assurance to protect customers and comply with regulations
Completeness and Continuous validity of sales activity, availability of pipeline and demand data lead, pipeline and revenue data to sales, partners and channels
Alignment of channel/sales incentives based on consistent sales productivity data
Provide traceability for demand and revenue reports through data lineage
Secure access for partners/distributors to share sensitive demand and revenue information
Consistent application of product and service definitions and descriptions across functions and with partners
Ensuring compliance with product regulations through version control and view of lineage
Improved collaboration on prototyping, testing and piloting through secure data sharing
Global/crossfunctional reconciliation of pricing and promotions data
Rationalization of pricing and improved record keeping for price changes
Segregation of differential pricing and promotions data for different customers, channels, etc.
De-duplicated, complete view of products and materials data to improve supply chain efficiency
Reconciled view of purchases across all suppliers to improve purchasing effectiveness & negotiation stance
Improve governance of product master data by maintaining visibility into definition of and changes to data
Encrypted data exchange with extended network of suppliers/distributors
Improved planning and product management based on accurate materials, inventory and order data
Reconciled product and materials master data to ensure accurate inventory and
Ensure compliance with production regulations through version control and view of lineage
Role-based access to critical operational data, based on business need
New Customer Cross-firewall Acquisition access to third party customer data e.g. credit bureaus, address directories, list brokers, etc.
Accelerated delivery of sales lead data to appropriate channels
Cross-Sell / Up Opportunity sell identification with integrated access to CRM, SFA, ERP and others
Sales and Channel Management
Incorporation of revenue data, internal or external SFA data, and data in forecast spreadsheets
Customer targeting and onboarding based on accurate customer/prospect/market data
Accurate, de-duplicated product design and development data across functional and geographical boundaries
New Product / Access to data in both Service applications/ Delivery systems as well as design documents
Distributed, round-the-clock environment for collaborative data sharing
Pricing / Promotions
Real-time pricing Complete, accurate product pricing and discount/ data to enable constant monitoring profitability data & on-the-fly pricing adjustments based on demand
Holistic pricing management based on data from applications, including pricing spreadsheets
B. LOWER COSTS Real-time supply chain management, aligned with just-intime production models
Supply Chain Management
Access to EDI and unstructured data (typically in Excel/Word) from suppliers/distributors
Production & Service Delivery
Integrated access to Near real-time EDI, MRP, SCM production and transaction data to and other data streamline operations
INFORMATICA CONFIDENTIAL
BEST PRACTICES
450 of 818
production planning Logistics & Distribution
Bi-directional integration of data with 3rd party logistics and distribution partners
Availability of order Reduction in logistics and status and delivery distribution errors with accurate, validated data data on a realtime, as-needed basis
Consistent definition across extended ecosystem of key data such as ship-to, delivery information
Predict the impact of changes, e.g. flagging dependencies on a 3rd party provider's data
Encrypted data exchange with extended network of logistics partners, distributors and customers
Invoicing, Collections and Fraud Prevention
Integration of historical customer data with third party data to detect suspect transactions and prevent fraud
Hourly or daily Reduced errors in invoicing/billing to accelerate availability of reconciled invoicing collections and payments data
Reconciliation of purchase orders to invoices to payments across geographies, organizational hierarchies
Detection/prevention of inefficiencies or fraud though dashboard and alerts
Secure customer access to billing and payment data
Financial Management
Incorporation of spreadsheet data with data from financial management and reporting systems
On-demand availability of financial management data to business users
Improved fidelity in financial management with accurate, complete financial data
Consistent interpretation of chart of accounts across all functions and geographies
Built-in audit trail on financial reporting data to ensure transparency and regulatory compliance
Segregated, secure access to sensitive financial data
Proactive reduction of data conformity and accuracy issues through scoring and monitoring
Reconcile data being reported across groups, functions to ensure consistency
Ensuring compliance on data integrity through version control and data lineage
Secured, encrypted reporting data available to authorized, designated personnel
Validation of correlated data for financial reporting and risk management
Visualize data relationships and dependencies at both business and IT level
Ease management oversight through access records and granular privilege management
Synchronized data across primary and backup systems
Impact and dependency analysis across multiple applications and systems for continuity and recovery planning
Secure crossfirewall access for operations from secondary data centers
C. MANAGE RISK Compliance (e.g. SEC/SOX/Basel II/ PCI) Risk
Leverage compliance metrics tracked in spreadsheets, along with systembased data
On-demand, continuous availability to reporting and monitoring systems
Financial & Asset Risk Management
Integrate financial and risk management systems data with spreadsheetbased data
Continuous data quality Real-time monitoring to maintain fidelity availability of key on financial data financial and risk indicators for ongoing monitoring & prevention
Business Continuity/ Disaster Recovery Risk
Access to offpremise data to support secondary/backup systems
High availability and automatic failover/recovery to prevent or minimize downtime
Updated, de-duplicated data reduces and simplifies data storage and management requirements
Last updated: 03-Jun-08 16:03
INFORMATICA CONFIDENTIAL
BEST PRACTICES
451 of 818
Canonical Data Modeling Challenge A challenge faced by most large corporations is achieving efficient information exchange in a heterogeneous environment. The typical large enterprise has hundreds of applications that serve as systems of record for information that were developed independently based on incompatible data models -- yet they must share information efficiently and accurately in order to effectively support the business and create positive customer experiences. The key issue is one of scale and complexity that is not typically evident in small to medium sized organizations. The problem arises when there are a large number of application interactions in a constantly changing application portfolio. If these interactions are not designed and managed effectively, they can result in production outages, poor performance, high maintenance costs and lack of business flexibility. Business-to-Business (B2B) systems often grow organically over time to include systems that an organization builds in addition to buys. The importance of canonical data models grow as a system grows. The challenge that use of canonical data models solves is to reduce the number of transformations needed between systems and reduce the number of interfaces that a system supports. The need for this is usually not obvious when there are only 1 or two formats in an end to end system, but when the system reaches a critical mass in number of data formats supported (and in work required to integrate a new system, customer, document type), that is when the importance of one or more canonical models becomes important. For example, if a B2B system accepts 20 different inputs, passes that data to legacy systems and generates 40 different outputs - it is apparent that unless the legacy system uses some shared canonical model, introducing a new input type requires modifications to the legacy systems, flow processes, etc. Put simply, if you have 20 different inputs and 40 different outputs, and all outputs can be produced from any input, then you will need 800 different paths unless you take the approach to transform all inputs to one or more canonical forms and transform all responses from 1 or more forms to the 40 different outputs. This is a fundamental aspect of how Informatica B2B Data Transformation operates in that all inputs are parsed from the original form to XML (not necessarily the same XML schema) and all outputs are serialized from XML to the target output form. The cost to creating canonical models is that they often require design and maintenance involvement from staff from multiple teams. This best practice describes three canonical techniques which can help to address the issues of data heterogeneity in an environment where application components must share information in order to provide effective business solutions.
Description This section introduces three canonical best practices in support of the Velocity methodology and modeling competencies. 1. Canonical Data Modeling 2. Canonical Interchange Modeling 3. Canonical Physical Formats Canonical techniques are valuable when used appropriately in the right circumstances. The key best practices which are elaborated in the following sections are: Use canonical data models in business domains where there is a strong emphasis to “build” rather than “buy” application systems. Use canonical interchange modeling at build-time to analyze and define information exchanges in a heterogeneous application environment. Use canonical physical formats at run-time in many-to-many or publish/subscribe integration patterns. In particular in the context of a business event architecture. Plan for appropriate tools to support analysts and developers. Develop a plan to maintain and evolve the canonical models as discrete enterprise components. The ongoing costs to maintain the canonical models can be significant and should be budgeted accordingly.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
452 of 818
For a large scale system-of-systems in a distributed computing environment, the most desirable scenario is to achieve loose coupling and high cohesion resulting in a solution that is highly reliable, efficient, easy to maintain, and quick to adapt to changing business needs. Canonical techniques can play a significant role in achieving this ideal state. The graphic below outlines how the three canonical techniques generally align with, and enable the qualities, in each of the four Coupling/Cohesion quadrants. Note: There is some overlap between the techniques since there is no hard black-and-white definition of these techniques and their impact on a specific application.
Each of the three techniques has a “sweet spot”; that is, they can be applied in a way that is extremely effective and provides significant benefits. The application of these methods to a given implementation imparts architectural qualities to the solution. This best practice does not attempt to prescribe which qualities are desirable or not since that is the responsibility of the solutions architect to determine. For example, tight coupling could be a good thing or a bad thing depending on the needs and expectations of the customer. Tight coupling generally results in better response time and network performance in comparison to loose coupling – but it also can have a negative impact on adaptability of components. Furthermore, the three canonical best practices are generally not used in isolation; they are typically used in conjunction with other methods as part of an overall solutions methodology. As a result, it is possible to expand, shrink, or move the “sweet spot” subject to how it is used with other methods. This best practice does not address the full spectrum of dependencies with other methods and their resultant implications, but it does attempt to identify some common pitfalls to be avoided.
Common Pitfalls
Peanut Butter: One pitfall that is pertinent to all three canonical practices is the “Peanut Butter” pattern which basically involves applying the methods in all situations. To site a common metaphor “to a hammer everything looks like a nail”. It certainly is possible to drive a screw with a hammer, but it’s not pretty and not ideal. When, and exactly how, to apply the canonical best practices should be a conscious, well-considered decision based on a keen understanding of the resulting implications.
Canonical Data Modeling Canonical Data Modeling is a technique for developing and maintaining a logical model of the data required to support the needs of the business for a subject area. Some models may be relevant to an industry supply chain, the enterprise as a whole, or a specific line of business or organizational unit. The intent of this technique is to direct development and maintenance efforts such that the internal data structures of application systems conform to the canonical model as closely as possible. This technique seeks to eliminate heterogeneity by aligning the internal data representation of applications with a common shared model. In an ideal scenario, there would be no need to perform any transformations at all when moving data from one component to another, but for practical reasons this is virtually impossible to achieve at an enterprise scale. Newly built
INFORMATICA CONFIDENTIAL
BEST PRACTICES
453 of 818
components are easier to align with the common models, but legacy applications may also be aligned with the common model over time as enhancements and maintenance activities are carried out.
Common Pitfalls Data model bottleneck: A Canonical Data Model is a centralization strategy that requires an adequate level of ongoing support to maintain and evolve it. If the central support team is not staffed adequately, it will become a bottleneck for changes which could severely impact agility. Heavy-Weight Serialized Objects: There are two widely-used techniques for exchanging data in a distributed computing environment -- serialized objects and message transfer. The use of serialized objects can negate the positive benefits of high cohesion if they are used to pass around large, complex objects that are not stable and subject to frequent changes. The negative impacts include excessive processing capacity consumption, increased network latency and higher project costs through extended integration test cycles.
Canonical Interchange Modeling Canonical Interchange Modeling is a technique for analyzing and designing information exchanges between services that have incompatible underlying data models. This technique is particularly useful for modeling interactions between heterogeneous applications in a many-to-many scenario. The intent of this technique is to make data mapping and transformations transparent at build time. This technique maps data from many components to a common Canonical Data Model which thereby facilitates rapid mapping of data between individual components, since they all have a common reference model.
Common Pitfalls Mapping with unstructured tools: Mapping data interchanges for many enterprise business processes can be extremely complex. For example, Excel is not sophisticated enough to handle the details in environments with a large number of entities (typically over 500) and with more than two source or target applications. Without adequate tools such as Informatica’s Metadata Manager the level of manual effort needed to maintain the canonical models and the mappings to dependent applications in a highly dynamic environment can become a major resource drain that is not sustainable and error prone. Proper tools are needed for complex environments. Indirection at Run-Time: Interchange Modeling is a “build time” technique. If the same concept of an intermediate canonical format is applied at run-time, it results in extra overhead and a level of indirection that can significantly impact performance and reliability. The negative impacts can become even more severe when used in conjunction with a serialized object information exchange pattern; that is, large complex objects that need to go through two (or more) conversions when being moved from application A to B (this can become a show-stopper for high-performance real-time applications when SOAP and XML are added to equation).
Canonical Physical Format Canonical Physical Format prescribes a specific runtime data format and structure for exchanging information. The prescribed generic format may be derived from the Canonical Data Model or may simply be a standard message format that all applications are required to use for certain types of information. The intent of this technique is to eliminate heterogeneity for data in motion by using standard data structures at run-time for all information exchanges. The format is frequently independent of either the source or the target system and requires that all applications in a given interaction transform the data from their internal format to the generic format.
Common Pitfalls Complex Common Objects: Canonical Physical Formats are particularly useful when simple common objects are exchanged frequently between many service providers and many service consumers. Care should be taken not to use this technique for larger or more complex business objects since it tends to tightly couple systems which can lead to longer time to market and increased maintenance costs. Non-transparent Transformations: Canonical Physical Formats are most effective when the transformations from a components internal data format to the canonical format are simple and direct with no semantic impedance mismatch.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
454 of 818
Care should be taken to avoid semantic transformations or multiple transformations in an end-to-end service flow. While integration brokers (or ESB’s) are a useful technique for loose coupling, they also add a level of indirection which can complicate debugging and run-time problem resolution. The level of complexity can become paralyzing over time if service interactions result in middleware calling middleware with multiple transformations in an end-to-end data flow. Inadequate Exception Handling: The beauty of a loosely-coupled architecture is that components can change without impacting others. The danger is that in a large-scale distributed computing environment with many components changing dynamically, the overall system-of-system can assume chaotic (unexpected) behavior. One effective counter strategy is to ensure that every system that accepts Canonical Physical Formats also includes a manual work queue for any inputs that it can’t interpret. The recommended approach is to make exception handling an integral part of the normal day-to-day operating procedure by pushing each message/object into a work queue for a human to review and disposition.
Canonical Modeling Methodology Canonical models may be defined in any number of business functional or process domains at one of four levels: 1. 2. 3. 4.
Supply Chain – external inter-company process and data exchange definitions Enterprise – enterprise-wide data definitions (i.e., master data management programs) Organization – specific business area or functional group within the enterprise System – a defined system or system-of-systems
For example, a supply chain canonical model in the mortgage industry is MISMO (Mortgage Industry Standards Maintenance Organization) which publishes an XML message architecture and a data dictionary for: Underwriting Mortgage insurance application Credit reporting Flood and title insurance Property appraisal Loan delivery Product and pricing Loan Servicing Secondary mortgage market investor reporting The MISMO standards are defined at the Supply Chain level and companies in this industry may choose to adopt these standards and participate in their evolution. Even if a company doesn’t want to take an active role, they will no-doubt need to understand the standards since other companies in the supply chain will send data in these formats and may demand that they receive information according to these standards. A company may also choose to adopt the MISMO standard at the Enterprise level – possibly with some extensions or modifications to suit their internal master data management initiative. Or one business unit such as the Mortgage business within a financial institution may adopt the MISMO standards as their canonical information exchange model or data dictionary – again possibly with extensions or modifications. Finally, a specific application system, or collection of systems, may select the MISMO standards as their canonical model – also with some potential changes. In one of the more complex scenarios, a given company may need to understand and manage an external Supply Chain canonically, using an Enterprise version of the canonical format, using one or many organizational versions, and using one or many system versions. Furthermore, all of the models are dynamic and change from time to time which requires careful monitoring and version control. A change at one level may also have a ripple effect and drive changes in other levels (either up or down). As shown in the figure below, steps 1 through 5 are a one-time effort for each domain while steps 6 through 11 are repeated for each project that intends to leverage the canonical models.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
455 of 818
1. Define Scope: Determine the business functional or process domain and the level (Supply Chain, Enterprise, Organization or System) of the canonical modeling effort. 2. Select Tools and Repository: In small scale or simple domains, tools such as Excel and a source code repository may be adequate. In complex environments with many groups/individuals involved, a more comprehensive structured metadata repository will be needed with a mechanism for access by a broad range of users. 3. Identify Content Administrator: In small scale or simple domains, the administration of the canonical models may be a part-time job for a data analyst, metadata administrator, process analyst or developer. In large and complex environments it is often necessary to have a separate administrator for each level and each domain. 4. Define Communities: Each level and each domain should have a defined community of stakeholders. At the core of each community is the canonical administrator, data analysts, process analysis and developers that are directly involved in developing and maintaining the canonical model. A second layer of stakeholders are the individuals that need to understand and apply the canonical models. A third and final layer of stakeholders are individuals such as managers, architects, program managers and business leaders that need to understand the benefits and constraints of canonical models. 5. Establish Governance Process: Define how the canonical models will be developed and changed over time as well as the roles and authorities of the individuals in the defined community. This step also defines the method of communication between individuals, frequency of meetings, versioning process, publishing methods and approval process. 6. Select Canonical Technique: Each project needs to decide which of the three techniques will be used: Canonical Data Modeling, Canonical Interchange Modeling or Canonical Physical Formats. This decision is generally made by the solution architect. 7. Document Sources and Targets: This step involves identifying existing documentation for the systems and information exchanges involved in the project. If the documentation doesn’t exist, in most cases it must be reverseengineered (unless a given system is being retired). 8. Identify Related Canonicals: This step involves identified relevant or related canonicals in other domains or at other levels that may already be defined in the enterprise. It is also often worth exploring data models of some of the large ERP systems vendors that are involved in the project as well as researching which external industry standards may be applicable. 9. Develop Canonical Model: This step involves a) an analysis effort, b) an agreement process to gain consensus across the defined community, and c) a documentation effort to capture the results. The canonical model may be developed either a) top-down based on the expertise and understanding of domain experts, b) bottom-up by rationalizing and normalizing definitions from different systems, or c) adopting and tailoring existing canonical models. 10. Build Target Scenario: This step is the project effort associated with leveraging the canonical model in the design, construction or operation of the system components. Note that the canonical models may be used only at design time (as in the case of canonical interchange modeling) or also at construction and run-time in the case of the other two canonical techniques. 11. Refresh Metadata & Models: This is a critical step to ensure that any extensions or modifications to the canonical models that were developed during the course of the specific project are documented and captured in the repository and that other enterprise domains that may exist are aware of the changes in the event that other models may be impacted as well.
Summary The key best practices are: INFORMATICA CONFIDENTIAL
BEST PRACTICES
456 of 818
Use canonical data models in business domains where there is a strong emphasis to “build” rather than “buy” application systems. Use canonical interchange modeling at build-time to analyze and define information exchanges in a heterogeneous application environment. Use canonical physical formats at run-time in many-to-many or publish/subscribe integration patterns. In particular in the context of a business event architecture. Plan for appropriate tools such as Informatica Metadata Manager to support analysts and developers. Develop a plan to maintain and evolve the canonical models as discrete enterprise components. The ongoing costs to maintain the canonical models can be significant and should be budgeted accordingly. In summary, there are three defined Canonical Best Practices each of which has a distinct objective. Each method imparts specific qualities on the resultant implementation which can be compared using a coupling/cohesion matrix. It is the job of the architect and systems integrator to make a conscious decision and select the methods that are most appropriate in a given situation. The methods can be very effective, but they also come with a cost so care should be taken to acquire appropriate tools and to plan for the ongoing maintenance and support of the canonical artifacts.
Appendix A – Case Study in applying Canonical Physical Formats in the Insurance Industry This appendix contains selected elements of Informatica’s approach for implementing the ACORD Life data and messaging standards. ACORD XML for Life, Annuity & Health is a family of specifications for the insurance industry intended to enable realtime, cross platform business partner message/information sharing. The primary Informatica product that supports this capability is B2B Data Transformation (or DT for short). The core challenge is to transforming custom source ACORD Life messages into a common Enterprise canonical format (see the figure below) while achieving key quality metrics such as: 1. 2. 3. 4.
Scalability: Ability to support many formats and a high volume of messages Maintainability: Ability to make changes to source, process and product characteristics of existing transformations Extensibility: Ease of adding new source formats to the system Cost effectiveness: Minimizing cost of implementation and ongoing maintenance
Informatica B2B Data Transformation Capability Overview Core Mapping Capabilities The B2B Data Transformation core is ideally suited to address the complex XML-to-XML transformations that are required to support most projects. Several key capabilities are described in the below. Map Objects The basic building block for a transformation – a Map object - takes a Source and a Target XPath as an input and moves data INFORMATICA CONFIDENTIAL
BEST PRACTICES
457 of 818
accordingly. Any kind of transformation logic can be associated with the element-level data move through use of Transformers. For instance, a sequence of Transformers may include data cleansing, a string manipulation, and a lookup. Groups Individual Map objects may be combined into Groups. Groups are collections of objects (like Maps and other Groups) that either succeed or fail together. This transactional in-memory behavior is essential for complex transformations like ACORD Life. For instance, if an XML aggregate with a particular id is not found then all other mappings in the same Group will fail. Sequences B2B Data Transformation also provides the ability to handle most complex XML sequences on both the source and target at once. Specifically, at any given time the transformation may access any source or target construct based on its order or it’s key. For instance, in the complex transformation example below the transformation logic can be easily expressed with a few B2B Data Transformation constructs: what source Party the extension code “60152” is contained within what target Party corresponds to the source Party update the Relation object that describes the target Party
The ability to combine such processing logic with direct manipulation of source and target XSD structures is a unique characteristic of B2B Data Transformation. As a result, the logic that is captured in B2B Data Transformation is compact, maintainable, and extensible.
B2B Data Transformation Development Environment B2B Data Transformation provides a visual interactive Studio environment that simplifies and shortens the development process.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
458 of 818
In the Studio, transformations are developed incrementally with continuous visual feedback simulating how the transformation logic would be applied to the actual data.
Specification Driven Transformations The Informatica Analyst extends the Studio IDE to further accelerate the process of documenting and creating complex XML-toXML transformations. The Analyst is an Excel spreadsheet based tool that provides a drag-and-drop environment for defining a complex transformation between two XML schemas. Once the transformation “specification” is defined, the actual transformation run-time is automatically generated from the specification.
The Analyst capabilities may be a key accelerator to bootstrap the development of “Custom Source-to-GBO” transformations and to implement a number of required transformation components.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
459 of 818
“AutoMapping” of like elements between schemas (look at the spreadsheet cell W3 in the above figure) is another key feature that may prove to be valuable in this project. AutoMapping creates maps between elements of the same name that are located in similar branches of XML. It also creates 3 important reports: non-mapped source elements non-mapped target elements stub implementation of non-matching types for the mapped elements These reports form a foundation for documentation, specification creation and implementation of an XML-to-XML transformation. This technology asset is a key component of the overall solution and its use and specific application will be more clearly defined in the component design phase of this project.
Accord Versioning Methodology ACORD Life specifications undergo continuous change. Most significant changes to this model were introduced before version 2.12, but the specification is still evolving. The only reliable mechanism for discovering deltas between two versions of ACORD Life is a side-by-side comparison of its respective XML schemas (XSDs). Informatica has developed and deployed a number of successful solutions in the area of ACORD XML versioning. Specifically, the Analyst tool was created to support ACORD XML versioning-class solutions.
High-Level Transformation Design For the purpose of technical transformation analysis in this proposal, Informatica introduce a functional breakdown of the end-toend transformation to finer-grain components. Note: The breakdown and composition order of components may change during the actual delivery of a project.
Descriptions of the individual tasks within the transformation are provided below. ACORD Versioning The versioning step provides a mechanism for moving data between two distinct versions of ACORD Life standard. Due to complexity of Life schemas and minimal Revision documentation provided by ACORD, the task of implementing versioning is complex and time-consuming. Informatica tools and techniques will allow implementing generic versioning transformations rapidly and reliably (see prior sections). Structural Changes
INFORMATICA CONFIDENTIAL
BEST PRACTICES
460 of 818
Life models are flexible enough to allow the same information to be described in multiple ways. For instance, same data, like policy face value may be placed in various valid locations between source and target formats.
In addition, custom ACORD implementations may also alter the structure of data. For example, the format below describes a “primary insured” role of the person in the Person aggregate itself rather than in a respective Relation aggregate.
These changes need to be accommodated by the transformation. Another type of structural change results from a certain intended style or layout of the message that is maintained by a target format. For instance, target format lays-out Parties data in a different way that source.
Structural changes lead to another significant processing step that is required for transformations – maintaining referential integrity. As important aggregates (like Parties and Holdings) are reformatted and re-labeled, the Relation objects need to reflect new modified aggregate IDs that are consistent with the original Relations. Vendor Extension Processing Each source format contains custom extensions that need to be transformed into target format. Sometimes the extension processing is as simple as a straight element-to-element mapping shown below.
In other scenarios it may be more complex and involve additional changes to multiple target elements and aggregates. Code Lists
INFORMATICA CONFIDENTIAL
BEST PRACTICES
461 of 818
Another typical aspect of a transformation deals with content substitution, primarily with code lists. For instance, product codes may differ between the formats.
Enrichment The final transformation step is data enrichment. Enrichment varies from inserting new system information into XML message …
… to analyzing data patterns, content or business rules to derive new elements, like SSN type code below.
Run-Time Configuration Run-time configuration of the system is optimized in the following dimensions (see the figure below): Separation of implementation and configuration concerns for individual formats Reuse of common transformation logic and configuration Transformations for each one of the source formats are created, maintained and extended in isolation from other sources. Transformation logic and configuration data may be versioned independently for each one of the sources.
Similarly, multiple versions may be created for each of the Common Components. Depending on a choice of a run-time container for the transformations (Java or WPS), the invoking logic would have to maintain
INFORMATICA CONFIDENTIAL
BEST PRACTICES
462 of 818
the knowledge of how multiple versions are compatible with each other. Such logic may be driven by a configuration file.
Project Methodology The following activities are typical in projects associated with using the ACORD data and message standards. 1. 2. 3. 4.
Requirement, system, and data analysis Creation of transformation specifications Configuration system design Common component implementation a. Schema minimization b. ACORD Versioning c. Enrichment components 5. Source-specific component implementation a. Source schema customization b. Transformation development The rest of this section details each of the activities.
Analysis In this initial project phase, we thoroughly assess and catalogue existing input and output data samples, schemas, and other available relevant documentation. Based on the assessment, we build a transformation profile outlining transformation patterns required for the implementation. The patterns include type conversions, common templates, lookup tables, etc.
Specifications In this phase we customize our generic transformation specification template in accordance with the transformation profile. We then jointly populate the specification with transformation rules.
Configuration Design The next step is to understand the “AS-IS” configuration system. Then based on additional requirements we collaboratively design an approach for a new configuration system. The new system needs to support two flavors of configuration: Centralized configuration for system-wide rules and properties. For instance, addition of a new Product and corresponding Business Rules Source-specific configuration for individual source formats. For instance, conversion tables for code-lists specific to the source As a part of the design process we go through a number of life-cycle use-case scenarios to understand how the new system would be maintained and extended.
Common Components This project phase includes a number of practical activities that need to be performed before source-specific transformation implementation may begin. Schema Minimization A very important step in the implementation process is to effectively minimize the full ACORD Life schema. The process is a combination of manual trimming and specialized tools that remove orphan elements. ACORD Versioning Versioning accommodates changes in schema that exist between two ACORD Life versions. For instance, versioning would move data from ACORD LIFE 2.5 to 2.13 in a single transformation. Versioning can be used as a “black-box” reusable transformation if it is applied to a consistent source format. For instance, a GBO Enterprise Life 2.5 to GBO Enterprise Life 2.13 versioning transformation may be used in conjunction with any
INFORMATICA CONFIDENTIAL
BEST PRACTICES
463 of 818
transformation that produces GBO Enterprise Life 2.5. However, if versioning is used in a context of a customized source format (like ABC Life 2.5 to GBO Enterprise Life 2.13) it is not reusable and becomes an “embedded” part of the custom source transformation. Techniques for creating a mapping for “black-box” or “embedded” versioning are the same. Exact determination of the approach needs to be made in the context of the project when more information is available. Enrichment Components Enrichment logic should be reused by all sources and should be performed once the data is moved into a common format. At this stage we should have determined what the common format is. It is likely a custom version of ACORD Life 2.13 extended with Enterprise specific element. However, detailed requirements analysis and design needs to be performed to determine the exact format and the enrichment functionality.
Source-specific Implementation Source Schema Customization Once an underlying ACORD Life schema for a transformation source format is minimized, it then needs to be customized in order to align with source data. The goal of this work is to derive a schema against which a source data sample would validate. This includes addition of custom extensions (OLifeExtension) as well as schema changes to accommodate source deviations from a standard Life schema. One type of changes deals with schema content restrictions and data-types. For instance, in the sample below, the source data-type needs to change from “double” to “string” or “integer”.
Another type of schema changes is structural where elements change order, element names, place in hierarchy, etc. Transformation Development In this phase we use the customized schema and the transformation specification to implement the transformation. In the process of development, we produce a number of re-usable transformation components that can be re-used across multiple source-specific implementations (such as lookup tables, target skeletons, etc.).
Last updated: 29-May-08 16:48
INFORMATICA CONFIDENTIAL
BEST PRACTICES
464 of 818
Chargeback Accounting Challenge The ICC operates within a larger organization that needs to make periodic decisions about its priorities for allocating limited resources. In this context, an ICC needs not only to secure initial funding but also to demonstrate that continued investment is justified. This Best Practice describes the options for estimating and allocating costs in an Informatica ICC. One of these options is to dispense with a chargeback mechanism and simply use a centrally funded cost center. However, there are additional benefits to a systematic approach to chargeback, not the least of which is the psychological impact on consumers and providers of the service. It also encourages users to understand the internal financial accounting of the organization so that customers and management alike can work within the realm of its constraints and understand how it relates to the culture.
Description The kind of chargeback model that is appropriate varies according to several factors. A simple alignment between the five ICC models and a similar classification of chargeback models would be convenient but, in practice, there are several combinations of chargeback models that may be used with the ICC models. This Best Practice focuses on specific recommendations related to the most common and recommended patterns. This document also introduces an economic framework for evaluating funding alternatives and the organizational behavior that results from them. It includes the following sections: Economic Framework Chargeback ModelsAlignment with ICC Models
Economic Framework As the following figure illustrates, the horizontal dimension of the economic framework is the investment category with strategic demands at one end of the spectrum and tactical demands at the other end. Strategic demands typically involve projects that drive business transformations or process changes and usually have a well-defined business case. Tactical demands are associated with day-to-day operations or keeping the lights on. In the middle of the spectrum, some organizations have an additional category for “infrastructure investments”—that is, project-based funding, focused on technology refresh or mandatory compliance-related initiatives. These are projects that are generally considered nondiscretionary and hence appear to be maintenance. The vertical dimension is the funding source and refers to who pays for the services: the consumer or the provider. In a free market economy, money is used in the exchange of products or services. For internal shared services organizations, rather than exchanging real money, accounting procedures are used to move costs between accounting units. When costs are transferred from an internal service provider to the consumer of the service, it is generally referred to as a chargeback. The following figure shows the economic framework for evaluating funding alternatives.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
465 of 818
If we lay these two dimensions out along the X and Y axis with a dividing line in the middle, we end up with these four quadrants: 1. Demand-Based Sourcing: This operating model responds to enterprise needs by scaling its delivery resource in response to fluctuating project demands. It seeks to recover all costs through internal accounting allocations to the projects it supports. The general premise is that the ICC can drive down costs and increase value by modeling itself after external service providers and operating as a competitive entity. 2. Usage-Based Chargeback: This operating model is similar to the Demand-Based Sourcing model but generally focuses on providing services for ongoing IT operations rather than project work. The emphasis, once again, is that the ICC operates like a standalone business that is consumer-centric, market-driven, and constantly improving its processes to remain competitive. While the Demand-Based Sourcing model may have a project-based pricing approach, the Usagebased model uses utility-based pricing schemes. 3. Enterprise Cost Center: Typically, this operating model is a centrally funded function. This model views the ICC as a relatively stable support function with predictable costs and limited opportunities for process improvements. 4. Capacity-Based Sourcing: This operating model strives to support investment projects using a centrally funded project support function. Centrally-funded ICCs that support projects are an excellent model for implementing practices or changes that project teams may resist. Not charging project teams for central services is one way to encourage their use. The challenge with this model is to staff the group with adequate resources to handle peak workloads and to have enough non-project work to keep the staff busy during non-peak periods. In general, the ICCs that are funded by consumers and are more strategic in nature rely on usage-based chargeback mechanisms. ICCs that are provider-funded and tactical rely on capacity-based sourcing or cost center models.
Chargeback Models This Best Practice defines the following types of chargeback models: Service-Based Pricing Fixed Price Tiered Flat Rate Resource Usage Direct Cost Cost Allocation
INFORMATICA CONFIDENTIAL
BEST PRACTICES
466 of 818
Service-Based Pricing and Fixed Price These are the most sophisticated of the chargeback models and require that the ICC clearly define its service offerings and structure a pricing model based on defined service levels. Service-based pricing is used for ongoing service while “fixed pricing” is used for incremental investment projects. In other words, both of them are a fixed price for a defined service. This model is most suitable for a mature ICC that has well-defined service offerings and a good cost-estimating model. Advantages: Within reasonable limits, the client’s budget is based on its ability to make an informed decision on purchases from the supplier The client transfers the risk of cost over-runs to the ICC Incentive for the ICC to control and manage the project and deliver on-time and on-budget Disadvantages: Internal accounting is more complex and there may not be a good mechanism to cover cost overruns that are not funded by the client A given client may pay a higher price if the actual effort is less than expected (note: at a enterprise level this is not a disadvantage since cost over-runs on one project are funded by cost under-runs on other projects)
Tiered Flat Rate The Tiered Flat Rate is sometimes called a “utility pricing” model. In this model, the consumer pays a flat rate per unit of service, but the flat rate may vary based on the total number of units or some other measure. This model is based on the assumption that there are economies of scale associated with volume of usage and therefore the price should vary based on it. Advantages: Within reasonable limits, the client’s budget is based on its ability to make an informed decision on purchases from the supplier The client is encouraged to continue using ICC services as volume increases rather than looking for other sources Disadvantages: May discourage the client from being efficient, thereby reducing usage since a lower tier may cost more per unit of consumption
Resource Usage The ‘client’ pays according to resource usage; the following types of resources are available: Number of records Data volume CPU time Informatica technology can support the collection of metrics for all three types of resource usage. PowerCenter Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter metadata repository. The Repository Manager generates the following views when you create or upgrade a repository. REP_WFLOW_RUN This view displays the run statistics for all workflows by folder REP_SESS_LOG This view provides log information about sessions. REP_SESS_TBL_LOG This view contains information about the status of an individual session run against a target. Additionally, Informatica Complex Data Exchange can be used to parse the output from a range of platform and database accounting reports and extract usage metrics. The advantages and disadvantages are shown below: a) Number of Records: Advantages:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
467 of 818
Number of records processed per a period of time can be easily measured in PowerCenter metadata Easier to compute the cost per record Not dependent upon server speed for measurement Disadvantages: Number of rows may not equate to large data volumes. How to fairly equate rows to a monetary amount Depending on the robustness of the implemented solution, it may take more hardware resources for Project A to process “n” rows than it takes Project B to process “n” rows b) Data Volume Advantages: More logical than counts of data records Easily measured in PowerCenter metadata Disadvantages: Not simple to compute the total amount of machine resources used and hence the cost Depending on the robustness of the implemented solution, it may take more hardware resources for Project A to process “n” rows than it takes Project B to process “n” rows c) CPU Time Advantages: Easily measured by PowerCenter metadata Ability to measure server utilization versus the previous two methods Disadvantages: Need to identify the processes by user to charge by this method In a shared hardware environment, if other processes are running on the server at the same time, the server run time may be longer, unfairly penalizing the customer processing this data
Direct Cost The ‘client’ pays the direct costs associated with a request that may include incrementally purchased hardware or software as well as a cost per per-hour or per-day for developers or analysts. The development group provides an estimate (with assumptions) of the cost to deliver based on the understanding of the client’s requirements. Advantages: Changes to requirements are easily absorbed into the project contract Project activities independent of pricing pressures The client has clear visibility as to exactly what s(he) is paying for Disadvantages: The client absorbs the risk associated with both the product definition and the estimation of delivery cost The client must be concerned with and therefore pay attention to the day-to-day details of the ICC There is no cost incentive for the development group to deliver in the most cost effective way and hence the total cost of ownership might be high
Cost Allocation Costs can also be allocated on a more-or-less arbitrary basis irrespective of any actual resource usage. This method is typically used for ongoing operating costs but may also be used for project work. The general presumption with this method is that most IT costs are either fixed or shared and therefore should simply be “allocated” or spread across the various groups that utilize the services.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
468 of 818
Advantages: Ease of budgeting and accounting All centralized and fixed costs are accounted for regardless of the demand Disadvantages: Needs sponsorship from the Executives for larger IT budgets rather than departmental funding High-level allocation may not be seen as “fair” if one business unit is larger than another There is little-to-no connection between the specific services that a consumer uses and the costs they pay (since the costs are based on other arbitrary measures) Each model has a unique profile of simplicity, fairness, predictability and controllability from the consumer perspective which is represented graphically in the following figure.
In general, Informatica recommends a hybrid approach of service-based, flat rate, and measured resource usage methods of charging for services provided to internal client: Direct cost for hardware and software procurement Fixed price Service-based for project implementation Measured Resource Usage for operations Tiered flat rate for support and maintenance
Alignment with ICC Models There are five standard ICC models, as illustrated in the figure below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
469 of 818
How do the five ICC models align with the financial framework and is there an “ideal” approach for each of the ICC organizational models? The short answer is “it depends.” In other words, many organizational constraints that can be linked to accounting rules, corporate culture, or management principles may dictate one approach or another. The reality is that any combination of the four financial models can be used with any of the five ICC models. That said, there is a common pattern, or recommended “sweet spot” for how to best align the ICC model with the financial accounting models. The following figure summarizes that alignment.
The Best Practices ICC typically focuses on promoting integration standards and best practices for new projects or initiatives, which puts it on the strategic end of the budget spectrum. Furthermore, it is often a centrally funded group with little or no chargeback in support of a charter to act as an organizational change agent. Zero charge-back costs encourage project teams to use the ICC and therefore spread the adoption of best practices. The Standard Services ICC is often a hybrid model encompassing both centrally-funded governance technology or governance INFORMATICA CONFIDENTIAL
BEST PRACTICES
470 of 818
activities (which service consumers are not likely to pay for) as well as training services or shared software development (especially in an SOA COE), typically charged back to projects. The Shared Services ICC is the most common approach and may involve both project activities and operational activities. Because most Shared Services groups are organized as a federation, it complicates the charge-back accounting to the point where it is too cumbersome or meaningless (e.g., people costs are already distributed because the resources reside in different cost centers). If a charge-back scheme is used for a Shared Services ICC, it is typically a hybrid approach based on a combination of project charges and operational allocations. The Central Services ICC requires more mature charge-back techniques based on the service levels or usage. This is important because it requires strong consumer orientation and incentives to encourage responsiveness in order to be perceived positively and sustain operations. In most organizations with a Central Services group, if the service consumers do not feel their needs are being met, they find another alternative and, over time the ICC is likely to disappear or morph into a shared services function. In other words, a centrally funded Central Services group is not a sustainable model. It puts too much emphasis on central planning, which results in dysfunctional behavior and therefore cannot be sustained indefinitely. The Self-Service ICC is typically either 100 percent centrally funded with no chargeback or 100 percent fully cost recovered. This particular ICC can typically be outsourced, or operate internally on a fully-loaded cost basis, or be absorbed into the general network and IT infrastructure. A hybrid funding model for a Self-Service ICC is unusual.
Appendix A – Chargeback Case Studies This section provides two ICC chargeback case studies based on real-world examples. The case studies have been disguised to allow us to be as specific as possible about the details.
Case Study #1: Charge-Back Model—ETL COE Production Support Chargeback A large U.S.-based financial institution, BIGBANK, was looking for a way to reduce the cost of loading its Teradata-based data warehouse. The extract, transfer, load (ETL) process was mainframe based with an annual internal cost of more than $10 million per year, which was charged back to end users through an allocation process based on the percentage of data stored on the warehouse by each line of business (LOB). Daily load volume into the warehouse was 20 Terabytes/month and demand for new loads was growing steadily. BIGBANK decided to implement a mid-range solution for all new ETL processes and eventually retire the more expensive mainframe-based solution. Initial implementation costs of a highly scalable mid-range solution, including licensing, hardware, storage, and labor were approximately $2.2 million annually. This solution consisted of an 11-node, grid computing based Sun solution with a shared Oracle-RAC data repository. Three nodes were dedicated for production with two nodes each for development, system integration test, user acceptance test, and contingency. Estimated daily ETL load capacity for this solution was greater than 40 TB/month. Management wanted to implement the new solution using a self-funding mechanism, specifically a charge-back model whereby the projects and business units using the shared infrastructure would fund it. To achieve this goal, the cost recovery model had to be developed and it had to be compelling. Furthermore, given that the ETL capacity of the new mid-range environment exceeded the daily load volumes of the existing Teradata warehouse, there was significant opportunity for expanding how many applications could to use the new infrastructure. The initial thought was to use load volumes measured in GB/month to determine charge-back costs based on the total monthly cost of the environment, which included the non-production elements. There would be an allocation to each LOB based upon data moved in support of a named application using the environment. Load volumes were measured daily using internal midrange measurement tools and costs were assigned based upon GB/month moved. The problem with this approach is that early adopters would be penalized, so instead, a fixed price cap was set on the cost/GB/month. Initially, the cost cap for the first four consumers was set at $800/GB to find the right balance between covering much of the cost but at a price-point that was still tolerable. The plan was to further reduce the cost/GB as time went on and more groups used the new system. After 18 months, with over 30 applications onboard and loading more than six TB/month, the GB/month cost was reduced to less than $50/GB. Load volumes and the associated costs were tracked monthly. Every six months, the costs were adjusted based upon the previous six months’ data and assigned to the appropriate named applications.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
471 of 818
Over time, the charge-back methodology of GB/month proved to be incomplete. Required labor was driven more by the number of load jobs per supported application and less by total volumes. The charge-back model was adjusted to tie labor costs to the total number of jobs per application per month. Hardware and software costs remained tied to GB loaded per month. All in all, the charge-back approach was an effective way to use projects to fund a shared infrastructure. At the time of this writing, use of the new mid-range solution continues to grow. There is no set date when the legacy mainframe ETL jobs will be fully retired, but with all the new ETL work being deployed on the mid-range infrastructure, the legacy jobs will gradually shrink due to attrition; eventually, it will be an easy decision to invest in migrating the remaining ones to the new environment.
Case Study #2: Charge-Back Model—ETL COE Initiative-Driven Capacity Chargeback Building further upon the case study at BIGBANK, funding for incremental production support personnel was identified as a serious risk in the early in the stages of deployment of the new shared infrastructure. The new Data Integration environment and its associated processes were internally marketed as highly available and immediately ready for any application that needed the service. Over time, however, it became increasingly obvious that as more applications moved from project status to production status, incremental production support staff would be required. Forecasting that incremental production support labor and having the funds available to source the labor became increasingly challenging in an organization that planned base support budgets annually. In short, demand for production support resources were driven by projects, which was out of sync with the annual operating budget planning cycle. As stated in the previous case study, the environment was self-funded by the applications using a fairly simple charge-back methodology. The charge-back methodology assumed that sufficient production support staff would be available to support all consuming applications. However, the data used to calculate a monthly application chargeback was based upon actual throughput metrics after several months in production. In other words, the metrics that showed that additional staff would be required became apparent months after the workload had already increased. When an application came aboard that required extensive support but did not have incremental labor forecast, the production support staff in place was forced to resort to heroic efforts to maintain the application. The resultant staff angst and internal customer dissatisfaction were significant. To solve this issue, the concept of an operational surcharge to support the project before moving the application into production was instituted based upon estimated data and job volumes. Called Operational Capacity from New Initiatives (OCNI), this cost was added to the project estimate before initial funding was allocated to the project. Once the project was approved and funds transferred between cost centers, the OCNI funds were pooled in a separate cost center (i.e., held in “escrow”), often from multiple projects until the work volume in the environment exceeded prescribed limits (usually average hours worked by the production support staff during four weeks). When work volume limits were exceeded, incremental staff was sourced and paid for with the escrowed OCNI dollars. At the end of the budget year, the incremental staff were moved to the base operating budget and the cycle started over again with the new budget year. This allowed the flexibility to rapidly add production support staff as well as effectively plan for base staff in the following budget year forecast. The result was that operational support resources more closely aligned with the actual support workload. The operational staff were not as stressed, the internal consumers were happier because their costs were included upfront in the planning cycle, the finance staff were pleased to make the model work within the constraints of the company’s accounting rules, and IT management had increased confidence in making decisions related to the annual operating budget planning. In summary, it was a win-win for everyone.
Last updated: 29-May-08 16:48
INFORMATICA CONFIDENTIAL
BEST PRACTICES
472 of 818
Engagement Services Management Challenge Because Integration Competency Centers (ICCs) are, by definition, shared services functions that support many and varied customers, it is essential that they operate as internal businesses with a portfolio of services that potential customers can easily find, understand and leverage. This Best Practice focuses on defining and developing the various services to be provided by the ICC. The number of different services provided (e.g., Production Operations, Metadata Management and Integration Training) determines the initial size and scope of the ICC. Once the services have been defined and the appropriate capabilities established, the organization can consider sustaining the service(s) in an operational mode.
Description Services to be offered by an ICC should include the following attributes. Name of service Description / narrative of service Who is the likely consumer of the service Value proposition Cost of service Ordering mechanism and delivery process Often times there is confusion between service definitions and delivery processes. There is a natural tendency for individuals to describe “what” they do from their own perspective rather than from the perspective of the customer who is the consumer of the service. Lack of clarity on this distinction is a primary cause of failed adoption of an ICC. It is imperative therefore, to internalize this distinction as a first step. Any attempt to develop a service portfolio or value proposition in advance of obtaining this insight is pointless, because the result will be an organization that is perceived to be internally rather than externally-focused, a situation that underminses the success of an ICC. The sequence of steps needed to fully define the portfolio of services is as follows: Define the services, which in turn Defines engagement and delivery processes, which in turn Specifies the capabilities and activities, which in turn Drives requirements for tools and skills. In summary, the first step is to define the service from the customer’s perspective. For example, consider a package shipping company. If it defines its service and value proposition as guaranteed rapid delivery of packages from anywhere to anywhere in the world, it is likely to maximize processes such as an extensive network of local delivery vehicles, a fleet of airplanes and sophisticated package sorting and tracking systems. If, on the other hand, it defines its service and value proposition as low-cost delivery of bulk goods to major U.S. cities, it is likely to maximize its network of truck and train delivery vehicles between major cities. Note that in this second scenario the customer is different (i.e., a smaller number of commercial customers instead of a large number of consumer customers) and the value proposition is also different (i.e., low cost versus speed and flexibility). Thus, it is essential to begin with a description of the service based on a clear understanding of who the customer is, and what the value proposition is from the customer perspective. Once that has been established, you can begin to design the processes, including how the service will be discovered and ordered by the customers. After the process definitions are complete, the ICC can proceed to define the capabilities and activities necessary to deliver the service requests and also to determine the tools and staff skills required. Note: Fulfillment elements such as capabilities, activities and tools (while essential to maintain a competitive service delivery) are
INFORMATICA CONFIDENTIAL
BEST PRACTICES
473 of 818
irrelevant to the customer. For example, customers do not care how the delivery company knows about the current status of each package, as long as the organization can report the status requested by the customer. Similarly for an ICC, the internal customers have little regard for how the developer optimizes the performance of a given ETL transformation – they only care that it satisfies their functional and quality requirements. There are two key tests for determining if the services have been defined at the appropriate level of detail. The first is to list and count them. If you have identified more than ten ICC services, you have probably mistaken fulfillment elements for services. The second is to apply a market-based price to the service. If an external organization with a comparable service description and a specific pricing model cannot be located, the service is probably defined at the wrong level. Because defining services correctly is the foundation for a successful ICC operation, all automation, organization, and process engineering efforts should be postponed until misconceptions are resolved. Service identification begins with the question, "What need or want are we fulfilling?" In other words, "What is our market, who are our customers and what do they require from us?" A service is defined in terms of explicit value to the customer and addresses items such as scope, depth, and breadth of services offered. For example, it should consider whether the service will be a one-size-fits-all offering, or whether gradations in service levels will be supported. Individual services can then be aggregated into a service portfolio, which is the external representation of the ICC’s mission, scope, and strategy. As such, it articulates the services that the ICC chooses to offer. Two points are implied: The ICC will consciously determine the services it offers — simply performing a service "because we always have" is not relevant. No ICC can be world-class in everything — just as an enterprise may develop strategic partnerships to outsource noncore competencies, so must the ICC. This means sourcing strategically for services that are non-core to ensure that the ICC is obtaining the best value possible across the entire service portfolio. Outsourcing for selected portions of the service delivery can have other benefits as well such as the ability to scale-up resources during periods of peak demand rather than hiring (and later laying off) employees. A value proposition states the unique benefits of the service in terms the customer can relate to. It answers the questions, "Why should I (i.e., the customer) buy this service?" and "Why should I buy it from the ICC?" ICCs are well positioned (due to their cross-functional charter) to understand the nuances of internal customer needs and to predict future direction and develop value propositions in a way that appeals to them. The following figure is an example of an externally focused, value-based service portfolio for an ICC. It may be difficult to obtain relevant external benchmarks for comparison, but it should always be possible to find variants or lower-level services that can be aggregated; or higher-level services that can be decomposed. This sample list was generated by browsing the Internet and discovering five sites that offer similar services and then synthesizing the service descriptions to align with the ICC charter within the culture and terminology for a given enterprise. See Selecting the Right ICC Model for a full list of services that can be provided. Service Name
Value Proposition
Consumer
Cost
Product Evaluation & Selection
The Product Evaluation & Selection service is a thorough, fact-based evaluation and selection process to provide a better understanding of differences among vendor offerings, seen in the light of the ICC landscape, and to identify vendors that best meet the requirements. This service considers all the enterprise requirements and resolves conflicts between competing priorities including security, performance, legal, purchasing, technology, standards, risk compliance and operational
Application Teams
RFI No charge for quick assessment RFP and 1-page vendor brief.
Architecture
Direct Cost Chargeback for in-depth evaluation.
INFORMATICA CONFIDENTIAL
PMO
BEST PRACTICES
Process
474 of 818
management (just to name a few). It is the most efficient way to involve cross-functional teams to ensure that once a product is selected, all the organizational processes are in place to ensure that it is implemented and supported effectively. Application Portfolio Optimization
The Application Portfolio Optimization service provides a thorough inventory, complete assessment, analysis and rationalization for a specific LOB and its IT applications with respect to business strategy, current and future business and technology requirements and industry standards. It gives LOBs the ability to assess application rationalization opportunities across a variety of business functions and technology platforms. It provides an holistic application portfolio view of planning and investment prioritization that includes application system capability sequencing and dependencies (roadmaps). It provides support to LOB teams to reduce ongoing IT costs, improve operational stability and accelerate implementation of new capabilities by systematically reducing the number of applications and data replications.
CIO Teams
Negotiated Fixed Information Price Architecture Roadmap
LOB Business Executives
The Integration Training & Best Practice service Integration Team Integration members across Training & Best facilitates capturing and disseminating IT intellectual capital associated with integration Practices processes, techniques, principles and tools to create synergies across the company. The integration practice leverages model-based planning techniques to simplify and focus complex decision-making for strategic investments. It includes a formal peer-review process for promoting integration practices that work well in one LOB or technology domain to a standard Best Practice that is applicable across the enterprise.
No charge for adhoc support and brief (<1 hour) presentations.
Integration Consulting
Direct cost chargeback.
Application The Integration Consulting service enables Teams project teams to tap into a group of dedicated domain experts to adopt and successfully implement new technologies. This service translates business and technology strategies into technical design requirements and assists projects with integration activities and deliverables through any and all stages of the project life-cycle. Performance is measured by factors such as investment expense, operating cost, system availability and the degree to which the solutions can support both the existing business strategy and be adapted to sustain emerging trends.
Integration Training, Internal Newsletter, BLOGS, Brown Bag Presentations, Integration Principles
Direct cost chargeback for formal training sessions.
Integration Project Request
TIP Each set of services based upon ICC models allows organizations to focus on the type of services that will provide the best ROI for those models. Aligning services and resources to cost savings helps organizations derive value from the ICC.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
475 of 818
Other Factors Affecting Service Offerings: Strategic vs. Tactical Priorities Another significant challenge of determining service offerings is the question of targeting offerings based on Strategic Initiatives or Tactical Projects. For example, if an ICC has a charter to directly support all strategic initiatives and provide support to tactical projects on an advisory basis, the service portfolio might include several comprehensive service offerings for strategic initiatives (e.g., end-to-end analysis, design, development, deployment and ongoing maintenance of integrations) and offer a “Best Practices Advisory” service for tactical projects. By reviewing a list of IT projects provided by the Project Management Office (PMO) for an organization, projects can be scored on a 1 to 5 numerical scale or simply as High, Medium or Low depending on the level of cross-functional integration that is required. The following figure illustrates the number of tactical versus strategic projects that an ICC might address.
Once categorized and scored with regard to integration needs, the ICC could provide central services such as development management of strategic projects that have a high index of integration versus projects with low levels of cross-functionality (which could be supported with a Best Practice ICC model). The goal is to focus on ICC service offerings that are geared toward strategic integration initiatives and to provide minimal ICC services for tactical projects.
Summary The key to a successful ICC is to offer a set of services that add value to the ICC team. Services can be very helpful in reducing project overhead for common functions in each data integration project. When such functions are removed from the project, the core integration effort can be reduced substantially.
Last updated: 06-Sep-08 10:53
INFORMATICA CONFIDENTIAL
BEST PRACTICES
476 of 818
Information Architecture Challenge Implementing best practices to provide for a data governance program requires the following activities: Creating various views or models (i.e., levels of abstraction) for multiple stakeholders. Adopting a shared modeling tool and repository that supports easy access to information. Keeping the models current as the plans and environment change. Maintaining clear definitions of data, involved applications/systems and process flow/dependencies. Leveraging metadata for Data Governance processes (i.e., inquiry, impact analysis, change management, etc.). Clearly defining the integration and interfaces among the various Informatica tools and between Informatica tools with other repositories and other vendor tools.
Description Information architecture is the art and science of presenting and visually depicting concept models of complex information systems in a clear and simplified format for all of the various stakeholders and roles. There are three key elements of the Information Architecture best practice: Methodology for how to create and sustain the models Framework for organizing various model views Repository for storing models and their representations
Methodology The information architecture methodology is described here in the context of a broader data governance methodology, as shown in the figure below. Many of the activities and techniques are applicable in other contexts such as data migration programs or data rationalization in support of mergers and acquisitions. It is the task of the architect and program team to tailor the methodology for a given program or enterprise strategy or purpose.
The following paragraphs provide a high-level description of the ten steps of the data governance methodology. The information architecture methodology described in this best practice is most closely aligned with step 3 and steps 5 through 10. For details on steps 1-4, refer to the Data Governance Enterprise Strategy document. 1. Organize Governance Committee: Identify the business and IT leaders that will serve as the decision group for the enterprise, define the committee charter and business motivation for its existence, and establish its operating model. Committee members need to understand why they are there, know the boundaries of the issues to be discussed, and have an idea of how they will go about the task at hand. 2. Define Governance Framework: Define the “what, who, how and when” of the governance process, document data INFORMATICA CONFIDENTIAL
BEST PRACTICES
477 of 818
policies, integration principles and technology standards that all programs must comply with. 3. Develop Enterprise Reference Models: Establish top-down conceptual reference models including a) Target Operating Blueprint, b) Business Function/Information Matrix and c) Business Component Model. 4. Assign Organizational Roles: Identify data owners and stewards for information domains, responsible parties/owners of shared business functions in an SOA strategy, or compliance coordinators in a Data Governance program. 5. Scope Program: Leverage the enterprise models to clearly define the scope of a given program and develop a plan for the road-mapping effort. Identify the high-level milestones required to complete the program and provide a general description of what is to take place within each of the larger milestones identified. 6. Assess Baseline and Data Quality: Leverage the enterprise models and the scope definition to complete a current-state architectural assessment, profile data quality, and identify business and technical opportunities. 7. Develop Target Architecture: Develop a future-state data/systems/service architecture in an iterative fashion in conjunction with Step 6. As additional business and technical opportunities become candidates for inclusion, the projected target architecture will also change. 8. Plan Migration Roadmap: Develop the overall program implementation strategy and roadmap. From the efforts in Step 5, identify and sequence the activities and deliverables within each of the larger milestones. This is a key part of the implementation strategy with the goal of developing a macro managed roadmap which adheres to defined best practices. Identifying activities does not include technical tasks which are covered in the next steps. 9. Develop Program Models: Create business data models and information exchange models for the defined program (i.e., logical and physical models are generally created by discrete projects within the program). The developed program models use functional specifications in conjunction with technical specifications. 10. Implement Projects: This is a standard project and program management discipline with the exception that some data governance programs have no defined end. It may be necessary to loop back to step 5 periodically and/or provide input to steps 2, 3 or 4 to keep them current and relevant as needs change. As the projects are implemented, observe which aspects could have been more clearly defined and at which step an improvement should take place.
Information Architecture Framework The information architecture framework is illustrated in the following figure.
Key features of the framework include: A four layer architecture with each layer focusing on a level of abstraction that is relevant for a particular category of stakeholder and the information they need: Layer 4 – Enterprise View: Overarching context for information owners & stewards Layer 3 – Business View: Domain models for business owners and project sponsors Layer 2 – Solution View: Architecture models for specific systems and solutions Layer 1 –Technology View: Technical models for developers, engineers and operations staff INFORMATICA CONFIDENTIAL
BEST PRACTICES
478 of 818
Layer 3 is based on the reference models defined in Layer 4; these layers are developed from a “top down” perspective Layers 1 and 2 are created top-down when doing custom development (i.e., able to control and influence the data models) and bottom-up when doing legacy or package integration (i.e., little ability to control the data model and generally a need to reverse engineer the models using analytical tools) Relevant information about the models is maintained in a metadata repository which may be centralized (i.e., contains all metadata) or federated (i.e., contains some metadata as well as common keys that can be used to link with other repositories to develop a consolidated view, as required) Separate models for representing data at rest (i.e., data persisted in a repository and maintained by an application component) and data in motion (i.e., data exchanged between application components)
Reference Models There are three models in the enterprise reference model layer of the information architecture framework, but only one instance of these models for any given enterprise. Target Operating Blueprint: Business context diagram for the enterprise showing key elements of organizational business units, brands, suppliers, customers, channels, regulatory agencies, and markets. Business Function/Information Matrix: Used to define essential operational capabilities and related service and information flows to generate create/use matrix. Basic service functions are used as navigation points to process models (i.e., reflected in process and target systems models). Business Component Model: Used to define reference system families derived from the create/use matrix. Serves as navigation point into other systems view models. Reference models may be purchased, developed from scratch, or adapted from vendor/industry models. A number of IT vendors and analyst firms offer various industry or domain-specific reference models. The level of detail and usefulness of the models varies greatly. It is not in the scope of this best practice to evaluate such models, only to recognize that they exist and may be worthy of consideration. There are also a significant number of open industry standard reference models that also should be considered. For example, the Supply-Chain Operations Reference (SCOR) is a process reference model that has been developed and endorsed by the Supply-Chain Council (SCC) as the cross-industry de facto standard diagnostic tool for supply chain management. Another example is the Mortgage Industry Standards Maintenance Organization, which maintains process and information exchange definitions in the mortgage industry. The reference models that are available from Proact, Inc. are particularly well suited to data integration and data governance programs. Some key advantages of buying a framework rather than developing one from scratch include: Minimizing internal company politics: Since most internal groups within a company have their own terminology (i.e., domain-specific reference model), it is often a very contentious issue to rationalize differences between various internal models and decide which one to promote as the common enterprise model. A common technique that is often attempted, but frequently fails, is to identify the most commonly used internal model and make it the enterprise model. This can alienate other functions who don’t agree with the model and can in the long run undermine the data governance program and cause it to fail. An effective external model however, can serve as a “rallying point” and bring different groups from the organization together rather than pitting them against each other or forcing long, drawn-out debates. Avoid “paving the cow path”: The “cow path” is a metaphor for the legacy solutions that have evolved over time. An internally developed model tends to often reflect the current systems and processes (some of which may not be ideal) since there is a tendency to abstract away details from current processes. This in turn can entrench current practices which may in fact not be ideal. An external model, almost by definition, is generic and does not include organizationspecific implementation details. Faster development: It is generally much quicker to purchase a model (and tailor it if necessary) than to develop a reference model from the ground up. The difference in time can be very significant. A rough rule of thumb is that adopting an external model takes roughly one to three months while developing a model can take one to three years. While the reference model may involve some capital costs, the often hidden costs of developing a reference model from scratch are much greater. Regardless of whether you buy or build the reference models, in order for them to be effective and successful, they must have the following attributes:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
479 of 818
Holistic: The models must describe the entire enterprise and not just one part. Furthermore, the models must be hierarchical and support several levels of abstractions. The lowest level of the hierarchy must be mutually exclusive and comprehensive (ME&C), which means that each element in the model describes a unique and non-overlapping portion of the enterprise while the collection of elements describes the entire enterprise. Note: It is critical to resist the urge to model only a portion of the enterprise. For example, if the data governance program focus is on customer data information, it may seem easier and more practical to only model customer related functions and data. The issue is that without the context of a holistic model, the definition of functions and data will inherently be somewhat ambiguous and therefore be an endless source of debate and disagreement. Practical: It is critical to establish the right level of granularity of the enterprise models. If they are too high-level, they will be too conceptual; if they are too low-level, the task of creating the enterprise models can become a “boiling the ocean” problem and consume a huge amount of time and resources. Both of these extremes of too little detail or too much detail are non-practical and the root cause of failure for many data governance programs. TIP There are two “secrets” to achieving the right level of granularity. First, create a hierarchy of functions and information subjects. At the highest level it is common to have in the range of 5-10 functions and information subjects that describe the entire enterprise. Second, at the lowest level in the hierarchy, stop modeling when you start getting into “how” rather than “what”. A good way to recognize that you are into the realm of “how” is if you are getting into technology-specific or implementation details. A general rule of thumb is that an enterprise reference model at the greatest level of detail typically has between 100 and 200 functions and information subjects.
Stable: Once developed, reference models should not change frequently unless the business itself changes. If the reference models did a good job separating the “what” from the “how”, then a business process change should not impact the reference models but if the organization expands its product or service offerings into new areas, either through a business transformation initiative or a merger/acquisition, then the reference model should change. Example of scenarios that would cause the reference model to change include a retail organization transforming its business by manufacturing some its own products or a credit card company acquiring a business that originates and services securitized car and boat loans. Reference models, once created, serve several critical roles: 1. Define the scope of selected programs and activities. The holistic and ME&C nature of the reference models allows a clear definition of what is in scope and out of scope. 2. A common language and framework to describe and map the current state enterprise architecture. The reference model is particularly useful for identifying overlapping or redundant applications and data. 3. They are particularly useful for identifying opportunities for different functional groups in the enterprise to work together on common solutions. 4. They provide tremendous insight for creating target architectures that reflect sound principles of well-defined but decoupled components.
Information Model There are two information models on the Business View (Layer 3) of the information architecture framework. These are sometimes referred to as semantic models since there may be separate instances of the models for different business domains. Business Glossary: List of business data elements with a corresponding description, enterprise level or domain specific, validation rules and other relevant metadata. Used to identify source of record, quality metrics, ownership authority and stewardship responsibility. Information Object Model: Used to provide traceability from the enterprise function and information subject models to the business glossary (i.e., an information object includes a list of data elements from the business glossary). Possible use for assessing current information management capabilities (reflected in process and target systems models) or as a conceptual model for custom-developed application components. The Business Glossary is implemented as a set of objects in Metadata Manager to capture, navigate, and publish business terms. This model is typically implemented as a custom extension to the Metadata Manager (refer to the Metadata Manager Best Practices for more details) rather than as Word or Excel documents (although these formats are acceptable for very simple
INFORMATICA CONFIDENTIAL
BEST PRACTICES
480 of 818
glossaries in specific business domains). The Business Glossary allows business users, data stewards, business analysts, and data analysts to create, edit, and delete business terms that describe key concepts of the business. While business terms are the main part of the model, it can also be used to describe related concepts like data stewards, synonyms, categories/classifications, rules, valid values, quality metrics, and other items. Refer to Metadata Manager Business Glossary and the Data Quality and Profiling Best Practices for more information. Creating a data model for the business glossary is a normal data modeling activity and may be customized for each enterprise based on needs. A basic version of the model should contain the following classes, properties and associations: Category (name, description, context) Business Term (name, description, context, rule, default value, quality score, importance level) Data Steward (name, description, Email address, phone number) Domain (either a data type, a range, or a set of valid values.) Valid Value (name of the value itself and a description) Relationships: Category Category DataSteward BusinessTerm BusinessTerm
Contains Contains Owns Has HasSynonym
BusinessTerm Category BusinessTerm ValidValue BusinessTerm
The following diagram shows the discovery, definition and specification flow among various users and between the analyst tools and PowerCenter. The diagram illustrates pictorially the general processes involved in several different use cases.
Process Models There are three process models on the Business View (Layer 3) of the information architecture framework. Information Exchange Model: Used to identify information exchanges across business components including use INFORMATICA CONFIDENTIAL
BEST PRACTICES
481 of 818
of integration systems (e.g., hubs, busses, warehouses, etc.) to enable the exchanges. All of the various information exchanges are represented within a single diagram or model at this level of abstraction. Operational sequence model: Alternate representation for operational scenarios using UML techniques. Individual operational processes may have their own model representation. Business Event Model: A common (canonical) description of business events including business process trigger, target subscribers and payload description (from elements in the business glossary). This model also fits into the classification of semantic models but in this case for data in motion rather than data at rest. For further details on how to develop the information exchange model and operational sequence model, refer to the Proact enterprise architecture standards.
Repository Earlier sections of this document addressed the information architecture methodology and framework. The final dimension to the Information Architecture best practice is a repository-based meta-model and modeling tool. Informatica’s Metadata Manager provides an enterprise repository to store models as well as other customizable metadata. Metadata Manager also offers an integrated solution with PowerCenter metadata. Refer to the Metadata Management Enterprise Competency for more information.
Last updated: 03-Jun-08 15:13
INFORMATICA CONFIDENTIAL
BEST PRACTICES
482 of 818
People Resource Management Challenge Selecting the right team members to staff an Integration Competency Center (ICC) is a critical success factor as is developing the team members’ skills to correspond with the ICC’s service offerings. Each member of the team has strengths and weaknesses in a variety of disciplines. The key to building a successful ICC team is to fit the discipline strengths to the needs of the ICC. This Best Practice focuses on the process of selecting individual team members and organizing the team with appropriate reporting structures. Overall, the team members should have the following qualities that promote team unity and allow them to effectively work in a cross-functional multi-disciplinary environment.
Description Staff Competencies In addition to technical competencies and skills in using the specific tools that have been adopted by the organization (sometimes referred to as “hard” skills), there are a number of qualities that promote team unity and allow staff to effectively work in a cross-functional multi-disciplinary environment. These qualities are often referred to as “soft” skills, but they are nonetheless critical factors in establishing a high-performance team that is effective and well respected across the organization. The following table lists a number of the most important qualities and differentiates the level of competency that should be expected from junior, senior and master (or expert) staff. Competency Know the Business Understands the complexities and key factors that impact the business
Collaborate Integrates and collaborates well inside and outside of the organization, holding a customercentric position
INFORMATICA CONFIDENTIAL
Junior ICC Staff
Senior ICC Staff
Master ICC Staff
Understands the company’s vision, mission, strategy, goals, and culture. Has a general knowledge of business and integration program operations. Articulates the business value delivered through ICC.
Understands and articulates the integration and financial plan for current year and how those plans support the strategy and goals of the enterprise. Continually examines competitor and best in class performers to identify ways to enhance the integration service and value to the enterprise. Integrates industry, market, and competitive data into integration planning process. Proactively connects others to best in class performance to drive enhancement of the integration practice.
Demonstrates deep/broad integration and functional skills. Demonstrates a business perspective that is much broader than one function/group. Cuts to the heart of complex business/functional issues. Monitors and evaluates the execution and quality of the strategic planning process.
Surfaces and resolves conflict with minimal noise. Builds partnerships across the integration community and positively represents the team in the services we provide. Builds trust by consistently delivering on promises.
Analyzes and understands the needs of all key stakeholders and takes steps to ensure their continued satisfaction. Partners with LOB teams to define and develop shared goals and expectations. Finds and develops common ground among a wide range of stakeholders. Coaches staff across the
Builds broad-based business relationships across the organization (including business executives) Creates win/win scenarios with key vendors. Leverages external industry organizations to achieve enterprise goals.
BEST PRACTICES
483 of 818
organization on integration practices. Customer Focus Knows and cares about customers – works well in a team to exceed expectations
Drive for Learning Sizes up and acts on the learning implications of business strategy and its execution
Defines and reviews the requirements, deliverables and costs of proposed solutions. Identifies variances from system performance and customer requirements and collaborates with LOB teams to improve the variances. Ensures each recommended solution has a scope, timelines, desired outcomes, and performance measures that are well defined and communicated.
Contracts and sets clear expectations with internal customers about goals, roles, resources, costs, timing, etc. Positions and sells business and technology partners on innovative opportunities impacting people, process and/or technology. Forecasts how the business is changing and how IT will need to support it.
Advises senior executives on how solutions will support short- and longterm strategic direction. Drives multi-year strategy and funding/cost-saving opportunities across the enterprise.
Asks open-ended probing questions to further define problems, uncover needs and clarify objectives. Maintains current view of “best in class” practices through selflearning and benchmarking. Scans the environment to remain abreast of new developments in business & technology trends.
Improves the quality of the integration program by developing and coaching staff across the enterprise to build their individual and collective performance and capability to the standards that will meet the current and future needs of the Business. Is recognized as an expert at in one or more broad integration domains.
Is recognized as an expert outside of the enterprise in one or more integration domains. Represents the enterprise on Industry boards and committees. Participates in major industry and academic conferences. Serves as an active member in international standards committees.
Identifies patterns and relationships between seemingly disparate data to generate new solutions or alternatives. Gathers necessary data to define the symptoms and root causes (who, what, why and costs) of a problem. Develops alternatives based on facts, available resources, and constraints.
Initiates assessments to investigate business and technology threats and opportunities. Translates strategies into specific objectives and action plans. Collaborates across functional groups to determine impact before implementing new processes and procedures. Uses financial, competitive and statistical modeling to define and analyze opportunities. Integrates efforts across LOB’s to support strategic priorities.
Uncovers hidden growth opportunities within market/industry segments to create competitive advantage. Formulates effective strategies consistent with the business and competitive strategy of the enterprise in a global economy. Identifies factors in the external and internal environment affecting the organization’s strategic plans and objectives.
Capitalize on Opportunities Recognizes possibilities that increase depth of integration solutions
INFORMATICA CONFIDENTIAL
BEST PRACTICES
484 of 818
Change Leadership initiates and creates conditions for change
Organizational Alignment Creates process and infrastructure to carry out plans and strategies
Accountable Can be counted on to strive for outstanding results
INFORMATICA CONFIDENTIAL
Does not wait for orders to take action on new ideas. Expresses excitement freely concerning new ideas and change.
Transcends silos to achieve enterprise results. Skillfully influences peers and colleagues to promote and sell ideas. Displays personal courage by taking a stand on controversial and challenging changes. Leads integrated charge across LOB organizations to achieve competitive advantage. Identifies opportunities, threats, strengths and weaknesses of the enterprise. Demonstrates a sense of urgency to capitalize on innovations and opportunities. Challenges the status quo.
Leverages industry, market and competitor trends to make a compelling case for change within the company. Mobilizes the organization to adapt to marketplace changes. Proactively plans responses to new and disruptive technologies.
Advises application teams on technology direction. Develops and maintains business system and corporate integration solutions. Responsible for working on medium to complex integration projects, recommending exceptions to standards, reviewing and approving architectural impact designs and directing implementation of the integration for multiple applications. Conducts complex technology and system assessments for component integration. Acts as a lead in component integration and participates in enterprise integration activity.
Acts in a strategic role in the development and maintenance of integrations for a line of business or infrastructure sub-domain that are in compliance with Enterprise standards. Provides in-depth technical and systems consultation to internal clients and technical management to ensure alignment with standards. Guides the organization in proper application of integration practice. Leverages both deep and broad technical knowledge, strong influencing and facilitation skills, and knowledge of integration processes and techniques to influence organizational alignment around a common direction.
Performs as the integration subject matter expert in a specific domain. Organizes, leads, and facilitates cross-entity, enterprise-wide redesign initiatives that will encompass an end to end analysis and future state redesign that requires specialized knowledge or skill critical to the redesign effort.
Asks probing questions to uncover and manage the needs and interests of all parties involved. Explores alternative positions and identifies common interests to
Skillfully influences others to acquire resources, overcome barriers, and gain support to ensure team success. Negotiates project timeline changes to meeting unforeseen developments or
Skillfully influences peers and management to promote and sell ideas. Is accountable for planning, conducting, and directing the most complex, strategic,
BEST PRACTICES
485 of 818
reach a “win/win” outcome.
additional unplanned requests. Escalates issues to appropriate parties when decision cannot be reached. Assumes accountability for delivering results that requires collaboration with individuals or groups in multiple functions. Collaborates with partners across functions to define and implement innovations that improve process execution and service delivery.
corporate-wide business problems to be solved with automated systems. Engages others in strategic discussions to leverage their insights and create shared ownership of the outcomes.
ICC Organization For each of the ICC models, the number of shared resources increases with the size of the organization and the type of ICC model that is chosen. In the following table, the number of ICC staff is represented as a percentage of the total IT staff (i.e., total IT includes both internal employees and external contract staff). For example, if a Best Practices ICC is implemented for a company with 100 IT staff, the number of ICC resources would be one to two resources. For a Shared Services ICC, the number of resources can be five to ten resources. Anything less than one dedicated resource means there is no actual ICC.
* - Number of ICC shared resources as a percentage increases dramatically as Integration Developers are added to perform the integration as part of the ICC
Best Practice Model The recommended minimum number of dedicated, shared resources is usually two. The maximum size of the group depends upon the amount of infrastructure currently in the organization. The roles for this model include the following:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
486 of 818
Training and Knowledge Coordinator Develops and maintains mapping patterns, reusable templates, and best-practices documents, usually stored in a central location and readily available to anyone in the enterprise. Coordinates vendor-offered or internally sponsored training in specific integration technology products. Also prepares and delivers events such as seminars or internal presentations to various teams Metadata Specialist / Data Steward Creates standards for capturing and maintaining metadata. The metadata repositories themselves will likely be maintained by various groups around the enterprise in a federated model, so the focus in this model is to define and enforce federated metadata standards and processes to sustain them. Responsible for the data ownership and stewardship of business elements within a particular subject area. Handles data definition and attributing. Works in conjunction with the Metadata Specialist. The Metadata Specialist role is responsible for capturing the sources of information governed by the ICC while the Data Steward is usually a business user or knowledgeable IT resource that is very familiar with the data. This is a corporate investment that may begin by cataloguing information and ultimately becomes the repository for business information.
Technology Standards Model Up to four shared resources are typically required to support the evolution process. Note that in smaller teams, one individual may play more than one role. The primary roles for this model are shown in the organization chart below:
Standards Coordinator Actively monitors and promotes industry standards and adapts relevant ones to the needs of the enterprise. Defines, documents, and communicates internal enterprise standards. May also act as the company’s official representative to external standards organizations, and may propose or influence the development of new industry standards. Works with the Knowledge Coordinator to publish standards within the organization. Technical Architect Develops and maintains the layout and details of the software configurations and physical hardware used to support the ICC, including tools for the ICC operations (e.g., monitoring, element management, metadata repository, and scanning tools) or for the middleware systems (e.g., message brokers, Web service application servers, and ETL hubs). Vendor Manager
INFORMATICA CONFIDENTIAL
BEST PRACTICES
487 of 818
Leads the efforts to select integration products and participates in the selection of vendors for the servers, storage, and network facilities needed for integration efforts. Handles vendor relationships on an ongoing basis, including maintaining awareness of trends, supporting contract negotiations, and escalating service and support issues. The organization may be a formal one if an ICC Director has been named or may be informal if the individuals report to different managers and coordinate their efforts through some type of committee or consensus team. Many of the roles may be part of an overall steering committee with each of these roles reporting to an unofficial management group.
Shared Services Model The shared services model adds a management structure and organization in contrast to the prior models that are typically based on a committee-type structure. The management structure of this model generally provides development support resources and also begins to perform consulting for each of the key projects. The number of services is defined by the ICC Director based on the scope and mission of the organization. This model is probably the most dynamic of any since the scope of services dictates the initial number of roles. Refer to Engagement Services Management for options.
The above organization chart includes the key positions for this model. Any of the positions below may also be added: ICC Director Ensures that the ICC is in alignment with the IT strategy and anticipates business requirements. Manages the ICC staff, prepares business cases for integration investment initiatives, and is responsible for the annual budgets. Data Architect Provides project-level architecture review as part of the design process for data integration projects, develops and maintains the enterprise integration data model, supports complex data analysis and mapping activities, and enforces enterprise data standards. Project Manager Supplies full-time management resources experienced in data integration to ensure project success. Is adept at managing dependencies between teams, identifying cross-system risks, resolving issues that cut across application areas, planning complex release schedules, and enforcing conformance to an integration methodology. Quality Assurance Specialist Develops QA standards and practices for integration testing, furnishes data validation on integration load tasks, supports string testing (a “unit” test for integration elements) activities in support of integration development, and leads the execution of end-to-end integration testing. Change Control Specialist Manages the migration to production of shared objects that may impact multiple project teams, determines impacts of changing one system in an end-to-end business process, and facilitates a common release schedule when multiple system changes need to be synchronized. Business Analyst
INFORMATICA CONFIDENTIAL
BEST PRACTICES
488 of 818
Facilitates solutions to complex, cross-functional business challenges; evaluates the applicability of technologies, including commercial ERP software; models business processes; and documents business rules concerning process and data integrations. Integration Developer Reviews design specifications in detail to ensure conformance to standards and identify any issues upfront, performs detailed data analysis activities, develops data transformation mappings, and optimizes performance of integration system elements.
Centralized Services Model The centralized services model formalizes both the development and production resources as part of the ICC. The centralized services model focuses primarily on actual project delivery and production support. Thus, the roles of client management and production control are introduced. Again, the organization of the ICC differs from organization to organization as centralized resources such as Security, Production Control exist and may be leveraged.
The key positions added in this model include: Engagement Manager Maintains the relationship with the internal customers of the ICC. Acts as the primary point of contact for all new work requests, supplies estimates or quotes to customers, and creates statements of work (or engagement contracts). Repository Administrator Ensures leverage and reuse of development assets, monitors repository activities, resolves data quality issues, and requests routine or custom reports. Security Administrator Provides access to the tools and technology needed to complete data integration development and overall data security, maintains middleware-access control lists, supports the deployment of new security objects, and handles configuration of middleware security.
Sample Organization Charts The following organization charts represent samples of a Shared Services model and a Central Services model, respectively. The solid line relationships are responsible for setting priorities, setting performance reviews and establishing development priorities. The dashed line relationships indicate coordination, communication, training and development responsibilities. These relationships need a strong affiliation with the resource manager in setting priorities, but the ICC is not fully responsible for the resource management component.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
489 of 818
The chart above shows an example of a Shared Services model organization structure. This model includes both solid- and dashed-lines with the standards and development groups represented by dashed-line relationship. The focus of this model is to build the Operations, Metadata, and Education functions, which coordinate with the Standards committees and Development groups. The Shared Services enable the development teams to succeed with the integration effort, but to only influence the associated priorities and policies. As the Shared Services model matures and becomes more recognized in the IT organization, the ICC should become increasingly responsible for project development. The following chart shows an example of a developed Central Services model organization. This model shows only dashed-line relationships with the Standards Committee and the Project Management Office (PMO).
INFORMATICA CONFIDENTIAL
BEST PRACTICES
490 of 818
The focus of this model is to build the project development capability through the introduction of Data Movement Architecture, Engagement Management and Development capabilities. These disciplines enable the ICC to have more influence in establishing priorities so that integration issues become more important to the development process.
Last updated: 06-Sep-08 11:35
INFORMATICA CONFIDENTIAL
BEST PRACTICES
491 of 818
Planning the ICC Implementation Challenge After choosing a model for an Integration Competency Center (ICC), as described in Selecting the Right ICC Model, the challenge is get the envisioned ICC off the ground. This Best Practice answers the Why, What, When and Who questions for the next 30, 60, 90 and 120+ day increments in the form of activities, deliverables and milestones. The most critical factor in the planning is actually choosing the right initial project to start the ICC with.
Description
Different ICC Models Require Different Resources As neither the Best Practices nor the Technical Standards ICC’s have their own platforms they require significantly less activities than the Shared Services or Central Services models. The diagram below shows how they can make the project delivery better through the sharing of past experiences.
Planning the Best Practices ICC The Best Practices ICC does not require the implementation of a platform. A best practices ICC can be developed and built from the ground up by documenting current practices that work well and taking the time to improve upon them. This is best done by a group that carries out many projects and that can make the time to review its processes. The processes can be improved upon and then published on the company intranet for other project teams to make use of. The Best Practices ICC model does not enforce standards. So a certain amount of evangelizing may be required for the practices to be adopted by other project teams. It is only with the agreement and adoption of the best practices by others that an ICC results. Therefore navigating the political and personal goals of individuals leading other projects and getting their ‘buy in’ to the ICC best practices is important.
Planning the Technical Standards ICC As discussed above, the Best Practices ICC model does not include an enforcement role. To achieve enforcement of standards across projects there must be managerial agreement or centralized edicts that determine and enforce standards. In planning the development of a Technical Standards ICC, key elements include authority and consensus. Either this will be a practice consolidation exercise in its own right, or the model is established by completing a project successfully according to best practices and obtaining agreement to turn those best practices into enforced standards and approve exceptions. The Technical Standards ICC does not necessarily have an implementation of a common shared platform. Enforcement of standards though does mean that a common and shared platform is the first logical extension of this model.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
492 of 818
Executive Sponsorship The most critical success factor for an ICC is having unwavering executive sponsorship. Since an ICC brings together people, processes, and technology, executive (typically CIO level) sponsorship is needed to institute the level of organization change necessary to implement the ICC. Since an ICC is a paradigm shift for employees who are accustomed to a project silo approach, there can be resistance to a new paradigm. Sometimes resistance is due to perceived job insecurity stemming from an ICC. This perception should be curtailed as a functioning ICC actually opens the door for more data integration opportunities due to the lower cost of integrating the data. The executive level sponsorship will greatly help the perception of the ICC and will facilitate the necessary level of organizational change required. Note: In practice, Informatica has found that executive sponsorship is most crucial in organizations that are most resistant to change. Examples include government or educational entities, financial institutions, or organizations that have a long, established history. Next, it is important for the IT organization to recognize the value of data integration. If the organization is not familiar with data integration or Informatica, a successful project must occur first. A level of success must be established with data integration and Informatica to establish an ICC. Establishing successful projects and proving value quickly and incrementally is essential. Once the value of data integration has been established within the organization, then the business case can be made to implement an ICC to lower the incremental cost of data integration. Just as quick wins were important in establishing the business case for an ICC they are also important as an ICC is implemented. As such, the 30/60/90/120+ day plan below outlines an example ICC rollout plan with incremental deliverables and opportunities for quick wins.
Initial Project Selection Planning an ICC involves the considered selection of an initial project to showcase ICC function and success. An appropriate project is needed that will avoid risk to the ICC approach itself. A high risk and complex project with high visibility that fails (and not due to the ICC implementation) as the first project, could damage the standing of the ICC within the organization. Therefore high risk and overly complex projects should be avoided. The initial project should also be representative of other projects planned for the next year. If this is the case there is likely to be more scope for sharing and reuse between the projects. Therefore more benefit will be derived from ICC. The initial project should be carefully chosen and ideally fit the following criteria: A pilot project Moderately Challenging Representative of other projects to be undertaken in the next year When choosing an initial project for the ICC, consider the following: Project Scope and Features Projects that demonstrate the values of the ICC Reusable Opportunities. Budget Issues Central funding to encourage acceptance Initial Setup Costs Project sponsor may not be able to provide all the necessary budget for the initial project Obtaining Resources Staffing resources from within the organization may need to be allocated by other authorities Note: The budgeting allocated will determine the scope and model of the ICC. Make sure that the scope and budget are sized appropriately. If they don’t match, plan to utilize some of the Financial Management best practices in order to obtain adequate funding. TIP A period of evangelizing within the organization may be required to garner support for adoption for the ICC. As projects continue development and implementation prior to the formal adoption of an ICC, there maybe an opportunity to develop best practices in grass roots fashion that become part of a formal ICC established at a later date. Choose initial projects that have:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
493 of 818
well established requirements well defined business benefit reasonable complexity to delivery expectations lower level of risk good visibility, but low level of political intricacies
Resources The implementation of the ICC requires resources. The ICC has to provide the starting point for cross project reuse of sharable assets such as technology, practices, shared objects and processes. It will need budget and man power. These resources will need to be obtained in the form of central budget and allocation with or without a chargeback model. Alternatively they can be paid from the project budgets that will be undertaken with the ICC. In the event of budget and resource issues problems can be circumvented with a grass roots best practice configuration. Ultimately, the ICC will need resources and management support if it is going to provide more than best practices learned from previous projects. The resources required fall into two broad categories: Resources to implement the ICC Infrastructure that will drive change and improvements. Provision of development and production support.
Planning: Shared Services and Central Services ICC Establishing a 120 Day Implementation Plan This Best Practice suggests planning the ICC implementation with milestones at 30, 60, 90 and 120 days. Certain implementations of an ICC where there is a central offering will have a plan at 120 days or longer for additional infrastructure and shared services. The purpose is to use the four iterations to show milestones at each phase or iteration. The plan is designed and treated as any other project in the IT department where there is a defined start time and end time where the ICC is designed, developed and launched. 120 days (or 4 months) was chosen to help scope the process for each organization’s culture and to help the ICC Director properly set expectations with management in showing value after this period of time. Larger organizations might want to develop an implementation plan around a less intrusive model such as Best Practices or Technology Standards where there is still payback, but where less organizational change and alignment is required. For the purposes of this Best Practice, the 120 day plan is based upon the Shared or Centralize Services model and ensures that incremental deliverables are accomplished with respect to the implementation of the ICC. This timeline also provides opportunities to enjoy successes at each step of the way and to communicate those successes to the ICC executive sponsor as each milestone is achieved. It is also important to note that since the Central Services ICC model is lengthy (6+ months) to fully implement, it might be worthwhile to repeat another 4 month iteration by first implementing a Best Practices model, then implementing a Shared or Central Services model in another 120 day plan. Some of the pre-project activities include: Business Case for ICC Hardware/Software ordered The Table below lists the milestones that can be expected to show successful progress of a Shared Services ICC project once the pre-project activities have been completed. 120-Day ICC Start-up Project Milestones Milestone Category
People
Day 30 ICC Director named Resource plan approved Sponsors & stakeholders
INFORMATICA CONFIDENTIAL
Day 60
Day 90
Core team members on board Key partnerships with internal governance groups formalized
BEST PRACTICES
Subcontractor and 3rd party agreements signed off Initial team training completed
Day 120 Staff competency evaluations and development plans documented
494 of 818
Process & Policy
identified
Stakeholder communication plan documented
Enterprise training plan documented
ICC charter approved Early adopters and project opportunities identified
ICC services defined Core Integration standards or principles documented
ICC service engagement and delivery process defined Internal communications and marketing plan documented Chargeback model approved
Services are discoverable and orderable by internal customers Regular metrics reporting in place Ongoing Metadata management process in place
Integration platform configured
ICC tools selected Service Level Agreement template established
Operating procedures documented (i.e. availability mgmt, failover, disaster recovery, backup, configuration mgmt, etc.)
Applications connected and using the integration platform SLA agreements signed off
Technology
The table implies the gradual ramp-up of an effective ICC model, starting with building the initial team and then progressing to more formalized procedures. A successful ICC is likely to become the corporate enterprise standard for data integration, so policy development with regard to the ICC can be expected. Such policy developments should be considered major success milestones.
Planning: Activities, Deliverables and Milestones for the 120 Day Plan Below is the 120 day plan broken into 30/60/90/120 day increments that are further categorized by, Activities, Deliverables and Milestones. This makes it easier for those engaged in the initiation of an ICC to see what they should be focusing on and what the end results should be.
30 Day Scorecard The following plan outlines the people, process, and technology steps that should occur during the first 30 days of the ICC rollout:
Activities Name Director to ICC organization Solicit Agreement for ICC approach Identify Executive Sponsor and key stakeholders Identify Required resource roles and skills: Identify, assemble, and budget for the human resources necessary to support the ICC rollout; This maybe spread over several roles and individuals Define Project Charter for ICC ICC launch should be treated as a project Refine Business Case Identify, estimate, and budget for the necessary technical resources (e.g., hardware, software). Note: To encourage projects to utilize the ICC model, it can often be effective to provide hardware and software resources without any internal chargeback for the first year of the ICC conception. Alternatively, the hardware
INFORMATICA CONFIDENTIAL
BEST PRACTICES
495 of 818
and software costs can be funded by the projects that are likely to leverage the ICC. Identify Early Adopter Projects and Plans that can be supported by the ICC Install and implement infrastructure for the ICC (hardware and software). Implement a technical infrastructure for the ICC. This includes implementing the hardware and software required to support the initial five projects (or projects within the scope of the first year of the ICC) in both a development and production capacity. Typically, this technical infrastructure is not the end-goal configuration, but it should include a hardware and software configuration that can easily meld into the end-goal configuration. The hardware and software requirements of the short-term technical infrastructure are generally limited to the components required for the projects that will leverage the infrastructure during the first year. Future scalability is a consideration here, so consider that new servers could be added to a grid later.
Deliverables Resource Plan ICC Project Charter List of prospective Early Adopter Projects Ballpark ICC Budget Estimate Technical Infrastructure Install
Milestones ICC Executive Sponsor approval of initial Project Charter and refined Business Case ICC Director on project full time List of 1-3 Early Adopter Projects Technical infrastructure completed and installed
60 Day Scorecard As the ICC successfully engages on its initial projects the following activities should occur in the 30 to 60 day period.
Activities Establish ICC Support Processes (key partnerships groups such as PMO, Systems Management, Database, Enterprise Architecture, etc.) Develop a Stakeholder communication plan Allocate core team resources to support new and forth coming projects on the ICC platform. Develop and adopt core development standards using sources like Velocity to see best practices in use Define ICC Services Evaluate and select ICC tools Develop Service Level Agreement template
Deliverables Best practice and standards documents for: Error handling processes Naming standards Slowly changing data management Deployment Performance tuning Detailed design documents And other Velocity Best Practice Deliverables List of tools added to the ICC environment, such as:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
496 of 818
PowerCenter Metadata Reporter PowerCenter Team Based Development Model Metadata Manager Data Quality, Profiling and Cleansing Various PowerExchange connectivity access products Other tools appropriate for ICC management Agreements with key partnerships with internal governance groups Stakeholder communication plan ICC Service Offerings Service Level Agreement template
Milestones Service offerings introduced to support best practices Engagement Services Management for a format and outline of marketable services) Best Practice and Standards document available Core Team Members assigned roles and responsibilities Key partnerships with internal governance groups formally established Regular Stakeholder and Sponsor meetings (Weekly)
90 Day Scorecard Activities Select key contractors and 3rd parties that would assist in the ICC operations Initial training class to early adopter project teams Development of enterprise training plan Final revision of ICC delivery processes and communication of them Define rules of engagement when using ICC delivery processes Develop chargeback model for services Develop internal communications and marketing plan Implement Disaster Recovery and High Availability as features of the ICC As projects join the ICC that has disaster recovery/failover needs, the appropriate implementation of DR/Failover should be completed for the ICC infrastructure. Approve operational service level agreements (SLA) between ICC and hosted projects
Deliverables Operational procedure manuals Operational service level agreements (SLA) between projects leveraging the ICC services Published rules of engagement when utilizing ICC services Published internal chargeback model to ICC client organizations. Signed subcontractor and 3rd party agreements Internal communications and marketing plan Disaster Recovery / High Availability features in place
Milestones Disaster recovery and high availability services in place and referenced in SLA’s Robust list of ICC services defined and available as consumable items for client organizations Chargeback models approved Service level agreements in place and rules of engagement in place for ICC services
INFORMATICA CONFIDENTIAL
BEST PRACTICES
497 of 818
Initial early adopter training class completed Published training schedule for the enterprise available
120 Day Scorecard Activities Operational environment ready for project on-boarding Develop change control procedures Reporting metrics to show performance of the ICC development Initial projects on-boarded to ICC Metadata strategy developed and process in place Publish a list of additional software components that can be leveraged by ICC customers/projects. Examples include: High Availability PowerCenter Enterprise Grid Option Unstructured Data Option Forecast of a longer term technical infrastructure, including both hardware and software. This technical infrastructure can generally provide cost-effective options for horizontal scaling – such as leveraging Informatica’s Enterprise Grid capabilities with a relatively inexpensive hardware platform, such as Linux or Windows. Further refinement of chargeback models
Deliverables Competency evaluations of ICC staff and key project team members ICC Helpdesk for production support on 24/7 basis for urgent issues. Operational environment in place Change control process documented Metadata strategy published SLA agreements signed off
Milestones ICC established as the enterprise standard for all data integration project needs. ICC service catalogue available for services such as: - data architecting, data modeling, developing, testing and others such as business analysis. ICC operations support established Applications connected and using shared integration platform
Last updated: 06-Sep-08 11:42
INFORMATICA CONFIDENTIAL
BEST PRACTICES
498 of 818
Portfolio Rationalization Challenge Once the decision has been made to structure corporate integration activities into an Integration Competency Center (ICC) an assessment of all software tools and their usage needs to be conducted. An issue that may be uncovered during this review is that an organization is using multiple tools, processes, servers and human resources to perform similar tasks. An objective of the ICC is to consolidate, centralize and standardize (allowing for re-use of objects and the simplification of development and operations). To achieve this goal a decision must be made as to which tool or set of tools and processes will be used and which others will be consolidated and/or retired. A further consolidation of servers and the reallocation of personnel may occur as part of the overall tool consolidation effort. This Best Practice covers the server and software tool consolidation effort and in some cases refers to all of these items as tools or tool sets. Some of the challenges faced in rationalizing the choice for a portfolio of integration tools, processes and servers include: Selecting which tools and processes are a best fit for the organization Retiring or managing the tools not selected Migrating processing to the new tools and servers Providing developer and end-user training
Description What is Portfolio Rationalization? Portfolio Rationalization is an ongoing process that allows the ICC to optimize the use of technologies and specific features. Portfolio Rationalization can be used to eliminate redundant technology or to identify new technology gaps. The process of evaluating these tools, making the decisions on tool status, acting to migrate processing to the selected/standard tool set, managing the remaining and to-be-retired tools and managing training on the selected tools is Portfolio Rationalization.
Business Drivers The business case for an ICC has already enumerated the value of centralizing, standardizing and consolidating integration activities. Similarly, consolidating the tool set adds to this value proposition by providing the following: A potential reduction in software license costs via bulk license agreements or a reduced number of total licenses Reduced overhead for product training on a condensed number of software tools Reduced complexity when using a limited number of software tools and by standardizing on a common version of the software Enablement of resources to more readily provide cross-over coverage since standards methods, and tools are common across teams A potential reduction in hardware maintenance costs with a reduced number of server footprints
Portfolio Rationalization Methodology A tool would not normally be brought in-house unless there was a perception that it met a specific need. Business needs must first be clarified so that they can be addressed. Based on the drivers for portfolio rationalization the organization should initiate a process to evaluate the tools that are already part of the portfolio as well as those that are candidates for future use. The process of inventorying the systems and tools along with the selection process to define the target platform are part of the evaluation process. The same process that is followed to resolve technology gaps is also used to select the standard tool sets for existing usage in order to consolidate the portfolio. After the standard tool sets have been defined it is time to move towards using them. The organization should select the appropriate rationalization strategy and then follow it to migrate and consolidate tool set usage. Unused components can be retired or re-purposed as they are no longer needed for the current business requirement. As depicted in the Portfolio Rationalization Methodology diagram below, this process begins anew each time new requirements surface or as market advances drive a desire to utilize newer technology.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
499 of 818
1. Clarify Business Needs: Document the business needs or problems that drive adding new technology to the portfolio. 2. Inventory Systems and Tools and evaluate both in house tools and tools new to the organization. Determine the best fit as far as features, compatibility and future potential to meet the business needs. Follow the same process of evaluation to resolve tool gaps. 3. Determine classifications of the tool sets under consideration for new and existing tools, and tools to be replaced. Select the tool sets to use as the standard. 4. Determine how the organization will begin moving towards the standard. Will the tools be used for new projects only or will the organization migrate and consolidate existing applications using other tools? 5. Begin a migration and consolidation project if this is the selected approach. 6. Retire or reallocate the architecture components no longer used, remove application processing from servers, terminate contracts for unused software. 7. Repeat the process as new business needs arise to look for best of breed components for the portfolio.
Evaluation Process In order to appropriately evaluate tools the organization needs to understand the following: Which tools are currently used? How are they are used? Who are the stakeholders? Were existing tools selected for a specific need? Will the need be met by the proposed replacement tools? What are the business drivers for using a particular class of tool? Which tools might be selected (new or existing)? The selection criteria include: Business Requirements Features Functionality Ease of use Industry Best Practices Corporate direction and standards Compatibility with other tools Total Cost of Ownership When choosing the tools and processes to implement for the ICC, evaluate current tools, processes and servers as well as best of breed new tools. The evaluation process will result in classifying these items as: Evaluation – technology is under evaluation for use Selected – technology has been selected for use, but not implemented Standard – technology is standard in the organization for use Deprecated – technology is no longer standard in the organization for use, but supported Unsupported – technology is neither standard or supported in the organization for use Retired – technology has a defined exit strategy Projects using the retired tools may fall into one of two categories: systems that will be converted and replaced with the new toolset and systems that will be sunset and no longer used.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
500 of 818
Gap Analysis When technology gaps are recognized within the organization, begin an evaluation process for tools to fill the gap. Technology that can fill the gaps might already exist within the organization, or the resulting evaluation might require new software tools be added to the portfolio. This situation is the simplest to manage with the ICC because it does not cause disruptions to existing systems in order to consolidate or migrate. If the ICC is mature and the tools already exist as standards in the environment, then use of these tools would merely involve leveraging new functionality or capabilities. Possibly, licenses could be added to an existing contract. Perhaps the organization could leverage existing hardware, upgrade existing hardware, or purchase new hardware. Capacity planning on existing systems should leave room to add new applications and teams to the supported environment without reaching a capacity threshold. If the technology being added is new to the organization, then the infrastructure to support it within the ICC needs to be developed. Servers, processes, training, licensing, and standards will all need to be developed and managed within the ICC framework. Starting from this fresh approach, new development can follow the standards and patterns put into place as the technology is introduced to developers and users.
Rationalization Strategies At this stage, tools have been evaluated, classified, and selected. Now it is time to decide how to move forward. The organization could take one of two approaches. Rationalize by attrition – Establish policies and procedures to contain any legacy components and simply build new (or modified) solutions on the target architecture. This could also be called the self-funding strategy. Conduct a Rationalization project. This is a pro-active plan to shut down and to migrate legacy components to the target architecture. This strategy generally requires a business case. The attrition approach is less disruptive to the organization because human resources remain focused on other priorities, new user training is handled at a slower pace and involves less people at each iteration and outages on existing systems do not occur. However, this method moves more slowly and does not realize the benefit of reduced license costs, reduction of server overhead and other benefits of consolidating the portfolio. In fact, adding the new, shared components for the desired target systems may increase these costs in the mid-term until a significant portion of the legacy systems have been migrated and consolidated and can be retired. Strategy Rationalization By Attrition
Description Benefit/Risk Only new or modified tools move to new Slow architecture and tools Does not disrupt status quo processing Continued infrastructure costs for maintaining multiple systems
Rationalization Project
Develop Plan to move all similar processing to the new environment and tool set.
Standardized tool sets across the enterprise to leverage team capabilities and reduce complexity Reduced infrastructure costs as hardware is consolidated and old software is not renewed Enhanced object sharing capability Project team members not available for other work Outages during the migration and consolidation efforts may disrupt service
Migration and Consolidation If the organization elects to conduct the Rationalization project to consolidate and migrate the portfolio of tools then it will be determining how best to migrate to the new tools and processes. The preceding discussion encompassed the situation where multiple sets of tools might have been employed to provide the same functionality. It is also possible that there has been a standard tool set in use for some time; however, usage has evolved without the common ties that an ICC brings to the table regarding naming conventions, standards, re-usable and shareable objects and methods. Independent development teams may INFORMATICA CONFIDENTIAL
BEST PRACTICES
501 of 818
have incorporated their own methodologies and practices without regard to the decisions made by other similar teams. Multiple versions of the software tools may be in active use. Bringing these disparate methodologies and systems together in a fashion that is logical and is least disruptive to the teams involved can be a challenge even when the same tool set has been utilized. End-user training, developer training, participation from the affected teams to test and other logistics must be kept in mind. There are various scenarios for migration and consolidation: Case 1: Hardware Consolidation utilizing a single version of the existing software tool set Case 2: Hardware Consolidation utilizing multiple versions of the existing software tool Case 3: Hardware Consolidation and Consolidation of multiple software tools to the Standard tool set
Common Considerations Even in the less complex case where consolidation is performed for the existing software tools (all on a common version) to a common hardware platform, there are considerations regarding moving multiple teams’ work to a shared infrastructure. Some of these include: Security on the server level components Security of components within the software tool Capacity planning Disruptions caused by one team affecting another team Managing SLAs on a shared environment Minimizing downtime or outages during the migration and consolidation process Depending on the application areas, security may be more or less important. Most organizations typically handle Financial and Human Resources data on a need-to-know basis. Development teams who have access to this data must be sure to keep the data isolated from other teams within the organization. Security procedures within the software tools and on the server must permit this separation by job function and allow the development objects and any data to be visible only to those authorized to view it. Regulatory compliance may influence the need for security even further. Since the level of security required may not always be apparent before implementing the shared infrastructure, it is best to put into place practices that enable the level of security for whoever requires it. When consolidating processing that resides on many servers onto a single server or set of servers, the organization must consider overall capacity for processing. There must be enough capability to run comfortably all processing allocated to the server with some room for growth within each application and growth caused by adding new teams to the shared infrastructure. The operations, support and development teams need to work together to determine a set of metrics that can be used to evaluate performance and remaining capacity so that adjustments can be made ahead of critical need. Monitoring of the metrics is performed by the organization on a regular basis to identify when capacity limits are being approached. Although sharing the infrastructure has a number of benefits for standardization, code sharing, reusability and more, a down side to sharing the infrastructure is that processing demands, schedules and out-of-control processes in one application area can impact unrelated applications. One way to reduce this effect is to limit resource allocations such as percentage of CPU, job duration and memory allocation to keep one application from locking up the entire server. Even with such measures, there may be times when an application either needs the resources as a priority for regular processing or takes it over or causes a crash based on circumstances not anticipated or fully tested. Even the best testing and development methods may still allow a window for unexpected results with negative effects. Ideally, developers catch the issues and resolve them in Development so that Production systems are not disrupted in this way. Although even in a Development environment, these disruptions are unwelcome. When routine production requirements demand a heavy use of resources by one application (at the expense of other processing in the shared environment) it must be accounted for. This processing demand conflict can be resolved in a number of ways. Negotiate expectations and Service Level Agreements (SLA) that represent the current situation Reschedule jobs to avoid demand conflict Move processing to another node Add capacity
INFORMATICA CONFIDENTIAL
BEST PRACTICES
502 of 818
Understanding the processing requirements and expected duration of each job (along with its potential growth as data volumes increase) is important for effectively scheduling the production jobs. Upstream data availability drives much of the job schedule, along with downstream demand from other systems and end-users. Anticipate when there will be processing demand collisions and work to avoid them or to set expectations for a reduced level of service. The organization may find that as it migrates to a new architecture with server and processing capabilities that are anticipated to be more powerful than the legacy architecture, that upstream processing is also faster and able to deliver sooner. This will allow the production schedule to be reshuffled to help avoid processing demand collisions. If despite this, there continues to be processing demand issues, resolve them by adding capacity or moving processing to another node in a cluster or grid. The organization will want to monitor processing of not only the applications in the shared portfolio, but also upstream and downstream processing. Some of the items to review are duration, CPU utilization, memory utilization and data volumes. Track these over time to see trends in job growth in order to stay ahead of the capacity issues and to continue to meet SLAs for data delivery as applications grow. Production cutover to the new environment typically involves at least a minimal amount of downtime. For some systems this must be kept to a bare minimum; for any system, plan the migration process to avoid downtime. One way of handling this is to do everything possible to set up the new environment ahead of time by setting up the server, creating directory structures and user IDs, assigning permissions, copying static files and installing software ahead of the cutover day. Validate basic connectivity and functionality before moving user processing to the new system. When satisfied that there is a stable environment, perform the steps to migrate or upgrade the tools, leaving the old system in place. Ideally, only a brief outage to perform a copy of the current application or data will be needed. As soon as new naming and coding standards for the new environment are determined, have developers begin to modify the current system to meet the standards, testing them and moving them through the development lifecycle to production. Although this does require an effort from the development team, it should be an unobtrusive set of changes not affecting end-users, and it will prepare the application for the new environment, leaving less to modify or test for the cutover. After replicating the old environment to the new one, the migration team is able to upgrade components to the new version of the software, make any required modifications that could not be made earlier and then run the system in parallel long enough for users to test and have the right level of comfort with the new system. Depending on the type of tool, the old system might then be shut off and the new one becomes the live system (or there may first be a need to refresh components that changed while conducting tests). If a refresh is required along with a second pass through the move to production, then there needs to be only minimal validation testing, since the team performed the system validation during the first cycle. Typical items to test or check are: Naming standards have been met Coding standards have been met Interfaces and Connectivity work No hard coded server or path names utilized (or all have been modified) Upgraded Software is downward compatible, losing no functionality and producing the expected results Execution results are accurate (consistent with the old system) Elapsed execution time meets expectations
Case 1 – Migration only The simplest migration case is one that involves moving applications to the same version of software on new hardware. In this case, operating system and tool-specific utilities are used to combine processing in a single environment. As described above, users will perform tests to make sure that all new interfaces work and that processing is occurring as usual on the new system. As long as items such as network connectivity, firewall access and nodes names are not hard coded (or have been modified as part of the migration and consolidation effort) then this scenario creates minimal impact to the organization and can be managed quickly with only brief outage windows for cutover to the new architecture. However, there may still be a need to resolve standards issues such as naming conventions for objects in the new architecture and some issues related to security that might change local path names used during processing in a shared environment. These changes can be made in the old environment and then moved to the new environment after testing or the non-standard methods might be allowed to exist in the shared environment for a short period. Be cautious if selecting the latter case. Be certain that the project plans allow time to bring the new application up to standard; otherwise there might be a single server with a complicated set of methods, none of which conforms to the new standards.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
503 of 818
There must be commitment that changes to meet the standards will be met, otherwise do not allow the new applications to move to the new environment. If possible, attempt to have development teams make appropriate changes prior to the migration to the new environment. Having the teams make the changes and test them in the old environment will minimize the testing needed to validate the new environment. If performing the modifications and testing are required as entry criteria for migration then some of the business cases that make consolidation attractive will act as incentives to comply and get off the old systems.
Case 2 – Migration and Upgrade In addition to the common considerations and the efforts required in Case 1, if usage of the given standard tool in the organization is on multiple versions, then first upgrade to the desired version and then migrate to the new environment. In order to do this an intermediate stepping stone system might be needed to allow for going through the upgrade steps without affecting the final target. This is especially true if the upgrade path from an older version requires an upgrade to an intermediate version below the final level desired. Testing is important when upgrading to assure that the newer version of software did not cause a negative effect to applications. As a result, more effort is required from the application development teams and end users to assure that everything is still working and to make corrections if it is not. The organization will need to develop and execute training plans that allow users to gain familiarity with the new version of the software. Use of an intermediate architecture in the scenario above may be useful as a platform for making standards based changes to applications, and testing them. If desired, run in parallel with the existing systems for some time while comparing results from execution to assure that the new version is working as expected.
Case 3 – Multiple software tools When multiple software tools are performing similar functionality and the organization has decided to migrate the applications using the non-standard tools to the new environment, then the project to migrate and consolidate becomes more complex. Some of the questions to ask include: How complex is the development effort to move processing to the new tool Are there tools or methods to assist in making the changes What is the best approach to move the application, particularly any customizations or business rules Are there tools or methods to assist in comparing final results during parallel testing of the new system against the old one There may be tools that can help to move the logic and business rules within one tool to another tool. For example, the ability to export or import XML code may assist in making the necessary changes, with customizations then made to finalize the code conversion. As another example, ideally the organization is using a data-modeling tool to maintain the corporate data model. If so, then there may be the opportunity to mike physical the model into another relational database structure with its data types and syntax. In this database example, the data will need to move as well as the new database structures. Informatica PowerCenter along with the Velocity Data Migration Methodology can assist with that effort. While tools may help, there is also going to be a development component to convert (or in some cases rebuild) the application into the selected tool. The level of effort required should weigh in the decision regarding whether the tool being replaced should be retired, deprecated or unsupported in the environment instead of being converted. If the convert decision is made, then it is an effort similar to any development project. Ideally there are already specifications for the old system and documentation that will ease the effort through the project development lifecycle. There are also Subject Matter Experts and committed end-users that know the application and can assist with requirements and testing. Take stock and decide how to handle enhancement requests during this process. One approach is to take the opportunity during the “re-write” to add these items. Realize that by doing so, comparing the end results may be more difficult, since changes have been intentionally made. For this reason, and to expedite the conversion, migration and consolidation, avoid making any changes other than those to fix errors that were noted during the conversion or for previously unnoticed production errors. For a time after the development effort is completed, run both systems in parallel to assure all parties that functionality has not been lost and processing is correct. When satisfied that this is the case, cut over from the old system to the new one in the new environment.
Summary In all three cases above, the benefits of standardizing and consolidating to an ICC begins to be realized each time an application can be moved and old software and servers can be retired. Some of the project effort will require a business case to justify the time spent (especially in Cases 2 and 3) but the benefits will be seen along the way as a pay-off in reduced licensing and server footprint costs, increased system performance and service from the central ICC team.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
504 of 818
As new products become available the organization may decide to move to new standards and to reclassify the existing elements in the portfolio. Portfolio Rationalization is an on-going process to evaluate and implement the best tools for the organization. As technology changes and as the organization’s needs change, continually review the approaches to software, hardware and processing in the portfolio and make the necessary adjustments.
Last updated: 06-Sep-08 01:16
INFORMATICA CONFIDENTIAL
BEST PRACTICES
505 of 818
Proposal Writing Challenge Writing an effective proposal is about persuading an executive audience. Your primary focus should not be to provide an exhaustive list of details or to try to answer all of the questions that may come up as it is read. This best practice provides guidelines on how to write a proposal that reaches your audience with: Basic concepts of structure and presentation Principles for graphics Principles for text This best practice, which picks up the thread from Business Case Development, is less concerned with researching options and content than with delivering a proposal that is well-organized and persuasive. Remember, “A good proposal is not one that is written, but one that is read.”
Description Basic Principles “If I had more time, this letter would have been shorter.”–Voltaire, Goethe, Twain. . . Stated by great minds, spoken across times and languages, the above quote remains true today. It takes more effort to be concise than it does to include every detail. It may seem counter-intuitive, but providing all of the possible detail is futile if your document isn’t read. The key is to determine what the most important messages are and to focus your efforts on those. 1. Visualize first: Before anything else, think about your end target. Build your graphics to communicate that message, and then add text to persuade. (More on the role of graphics versus text later.) 2. Sell benefits – not just features: Don’t spend all of your time explaining “what.” If you can intrigue your audience and convince them “why” you can explain exactly “how”. First, you must persuade them that they want to know more. 3. Every page should answer “Why?” and “So What?”: A lot of documents do a great job with “what,” especially those written by substance-oriented people, but all that effort can be wasted if you don’t get your readers to care and understand how what you are suggesting is a benefit.Alwaysanswer “why?” and “so what?” 4. Entire core document (excluding attachments) should be viewable as an Executive Summary: a. Consistent “look and feel”: Keep it crisp, clean and united as one document. Readers should be able to absorb the entire message in a scan that takes only a few minutes. b. Divided into modules: No matter what the cost and scope, the core elements of any proposal are essentially the same. Divide your proposal into logical modules that make it easy to navigate and absorb. c. Two pages for each module: Each section of the document should be no more than two pages, with half of the length used for graphics. d. Persuasive text and descriptive graphics – not the other way around: All of your substance should be represented in the graphics. The text should be reserved primarily for the “why.” 5. More emphasis on format than content: Packaging has a huge impact on readability and in this context getting the document read is the primary objective. Take the time to focus on the document’s format without compromising your content. Format can have a huge impact on acceptance, readability and communicating points accurately. 6. More emphasis on format than content: Packaging has a huge impact on readability and in this context getting the document read is the primary objective. Take the time to focus on the document’s format without compromising your content. Format can have a huge impact on acceptance, readability and communicating points accurately.
Recommended Structure Making due allowance for variances in complexity and scale, it is generally recommended that each proposal have seven sections, each of which is two pages long; a document which is only fourteen pages long is more likely to be read in its entirety. The following sections should be included: Business Opportunity Alternatives Considered INFORMATICA CONFIDENTIAL
BEST PRACTICES
506 of 818
Proposed Approach Implementation Plan Deliverables Resource Plan Financial Summary Each section should include one page of text and a one-page graphic. If required, additional appendices can be included for detailed specifications.
Guiding Principles for Graphics Graphics are no less important than text; and in some cases are more important. A quick scan of the document’s graphics should convey your end message and intrigue your reader to look more closely. Edward Tufte’s landmark book Visual Explanations provides some deep insights for advanced practices, but if you are just starting out, consider the following guidelines: 1. Descriptive in nature – should tell the story: Your reader should be able to look at the graphics and know what the document is about without reading any text. The graphics should illustrate the “what” of your proposal in a clear and concise manner. 2. Ideally, achieve a 10-second and 10-minute impact: Your graphic should be clear enough that the core message is apparent within ten seconds, yet complex enough that ten minutes later the reader will still be extracting new content. 3. Don’t turn sideways – fold-out for big graphs: Remember, take the time to focus on the format; put yourself in your reader’s shoes and make it easy for them. 4. Good graphics are hard to create: And they take time. Practice, practice, practice. The creation of excellent graphics often takes multiple iterations, white board brainstorming sessions, drawing ideas out and getting peer reviews. Keep your audience in mind. It is not always clear what is best for each audience. 5. Should be complete before text: Your text should support your graphics, persuading that the ideas illustrated are good ones. Making clear connections and eliminating redundancy is easier when the graphics come first. This also helps in evaluating if the graphics can stand alone in conveying your end message. 6. Will be “read”: “Reading” images is often faster than reading text, an important consideration when appealing to an executive audience. 7. Are worth 1,000 words: Don’t spend another thousand words rehashing what your graphics have already said. Make sure your graphics send a clear, eloquent message. If your graphic can stand alone it can be passed on. Make sure it stands up out of context. The following graphic from Edward Tufte’s work is a comprehensive example of a picture worth a 1,000 words:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
507 of 818
Guiding Principles for Text After taking the time to create graphics that communicate your end message, don’t waste your text on redundancy. Instead, use your text to persuade. A good exercise to help focus on communicating what is important: Try to cut out half the words without losing content. If you put it all in, nothing gets read. 1. Should communicate only two things: a. Benefits: Your graphics should convey what you are proposing, but they won’t explain what is in it for the executive. Use your text to do this. b. Advantage of your proposal/solution/recommendation: Focus on explaining why this approach is the optimal one. 2. Persuasive in nature: Again, leave the “what” to the graphics. Use specific statements and facts to support your point. Numbers can be very persuasive, but be prepared to substantiate them! 3. Grade 8 reading level: Put yourself in the reader’s shoes and make it easy for them to read. Avoid big words and convoluted sentence structure. A good check is to consider if an eighth grader could read it and understand your message. If it’s a technical subject, also consider if someone who doesn’t do more than surf the Internet could understand it. 4. Related to graphics: Since your graphics are descriptive and your text is persuasive, ensure that there are clear associations between the two. A logical “what” and a persuasive “why” are meaningless unless it’s easy to see how what you are proposing provides the outlined benefits. 5. Written from reader’s perspective (avoid we, us, you, etc.): Again, step into the reader’s shoes. Make it easy for them to understand the key points. You aren’t trying to impress them with your knowledge. You are trying to reach and persuade them. 6. Avoid absolutes (always, never, best, etc.): Absolutes can get you into trouble! Review your document for statements with absolutes, take those words out and reread your document. Often nothing meaningful is lost and statements are crisper. 7. Edited by a 3rd party: Additional perspectives engaged in critiquing and brainstorming are invaluable and an objective party may more easily identify unnecessary information. For example, compare the following proposal introductions: This project will define and document Product & Account Architecture, covering business strategy, business architecture, and technical architecture for product sales and fulfillment, service enrollment and fulfillment, account closing/service discontinuation, account/service change and selected account transactions. Or The cost for this project is $10.3M with annual savings of $12.5M resulting in a 10-month payback. In addition to the hard benefits, this project will increase customer satisfaction, improve data accuracy and enhance compliance enforcement. Clearly, the second paragraph has greater impact. The tone is persuasive rather than academic.
Summary The single objective of a proposal is to win support for the proposed solution; this provides a very simple measure for the effectiveness of any paragraph of text or any graphic. Essentially, the entire document should be regarded as an extended executive summary. Every page should answer “Why?” and “So What?” It is best to visualize first and then to write. Text should be persuasive and graphics should be descriptive – not the other way around. Sell benefits rather than features.
Last updated: 30-May-08 14:13
INFORMATICA CONFIDENTIAL
BEST PRACTICES
508 of 818
Selecting the Right ICC Model Challenge Choosing the right Integration Competency Center (ICC) implementation requires answering questions about the services the ICC should offer and having an understanding the current organizational culture, financial accounting processes and human resource allocation. Consider the benefits of an ICC implementation versus the costs and risks of a project silo approach (which essentially is a decision to not implement an ICC). Choosing the right model is significant because a good ICC will grow in importance to the organization. This Best Practice provides an overview of items to consider that can help in choosing the appropriate structure for an ICC. The main challenge in ICC startup is to identify the appropriate organizational structure for a given enterprise. There are four main factors that help to determine the type of ICC model that an organization should implement. They are: IT Organization Size Business Value/Opportunity Planning for an ICC Implementation IT Strategic Alignment Urgency/Bias for Action by Business Community
Description What are the ICC Models? Integration Competency Centers fall in to five main models: Best Practices Technology Standards Shared Services Central Services Self Service The first model (Project Silos) in the figure below is not really an ICC. It is the situation that often exists before organizations begin an ICC infrastructure to improve data integration efficiency.
Model 1 - Best Practices A Best Practices ICC is the easiest to implement, which makes it a good first step for an organization that wants to begin leveraging integration expertise. The Best Practices ICC model focuses on establishing proven processes across business units, defining processes for data integration initiatives and recommending appropriate technology, but it does not share the development workload with individual project teams. The result is a higher overall ROI for each data integration initiative. To achieve this goal, a Best Practices ICC documents and distributes recommended operating procedures and standards for INFORMATICA CONFIDENTIAL
BEST PRACTICES
509 of 818
development, management and mapping patterns. It also defines how to manage change within an integration project. The people who lead this effort are typically those in the organization who have the most integration expertise. They form a virtual team consisting of project managers and ETL lead developers from different projects. The most important roles in this type of ICC are the knowledge coordinator who collects and distributes best practices and the ICC manager who ensures that the ICC anticipates business requirements and that business managers and customers turn to the ICC for assistance with integration initiatives. The primary function of this ICC model is to document best practices. It does not include a central support or development team to implement those standards across projects. To implement a Best Practices ICC, companies need a flexible development environment that supports diverse teams and that enables the team to enhance and extend existing systems and processes.
Model 2 - Technology Standards The Technology Standards model standardizes development processes on a single, unified technology platform, enabling greater reuse of work from project to project. Although neither technology nor people are typically shared, standardization creates synergies among disparate project teams. A Technology Standards ICC provides the same knowledge leverage as a Best Practices ICC, but enforces technical consistency in software development and hardware choices. A Technology Standards ICC focuses on processes, including standardizing and enforcing naming conventions, establishing metadata standards, instituting change management procedures and providing standards training. This type of ICC also reviews emerging technologies, selects vendors, and manages hardware and software systems. The people within a Technology Standards ICC typically come from different development teams, and may move from one team to another. However, at its core is a group of best practices leaders. These most likely include the following roles: Technology Leader Metadata Administrator Knowledge Coordinator Training Coordinator Vendor Manager ICC Manager A Technology Standards ICC may standardize integration activities on a common platform and link repositories for optimized metadata sharing. To support these activities the ICC needs technologies that provide for metadata management; enable maximum reuse of systems, processes, resources and interfaces; and offer a robust repository, including embedded rules and relationships and a model for sharing data.
Model 3 - Shared Services The Shared Services model defines processes, standardizes technologies and maintains a centralized team for shared work, but most development work occurs in the distributed lines of business. This hybrid centralized/decentralized model optimizes resources. A Shared Services ICC optimizes the efficiency of integration project teams by providing a common, supported technical environment and services ranging from development support all the way through to a help desk for projects in production. This type of ICC is significantly more complex than a Best Practices or Technology Standards model. It establishes processes for knowledge management, including product training, standards enforcement, technology benchmarking, and metadata management and it facilitates impact analysis, software quality and effective use of developer resources across projects. The team takes responsibility for the technical environment, including hardware and software procurement, architecture, migration, installation, upgrades, and compliance. The Shared Services ICC is responsible for departmental cost allocation; for ensuring high levels of availability through careful capacity planning; and for security, including repository administration and disaster recovery planning. The ICC also takes on the task of selecting and managing professional services vendors. The Shared Services ICC may support development activities, including performance and tuning. It provides QA, change management, acceptance and documentation of shared objects. The Shared Services ICC supports projects through a development help desk, estimation, architecture review, detailed design review and system testing. It also supports cross project integration through schedule management and impact analysis. When a project goes into production, the ICC helps to resolve problems through an operations help desk and data validation. It monitors schedules and the delivery of operations metadata. It also manages change from migration to production, provides change
INFORMATICA CONFIDENTIAL
BEST PRACTICES
510 of 818
control review and supports process definition. The roles within a Shared Services ICC include technology leader, technical architect, and data integration architect—someone who understands ETL, EAI, EII, Web services, and other integration technologies. A repository administrator and a metadata administrator ensure leverage and reuse of development assets across projects, set up user groups and connections, administer user privileges and monitor repository activities. This type of ICC also requires a knowledge coordinator, a training coordinator, a vendor manager, an ICC manager, a product specialist, a production operator, a QA manager, and a change control coordinator. A Shared Services ICC requires a shared infrastructure environment for development, QA, and production.
Model 4 - Central Services Centralized integration initiatives can be the most efficient and have the most impact on the organization. A Central Services ICC controls integration across the enterprise. It carries out the same processes as the other models, but in addition usually has its own budget and a chargeback methodology and the staff report to the ICC Director rather than to other managers or executives. It also offers more support for development projects, providing management, development resources, data profiling, data quality and unit testing. Because a Central Services ICC is more involved in development activities than the other models, it requires a production operator and a data integration developer. In this ICC model, standards and processes are defined, technology is shared and a centralized team is responsible for all development work on integration initiatives. Like a Shared Services ICC, it also includes the roles of technology leader, technical architect, data integration architect, repository administrator, metadata administrator, knowledge coordinator, training coordinator, vendor manager, ICC manager, product specialist, production operator, QA manager and change control coordinator. To achieve its goals, a Central Services ICC needs a live and shared view of the entire production environment. Tools to maximize reuse of systems, processes, resources and interfaces are essential, as is visibility into dependencies and assets. A Central Services ICC depends on robust metadata management tools and tools that enable the team to enhance and extend existing systems and processes.
Model 5 - Self Service The self-service ICC model achieves both a highly efficient operation and furnishes an environment where innovation can flourish. Self-service ICCs require strict enforcement of a set of application integration standards through automated processes and have a number of tools and systems in place that support automated or semi-automated processes. An example of a Self Service ICC is the Informatica On-Demand service which allows business analysts to integrate salesforce.com data with other repositories through a well-defined and easy to use interface. The staff involved in this model often work “behind the scenes” to maintain the software and tools that the business analysts use.
Multiple Model Approach In addition to considering the above models, it is desirable to offer services and an organization structure based upon the level of integration involved. The type of projects that are being carried out have a heavy weighting on the ICC model chosen. The figure below illustrates the viewpoint that certain ICC models suit certain types of projects. Strategic projects initiated by an organization may usually be better served under a central services model. Tactical operational projects initiated and controlled by lower tiers of management may be better served with a Best Practices or Shared Services ICC implementation that reflects a certain amount of autonomy that is present within that level of an organization. It is also important to note that even if the organization adopts a central services ICC model, not all projects may fall into a Central Services” ICC model. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
511 of 818
Activities A high level plan to select the right ICC model is outlined below. The following sections elaborate on the key decision criteria and activities associated with these steps.
1. Evaluate Selection Criteria: Determine the recommended model based on best practice guidelines related to organizational size, strategic alignment, business value, and urgency for realizing significant benefits. 2. Document Objective and Potential Benefits: Define what the business intent or purpose of the ICC is and what the desired or expected benefits are (this is not a business case at this stage) 3. Define Service Scope: Define the scope of services that will be offered by the ICC (detailed service definitions occur in step 6 or 7). 4. Determine Organizational Constraints: Identify budget limitations or operational constraints imposed such as geographic distribution of operations and degree of independence by operational groups. 5. Select an ICC Model: Recommend a model and gain executive support or sponsorship to proceed. If there is an urgent need to implement the ICC move to step 6 – otherwise proceed to step 7. 6. Develop 120 Day Implementation Plan: Leverage Planning the ICC Implementation to implement and launch an ICC. 7. Evolve to Target State: Develop a future-state vision and begin implementing it. Planning the ICC Implementation may still provide useful guidance, but the time-frame may be measured in years rather than months.
Which Model is Best for an Organization? There are four main factors that help to determine the type of ICC model that an organization should implement. They are: IT Organization Size Business Value/Opportunity Planning IT Strategic Alignment Urgency/Bias for Action by Business Community
Criteria 1 – IT Organization Size The size of the IT organization is one factor in selecting the right ICC model. While any ICC Model could be used in any size organization, each model has a “sweet spot” that optimizes the benefits to the enterprise and minimizes potential negative INFORMATICA CONFIDENTIAL
BEST PRACTICES
512 of 818
factors. For example, the Best Practices model could be used in a very large multi-national organization, but it would not capture all of the benefits that could be realized. Conversely, a Central Services or Shared Services model could be implemented in an IT department with 50 staff, but the formality of these models would add extra overhead which is not necessary in smaller groups. Below are initial guidelines to help determine which model fits best.
The “sweet spot” for Central Services is between 500-2,000 IT staff. Organizations of that size can benefit most from a centralized approach while avoiding negative factors like diseconomies of scale that can occur in larger organizations. Organizations with greater than 2,000 staff are generally better suited to leverage a Shared Services model. In very large organizations with 5,000 or more staff and several CIO groups, it is common to see a Shared Services model where each organization under a CIO has a Central Services ICC with the efforts across the groups being coordinated by an enterprise Shared Services ICC. Note that the Best Practices and Technology Standards models are applicable in any size IT organization.
Criteria 2 – Business Value/Opportunity Planning Identifying the value and opportunity of data integration to the business is another factor to selecting the right model. For example, if there are defined business initiatives that require integration (e.g., a customer data warehouse) then there is more opportunity to leverage investment in an ICC. Conversely if the business units that make up a large corporation operate as separate and autonomous groups with few end-to-end processes and little need to share information (like a holding company for example) then the focus to obtain an efficient IT operations should be more on optimizing IT practices and leveraging technology standards. Business System Pattern
Suggested Model
Siloed business units, separate IT operations infrastructure
Best Practices
Siloed business units, shared IT operations infrastructure
Technology Standards
Master Data Management, Customer Data Integration type data initiatives
Shared Services
Data Integration Vision established or Enterprise Data
Central Services
The main opportunity here for a more centralized approach is that there is a business initiative to bring all enterprise information together to find the ‘single version of the truth’. An ICC can certainly gain sponsorship in building the sustaining integration organization that is required to meet this vision. Less visionary initiatives may require educating business users of the need for data integration and might require more of a ‘pay as you go’ approach to show successes incrementally. However, elaborative meetings to extract the business vision on data could be very helpful in defining the vision and identifying the need for data integration.
Criteria 3 – IT Strategic Alignment How the IT organization is aligned to data integration is another key criteria that can be used to measure readiness for each of the different ICC models. Evaluate whether the key IT projects are integration-driven vs. operations-driven or if they are focused
INFORMATICA CONFIDENTIAL
BEST PRACTICES
513 of 818
on key business application systems. IT Project Pattern
Suggested Model
Key projects are separate, supporting single business unit and not integrated with other systems
Best Practices
Key projects are separate, supporting single business unit, but need to access data from 1-3 other systems Technology Standards Key Projects are wider in scope and require cross functional teams to address integration with greater than Shared Services 3 other systems Key Projects are focused on major integration efforts such as customer integration, key system data exchanges, etc
Central Services
Consider looking at the IT Project Portfolio to gain an understanding of the criteria used to prioritize projects. Projects that are focused on reductions in cost or redundancy can be helpful in identifying key integration and cross functional needs across an organization. Other items to look for are key data integration initiatives such as consolidation/retirement of applications that occur through merger activities, reduction/simplification of processes, etc. with a focus to reduce IT support costs and invest in infrastructure to free up development resources.
Criteria 4 – Urgency/Bias for Action by Business Community A final criteria for selecting an ICC model is how quickly the organization needs to act. A sense of urgency (or bias for action) may provide the incentive to move more quickly to a Central Services model. Suggested Model
Urgency Level
Perceived benefits for sharing practices, but no immediate or pressing Best Practices reason to address integration needs from a business or IT perspective. Desire to standardize technology and reduce variations in the IT infrastructure, but no compelling need at the moment to address business or data integration issues.
Technology Standards
There are number of opportunities for collaboration, resource sharing, Shared Services and reuse of development components, but these may be addressed incrementally over time. There are one or more key strategic initiatives that are being driven top-down that require collaboration and coordination across multiple groups and must show progress and results quickly.
Central Services
A sense of urgency helps to identify the business case and provides the impetus necessary to execute it. Therefore, it is considered a factor in choosing the right model.
Matching Budget to Suggested Model Selection To put all of this together, use the four criteria above to determine which model to implement. Match the budget to the suggested model. It may be necessary to build a business case (particularly when implementing a Shared or Central Services model). Examples of how to build a business case and develop a chargeback model can be found in the Financial Management competency. The remainder of this best practice is focused on the activities that make up an ICC. This information will help in estimating the cost and organizational impact of establishing the ICC function.
Objectives and Benefits After selecting a model for an ICC implementation there are important questions to explore: What are the objectives of implementing an ICC? What will the services and benefits offered by the ICC consist of?
INFORMATICA CONFIDENTIAL
BEST PRACTICES
514 of 818
Typical ICC objectives include: Promoting data integration as a formal discipline. Developing a set of experts with data integration skills and processes and leveraging their knowledge across the organization. Building and developing skills, capabilities and best practices for integration processes and operations. Monitoring, assessing and selecting integration technology and tools. Managing integration pilots. Leading and supporting integration projects with the cooperation of subject matter experts. Reusing development work such as source definitions, application interfaces and codified business rules. Although a successful project that shares its lessons with other teams can be a great way to begin developing organizational awareness for the value of an ICC, setting up a more formal ICC requires upper management buy-in and funding. Some of the typical benefits that can be realized from doing so include: Rapid development of in-house expertise through coordinated training and shared knowledge. Leveraging of shared resources and "best practice" methods and solutions. More rapid project deployments. Higher quality/reduced risk for data integration projects. Reduced costs of project development and maintenance. Shorter time to ROI. When examining the move towards an ICC model that optimizes and (in certain situations) centralizes integration functions, consider two things: The problems, costs and risks associated with a project silo-based approach The potential benefits of an ICC environment
ICC Activities What Activities does an ICC perform? The common activities provided by an ICC can be divided into four major categories: Knowledge Management Environment Development Support Production Support The ICC Activities Summary table below breaks down the activities that can be provided by ICC based on the four categories above: Knowledge Management Training Standards Training Product Training Standards Standards Development Standards Enforcement Methodology Mapping Patterns Technology Emerging Technologies Benchmarking
INFORMATICA CONFIDENTIAL
Development Support Performance Performance and Tuning Shared Objects Shared Object Quality Assurance Shared Object Change Management Shared Object Acceptance Shared Object Documentation Project Support Development Helpdesk Software/Method Selection Project Estimation
BEST PRACTICES
515 of 818
Metadata Metadata Standards Metadata Enforcement Data Integration Catalog
Project Management Project Architecture Review Detailed Design Review Development Resources Data Profiling Data Quality Testing Unit Testing System Testing Cross Project Integration Schedule Management/Planning Impact Analysis
Environment Hardware Vendor Selection and Management Hardware Procurement Hardware Architecture Hardware Installation Hardware Upgrades Software Vendor Selection and Management Software Procurement Software Architecture Software Installation Software Upgrades Compliance (Licensing)
Production Support Issue Resolution Operations Helpdesk Data Validation Production Monitoring Schedule Monitoring Operations Metadata Delivery Change Management Object Migration Change Control Review Process Definition
Professional Services Vendor Selection and Management Vendor Qualification Security Security Administration Disaster Recovery Financial Budget Departmental Cost Allocation Scalability/Availability High Availability Capacity Planning The activities that could potentially be provided by ICC’s for each category are described in the tables below. The ICC models that they usually fall into are abbreviated as: Best Practices (BP) Technology Sharing (TS) Shared Services (SS) Central Services (CS)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
516 of 818
ICC Knowledge Management Activities Activity
ICC Models
Description
Standards Training
TS SS CS BP
Training of best practices including but not limited to naming conventions, unit test plans, configuration management strategy and project methodology.
Product Training
SS CS
Co-ordination of vendor offered or internally sponsored training of specific technology products.
Standards Development BP TS SS CS
Creating best practices including but not limited to naming conventions, unit test plans and coding standards.
Standards Enforcement BP TS SS CS
Enforcing development teams to use documented best practices through formal development reviews, metadata reports, project audits or other means.
Methodology
Creating methodologies to support development initiatives. Examples include methodologies for rolling out data warehouses and data integration projects. Typical topics in a methodology include but are not limited to:
SS CS
Project Management Project Estimation Development Standards Operational Support Mapping Patterns
SS CS
Developing and maintaining mapping patterns (templates) to speed up development time and promote mapping standards across projects.
Emerging Technologies TS SS CS
Responsible for the assessment of emerging technologies and determining if/where they fit in the organization and policies around their adoption/use
Benchmarking
TS SS CS
Conducting and documenting tests on hardware and software in the organization to establish performance benchmarks
Metadata Standards
BP TS SS CS
Enforcing development teams to conform to documented metadata standards
Metadata Enforcement
SS CS
Enforcing development teams to conform to documented metadata standards
Data Integration
SS CS
Track the list of systems involved in data integration efforts, the integration between systems, and the use/subscription of data integration feeds. This information is critical to managing the interconnections in the environment in order to avoiding duplication of integration efforts and knowing when particular integration feeds are no longer needed.
ICC Environment Activities Activity
ICC Models
Hardware Vendor Selection and Management
TS SS CS
Description Selection of vendors for the hardware tools needed for integration efforts that may span Servers, Storage and network facilities
Hardware Procurement SS CS
Responsible for the purchasing process for hardware items that may include receiving and cataloging the physical hardware items.
Hardware Architecture
SS CS
Developing and maintaining the physical layout and details of the hardware used to support the Integration Competency Center
Hardware Installation
SS CS
Setting up and activating new hardware as it becomes part of the physical architecture supporting the Integration Competency Center
Hardware Upgrade
SS CS
Managing the upgrade of hardware including operating system patches, additional cpu/memory upgrades, replacing old technology etc.
Software Vendor TS SS CS Selection Management
Selection of vendors for the software tools needed for integration efforts. Activities may include formal RFP’s, vendor presentation reviews, software selection criteria, maintenance renewal negotiations and all activities related to managing the software vendor relationship.
Software Procurement
Responsible for the purchasing process for software packages and licenses
SS CS
INFORMATICA CONFIDENTIAL
BEST PRACTICES
517 of 818
Software Architecture
SS CS
Developing and maintaining the architecture of the software package(s) used in the competency center. This may include flowcharts and decision trees of what software to select for specific tasks.
Software Installation
SS CS
Setting up and installing new software as it becomes part of the physical architecture supporting the Integration Competency Center
Software Upgrades
SS CS
Managing the upgrade of software including patches and new releases. Depending on the nature of the upgrade, significant planning and rollout efforts may be required during upgrades. (Training, testing, physical installation on client machines etc)
Compliance Licensing
SS CS
Monitoring and ensuring proper licensing compliance across development teams. Formal audits or reviews may be scheduled. Physical documentation should be kept matching installed software with purchased licenses.
Professional Services Selection and Management
SS CS
Selection of vendors for professional services efforts related to integration efforts. Activities may include managing vendor rates and bulk discount negotiations, payment of vendors, reviewing past vendor work efforts, managing list of ‘preferred’ vendors, etc.
Professional Services Vendor Qualification
SS CS
Activities may include formal vendor interviews as consultants/contracts are proposed for projects, checking vendor references and certifications, to formally qualifying selected vendors for specific work tasks (i.e., Vendor A is qualified for Java development while Vendor B is qualified for ETL and EAI work)
Security Administration SS CS
Provide access to the tools and technology needed to complete data integration development efforts including software user id’s, source system user id/passwords, and overall data security of the integration efforts. Ensures enterprise security processes are followed.
Disaster Recovery
SS CS
Perform risk analysis in order to develop and execute a plan for disaster recovery including repository backups, off-site backups, failover hardware, notification procedures and other tasks related to a catastrophic failure (i.e., server room fire destroys dev/prod servers).
Budget
CS
Yearly budget management for the Integration Competency Center. Responsible for managing outlays for services, support, hardware, software and other costs.
Departmental Cost Allocation
SS CS
For clients where shared services costs are to be spread across departments/business units for cost purposes. Activities include defining metrics uses for cost allocation, reporting on the metrics, and applying cost factors for billing on a weekly/monthly or quarterly basis as dictated.
Scalability High Availability
SS CS
Design and implementation of hardware, software and procedures to ensure high availability of the data integration environment.
Capacity Planning
SS CS
Design and plan for additional integration capacity to address the growth in size and volume of data integration in the future for the organization.
ICC Development Support Activities Activity
ICC Models
Description
Performance and Tuning
SS CS
Provide targeted performance and tuning assistance for integration efforts. Provide on-going assessments of load windows and schedules to ensure service level agreements are being met.
Shared Object Quality Assurance
SS CS
Provide quality assurance services for shared objects so that objects conform to standards and do not adversely affect the various projects that may be using them.
Shared Object Change Management
SS CS
Manage the migration to production of shared objects which may impact multiple project teams. Activities include defining the schedule for production moves, notifying teams of changes, and coordinating the migration of the object to production.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
518 of 818
Shared Object Acceptance
SS CS
Define and documenting the criteria for a shared object and officially certifying an object as one that will be shared across project teams.
Shared Object Documentation
SS CS
Defining the standards for documentation of shared objects and maintaining a catalog of all shared objects and their functions.
Development Helpdesk SS
Provide a helpdesk of expert product personnel to support project teams. This will assist project teams new to developing data integration routines a place to turn for experienced guidance.
Software Method Selection
Provide a workflow or decision tree to use when deciding which data integration technology to use for a given technology request.
SS CS
Requirements Definition SS CS
Develop the process to gather and document integration requirements. Depending on the level of service, activity may include assisting or even fully gathering the requirements for the project.
Project Estimation
SS CS
Develop project estimation models and provide estimation assistance for data integration efforts.
Project Management
CS
Provide full time management resources experienced in data integration to ensure successful projects.
Project Architecture Review
SS CS
Provide project level architecture review as part of the design process for data integration projects. Help ensure standards are met and the project architecture fits with the enterprise architecture vision.
Detailed Design Review SS CS
Services to review design specifications in detail to ensure conformance to standards and identify any issues upfront before development work is begun.
Development Resources
SS CS
Product skilled resources for completion of the development efforts.
Data Profiling
CS
Provide data profiling services to identify data quality issues. Develop plans for addressing issues found in data profiling.
Data Quality
CS
Define and meet data quality levels and thresholds for data integration efforts.
Unit Testing
CS
Define and execute unit testing of data integration processes. Deliverables include documented test plans, test cases and verification against end user acceptance criteria.
System Testing
SS CS
Define and perform system testing to ensure that data integration efforts work seamlessly across multiple projects and teams.
Schedule Management SS CS Planning
Provide a single point for managing load schedules across the physical architecture to make best use of available resources and appropriately handle integration dependencies.
Impact Analysis
Provide impact analysis on proposed and scheduled changes that may impact the integration environment. Changes include but are not limited to system enhancements, new systems, retirement of old systems, data volume changes, shared object changes, hardware migration and system outages.
SS CS
ICC Production Support Activities Activity
ICC Models
Description
Operations Helpdesk
SS CS
First line of support for operations issues providing high level issue resolution. Helpdesk would field support cases and issues related to scheduled jobs, system availability and other production support tasks.
Data Validation
SS CS
Provide data validation on integration load tasks. Data may be ‘held’ from end user access until some level of data validation has been performed. It might be manual review of load statistics - to automated review of record counts including grand total comparisons, expected size thresholds or any other metric an organization may define to catch potential data inconsistencies before reaching end users.
Schedule Monitoring
SS CS
Nightly/daily monitoring of the data integration load jobs. Ensuring jobs are
INFORMATICA CONFIDENTIAL
BEST PRACTICES
519 of 818
properly initiated, are not being delayed, and ensuring successful completion. May provide first level support to the load schedule while escalating issues to the appropriate support teams. Operations Metadata Delivery
SS CS
Responsible for providing metadata to system owners and end users regarding the production load process including load times, completion status, known issues and other pertinent information regarding the current state of the integration job stream.
Object Migration
SS CS
Coordinate movement of development objects and processes to production. May even physically control migration such that all migration is scheduled, managed, and performed by the ICC.
Charge Control Review SS CS
Conduct formal and informal reviews of production changes before migration is approved. At this time, standards may be enforced, system tuning reviewed, production schedules updated, and formal sign off to production changes is issued.
Process Definition
Develop and document the change management process such that development objects are efficiently and flawlessly migrated into the production environment. This may include notification rules, schedule migration plans, emergency fix procedures etc.
BP TS SS CS
Choosing the ICC Model Other factors in choosing an ICC model can depend on the nature of the organization. For example, if the data integration functions are already centrally managed for established corporate reasons it is unlikely that there will be a sudden move to shared services. In turn, shared services may be more realistic in a culture where independent departments take on and manage their own data integration projects. A central services model would only result if senior managers wanted to consolidate the management and development of projects for reasons of cost, knowledge or quality and make it policy. The higher the degree of centralization, the greater the potential cost savings. Some organizations have the flexibility to easily move toward central services, while others don’t – either due to organizational or regulatory constraints. There is no ideal model; just one that is appropriate to the environment in which it operates that will deliver increased efficiency and quicker and higher ROI for data integration projects. The adoption of the Central Services model does not necessarily mandate the inclusion of all applications within the orbit of the ICC. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model. The tables below show how to compare services that will be provided by an ICC against the four models. Having considered the service categories, the appropriate ICC Organizational Model may be indicated. Working the exercise in reverse may reveal services that will need to be provided for a chosen ICC model that may not be possible initially or that would require extra resources and budget. The desired services can be marked down and compared against the standard models that are shown. Review Engagement Services Management to determine how to bundle these activities into client services for consumption.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
520 of 818
Other more general questions to consider when envisioning services the ICC can provide are based on the level of responsibility INFORMATICA CONFIDENTIAL
BEST PRACTICES
521 of 818
the ICC and its management group may take on. Will the ICC be responsible for: A shared cross functional integration system? Enforcing technology standards? Maintaining a metadata repository? Developing shared common objects incorporating business logic? End to end business process monitoring? Will production support be provided? Consider if the expertise and resources actually exist within the host organization to provide the services. Further, how would the intra-organizational politics affect the reception to of any of the models for an ICC? Would a new ICC group taking on those responsibilities be acceptable to other significant persons and departments who may be threatened by or benefit from the introduction of the ICC model? Conversely, which individuals and departments would support the creation of an ICC with one of the four main models described in this Best Practice? More information is available in the following publication: Integration Competency Center: an Implementation Methodology by John Schmidt and David Lyle, Copyright 2005 Informatica Corporation.
Last updated: 16-Oct-08 17:13
INFORMATICA CONFIDENTIAL
BEST PRACTICES
522 of 818
Service Level Agreement Management Challenge Service Level Agreements (SLAs) and Operation Level Agreements (OLAs) for an Integration Competency Center (ICC) or for any other information system may either be a formal (with penalties and escalation paths for failure to meet the agreement) or an informal set of expectations. An OLA focuses on the operational availability of systems on which the ICC depends upon for data. An SLA focuses on the delivery of processed data for reporting and downstream processing. Some of the challenges around SLA management include: Defining the SLA or OLA Harmonizing the needs of diverse teams in a Shared Infrastructure Determining the formality/informality of the SLA/OLA and expectations Discerning the impact to the organization of a missed SLA/OLA Designing and building the systems that allow the SLA/OLA to be achieved Setting the escalation path for resolving a missed or unachievable SLA/OLA Maintaining Service Levels and Operation Levels for growth
Description Defining the SLA Service Level Agreements may be formal, documented, signed agreements or informal understandings, hopes, and expectations regarding the level of service to expect. In the context of Integration Systems, SLAs are centered on the availability of data and processing for acquisition from legacy applications, conversion and load to target systems and for end-user reporting and other dependent applications. Operations Level Agreements similarly may be formal or informal. They differ from SLAs in that they pertain to hardware architecture performance and availability. An uptime percentage would be a typical metric found in an OLA. For simplicity, this document will refer to both SLAs and OLAs as SLAs unless clarification is required.
Considerations In order to define an SLA for a new or existing application system, there need to be an understanding not only of the application, but also how it fits into the environment with other systems and applications. For an application for which an SLA is defined, the organization will want to understand: Data Volumes Growth Projections Execution Window Required System Resources Required Dependencies (Predecessors and Successors) Concurrent Processing Conflicts Production Schedules User Requirements Be sure to understand these items before formalizing the SLA, otherwise expectations may be set that cannot be realized. For example, the delivery of updated data to a reporting system cannot be performed if the source data is not available for the planned processing. Similarly, if operations downtime is planned for certain hours to perform routine maintenance or backups, then that is a window when processing will not be able to occur. Downstream users will need to understand the requirements. Alternatively, if the user requirements show a critical business need, then the development and support teams might work to see if the scheduling of predecessor events can be adjusted to meet the requirement. As a side note to adjusting production schedules and outage windows, when introducing new hardware or software with increased performance capabilities into the environment, review the dependency information to determine if there is an opportunity to make adjustments. If shorter execution times or earlier availability of predecessor items are expected then the organization might improve service by guaranteeing sooner delivery. As applications mature, growth based on increased usage
INFORMATICA CONFIDENTIAL
BEST PRACTICES
523 of 818
or higher data volumes may begin to erode the performance and require longer execution times. Allow for that eventuality in the SLA expectations or plan to continue to meet the delivery expectations with capacity increases that keep performance constant or open up a bigger execution window. Similarly, other applications running in a shared architecture can affect an application if they are utilizing the resources needed for efficient processing. This concurrent processing may cause longer elapsed times for the application. A growth in usage of the shared infrastructure (by adding more users or more applications or processing) must be considered when developing the SLA. As these items begin to migrate to the shared environment, review their impact to the SLA. Possible actions include adjusting the SLA, adjusting the production schedule or adding capacity in order to accommodate the increased demand. The SLA defines the expectations surrounding performance, availability and delivery based on an understanding of the above items and their limitations. Additionally, the SLA often includes measures to take when an event occurs that does not permit the SLA to be met. Response times for resolution of the incident may be specified, with several levels of response urgency indicated for more or less business critical applications. Semi-routine service tasks such as code migrations, new user access requests, password resets or application enhancements may also have an SLA focusing on how soon the team addresses the request. SLAs for problem resolution generally indicate several levels of response, depending on if there is a work-around for the problem, the number of users affected and the business criticality of the disrupted application. SLA expectations need to apply not only to those providing the service, but also to those using it. A typical responsibility of an end-user is to report problems or deficiencies promptly and to be available to test and approve resolution. The end-user also approves outages or reduced services for special maintenance windows for upgrades and promptly notifies the service provider of special requests (and includes requirements and deadlines) allowing enough lead time to deliver. The SLA can include the following items: Application or system performance expectations Availability expectations (percentage uptime, time and days, data refresh schedule, etc.) Routine request turn-around Special request turn-around Problem management response and resolution time Change request management timelines, approval processes and completion expectations Prioritization based on urgency Escalation path Off-hours support expectations Roles and responsibilities for all parties Often accompanying the SLA is a Support document that outlines the steps, typical resolutions, contact information and escalation paths to use when a problem incident is raised. This allows the team responsible for resolution to be well prepared to handle urgent issues with no wasted effort.
Relevance of Formal SLAs vs. Informal SLAs The formality or informality of an SLA depends upon a number of factors: Are SLAs being managed and resolved in-house or by a third party organization? Are there contractual agreements in place regarding performance, delivery and service levels? Are there penalties for failure to deliver or to resolve issues promptly? Are there bonuses for better than promised service? Is the application business critical? Is the time of day or day of delivery business critical? If the answers to the above considerations are yes, then create a formal written agreement and have it signed by all parties. If no, then the SLA does not need to be specifically documented, but rather might be an informal expectation based on experience or desires. Be aware that even informal agreements become expectations that can disappoint the user if not met consistently.
Measuring Success In particular for formal SLAs, the organization must be able to measure and quantify what meets or does not meet the agreement as well as improvements or declines in the delivery of the expectations that were established in the SLA. Monitor items such as: INFORMATICA CONFIDENTIAL
BEST PRACTICES
524 of 818
Hardware Metrics Uptime Failover CPU utilization Disk utilization Memory utilization Software and Application Metrics Delivery of new features to production processing Availability of data Query performance Number of rows processed Number of jobs processed Execution time CPU time utilized Common Metrics Number of problem incidents Speed of resolution and initial response Change request management response and completion Score current service levels and compare them to historical trends. In this way the organization will be able to determine problem areas and make corrections to processes, schedules and capacity if there are red flags in any of these areas. Maintaining metrics, publishing them and resolving deficiencies or declining trends (as already stated) are most important when formal agreements have been made, but these activities can be equally important in ensuring an organization is informed and can take action to meet informal expectations before items like capacity and scheduling issues impact an SLA.
Samples Example 1
INFORMATICA CONFIDENTIAL
BEST PRACTICES
525 of 818
The figure above illustrates the dependencies among SLAs and OLAs. The ability to meet the SLA depends upon the availability of hardware, data and processing capabilities. It is possible for dependencies to span multiple items. For example, the end-user requesting a report might have an SLA that agrees to deliver the report by a certain time and day. Achieving the SLA depends not only on Your Process, Your Data, and Your Server, but also on the other systems that feed them.
The Sample Metrics Tracking figure above shows the capture and plotting of elapsed execution time for an application over time. Spikes indicate when data errors or delays in receiving source data occurred, causing an extended processing window. The drop beginning in late June 2006 resulted from replacing the database systems for this data warehouse application with a faster database and database server to increase capacity. In reviewing an SLA like the one in Example 1 (where end-users expect a report by 8 AM) the application processing (as shown above) was beginning to push the threshold and run past the SLA target time on occasion (and a little too close for comfort on the days leading up to the capacity improvements). Thirteen out of the fifty two executions shown for April and May (25 %) exceeded the target SLA! Imagine the frustration and lack of confidence end-users began to have with the system. In addition to needing to address this problem, there was a desire to move the SLA to 6 AM. The increased capacity moved up data availability significantly leaving plenty of opportunity to meet the new SLA target (even if source data issues occurred). There are eight blips where source data delivery issues extended the execution window, but only one missed SLA. Through extrapolation and understanding the system, the metrics diagram also shows that the SLAs focused on upstream processing to deliver the source data are being missed too frequently. The system that provides the data now then needs attention.
Example 2 The sample support SLA below shows the guidelines and expectations when raising a problem request. Other portions of the document explain how to raise a request, and what means of communications are acceptable for the first notification and then for follow-up. Priority Urgent
Guidelines For Use
Service Level
A product or service is unavailable in Production and no workaround is available There is a severe impact to the Production environment
INFORMATICA CONFIDENTIAL
BEST PRACTICES
Support team is paged when the request is submitted Request will be addressed within 15 minutes of receipt
526 of 818
Multiple people are affected Immediate action is needed This priority should not be used to report a problem affecting a test environment Should be reserved for those items that are truly Urgent
Client will be notified by phone and/or by email when the request is placed in an "In Progress" status
High
A product or service is unavailable in test or production and no workaround is available Multiple people are unable to complete critical tasks in test or in Production There is a severe impact on the completion of project milestones or to the production environment Short-term workarounds may be available while the problem is addressed Prompt action is needed This priority should not be used to report a problem affecting only one person in a test environment
Support team is paged when the request is submitted Request will be addressed within 1 hour of receipt Client will be notified by phone and/or by email when the request is placed in "In Progress" status
Medium
One or more people are experiencing problems or need support work Project milestones are not likely to be affected and other tasks can be completed while this problem or task is being resolved
Request will be addressed within 4 business hours of receipt Client will be notified by phone and/or by email when the request is placed in "In Progress" status
Note: Most Requests should use this priority Low
Problem resolution or support work is requested but is not needed immediately Project deadlines and/or an individuals performance will be enhanced
Request will be addressed as agreed upon by both parties. Client will be notified by phone and/or by email when the request is placed in "In Progress" status
Example 3 The following SLA Sample outlines the responsibilities of the development Team and the Support Team for different environments and outlines the notification plan when an incident or a change request occurs. A. Development / UAT i. Support Team provides installation/upgrades/patches, etc. ii. Developers are granted access to perform migrations. iii. Developers are granted access to start/stop servers. iv. In the event that they are unable to successfully start/stop the instances Developers must: 1. Open a Problem Ticket 2. Provide details for exactly what they tried to do in order to bring the instance up / down. 3. Support Team gets involved as second level support and provides assistance. B. Production i. This is a controlled environment to which the development teams do not have any access. ii. All issues must be raised directly to Support Team via the following process: 1. Open a Problem Ticket 2. Provide details for exactly what the issue is 3. Contact the help desk 4. If help desk is unable to resolve, Support Team gets involved as second level support and provides assistance. iii. Code Migrations requests are raised via the RFC change process using the Change Management System. 1. Open a RFC in the Support Team queue
INFORMATICA CONFIDENTIAL
BEST PRACTICES
527 of 818
2. Provide details of what steps to take for the change: a. Associated Repository and Informatica Version b. Information about which Informatica folders are affected by the change c. Location of XML file containing Informatica objects to import d. Details of any other change steps (modified relational connections, etc.) 3. Validate the success of the change during the change window. 4. For non-emergency changes, if the change requires off-hours support, please provide 5 business days advance notice. d
Last updated: 06-Sep-08 16:21
INFORMATICA CONFIDENTIAL
BEST PRACTICES
528 of 818
Creating Inventories of Reusable Objects & Mappings Challenge Successfully identify the need and scope of reusability. Create inventories of reusable objects with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across repositories (Global shortcuts). Successfully identify and create inventories of mappings based on business rules.
Description Reusable Objects Prior to creating an inventory of reusable objects or shortcut objects, be sure to review the business requirements and look for any common routines and/or modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects or shortcut objects. In PowerCenter, these objects can be created as: single transformations (i.e., lookups, filters, etc.) a reusable mapping component (i.e., a group of transformations - mapplets) single tasks in workflow manager (i.e., command, email, or session) a reusable workflow component (i.e., a group of tasks in workflow manager - worklets). Please note that shortcuts are not supported for workflow level objects (Tasks). Identify the need for reusable objects based on the following criteria: Is there enough usage and complexity to warrant the development of a common object? Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data. Identify the Scope based on the following criteria: Do these objects need to be shared with in the same folder. If so, then create re-usable objects with in the folder Do these objects need to be shared in several other PowerCenter repository folders? If so, then create local shortcuts Do these objects need to be shared across repositories? If so, then create a global repository and maintain these reusable objects in the global repository. Create global shortcuts to these reusable objects from the local repositories. Note: Shortcuts cannot be created for workflow objects.
PowerCenter Designer Objects Creating and testing common objects does not always save development time or facilitate future maintenance. For example, if a simple calculation like subtracting a current rate from a budget rate that is going to be used for two different mappings, carefully consider whether the effort to create, test, and document the common object is worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the calculation were to be performed in a number of mappings, if it was very difficult, and if all occurrences would be updated following any change or fix, then the calculation would be an ideal case for a reusable object. When you add instances of a reusable transformation to mappings, be careful that the changes do not invalidate the mapping or generate unexpected data. The Designer stores each reusable transformation as metadata, separate from any mapping that uses the transformation. The second criterion for a reusable object concerns the data that will pass through the reusable object. Developers often encounter situations where they may perform a certain type of high-level process (i.e., a filter, expression, or update strategy) in two or more mappings. For example, if you have several fact tables that require a series of dimension keys, you can create a mapplet containing a series of lookup transformations to find each dimension key. You can then use the mapplet in each fact table mapping, rather than recreating the same lookup logic in each mapping. This seems like a great candidate for a mapplet. However, after performing half of the mapplet work, the developers may realize that the actual data or ports passing through the high-level logic are totally different from case to case, thus making the use of a mapplet impractical. Consider whether there is a practical way to generalize the common logic so that it can be successfully applied to multiple cases. Remember, when creating a reusable object, the actual object will be replicated in one to many mappings. Thus, in each mapping using the mapplet or
INFORMATICA CONFIDENTIAL
BEST PRACTICES
529 of 818
reusable transformation object, the same size and number of ports must pass into and out of the mapping/reusable object. Document the list of the reusable objects that pass this criteria test, providing a high-level description of what each object will accomplish. The detailed design will occur in a future subtask, but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. Keep in mind that it will be impossible to identify one hundred percent of the reusable objects at this point; the goal here is to create an inventory of as many as possible, and hopefully the most difficult ones. The remainder will be discovered while building the data integration processes.
PowerCenter Workflow Manager Objects In some cases, we may have to read data from different sources and go through the same transformation logic and write the data to either one destination database or multiple destination databases. Also, sometimes, depending on the availability of the source, these loads have to be scheduled at different time. This case would be the ideal one to create a re-usable session and do Session overrides at the session instance level for the database connections/pre-session commands / post session commands. Logging load statistics, failure criteria and success criteria are usually common pieces of code that would be executed for multiple loads in most Projects. Some of these common tasks include: Notification when number of rows loaded is less then expected Notification when there are any reject rows using email tasks and link conditions Successful completion notification based on success criteria like number of rows loaded using email tasks and link conditions Fail the load based on failure criteria like load statistics or status of some critical session using control task Stop/Abort a Workflow based on some failure criteria using control task Based on some previous session completion times, calculate the amount of time the down stream session has to wait before it can start using worklet variables, timer task and assignment task Re-usable worklets can be developed to encapsulate the above-mentioned tasks and can be used in multiple loads. By passing workflow variable values to the worklets and assign then to worklet variables, one can easily encapsulate common workflow logic.
Mappings A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate table. The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single ‘object’ in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system). Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target, each with the same number, and list the additional source(s) of data. The table would look similar to the following:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
530 of 818
Number 1 2 3 4 4
Target Table Customers Products Customer_Type Orders_Item Orders_Item
Source Cust_File Items Cust_File Tickets Ticket_Items
When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets. When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty and reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. When merging targets into one mapping in this manner, give both targets the same number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping. The resulting inventory would look similar to the following: Number 1 2 4
Target Table Customers Customer_Type Products Orders_Item
Source Cust_File Items Tickets Ticket_Items
At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance. First, give each mapping a name. Apply the naming standards generated in 3.2 Design Development Architecture. These names can then be used to distinguish mappings from one other and also can be put on the project plan as individual tasks. Next, determine for the project a threshold for a high, medium, or low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply: Low – 1 to 10,000 rows Medium – 10,000 to 100,000 rows High – 100,000 rows + Assign a likely row volume (high, medium or low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of ‘high’ volume; these mappings will be the first candidates for performance tuning. Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating.
Last updated: 05-Jun-08 13:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
531 of 818
Metadata Reporting and Sharing Challenge Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.
Description The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata: Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a very comprehensive tool that is powered by the functionality of Informatica’s BI reporting tool, Data Analyzer. It is included on the PowerCenter CD. Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).
Metadata Reporter The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup. Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they are listed below: Schemas.xml Schedule.xml GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment) Reports.xml Dashboards.xml Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting objects. The following are the folders that are created in Data Analyzer when you import the above-listed files: Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Today’s Login ,Reports accessed by Users Today etc. PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their functionality these reports are further grouped into subfolders as following: Configuration Management - contains a set of reports that provide detailed information on configuration management, including deployment and label details. This folder contains following subfolders: INFORMATICA CONFIDENTIAL
BEST PRACTICES
532 of 818
Deployment Label Object Version Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage, run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following subfolders: Session Execution Workflow Execution PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties, and interdependencies on other objects within the repository. This folder contains following subfolders: Mappings Mapplets Metadata Extension Server Grids Sessions Sources Target Transformations Workflows Worklets Security - contains a set of reports that provide detailed information on the users, groups and their association within the repository. Informatica recommends retaining this folder organization, adding new folders if necessary. The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform. (Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc:odbc:
Sr No 1
Name Deployment Group
2
Deployment Group History
INFORMATICA CONFIDENTIAL
Reports For PowerCenter Repository Folder Description Displays deployment groups by Public Folders>PowerCenter Metadata repository Reports>Configuration Management>Deployment>Deployment Group Displays, by group, deployment Public Folders>PowerCenter Metadata groups and the dates they were Reports>Configuration Management>Deployment>Deployment Group deployed. It also displays the source and target repository names of the History
BEST PRACTICES
533 of 818
3
4
5
Labels
Public Folders>PowerCenter Metadata Reports>Configuration Management>Labels>Labels All Object Version Public Folders>PowerCenter Metadata History Reports>Configuration Management>Object Version>All Object Version History Server Load by Day of Public Folders>PowerCenter Metadata Week Reports>Operations>Session Execution>Server Load by Day of Week
6
Session Run Details
7
Target Table Load Public Folders>PowerCenter Metadata Analysis (Last Month) Reports>Operations>Session Execution>Target Table Load Analysis (Last Month) Workflow Run Details Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Workflow Run Details
8
Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Session Run Details
9
Worklet Run Details
Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Worklet Run Details
10
Mapping List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping List
11
Mapping Lookup Transformations
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Lookup Transformations
12
Mapping Shortcuts
13
Source to Target Dependency
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Source to Target Dependency
INFORMATICA CONFIDENTIAL
BEST PRACTICES
deployment group for all deployment dates. This is a primary report in an analytic workflow. Displays labels created in the repository for any versioned object by repository. Displays all versions of an object by the date the object is saved in the repository. This is a standalone report. Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Displays session run details for any start date by repository by folder. This is a primary report in an analytic workflow. Displays the load statistics for each table for last month by repository by folder. This is a primary report in an analytic workflow. Displays the run statistics of all workflows by repository by folder. This is a primary report in an analytic workflow. Displays the run statistics of all worklets by repository by folder. This is a primary report in an analytic workflow. Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. This is a primary report in an analytic workflow. Displays Lookup transformations used in a mapping by repository and folder. This report is a standalone report and also the first node in the analytic workflow associated with the Mapping List primary report. Displays mappings defined as a shortcut by repository and folder. Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.
534 of 818
14
Mapplet List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet List
15
Mapplet Lookup Transformations
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Lookup Transformations
16
Mapplet Shortcuts
17
Unused Mapplets in Mappings
18
Metadata Extensions Usage
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Unused Mapplets in Mappings Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Metadata Extensions>Metadata Extensions Usage
19
Server Grid List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Server Grid>Server Grid List
20
Session List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sessions>Session List
21
Source List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source List
22
Source Shortcuts
23
Target List
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target List
24
Target Shortcuts
25
Transformation List
26
Transformation Shortcuts
INFORMATICA CONFIDENTIAL
Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. This is a primary report in an analytic workflow. Displays all Lookup transformations used in a mapplet by folder and repository. This report is a standalone report and also the first node in the analytic workflow associated with the Mapplet List primary report. Displays mapplets defined as a shortcut by repository and folder. Displays mapplets defined in a folder but not used in any mapping in that folder. Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension. Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers. Displays all sessions and their properties by repository by folder. This is a primary report in an analytic workflow. Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in an analytic workflow. Displays sources that are defined as shortcuts by repository and folder Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in an analytic workflow. Displays targets that are defined as shortcuts by repository and folder.
Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target Shortcuts Displays transformations defined by Public Folders>PowerCenter Metadata repository and folder. This is a Reports>PowerCenter Objects>Transformations>Transformation List primary report in an analytic workflow. Public Folders>PowerCenter Metadata Displays transformations that are Reports>PowerCenter defined as shortcuts by repository and
BEST PRACTICES
535 of 818
27
Scheduler (Reusable) List
28
Workflow List
29
Worklet List
30
Users By Group
Sr No 1
Name Bottom 10 Least Accessed Reports this Year
2
Report Activity Details
3
Report Activity Details for Current Month
4
Report Refresh Schedule
5
Reports Accessed by Users Today
6
Todays Logins
7
Todays Report Usage by Hour
8
Top 10 Most
INFORMATICA CONFIDENTIAL
Objects>Transformations>Transformation folder. Shortcuts Displays all the reusable schedulers Public Folders>PowerCenter Metadata defined in the repository and their Reports>PowerCenter Objects>Workflows>Scheduler (Reusable) List description and properties by repository by folder. This is a primary report in an analytic workflow. Displays workflows and workflow Public Folders>PowerCenter Metadata properties by repository by folder. This Reports>PowerCenter report is a primary report in an Objects>Workflows>Workflow List analytic workflow. Displays worklets and worklet Public Folders>PowerCenter Metadata properties by repository by folder. This Reports>PowerCenter is a primary report in an analytic Objects>Worklets>Worklet List workflow. Public Folders>PowerCenter Metadata Displays users by repository and Reports>Security>Users By Group group. Reports For Data Analyzer Repository Folder Description Displays the ten least accessed Public Folders>Data Analyzer Metadata Reporting>Bottom 10 Least Accessed Reports reports for the current year. It has an analytic workflow that provides access this Year details such as user name and access time. Public Folders>Data Analyzer Metadata Part of the analytic workflows "Top 10 Reporting>Report Activity Details Most Accessed Reports This Year", "Bottom 10 Least Accessed Reports this Year" and "Usage by Login (Month To Date)". Provides information about reports Public Folders>Data Analyzer Metadata Reporting>Report Activity Details for Current accessed in the current month until current date. Month Public Folders>Data Analyzer Metadata Provides information about the next Reporting>Report Refresh Schedule scheduled update for scheduled reports. It can be used to decide schedule timing for various reports for optimum system performance. Public Folders>Data Analyzer Metadata Part of the analytic workflow for Reporting>Reports Accessed by Users Today "Today's Logins". It provides detailed information on the reports accessed by users today. This can be used independently to get comprehensive information about today's report activity details. Public Folders>Data Analyzer Metadata Provides the login count and average Reporting>Todays Logins login duration for users who logged in today. Public Folders>Data Analyzer Metadata Provides information about the Reporting>Todays Report Usage by Hour number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour. Public Folders>Data Analyzer Metadata Shows the ten most accessed reports
BEST PRACTICES
536 of 818
Accessed Reports this Reporting>Top 10 Most Accessed Reports Year this Year
9
10
11
12
13
14
for the current year. It has an analytic workflow that provides access details such as user name and access time. Top 5 Logins (Month Public Folders>Data Analyzer Metadata Provides information about users and To Date) Reporting>Top 5 Logins (Month To Date) their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user. Shows the five longest running onPublic Folders>Data Analyzer Metadata Top 5 Longest demand reports for the current month Running On-Demand Reporting>Top 5 Longest Running Onto date. It displays the average total Demand Reports (Month To Date) Reports (Month To response time, average DB response Date) time, and the average Data Analyzer response time (all in seconds) for each report shown. Shows the five longest running Public Folders>Data Analyzer Metadata Top 5 Longest Reporting>Top 5 Longest Running Scheduled scheduled reports for the current Running Scheduled month to date. It displays the average Reports (Month To Date) Reports (Month To response time (in seconds) for each Date) report shown. Total Schedule Errors Public Folders>Data Analyzer Metadata Provides the number of errors for Today Reporting>Total Schedule Errors for Today encountered during execution of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it. User Logins (Month Public Folders>Data Analyzer Metadata Provides information about users and To Date) Reporting>User Logins (Month To Date) their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user. Provides information about users who Users Who Have Public Folders>Data Analyzer Metadata exist in the repository but have never Never Logged On Reporting>Users Who Have Never Logged logged in. This information can be On used to make administrative decisions about disabling accounts.
Customizing a Report or Creating New Reports Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics. Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters offers tremendous reporting flexibility. Additionally, you can setup report templates and export them as Excel files, which can be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.
Wildcards The Metadata Reporter supports two wildcard characters: Percent symbol (%) - represents any number of characters and spaces. Underscore (_) - represents one character or space. You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
537 of 818
Suppose you have the following values available to select: items, items_in_promotions, order_items, promotions
The following list shows the return values for some wildcard combinations you can use: Wildcard Combination %
Return Values items, items_in_promotions, order_items, promotions items, items_in_promotions, order_items, promotions items, order_items Items items, items_in_promotions items, items_in_promotions, promotions items_in_promotions, promotions
A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document. For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.
Security Awareness for Metadata Reporter Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production, the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data Analyzer documentation.
Metadata Exchange: the Second Generation (MX2) The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report and query the Informatica metadata. Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2 supersede those of MX. The primary requirements and features of MX2 are: Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based API for accessing and manipulating the PowerCenter Repository from various programming languages. Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products. Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream INFORMATICA CONFIDENTIAL
BEST PRACTICES
538 of 818
data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces. Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository. Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution. Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools. Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository. Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COMcompliant can seamlessly interface with the PowerCenter Repository by means of MX2.
Last updated: 27-May-08 12:03
INFORMATICA CONFIDENTIAL
BEST PRACTICES
539 of 818
Repository Tables & Metadata Management Challenge Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.
Description Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.
Managing Repository The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.
Repository backup Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.
This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.
The following paragraphs describe some useful practices for maintaining backups: Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
540 of 818
Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip. Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself. Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.
Restore repository Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.
Restore folders There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.
Remove older versions Use the purge command to remove older versions of objects from repository. To purge a specific version of an object, view the history of the object, select the version, and purge it.
Finding deleted objects and removing them from repository If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option. Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted objects, use either the find checkouts command in the client tools or a query generated in the repository manager, or a specific query.
After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
541 of 818
Truncating Logs You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder. Options allow truncating all log entries or selected entries based on date and time.
Repository Performance Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.
Repository Agent and Repository Server performance Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various causes should be analyzed and the repository server (or agent) configuration file modified to improve performance.
Managing Metadata The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.
Failed Sessions The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Last_Error AS Error_Message, DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status, Actual_Start AS Start_Time, Session_TimeStamp FROM rep_sess_log WHERE run_status_code != 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
542 of 818
Long running Sessions The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n SELECT Subject_Area AS Folder, Session_Name, Successful_Source_Rows AS Source_Rows, Successful_Rows AS Target_Rows, Actual_Start AS Start_Time, Session_TimeStamp FROM rep_sess_log WHERE run_status_code = 1 AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) AND (Session_TimeStamp - Actual_Start) > (10/(24*60)) ORDER BY Session_timeStamp
Invalid Tasks The following query lists folder names and task name, version number, and last saved for all invalid tasks. SELECT SUBJECT_AREA AS FOLDER_NAME, DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE, TASK_NAME AS OBJECT_NAME, VERSION_NUMBER, -- comment out for V6 LAST_SAVED FROM REP_ALL_TASKS WHERE IS_VALID=0 AND IS_ENABLED=1 --AND CHECKOUT_USER_ID = 0 -- Comment out for V6 --AND is_visible=1 -- Comment out for V6 ORDER BY SUBJECT_AREA,TASK_NAME
Load Counts The following query lists the load counts (number of rows loaded) for the successful sessions. SELECT subject_area, workflow_name, session_name, DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status,
INFORMATICA CONFIDENTIAL
BEST PRACTICES
543 of 818
successful_rows, failed_rows, actual_start FROM REP_SESS_LOG WHERE TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE) ORDER BY subject_area workflow_name, session_name, Session_status
Last updated: 27-May-08 12:04
INFORMATICA CONFIDENTIAL
BEST PRACTICES
544 of 818
Using Metadata Extensions Challenge To provide for efficient documentation and achieve extended metadata reporting through the use of metadata extensions in repository objects.
Description Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository. Informatica Client applications can contain two types of metadata extensions: vendor-defined and user-defined. Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them. User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions. You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendordefined metadata extensions are always reusable. Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a non-reusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable. Metadata extensions can be created for the following repository objects: Source definitions Target definitions Transformations (Expressions, Filters, etc.) Mappings Mapplets Sessions Tasks Workflows Worklets Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source. The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field. You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer. You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and
INFORMATICA CONFIDENTIAL
BEST PRACTICES
545 of 818
Description. Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide. You can also migrate Metadata Extensions from one environment to another. When you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values. Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications. Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.
Last updated: 27-May-08 12:04
INFORMATICA CONFIDENTIAL
BEST PRACTICES
546 of 818
Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance Challenge The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and under-appreciated. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes. To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports provide lots of useful information about PowerCenter object metadata and operational metadata that can be used for quality assurance.
Description Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids: Shared template for each type of mapping. Checklists to guide the developer through the process of adapting the template to the mapping requirements. Macros/scripts to generate productivity aids such as SQL overrides etc. It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes. Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the door to an automated QA strategy For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target. Possible code errors in this respect can be identified as follows: Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD. Develop a query on the repository to search for sessions named EXTRACT, which do not have the truncate target option set. Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have the truncate target option set. Provide a facility to allow developers to run both queries before releasing code to the test environment. Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS: The following bullets represent a high-level overview of the steps involved in automating QA: Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories. Identify the key attributes of each category. Define naming standards to identify the category for transformations/mappings/sessions/workflows. Analyze the MX Views to source the key attributes. Develop the query to compare actual and expected attributes for each category.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
547 of 818
After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages: Execute a profile to assign environment variables (e.g., repository schema user, password, etc). Select the folder to be reviewed. Execute the query to find exceptions. Report the exceptions in an accessible format. Exit with failure if exceptions are found.
TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica . The principal objective of any QA strategy is to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.
Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations. In this section, we focus primarily on how these reports and custom reports can help ease the QA process. The following reports can help identify regressions in load performance: Session Run details Workflow Run details Worklet Run details Server Load by Day of the Week can help determine the load on the server before and after QA migrations and may help balance the loads through the week by modifying the schedules. The Target Table Load Analysis can help identify any data regressions with the number of records loaded in each target (if a baseline was established before the migration/upgrade). The Failed Session report lists failed sessions at a glance, which is very helpful after a major QA migration or QA of Informatica upgrade process During huge deployments to QA, the Code review team can look at the following reports to determine if the standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for review because the reviewer doesn’t need to open each mapping and check for these details. All of the following are out-of-the-box reports provided by Informatica: Label report Mappings list Mapping shortcuts Mapping lookup transformation Mapplet list Mapplet shortcuts Mapplet lookup transformation Metadata extensions usage Sessions list> Worklets list Workflows list Source list Target list Custom reports based on the review requirements
INFORMATICA CONFIDENTIAL
BEST PRACTICES
548 of 818
In addition, note that the following reports are also useful during migration and upgrade processes: Invalid object reports and deployment group report in the QA repository help to determine which deployments caused the invalidations. Invalid object report against Development repository helps to identify the invalid objects that are part of deployment before QA migration. Invalid object report helps in QA of an Informatica upgrade process. The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository Reports installation:
1 2
3 4 5
6 7
8 9 10
11 12 13
14
15 16 17 18
19
Report Name Deployment Group Deployment Group History
Description Displays deployment groups by repository Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. Labels Displays labels created in the repository for any versioned object by repository. All Object Version Displays all versions of an object by the date the object is saved in History the repository. Server Load by Displays the total number of sessions that ran, and the total session Day of Week run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays Session Run Displays session run details for any start date by repository by folder. Details Target Table Load Displays the load statistics for each table for last month by repository Analysis (Last by folder Month) Displays the run statistics of all workflows by repository by folder. Workflow Run Details Displays the run statistics of all worklets by repository by folder. Worklet Run Details Mapping List Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. Mapping Lookup Displays Lookup transformations used in a mapping by repository and Transformations folder. Mapping Shortcuts Displays mappings defined as a shortcut by repository and folder. Source to Target Displays the data flow from the source to the target by repository and Dependency folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived. Mapplet List Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. Mapplet Lookup Displays all Lookup transformations used in a mapplet by folder and Transformations repository. Mapplet Shortcuts Displays mapplets defined as a shortcut by repository and folder. Unused Mapplets Displays mapplets defined in a folder but not used in any mapping in in Mappings that folder. Metadata Displays, by repository by folder, reusable metadata extensions used Extensions Usage by any object. Also displays the counts of all objects using that metadata extension. Server Grid List Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol
INFORMATICA CONFIDENTIAL
BEST PRACTICES
549 of 818
20 21
22 23
24 25 26 27 28 29
address of the servers. Session List Displays all sessions and their properties by repository by folder. This is a primary report in a data integration workflow. Source List Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in a data integration workflow. Source Shortcuts Displays sources that are defined as shortcuts by repository and folder Target List Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in a data integration workflow. Target Shortcuts Displays targets that are defined as shortcuts by repository and folder. Transformation List Displays transformations defined by repository and folder. This is a primary report in a data integration workflow. Transformation Displays transformations that are defined as shortcuts by repository Shortcuts and folder. Scheduler Displays all the reusable schedulers defined in the repository and (Reusable) List their description and properties by repository by folder. Workflow List Displays workflows and workflow properties by repository by folder. Worklet List Displays worklets and worklet properties by repository by folder.
Last updated: 05-Jun-08 13:27
INFORMATICA CONFIDENTIAL
BEST PRACTICES
550 of 818
Configuring Standard Metadata Resources Challenge Metadata that is derived from a variety of sources and tools is often disparate and fragmented. To be of value, metadata needs to be consolidated into a central repository. Informatica's Metadata Manager provides a central repository for the capture and analysis of critical metadata. Before you can browse and search metadata in the Metadata Manager warehouse, you must configure Metadata Manager, create resources, and then load the resource metadata.
Description Informatica Metadata Manager is a web-based metadata management tool that you can use to browse and analyze metadata from disparate metadata repositories. Metadata Manager helps you understand and manage how information and processes are derived. It also helps you understand the fundamental relationships between information and processes, and how they are used. Metadata Manager extracts metadata from application, business intelligence, data integration, data modeling, and relational metadata sources. Metadata Manager uses PowerCenter workflows to extract metadata from metadata sources and load it into a centralized metadata warehouse called the Metadata Manager warehouse. Metadata Manager uses resources to represent the metadata in the Metadata Manager. Each resource represents metadata from a metadata source. Metadata Manager shows the metadata for each resource in the metadata catalog. The metadata catalog is a hierarchical representation of the metadata in the Metadata Manager warehouse. There are several steps to configure a standard resource in Metadata Manager. It is very important to identify, setup, and test your resource connections before configuring a resource into Metadata Manager. Informatica recommends creating naming standards usually prefixed by the metadata source type for the Metadata Manager Application (i.e. for a SQL Server relational database use ‘SS_databasename_schemaname’). The steps below describe how to load metadata from a metadata source into the Metadata Manager warehouse. Each detailed section shows information needed for individual standard resource types.
Loading Metadata Resource into Metadata Manager Warehouse The Load page in the Metadata Manager Application is used to create and load resources into the Metadata Manager warehouse. Use the Load page to monitor and schedule resource loads, purge metadata from the Metadata Manager warehouse, and manage the search index. Complete the following steps to load metadata from a metadata source into the Metadata Manager warehouse: 1. Set up Metadata Manager and metadata sources. Create a Metadata Manager Service, install the Metadata Manager, and configure the metadata sources from which you want to extract metadata. 2. Create resources. Create resources that represent the metadata sources from which you want to extract metadata. 3. Configure resources. Configure the resources, including metadata source files and direct source connections, parameters, and connection assignments. You can also purge metadata for a previously loaded resource and update the index for resources. 4. Load and monitor resources. Load a resource to load the metadata for the resource into the Metadata Manager warehouse. When you load a resource, Metadata Manager extracts and loads the metadata for the resource. You can monitor the status of all resources and the status of individual resources. You can also schedule resource loads. 5. Manage resource and object permissions for Metadata Manager users. You can configure the resources and metadata objects in the warehouse for which Metadata Manager users have access. Use Metadata Manager command line programs to load resources, monitor the status of resource loads and PowerCenter workflows, and back up and restore the Metadata Manager repository.
Configure Metadata Resources Before you configure resources and load metadata into the Metadata Manager warehouse, you must configure the metadata sources. For metadata sources that use a source file, you select the source file when you configure the resource. If you do not correctly configure the metadata sources, the metadata load can fail or the metadata can be incorrectly loaded in the Metadata Manager warehouse. Table 2-1 describes the configuration tasks for the metadata sources:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
551 of 818
Table 2-1. Metadata Source Configuration Tasks Metadata Source Type
Metadata Source
Tasks
Application
SAP
Install SAP transports and configure permissions. For more information, see SAP.
Business Intelligence
Business Objects
Export documents, universes, and Crystal Reports to a repository. For more information, see Business Objects.
Cognos ReportNet Content Manager
Verify that you have access ReportNet URL. Metadata Manager uses the ReportNet URL to access the source repository metadata.
Cognos Impromptu
Use the Cognos client tool to export the metadata to a .cat file.
Hyperion Essbase
Export metadata to an XML file. For more information, see Hyperion Essbase.
IBM DB2 CubeViews
Export metadata to an XML file. For more information, see IBM DB2 Cube Views.
Microstrategy
Configure the database user account and projects. For more information, see Microstrategy.
Data Integration
PowerCenter
Metadata Manager extracts the latest version of objects that are checked into the PowerCenter repository. Check in all metadata objects that you want to extract from the PowerCenter repository. For more information, see PowerCenter.
Database IBM DB2 UDB Configure the permissions for the database user account. For more information, see Relational Database Sources. Management Informix Microsoft SQL Server Oracle Sybase Teradata Data Modeling*
Embarcadero ERStudio
Use the ERStudio client tool to export the metadata to a .dm1 file.
ERwin
Export metadata. For more information, see Erwin.
Oracle Designer
Use the Oracle Designer client tool to export the metadata to a .dat file.
Rational Rose Use the Rational Rose client tool to export the metadata to an .mdl file. ER
Sybase Use the Sybase PowerDesigner client tool to save the model to a .pdm file in XML format. PowerDesigner
Visio
Use the Visio client tool to export the metadata to an .erx file.
Custom
Custom
Export metadata to a .csv or .txt file. For more information, see Custom Metadata Sources.
* You can load multiple models from the same data modeling tool source. For more information, se e Data Modeling Tool Sources
Standard Resource Types Business Objects The Business Objects Resource requires you to install Business Object Designer on the machine hosting the Metadata Manager console and to provide user name and password to access Business Objects repository. Export the Business Objects universes, documents, and Crystal Reports to the Business Objects source repository. You can extract documents, universes, and Crystal
INFORMATICA CONFIDENTIAL
BEST PRACTICES
552 of 818
Reports that have been exported to the source repository. You cannot extract metadata from documents or universes. Export from source repositories to make sure that the metadata in the Metadata Manager warehouse is consistent with the metadata that is distributed to Business Objects users. Use Business Objects Designer to export a universe to the Business Objects source repository. For example, to begin the export process in Business Objects Designer, click File > Export. You must secure a connection type to export a universe to a Business Objects source repository. Use Business Objects to export a document to the Business Objects repository. For example, to begin the export process in Business Objects, click File > Publish To > Corporate Documents. Use the Business Objects Central Manager Console to export Crystal Reports to the Business Objects repository. The screenshot below displays the information you will need to add the resource.
Custom Metadata Sources If you create a custom resource and use a metadata source file, you must export the metadata to a metadata file with a .csv or .txt file extension. When you configure the custom resource, you specify the metadata file.
Data Modeling Tool Sources You can load multiple models from a data modeling tool into the Metadata Manager warehouse. After you load the metadata, the Metadata Manager catalog shows models from the same modeling tool under the resource name. This requirement applies to all data modeling tool resource types.
Erwin / ER-Studio Metadata Manager extracts ERwin metadata from a metadata file. When you configure the connection to the ERwin source repository in Metadata Manager, you specify the metadata file. The required format for the metadata file depends on the version of the ERwin source repository. The following table specifies the required file type format for each supported version: Version
File Type
ERwin 3.0 to 3.5.2
.erx
ERwin 4.0 SP1 to 4.1
.er1 or .xml
Erwin 7.x
.erwin or .xml
INFORMATICA CONFIDENTIAL
BEST PRACTICES
553 of 818
The screenshot below displays the information you will need to add the resource.
Hyperion Essbase Use the Hyperion Essbase client tool to export the metadata to an .xml file. Metadata Manager extracts Hyperion Essbase metadata from a metadata file with an .xml file extension. When you set up the resource for Hyperion Essbase in the Metadata Manager, you specify the metadata file. Use the Hyperion Essbase Integration Server to export the source metadata to an XML file. Export one model to each metadata file. To export the Hyperion model to an XML file: 1. 2. 3. 4. 5. 6. 7.
Log in to Hyperion Essbase Integration Server. Create the Hyperion source or open an existing model. Click File > Save to save the model if you created or updated it. Click File > XML Import/Export. On the Export tab, select the model. Click Save As XML File. A pop-up window appears. Select the location where you want to store the XML file.
The screenshot below displays the information you will need to add the resource.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
554 of 818
IBM DB2 Cube Views Use the IBM DB2 Cube Views OLAP Center GUI to export cube models to .xml files. When configure the resource for DB2 Cube Views in Metadata Manager, you specify the metadata files. TIP You can load multiple cube models into the Metadata Manager warehouse. Export each cube model into a separate .xml file and name the file with the same name as the cube model. If you export multiple cube models into an .xml file, export the same cube models into the same .xml file each time you export them. The screenshot below displays the information you will need to add the resource.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
555 of 818
Microstrategy To configure Microstrategy, complete the following tasks: Configure permissions. Configure multiple projects (optional). The screenshot below displays the information you will need to add the resource.
Configure Permissions
INFORMATICA CONFIDENTIAL
BEST PRACTICES
556 of 818
The Microstrategy project user account for which you provide the user name and password must have the Bypass All Object Security Access Checks administration privilege. You set this privilege in the Microstrategy Desktop client tool. Note: Although Microstrategy allows you to connect to a project source using database or network authentication, Metadata Manager uses project source authentication. Configure Multiple Projects in the Same Metadata File Microstrategy projects can be from different project sources. You can load multiple Microstrategy projects under the same Microstrategy resource. You must provide the user name and password for each project source. Project names must be unique. When you configure the Microstrategy resource, you specify the project source, project, user name, and password for each project.
PowerCenter The screenshot below displays the information you will need to add the resource.
Relational Database Sources Configure the permissions for the IBM DB2 UDB, Informix, Microsoft SQL Server, Oracle, Sybase ASE, and Teradata database user account. The database user account you use to connect to the metadata source must have SELECT permissions on the following objects in the specified schemas: Tables, Views, Indexes, Packages, Procedures, Functions, Sequences, Triggers, Synonyms. Note: For Oracle resources, the user account must also have the SELECT CATALOG ROLE permission.
DB2 Resource jdbc:informatica:db2://host_name:port;DatabaseName=database_name
Informix Resource jdbc:informatica:informix://host_name:port;InformixServer=server_name;DatabaseName=database_name SQL Server Resourcejdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name Connection String:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
557 of 818
For default instance: SQL Server Name@Database Name For named instance: Server Name\Instance Name@Database Name
Oracle Resource jdbc:informatica:oracle://host_name:port;SID=sid Connect String: Oracle instance name If the metadata in the Oracle source database contains unicode characters, set the NLS_LENGTH_SEMANTICS parameter to CHAR from BYTE. Specify a user name and password to access the Oracle database metadata. Be sure that the user has the Select Any Table privilege and Select Permissions on the following objects in the specified schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms. Also ensure the user has Select Permissions on the SYS.v_$instance. One Resource is needed for each Oracle instance.
Teradata Resource: jdbc:teradata://database_server_name/Database=database_name Connect String: Be sure that the user has access to all the system “DBC” tables.
SAP To configure SAP, complete the following tasks: Install PowerCenter transports Configure user authorization profile Installing Transports To extract metadata from SAP, you must install PowerCenter transports. The transports are located in the following folder in the location where you downloaded PowerCenter:
Transport Request
Functionality
XCONNECT_DESIGN_R900116.R46 4.0B to 4.6B, 4.6C, and nonXCONNECT_DESIGN_K900116.R46 Unicode Versions 4.7 and Higher
R46K900084
For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.
XCONNECT_DESIGN_R900109.U47
U47K900109
For mySAP R/3 ECC (R/3) and mySAP add-on components, including CRM, BW, and APO: Supports Table Metadata Extraction for SAP in Metadata Manager.
Unicode Versions 4.7 and Higher
XCONNECT_DESIGN_K900109.U47
You must install the other mySAP transports before you install the transports for Metadata Manager. Configure User Authorization Profile The SAP administrator needs to create the product and development user authorization profile.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
558 of 818
Table 2-3 describes the user authorization profile: Table 2-3. SAP User Authorization Profile Authorization Object
Description
Class
Field Values
S_RFC
Authorization check for RFC access.
Cross Application Authorization objects
Activity: 16 (Execute) Name of RFC to be protected: *. Type of RFC object to be protected: FUGR
Last updated: 02-Jun-08 22:53
INFORMATICA CONFIDENTIAL
BEST PRACTICES
559 of 818
Custom Metadata with Metadata Manager Challenge This Best Practice describes some of the customizations that can be done to a standard out-of-the-box Metadata Manager Instance.
Description There are a couple of areas in which the Metadata Manager Application can be extended to meet specific needs. Customizations include extended out-of-the-box meta-models to include new attributes and implementing Custom Resources (i.e., XConnects) to import metadata from sources that are not supported by Informatica.
Metadata Manager Custom Resources Metadata Manager provides an interface to import metadata from sources that are not by default supported by the tool. Users can create their own Custom Metadata Imports (i.e., Customer Resources/XConnects) to import metadata from a variety of nonstandard sources. Custom Resources can be created for a variety of metadata sources including Modeling tools, Databases and ETL code written outside of PowerCenter. Custom Resources can also be created to populate Business Glossaries. At a high level, the task of creating a Custom Resource can be broken down into the following steps. 1. Design the meta-model. The meta-model defines the kind of metadata that a resource may hold. Users do this by defining classes that represent different metadata elements. Users can set up class hierarchies and properties for each class. 2. Create a Custom Resource. The resource is the placeholder for metadata from a particular source. Resources are based on a particular meta-model. 3. Design the Metadata File templates. Custom Resources read the metadata from flat files and load them into the Metadata Manager’s Repository. Before creating a Custom Resource, design the input file’s layout for this Custom Resource. At this point the files don’t need to have the actual metadata. The Custom Metadata Configurator needs the file layout to generate the metadata load workflows. 4. Define the Custom Metadata rules (in Custom Metadata Configurator) and generate the metadata load workflows. Metadata rules are set up in the custom metadata configurator. The Metadata Manager Custom Metadata Integration Guide provides more information on the methodology and procedures for integrating custom metadata into the Metadata Manager warehouse. This guide assumes that users have knowledge of relational database concepts, models and PowerCenter. 5. Define a process to extract metadata from metadata sources and create metadata files. In many situations, the metadata files are quite large and complex. For example, the metadata input file for a Database Custom Resource will have to include the list of all tables and columns at the least. Users are better off creating a process that connects to the metadata source and extracts metadata out of it into the format specified earlier. Quite often this process is a set of PowerCenter mappings and/or SQL/Command Line scripts. 6. Load the Custom Resource From the load page, users can upload the metadata files onto the Metadata Manager server and load the resource. After this step users can browse and analyze the Custom Resource.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
560 of 818
Custom Metadata Attributes MetadataManager allows the user to extend out-of-the-box metadata by adding custom attributes to the existing meta-models. For example, an Oracle table can be extended to hold the business owner’s name for that table. This is not an attribute that Oracle holds by default. By extending the model, users can manually add this value using Metadata Manager’s Interface. The Model page in Metadata Manager allows users to create and modify meta-model. After creating or editing the model for the custom metadata, add the metadata to the Metadata Manager warehouse. Users can also export and import the models, or export and import the metadata that they added to the metadata catalog. Custom Attributes cannot be populated automatically. They have to be added and maintained manually.
Metadata Manager Reporting To access the Metadata Manager Reporting Service (aka Data Analyzer) from the Metadata Manager to run reports, complete the following steps: 1. Create a Reporting Service. Create a Reporting Service in the PowerCenter Administration Console and use the Metadata Manager repository as the data source. 2. Launch Metadata Manager Reporting. On the Metadata Manager Browse page, click Reports in the toolbar. If the user has the required privileges on the Reporting Service, Metadata Manager logs the user into the Data Analyzer instance being used for Metadata Manager. Then run the Metadata Manager reports. Metadata Manager includes the following types of reports: Primary reports. This is the top-level report in an analytic workflow. To access all lower-level reports in the analytic workflow, first run this report on the Analyze tab. Stand-alone reports. Unlike analytic workflow reports, these reports can be run independently of other reports. Workflow reports. These are the lower-level reports in an analytic workflow. To access a workflow report, first run the associated primary report and all workflow reports that precede the given workflow report. Use these reports to perform several types of analysis on metadata stored in the Metadata Manager warehouse. Metadata Manager prepackages reports for business intelligence, data modeling, data integration, database management and metamodel.
Customizing Metadata Manager Reporting Create new reporting elements and attributes under ‘Schema Design’. These elements can be used in new reports or existing report extensions. Out-of-the-box reports, indicators or dashboards can also be extended or customized. Informatica recommends using the ‘Save As’ new report option for such changes in order to avoid any conflicts during upgrades. The Metadata Manager Reports Reference gives a guideline of the reports and attributes being used. Further, use Data Analyzer's 12-3-4 report creation wizard to create new reports. Informatica recommends saving such reports in a new report folder to avoid conflict during upgrades.
Customizing Metadata Manager ODS Reports Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for which Metadata Manager does not prepackage a standard resource.
Business Glossary The Business Glossary is a categorized set of business terms. A business term defines, in business language, a concept that is relevant for the business users in the organization. A business term can include the business definition and usage of the concept, its rule and valid values for data. Business terms can be organized by categories. Using a Business Glossary one can easily relate business terms and concepts to the technical artifacts.
Business Glossary Model INFORMATICA CONFIDENTIAL
BEST PRACTICES
561 of 818
The pre-packaged business glossary model consists of two classes: Category and Term. Additional classes (e.g., project, application, etc) can be added. The same term can appear in multiple categories in different glossaries.
Create a Glossary A glossary can be created through the UI or by importing from Excel or XML. A typical approach would be to create a sample glossary using the UI and export to Excel. The rest of the glossary items can then be filled in using Excel and imported back.
Glossary Approval Workflow Business Glossary terms can be taken through draft, review and publish phases. Permissions can be used to control access to the terms during each phase. Emails can be triggered during each phase to appropriate users. Audit trails can also be seen for each Business Glossary term as it gets modified.
Valid Values Business Glossary terms can be linked to a reference table created using the reference table manager (RTM). This is useful in linking a list of valid values to a term. A list of product codes, gender codes, etc., are examples of reference tables. For more information refer to Metadata Manager Business Glossary.
Last updated: 17-Nov-10 15:55
INFORMATICA CONFIDENTIAL
BEST PRACTICES
562 of 818
Custom XConnect Implementation Challenge Metadata Manager uses XConnects to extract source repository metadata and load it into the Metadata Manager Warehouse. The Metadata Manager Configuration Console is used to run each XConnect. A Custom XConnect is needed to load metadata from a source repository for which Metadata Manager does not prepackage an out-of-the box XConnect.
Description This document organizes all steps into phases, where each phase and step must be performed in the order presented. To integrate custom metadata, complete tasks for the following phases: Design the Metamodel. Implement the Metamodel Design. Set-up and run the custom XConnect. Configure the reports and schema.
Prerequisites for Integrating Custom Metadata To integrate custom metadata, install Metadata Manager and the other required applications. The custom metadata integration process assumes knowledge of the following topics: Common Warehouse Metamodel (CWM) and Informatica-Defined Metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica metamodel components supplement the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about the Informatica-defined metamodel components, run and review the metamodel reports. PowerCenter Functionality. During the metadata integration process, XConnects are configured and run. The XConnects run PowerCenter workflows that extract custom metadata and load it into the Metadata Manager Warehouse. Data Analyzer Functionality. Metadata Manager embeds Data Analyzer functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in Data Analyzer is required. Knowledge of creating, modifying, and deleting table definitions, metrics, and attributes is required to update the schema with new or changed objects.
Design the Metamodel During this planning phase, the metamodel is designed; the metamodel will be implemented in the next phase. A metamodel is the logical structure that classifies the metadata from a particular repository type. Metamodels consist of classes, class, associations, and packages, which group related classes and class associations. An XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations. This task consists of the following steps: 1. Identify Custom Classes. To identify custom classes, determine the various types of metadata in the source repository that need to be loaded into the Metadata Manager Warehouse. Each type of metadata corresponds to one class. 2. Identify Custom Class Properties. After identifying the custom classes, each custom class must be populated with properties (i.e., attributes) in order for Metadata Manager to track and report values belonging to classes instances. 3. Map Custom Classes to CWM Classes. Metadata Manager prepackages all CWM classes, class properties, and class associations. To quickly develop a custom metamodel and reduce redundancy, reuse the predefined class properties and associations instead of recreating them. To determine which custom classes can inherit properties from CWM classes, map custom classes to the packaged CWM classes. For all properties that cannot be inherited, define them in Metadata Manager. 4. Determine the Metadata Tree Structure. Configure the way the metadata tree displays objects. Determine the groups of metadata objects in the metadata tree, then determine the hierarchy of the objects in the tree. Assign the TreeElement class as a base class to each custom class. 5. Identify Custom Class Associations. The metadata browser uses class associations to display metadata. For INFORMATICA CONFIDENTIAL
BEST PRACTICES
563 of 818
each identified class association, determine if you can reuse a predefined association from a CWM base class or if you need to manually define an association in Metadata Manager. 6. Identify Custom Packages. A package contains related classes and class associations. Multiple packages can be assigned to a repository type to define the structure of the metadata contained in the source repositories of the given repository type. Create packages to group related classes and class associations. To see an example of sample metamodel design specifications, see Appendix A in the Metadata Manager Custom Metadata Integration Guide.
Implement the Metamodel Design Using the metamodel design specifications from the previous task, implement the metamodel in Metadata Manager. This task includes the following steps: 1. Create the originator (aka owner) of the metamodel. When creating a new metamodel, specify the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in Metadata Manager, select ‘Customer’ as the originator type. Go to the Administration tab. Click Originators under Metamodel Management. Click Add to add a new originator. Fill out the requested information (Note: Domain Name, Name, and Type are mandatory fields). Click OK when you are finished. 2. Create the packages that contain the classes and associations of the subject metamodel. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together. Go to the Administration tab. Click Packages under Metamodel Management. Click Add to add a new package. Fill out the requested information (Note: Name and Originator are mandatory fields). Choose the originator created above. Click OK when you are finished. 3. Create Custom Classes. In this step, create custom classes identified in the metamodel design task. Go to the Administration tab. Click Classes under Metamodel Management. From the drop-down menu, select the package that you created in the step above Click Add to create a new class. Fill out the requested information (Note: Name, Package, and Class Label are mandatory fields). Base Classes: In order to see the metadata in the Metadata Manager metadata browser, you need to at least add the base class, TreeElement. To do this: a. Click Add under Base Classes. b. Select the package. c. Under Classes, select TreeElement. d. Click OK (You should now see the class properties in the properties section). To add custom properties to your class, click Add. Fill out the property information (Name, Data Type, and Display Label are mandatory fields). Click OK when you are done. Click OK at the top of the page to create the class. Repeat the above steps for additional classes. 4. Create Custom Class Associations. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations that cannot be reused. If you only
INFORMATICA CONFIDENTIAL
BEST PRACTICES
564 of 818
need the ElementOwnership association, skip this step Go to the Administration tab. Click Associations under Metamodel Management. Click Add to add a new association. Fill out the requested information (all bold fields are required). Click OK when you are finished. 5. Create the Repository Type. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type contains workflows and mappings, but a Data Analyzer business intelligence repository type does not. Repository types maintain the uniqueness of each repository. Go to the Administration tab. Click Repository Types under Metamodel Management. Click Add to add a new repository type. Fill out the requested information (Note: Name and Product Type are mandatory fields). Click OK when you are finished. 6. Configure a Repository Type Root Class. Root classes display under the source repository in the metadata tree. All other objects appear under the root class. To configure a repository root class: Go to the Administration tab. Click Custom Repository Type Root Classes under Metamodel Management. Select the custom repository type. Optionally, select a package to limit the number of classes that display. Select the Root Class option for all applicable classes. Click Apply to apply the changes.
Set Up and Run the XConnect The objective of this task is to set up and run the custom XConnect. Custom XConnects involve a set of mappings that transform source metadata into the required format specified in the Informatica Metadata Extraction (IME) files. The custom XConnect extracts the metadata from the IME files and loads it into the Metadata Manager Warehouse. This task includes the following steps: 1. Determine which Metadata Manager Warehouse tables to load. Although you do not have to load all Metadata Manager Warehouse tables, you must load the following Metadata Manager Warehouse tables: IMW_ELEMENT: The IME_ELEMENT interface file loads the element names from the source repository into the IMW_ELEMENT table. Note that element is used generically to mean packages, classes, or properties. IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file loads the attributes belonging to elements from the source repository into the IMW_ELMNT_ATTR table. IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file loads the associations between elements of a source repository into the IMW_ELMNT_ASSOC table. To stop the metadata load into particular Metadata Manager Warehouse tables, disable the worklets that load those tables. 2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms to the format specified in each required IME interface file. (The IME files are packaged with the Metadata Manager documentation.) Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Note that you can load metadata into a Metadata Manager Warehouse table using more than one of the accepted source type formats. 3. Register the Source Repository Instance in Metadata Manager. Before the Custom XConnect can extract metadata, the source repository must be registered in Metadata Manager. When registering the source repository, the Metadata Manager application assigns a unique repository ID that identifies the source repository. Once registered, Metadata Manager adds an XConnect in the Configuration Console for the source repository. To register the source repository, go to the Metadata Manager web interface. Register the repository under the custom repository type created
INFORMATICA CONFIDENTIAL
BEST PRACTICES
565 of 818
above. All packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. Once the repository is registered in Metadata Manager, Metadata Manager adds an XConnect in the Configuration Console for the repository. Create the Repository that will hold the metadata extracted from the source system: Go to the Administration tab. Click Repositories under Repository Management. Click Add to add a new repository. Fill out the requested information (Note: Name and Repository Type are mandatory fields). Choose the repository type created above. Click OK when finished. 4. Configure the Custom Parameter Files. Custom XConnects require that the parameter file be updated by specifying the following information: The source type (database table, database view, or flat file). The name of the database views or tables used to load the Metadata Manager Warehouse, if applicable. The list of all flat files used to load a particular Metadata Manager Warehouse table, if applicable. The worklets you want to enable and disable. Understanding Metadata Manager Workflows for Custom Metadata wf_Load_IME. Custom workflow to extract and transform metadata from the source repository into IME format. This is created by a developer. Metadata Manager prepackages the following integration workflows for custom metadata. These workflows read the IME files mentioned above and load them into the Metadata Manager Warehouse. WF_STATUS: Extracts and transforms statuses from any source repository and loads them into the Metadata Manager Warehouse. To resolve status IDs correctly, the workflow is configured to run before the WF_CUSTOM workflow. WF_CUSTOM: Extracts and transforms custom metadata from IME files and loads that metadata into the Metadata Manager Warehouse. 5. Configure the Custom XConnect. The XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations specified in the custom metamodel. When the custom repository type is defined, Metadata Manager registers the corresponding XConnect in the Configuration Console. The following information in the Configuration Console configures the XConnect: Under the Administration Tab, select Custom Workflow Configuration and choose the repository type to which the custom repository belongs. Workflows to load the metadata. CustomXConnect-wf_Load_IME workflow Metadata Manager-WF_CUSTOM workflow(prepackages all worklets and sessions required to populate all Metadata Manager Warehouse tables, except the IMW_STATUS table) Metadata Manager -WF_STATUS workflow (populates the IMW_STATUS) Note: Metadata Manager Server does not load Metadata Manager Warehouse tables that have disabled worklets. Under the Administration Tab, select Custom Workflow Configuration and choose the parameter file used by the workflows to load the metadata (the parameter file name is assigned at first data load). This parameter file name has the form nnnnn.par, where nnnnn is a five digit integer assigned at the time of the first load of this source repository. The script promoting Metadata Manager from the development environment to test and from the test environment to production preserves this file name.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
566 of 818
6. Reset the $$SRC_INCR_DATE Parameter . After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every f days. The value depends on how often the Metadata Manager Warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting. 7. Run the Custom XConnect . Using the Configuration Console, Metadata Manager Administrators can run the custom XConnect and ensure that the metadata loads correctly. Note: When loading metadata with Effective From and Effective To Dates, Metadata Manager does not validate whether the Effective From Date is less than the Effective To Date. Ensure that each Effective To Date is greater than the Effective From Date. If you do not supply Effective From and Effective To Dates, Metadata Manager sets the Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.
To Run a Custom XConnect Log in to the Configuration Console. Click Source Repository Management Click Load next to the custom XConnect you want to run
Configure the Reports and Schema The objective of this task is to set up the reporting environment, which needs to run reports on the metadata stored in the Metadata Manager Warehouse. The setup of the reporting environment depends on the reporting requirements. The following options are available for creating reports: Use the existing schema and reports. Metadata Manager contains prepackaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. Metadata Manager also provides impact analysis and lineage reports that provide information on any type of metadata. Create new reports using the existing schema. Build new reports using the existing Metadata Manager metrics and attributes. Create new Metadata Manager Warehouse tables and views to support the schema and reports. If the prepackaged Metadata Manager schema does not meet the reporting requirements, create new Metadata Manager Warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata Manager Warehouse tables or views, register the tables in the Metadata Manager schema and create new metrics/attributes in the Metadata Manager schema. Note that the Metadata Manager schema is built on the Metadata Manager views. After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.
Last updated: 05-Jun-08 14:15
INFORMATICA CONFIDENTIAL
BEST PRACTICES
567 of 818
Estimating Metadata Manager Volume Requirements Challenge Understanding the relationship between various inputs for the Metadata Manager solution in order to estimate data volumes for the Metadata Manager Warehouse.
Description The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded into it. The size is dependent on the number of element attributes being captured in source metadata and the associations defined in the metamodel. When estimating volume requirements for a Metadata Manager implementation, consider the following Metadata Manager components: Metadata Manager Service - Manages the source repository metadata stored in the Metadata Manager Warehouse. You can use Metadata Manager, which uses the Metadata Manager Service, to search, view, and configure source repository metadata and run reports. Metadata Manager Integration Repository - This PowerCenter repository stores the workflows, which are resource components that extract source metadata and load it into the Metadata Manager Warehouse. Metadata Manager Warehouse- The Metadata Manager Warehouse stores the Metadata Manager metadata. It also stores source repository metadata and metamodels
Considerations Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager in the development environment to get accurate size estimates for the Metadata Manager in the production environment. The required steps are as follows: 1. Identify the source metadata that needs to be loaded in the Metadata Manager production warehouse. 2. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as explained under the Sizing Estimate Example section of this document). 3. Run the resource loads and monitor the disk usage. The development metadata loaded during the initial run of the resources should be used as a baseline for further sizing estimates.
Sizing Estimate Example The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes, consider the expected Metadata Manager Warehouse target size to increase in direct proportion. Expected Metadata Manager Warehouse Target Size
Resource
Input Size
Metamodel and other tables
-
50MB
PowerCenter
1MB
10MB
Data Analyzer
1MB
4MB
Database
1MB
5MB
Other Resource
1MB
4.5MB
Last updated: 02-Jun-08 23:31
INFORMATICA CONFIDENTIAL
BEST PRACTICES
568 of 818
Metadata Manager Load Validation Challenge Just as it is essential to know that all data for the current load cycle has loaded correctly, it is important to ensure that all metadata extractions (Metadata Resources) loaded correctly into the Metadata Manager warehouse. If metadata extractions do not execute successfully, the Metadata Manager warehouse will not be current with the most up-to-date metadata.
Description Metadata Manager’s resource loads are comprised of four phases: loading, indexing, profiling (optional) and linking (optional) of a resource. Metadata Manager provides a comprehensive interface to monitor and validate all four load phases. The Metadata Manager Application interface provides the run history for each of the resources. For load validation, use the Load page in the Metadata Manager Application interface, PowerCenter Workflow Monitor and PowerCenter Administration Console logs. The Workflow Monitor in PowerCenter will also have a workflow and session log for the resource load. Resources can fail for a variety of reasons common in IT, such as unavailability of the database, network failure, improper configuration, etc. Commonly a resource may load without any errors due to configuration issues, but users may not be doing any lineage operations on the resource that would cause an error to become apparent. More detailed error messages can be found in the activity log or in the workflow log files. The following installation directories will also have additional log files that are used for the resource load process: …\server\tomcat\mm_files\MM_PC901\mm_load …\server\tomcat\mm_files\MM_PC901\mm_index …\server\tomcat\logs
Loading and Monitoring Resources Overview The Load page for Metadata Manager allows users to create, load and monitor resources. Use the Load page to perform the following resource tasks: Load a resource. Load the source metadata for a resource into the Metadata Manager warehouse. Metadata Manager extracts metadata and profiling information, indexes the resource and optionally links the resource. Monitor a resource. Use the Metadata Manager activity log, resource output log and PowerCenter Workflow Monitor to monitor and troubleshoot resource loads. The load monitor also provides a “link summary” report that details the number of possible links for a resource and the number of actual links for a resource. Schedule a resource. Create a schedule to select the time and frequency that Metadata Manager loads a resource. The schedule can be attached to a resource. The figure below shows the Load page for Metadata Manager:
Loading Resources
INFORMATICA CONFIDENTIAL
BEST PRACTICES
569 of 818
Once a resource has been configured, it can be loaded. Metadata Manager also provides a command line interface to create and load metadata resources. Metadata Manager loads the resource and displays the results of the resource load in the Resource List. When Metadata Manager loads a resource, it completes the following tasks: Metadata Load: Loads the metadata for the resource into the Metadata Manager warehouse. Profiling: Extracts profiling information from the source database. If a relational database resource is loaded, profiling information can be extracted from tables and columns in the database. This is an optional step. It is relevant only for database resources. Linking: Creates or updates the links for a resource. Indexing: Creates or updates the index files for the resource. Start the load process from the Resource List section of the Load page. To load a resource: 1. On the Load page, select the resource to be loaded in the Resource List. 2. Click Load. Metadata Manager adds the resource to the load queue and starts the load process. If Metadata Manager finds an unassigned connection to another metadata source, Metadata Manager pauses the load. Configure the connection assignment to proceed. Configure the connection assignments for the resource in the Resource Properties section and click Resume. 3. To cancel the load, click Cancel. When the resource load completes, Metadata Manager updates the Last Status Date and Last Status for the resource. Use the activity log and the output log to view more information about the resource load.
Resuming a Failed Resource Load If a resource load fails when PowerCenter runs the workflows that load the metadata into the warehouse, the resource load can be resumed. Use the output log in Metadata Manager and the workflow and session logs in the PowerCenter Workflow Manager to troubleshoot the error and resume the resource load. To resume a failed load: 1. On the Load page, from the Resource List, select the resource for which the resource load should be resumed. 2. Click Resume. Metadata Manager continues loading the resource from the previous point of failure and completes any profiling or indexing operations.
Load Queue When a resource is loaded, Metadata Manager places the resource in a load queue. The load queue controls the order in which Metadata Manager loads resources. Metadata Manager places resources in the load queue when the resource load starts from the Load page or when a scheduled resource load begins. If a resource load fails, Metadata Manager keeps the resource in the load queue until the timeout interval for the resource load is exceeded. When the timeout interval is exceeded, Metadata Manager removes the resource from the load queue and begins loading the next resource in the queue. Configure the number of resources that Metadata Manager loads simultaneously and the timeout interval for resource loads when configuring the Metadata Manager Service in the PowerCenter Administration Console. Metadata Manager can load up to five resources simultaneously. The default timeout for the load queue is 30 minutes.
Link Details To view link details for a resource: 1. On the Load tab, select the resource in the Resources panel. 2. Click Actions => View Load Details. The Load Details tab for the resource appears. 3. Click the Links view. The view lists the possible, actual and missing links for all connections and for specific connections. To export the missing link details to a Microsoft Excel file, select All Connections or select a specific connection and then click Actions => Export Details to Excel.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
570 of 818
Links View The Links view contains information about links created between objects in resources that share a connection assignment. Metadata Manager creates links during a resource load or when Metadata Manager is directed to create the links. Metadata Manager uses the links between objects in different resources to display data lineage across sources. The screenshot below displays the link summary for a PowerCenter resource.
Note:The Links view is not applicable for a custom resource. Metadata Manager updates the Links view during each resource load or link process. The following table describes the columns in the Links view: Column
Description
Connection Name of the connection in the data integration, business intelligence or data modeling repository to the source database. Assigned Schema name for the source database resource that the connection is assigned to. Schema Referred Name of the overridden schema in the source database resource that Metadata Manager uses to link the objects. Schema When the object in the source database overrides the connection, Metadata Manager uses the overridden schema instead of the assigned schema to link the objects. Resource Name of the connected resource. For a data integration, business intelligence or data modeling resource, this column lists the resource that owns the assigned schema. For a relational database resource, this column lists the resource that owns the connection. Linkable Number of objects associated with the connection and schema that can have links across resources. Metadata Objects Manager lists the number of linkable objects in the following columns:
Actual Links
Missing Links
Before. Number existing before the last resource load. Added. Number added during the last resource load. Removed. Number removed during the last resource load. Current. Total number. Number of created links associated with the connection and schema. Metadata Manager lists the number of links in the following columns: Before. Number existing before the last resource load or linking. Added. Number added during the last resource load or linking. Removed. Number removed during the last resource load or linking. Current. Total number. Number of links associated with the connection and schema that Metadata Manager could not create. Missing links can occur because of incorrect connection assignments, outdated metadata for the resources or a counterpart object that does not exist in the source database. Metadata Manager lists the number of missing links in the following columns: Before. Number existing before the last resource load or linking. Added. Number added during the last resource load or linking. Removed. Number removed during the last resource load or linking. Current. Total number.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
571 of 818
Loading Metadata Sources in Order To view data lineage or where-used analysis across source repositories and databases, configure the connection assignments for the resource, load the metadata for the database or other source repository, and then load the resource that contains the connections. For example, if the user wants to run data lineage analysis between a PowerCenter repository and an Oracle database, the user must load the Oracle database, configure the connection assignments for the PowerCenter resource, and then load the PowerCenter resource.
Resource Link Administration If for some reason the database resource has not been loaded or the database resource load failed but the user has gone ahead and loaded the ETL or BI resource, Metadata Manager allows the user to link the PowerCenter resource with the database resource later. This feature is also useful when a configuration mistake has been made with the connection assignments. The screen shot below shows the Resource Link Administration page.
Note: Creating Linking through this wizard does not run the entire metadata load. It runs just the linking portion of the load.
Monitoring Resources Monitor resource load runs to determine if they are successful. If a resource load fails, troubleshoot the failure and load the resource again. Use the following logs in Metadata Manager to view information about resource loads and troubleshoot errors: Activity log. Contains the status of resource load operations for all resources. Load Details log. Contains detailed information about each resource load operation. The PowerCenter Workflow Monitor can be used to view the workflows as they load the metadata. During the metadata extraction process, PowerCenter workflows extract the data from the Metadata Source System and load it to the Metadata Warehouse. Monitoring of these workflows can be done from the Metadata Manager Resource Activity log as well as from PowerCenter’s Workflow monitor. In some cases it might be necessary to review these workflow/session logs to troubleshoot an issue.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
572 of 818
Use session and workflow logs to troubleshoot errors. If loading multiple resources of the same resource type concurrently, the Integration Service runs multiple instances of the workflow that corresponds to the resource type. Each workflow instance includes separate workflow and session logs. The mmcmd command can also be used to get more information about the status of a resource load from the command line. Note: Profiling may show as successful although some of the PowerCenter sessions that load profiling information fail. Sessions can fail because of run-time resource constraints. If one or more sessions fail but the other profiling sessions complete successfully, profiling displays as successful on the Load page.
Activity Log The activity log contains details on each resource load. Use the activity log to get more details on a specific resource load. The following shows a sample activity log:
The following table describes the contents of the activity log: Property Resource Task Type
Description Name of the resource. Type of task performed by Metadata Manager. Metadata Manager performs the following tasks: - Metadata Load. Loads metadata into the Metadata Manager warehouse. - Profiling. Extracts profiling information from the source database.
User Start Date End Date Status
- Indexing. Creates or updates index files for the resource. Metadata Manager user that started the resource load. The date and time the resource load started. The date and time the resource load completed. The status of the metadata load, profiling, and indexing operations.
Load Details The load details log displays the results of the most recent resource load for a resource. Use this log for detailed information about the operations performed by Metadata Manager when it loads the resource. The following example shows an excerpt from an output log:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
573 of 818
Last updated: 30-Oct-10 22:03
INFORMATICA CONFIDENTIAL
BEST PRACTICES
574 of 818
Metadata Manager Business Glossary Challenge A group of people working towards a common goal needs shared definitions for the information they are dealing with. Implementing a Business Glossary with Metadata Manager provides a vocabulary to facilitate better understanding between business and IT. Whether a company is in Life Sciences, Banking, Retail or Insurance all of them have a shared need to establish a common understanding of the enterprise terms and definitions that are used in day-to-day activities in meetings, Emails and conversations between enterprise users. It is imperative for different divisions in a company (i.e., R&D, Marketing and Sales) to have a clear understanding of business terms and their descriptions or formulas (if any).
Description Business Glossary is a key feature of Informatica Metadata Manager. This tool enables data analysts, business analysts and data stewards to work together to create, manage and share a common vocabulary of data integration business terms. This feature fosters cross-functional alignment and helps all parts of the business better understand the context and usage of data. IT organizations can answer the most common questions that business consumers have about data, such as “What does this data mean?”, “Where does the data come from?” and “Who has responsibility for this data?” Using Business Glossary will not only provide a means of sharing the business meanings associated with a term (like derivation, ref. values, etc.) but also will enable business and IT users to see what technical objects are related with a term through controlled security mechanisms. This makes Metadata Manager Business Glossary a key differentiator from other products that just act as a glossary without relating to technical metadata.
The above diagram shows a sample lineage diagram for a business term. Metadata Manager allows users to associate technical objects to business entities. By providing business context to technical artifacts related to data integration, the Business Glossary makes it possible to catalog, govern and use data consistently and efficiently. This feature helps establish data ownership and manage accountability through auditable data trails that are critical for successful data governance programs and governance, risk and compliance initiatives. Data values and data object names such as names of entities and attributes can be interpreted differently by various groups. In many organizations, Business Analysts create spreadsheets or Word documents to manage business terms. Without a common repository to store these business terms, it is often a challenge to communicate them to other groups. By creating a Business Glossary, Metadata Manager can be leveraged to associate business terms to IT assets. Metadata Manager provides an interface to create and maintain enterprise business glossaries. It also allows users to define relationships between business terms and technical terms. Metadata Manager allows creating multiple glossaries in a single environment. Metadata Manager can integrate with an LDAP server and allow users to use standard domain authentication, making it easier to provide access to users across the enterprise.
Creating a Business Glossary In Metadata Manager, a Business Glossary can be created using either the “Add Business Glossary” option from the Load tab or
INFORMATICA CONFIDENTIAL
BEST PRACTICES
575 of 818
the ActionsèNewàGlossary option from the Browse tab. Any of the following four methods can be used to populate the Business Glossary: 1. Manually create Categories and Terms and use the GUI to enter the data. 2. Use Excel files as the source and import the data. 3. Use the Custom Metadata Configurator to generate a custom XConnect. This option is useful when trying to import terms from an existing source (e.g., DB tables, SharePoint, etc.). 4. Import from an XML file. This option is useful when copying terms between different glossaries. Pull down lists are provided for Owner, Data Steward and Status. The values in the pull down lists for Owner and Data Steward come from the Informatica Administrator Console. Any user in the Informatica domain with appropriate permissions will show up in these pull downs. The status pull down shows values for Proposed, Approved, Standard and Deprecated. Follow the steps below to populate the Business Glossary using the Import from Excel method: 1. 2. 3. 4. 5. 6.
Create at least one category and one term in the UI. Export the Glossary to an Excel file. Insert additional records for other categories and terms. Populate appropriate columns in the Excel file to create relationships between the terms. Create relationships between terms and catalog objects. Add terms to multiple Categories.
At least one term and one category should be created before exporting to Excel. Objects can be added into the Excel sheet, and the import from Excel can be performed.
Linking Business Terms with Technical Objects Business Glossary implementation with Metadata Manager intends to not only capture the business terms, but to also provide a way to relate technical and business concepts. This will allow Business Users, Data Stewards, Business Analysts and Data Analysts to describe key concepts of their business and communicate to other groups. Business terms can be related to technical metadata using any of the following options: 1. 2. 3. 4.
Manually create relationships by drag and drop in the GUI. Import from an Excel file. Use the Custom Metadata Configurator to generate custom XConnect. Use the Object Relationships wizard.
Objects Relationship Wizard The Object Relationships wizard (ORW) can be used to create relationships for multiple metadata objects in a Business Glossary or a custom resource. The wizard can be used to find metadata objects that users can create relationships between. If a relationship exists, the wizard can be used to delete it. To use the wizard, select the resources that contain the metadata objects that the user wants to create relationships between. Select one term or an entire glossary. Configure the metadata object property that should match. The wizard searches for metadata objects in the resources that contain matching properties. The wizard then displays the relationships that can be created or deleted. The ORW searches for technical objects that contain matching properties, i.e., the Business Glossary description field should match an Oracle Column Name). To launch the ORW, select the term or the Business Glossary (by going to the categories screen), under Actions, and click on “Object Relationships Wizard”.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
576 of 818
In the next screen choose the technical resources to which the business terms will be linked to. After choosing the links, there will be a prompt to match the Business Glossary attribute to the Technical Object’s attribute. Only one attribute can be chosen (i.e., the Business Glossary term name can be matched to the Oracle Column Name).
The next screen of the wizard shows the list of technical objects that were found. The user can choose to select only a subset of this list.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
577 of 818
The ORW provides an easy way for a user to link a group of terms (or custom elements) to technical elements. However, the one drawback with the ORW is that the ORW run cannot be automated or scheduled. If the situation demands for complete automation of the Business Glossary load and linking, a custom XConnect should be developed for this purpose. By using a custom XConnect, the complete automation of a Business Glossary load can be achieved. A combination of PowerCenter code and a custom XConnect can be used to import business terms from a third party source and load them into Metadata Manager.
Last updated: 30-Oct-10 21:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
578 of 818
Metadata Manager Migration Procedures Challenge This Best Practice describes the processes that should be followed (as part of the Metadata Manager deployment in multiple environments) whenever out of the box Metadata Manager components are customized or configured or when new components are added to Metadata Manager. Because the Metadata Manager product consists of multiple components, the steps apply to individual product components. The deployment processes are divided into the following four categories: Out of the Box Resources Custom Resources Business Glossaries Custom Metadata Reports
Description Out of the Box Resources Out of the box resources may be migrated in a few different ways. The Metadata Administrator may choose to recreate the entire resource in the target environment and reload the resources manually. Metadata Manager also provides a way to “copy” resources from one environment to another. Using the mmcmd command, a resource’s configuration can be exported into a “Resource Configuration File” (RCF). mmcmd.bat getResource -url http://192.168.1.118:10250 -u Administrator -pw Administrator -r "Ora - Local" -rcf ora_local1.xml The generated RCF is below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
579 of 818
environment. The PowerCenter parameter files must be in the directory specified in the resource configuration file. If the resource configuration file does not include a path, mmcmd looks for the parameter files in the mmcmd directory. In the production environment, run mmcmd createResource to create each resource with the appropriate resource configuration file. For example: mmcmd> createResource -url http://192.168.1.118:10250 -u Administrator -pw Administrator -r "Ora - Local" -rcf ora_local.xml Run mmcmd load to load each resource into the production environment Metadata Manager warehouse.
Custom Resources Migrating a custom resource involves four steps: 1. 2. 3. 4.
Export the meta-model. Create a custom resource. Migrate the custom resource templates. Import the custom resource template.
Export the Meta-Model Migrating the metadata model is fairly straightforward. The meta-model can be exported and imported via the command line (using mmcmd) or via the Metadata Manager Console.
Creating a Custom Resource
To create the Custom Resource via command line, the user has to first export the resource from DEV into a resource configuration file (by running mmcmd getResource) and import the RCF into PROD (by running mmcmd createResource). Alternatively, the Custom Resource can be created via the Metadata Manager GUI.
Migrate the Custom Resource Templates
The Custom Metadata Configurator allows one to export and import the template from one environment to another. To export a custom resource template: 1. In the Custom Metadata Configurator, log in to the Metadata Manager repository that contains the custom resource template the user wants to export. 2. Click Export. The Export Custom Template dialog box appears:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
580 of 818
3. Select the Custom Resource for the template the user wants to export. 4. Enter the path and name of the custom template file that the user wants to export to or click Browse to select the file. 5. Click Export. The Custom Metadata Configurator exports the custom resource template to the file. Import the custom resource template into the target repository from the file.
Importing the Custom Resource Template When importing a custom resource template, Metadata Manager creates a template in the Metadata Manager repository for a Custom Resource. Before importing the custom resource template into the repository, the user must import the model for the Custom Resource into the Metadata Manager instance. After importing the template, the user must generate the PowerCenter objects that Metadata Manager requires to load the custom metadata. If the template contains relationships to other resources, the user must also configure the resources for the relationships before importing the template. If the resources do not exist in Metadata Manager, the resources must be loaded before importing the template. To import a custom resource template: 1. Log in to the Metadata Manager instance where the custom resource template will be imported. 2. Create a Custom Resource on the Load tab. 3. In the Custom Metadata Configurator, log in to the Metadata Manager repository where the Custom Resource was created. 4. Click Import. The Import Custom Template dialog box appears. 5. Select the name of the Custom Resource to which the templated is being imported. 6. Enter the path and name of the custom template file to be imported or click the Browse button to select the custom template file. 7. If the custom resource template contains relationships to other resources, configure a target resource for the relationship for each source resource in the custom template file.
8. Click import. The Custom Metadata Configurator imports the template from the custom template file. Note: After importing the template, the Custom XConnect workflows must be generated.
Business Glossary
INFORMATICA CONFIDENTIAL
BEST PRACTICES
581 of 818
Business Glossary term definitions can be copied between environments by exporting and importing via XML. The glossary can be exported using the Metadata Manager GUI or using the “mmcmd export” command.
Report Changes The following tables depict the various scenarios related to the reporting area and the actions that need to be taken as it relates to the deployment of the changed components. It is always advisable to create new schema elements (metrics, attributes, etc.) or new reports in a new Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects across development, test and production.
Nature of Report Change: Modify Schema Component (metric, attribute etc.) Development
Test
Production
Perform the change in development, test Import the XML exported in the development environment. the same and certify it for deployment. Do an XML export of the changed components.
Import the XML exported in the development environment.
Answer ‘Yes’ to overriding the definitions Answer ‘Yes’ to overriding the definitions that already exist for the changed schema that already exist for the changed components. schema components..
Nature of Report Change: Modify an Existing Report (add or delete metrics, attributes, filters, change formatting etc.) Development
Test
Production
Perform the change in development, test Import the XML exported in the development environment. the same and certify it for deployment.
Import the XML exported in the development environment.
Do an XML export of the changed report. Answer ‘Yes’ to overriding the definitions Answer ‘Yes’ to overriding the definitions that already exist for the changed report. that already exist for the changed report.
Nature of Report Change: Add New Schema Component (metric, attribute etc.) Development
Test
Production
Perform the change in development, test Import the XML exported in the development environment. the same and certify it for deployment. Do an XML export of the new schema components.
Import the XML exported in the development environment.
Nature of Report Change: Add A New Report Development
Test
Production
Perform the change in development, test Import the XML exported in the development environment. the same and certify it for deployment.
Import the XML exported in the development environment.
Do an XML export of the new report.
Last updated: 31-Oct-10 01:44
INFORMATICA CONFIDENTIAL
BEST PRACTICES
582 of 818
INFORMATICA CONFIDENTIAL
BEST PRACTICES
583 of 818
Metadata Manager Repository Administration Challenge The task of administering the Metadata Manager Repository involves taking care of both the integration repository and the Metadata Manager warehouse. This requires knowledge of both PowerCenter administrative features (i.e., the integration repository used in Metadata Manager) and Metadata Manager administration features.
Description A Metadata Manager administrator needs to be involved in the following areas to ensure that the Metadata Manager warehouse is fulfilling the end-user needs: Migration of Metadata Manager objects created in the Development environment to QA or the Production environment Creation and maintenance of access and privileges of Metadata Manager objects Repository backups Job monitoring Metamodel creation.
Migration from Development to QA or Production In cases where a client has modified out-of-the-box objects provided in Metadata Manager or created a custom metamodel for custom metadata, the objects must be tested in the Development environment prior to being migrated to the QA or Production environments. The Metadata Manager Administrator needs to do the following to ensure that the objects are in sync between the two environments: Install a new Metadata Manager instance for the QA/Production environment. This involves creating a new integration repository and Metadata Manager warehouse Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the Metadata Manager Administration tab) or via the Metadata Manager command line utility. Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality. This functionality is identical to the function in Data Analyzer.
Providing Access and Privileges Users can perform a variety of Metadata Manager tasks based on their privileges. The Metadata Manager Administrator can assign privileges to users by assigning them roles. Each role has a set of privileges that allow the associated users to perform specific tasks. The Administrator can also create groups of users so that all users in a particular group have the same functions. When an Administrator assigns a role to a group, all users of that group receive the privileges assigned to the role. The Metadata Manager Administrator can assign privileges to users to enable users to perform the any of the following tasks in Metadata Manager: Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema. Configure the Metadata Manager Warehouse. Users can add, edit, and delete repository objects using Metadata Manager. Configure metamodels. Users can add, edit, and delete metamodels. Metadata Manager also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in Metadata Manager. Similarly, the Administrator can establish access permissions for source repository objects in the Metadata Manager warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in Metadata Manager. The Administrator can assign the following types of access permissions to objects: Read - Grants permission to view the details of an object and the names of any objects it contains. Write - Grants permission to edit an object and create new repository objects in the Metadata Manager warehouse. Delete - Grants permission to delete an object from a repository.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
584 of 818
Change permission - Grants permission to change the access permissions for an object. When a repository is first loaded into the Metadata Manager warehouse, Metadata Manager provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions.
Metamodel Creation In cases where a client needs to create custom metamodels for sourcing custom metadata, the Metadata Manager Administrator needs to create new packages, originators, repository types and class associations.
Job Monitoring When Metadata Manager Resources are running in the Production environment, Informatica recommends monitoring loads through the Metadata Manager console. The Load Page in the Metadata Manager Application interface has an Activity Log that can identify the total time it takes for a Resource to complete. The console maintains a history of all runs of an Resource, enabling a Metadata Manager Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in the Metadata Manager warehouse. The Activity Log provides the following details about each repository load: Repository Name- name of the source repository defined in Metadata Manager Run Start Date- day of week and date the Resource run began Start Time- time the Resource run started End Time- time the Resource run completed Duration- number of seconds the Resource run took to complete Ran From- machine hosting the source repository Last Refresh Status- status of the Resource run, and whether it completed successfully or failed
Repository Backups When Metadata Manager is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas: Database backups of the Metadata Manager warehouse Integration repository; Informatica recommends either of two methods for this backup: The PowerCenter Repository Server Administration Console or pmrep command line utility The traditional, native database backup method. The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.
Last updated: 03-Jun-08 00:20
INFORMATICA CONFIDENTIAL
BEST PRACTICES
585 of 818
Metadata Repository Architecture Challenge It is crucial for a governance or support group to manage the design of the metadata repository architecture. The infrastructure of a metadata repository needs to be set up to not only leverage key attributes but to maintain interoperability between development groups. The challenge is to identify which key features of the metadata architecture are needed and to make them available to the various development environments.
Description Architecture Overview A centralized repository should store all of the heterogeneous metadata and converts it into homogeneous meaningful, accessible business and technical metadata. The biggest advantage to this approach is that consistent management of all relevant metadata can be achieved. It also provides a single view of metadata across the enterprise. However, it requires that all tools need to replicate their metadata with a Metadata Warehouse frequently to keep it in synch with source repositories. The following diagram depicts a centralized metadata architecture.
Sources of Metadata (XConnect) Business and technical metadata is stored in disparate systems throughout an organization and exists in a wide variety of formats; including diverse software applications and tools. The table below lists some of the typical systems for the most common types of metadata. These sources of metadata should flow directly into the metadata repository and be integrated through the metadata repository extraction process.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
586 of 818
Source Type
Type of Metadata Data Source, Data Connection Universe BI Schema [Table, Table Element, Table Joins] Dimensions, Facts, Attributes, Measures
BI Tools
Hierarchies, Levels, Filters Report, Report Layout, Report Content [Data Columns, Filters] Report Data Source, Query Composite Reports (Documents) Schema, Table, Table-Column, Datatype Synonym, View, Materialized View
Databases
Index, Index-Column Keys-Constraints [Primary Key, Foreign Key, Unique Key, Check Constraint], Sequence Function, Package, Stored Procedure, Triggers [Database, Schema, Table, View] Folders, Sources, Targets, Data Connections Delimited Flat Files, Delimited FlatFile-Fields VSAM files, XML files (no field level information) Source Qualifiers (DSQs)
Data Integration
Mappings, Mapping Shortcuts Transforms: Lookups, UNION, Aggregator, Joiner, Sequence, Sorter, Rank, Custom, Expression, Filter, Normalizer, Router, Mapplet, Update Strategy, Transaction Control, JAVA Procedures: Stored Procedures, EP, AEP
Business Glossaries
Sessions, Tasks, Workflows, Integration service Business metadata from structured, semi-structured and unstructured format Physical and Logical Models Model, Schema, Subject Area Table, Table-Column, View, View-Column
Data Modeling Tool
Domain, List of Values Index, Index-Columns
Custom Metadata Application Mainframe
Keys-Constraints [Primary Key, Foreign Key, Unique Key, Rule Constraint, List Constraint, Min-Max Constraint], Relationships, Procedures, Triggers Design, Build and Run metadata by using Custom Metadata Configurator SAP R3 tables DB2 structure from Mainframe, similar to Databases XConnect
Loading Metadata (XConnect) XConnect is a set of definitions and rules to extract metadata from various applications/systems. The process for loading and maintaining the metadata repository should be as automated as possible. The task of manually keying in metadata becomes
INFORMATICA CONFIDENTIAL
BEST PRACTICES
587 of 818
much too time consuming for the metadata repository team to perform, and over time, usually undermines the repository initiative and causes it to be discontinued. With careful analysis and some development effort, the vast majority of these manual processes can be automated. The following table summarizes the most common metadata updating frequencies. XConnect Type BI Tools Databases Data Integration Business Glossaries Data modeling tool Custom Metadata
As As As As As As
Update Frequency changes occur changes occur changes occur changes occur changes occur changes occur
Lineage and Impact Analysis A key feature of a metadata repository is the ability to track the metadata dependencies and lineage across metadata sources. It is also important to be able to capture and store the mappings between related metadata as it flows among the various modeling, ETL and analysis tools used in the lifecycle of a project. Metadata content should also be able to incorporate critical information about itself, such as when and from what source it was created, how and when it was updated, what business rules were used in creating it, and what dependencies exist. Lineage awareness also permits the metadata architects to perform impact analysis, evaluating the enterprise-wide impact that may result from a change such as a modification to a business rule. Cross-tool metadata dependency and lineage awareness are important requirements for any metadata repository architecture. The following diagram displays a lineage report that traces “Department” as a business term. The lineage report shows how many elements tracked by Metadata Manager use the term “Department” (e.g., reports, datamarts (that store Department dimensions) and other dependencies that include Data modeling, ETL mapping, etc.).
Browse and Search Catalog Metadata Repository users will benefit by having effective ways to search against multiple source repositories and to conveniently access available content from them. While the metadata within different repositories may be varied in terms of data formats and operational purposes, standardized methods for searching is crucial in enabling interoperability among multiple silos
INFORMATICA CONFIDENTIAL
BEST PRACTICES
588 of 818
of metadata objects.
Last updated: 06-Sep-08 13:40
INFORMATICA CONFIDENTIAL
BEST PRACTICES
589 of 818
Daily Operations - PowerCenter Challenge Once the data warehouse has been moved to production the most important task is to keep the system running and available for the end users.
Description In most organizations the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer of the Production Support team. To that end, the Production Support team needs two documents to aid in the support of the production data warehouse (a Service Level Agreement and an Operations Manual).
Monitoring the System Monitoring of the system is useful in identifying any problems or outages before the users are impacted. The Production Support team must know what failed, where it failed, when it failed and who needs to be working on the solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold violations, service level agreements and other organizational requirements helps to determine the effectiveness of the data warehouse and any need for changes.
Service Level Agreement The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a high-level document that discusses system maintenance and the components of the system and identifies the groups responsible for monitoring the various components. The SLA should be measurable against key performance indicators. At a minimum, it should contain the following information: Times when the system should be available to users. Scheduled maintenance window. Who is expected to monitor the operating system. Who is expected to monitor the database. Who is expected to monitor the PowerCenter sessions. How quickly the support team is expected to respond to notifications of system failures. Escalation procedures that include data warehouse team contacts in the event that the support team cannot resolve the system failure.
Operations Manual The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that can arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain: Information on how to stop and re-start the various components of the system. IDs and passwords (or how to obtain passwords) for the system components. Information on how to re-start failed PowerCenter sessions and recovery procedures. A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.) and their average run times. Error handling strategies. Who to call in the event of a component failure that cannot be resolved by the Production Support team. PowerExchange Operations Manual.
Last updated: 05-Sep-08 13:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
590 of 818
INFORMATICA CONFIDENTIAL
BEST PRACTICES
591 of 818
Daily Operations - PowerExchange Challenge Once the Power Exchange listeners have been moved to production the most important task is to keep the system running and available for the end users.
Description In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer of the Production Support team. To that end, the Production Support team needs two documents to aid in the support of the production data warehouse (a Service Level Agreement and an Operations Manual).
PowerExchange Operations Manual The need to maintain archive logs and listener logs, use started tasks, perform recovery and other operation functions on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a regular basis operations is likely to face space issues. In order to set up archive logs on MVS datasets need to be allocated and sized. Recovery after failure requires operational intervention to restart workflows and to set the restart tokens. For Change Data Capture, operations is required to start the started tasks in a scheduler and/or after an IPL. There are specific commands that need to be executed by operations. The PowerExchange Reference Guide (8.1.1) and the related Adapter Guide provide detailed information on the operations of PowerExchange Change Data Capture.
Archive/Listener Log Maintenace The archive log is controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS in the parameter ARCHIVE_RTPD=. The default supplied during the Install (in RUNLIB member SETUPCC2) is 9999. This is generally longer than most organizations need it to be. To change it, just rerun the first step (and only the first step) in SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention period. This does not, however, fix the old archive datasets; to do that, use SMS to override the specification, removing the need to change the EDMUPARM. The listener default log is part of the joblog of the running listener. If the listener job runs continuously, there is a potential risk for the spool file to reach the maximum and cause issues with the listener. As a better alternative schedule the listener started task to restart every weekend so that the log will be refreshed and a new spool file will be created. If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=* //DTLLOG DD DSN=&HLQ..LOG, this will log the file to the member LOG in the HLQ..RUNLIB.
Recovery After Failure The last resort recovery procedure is to re-execute the initial extraction and load and restart the CDC process from the new initial load start point. Fortunately there are other solutions. In any case, if every change is needed, re-initializing may not be an option.
Application ID PowerExchange documentation refers to “consuming” applications – the processes that extract changes, whether they are real time or change (periodic batch extraction). Each “consuming” application must identify itself to PowerExchange. Realistically, this means that each session must have an application id parameter containing a unique “label”.
Restart Tokens Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of the extraction (Address in the database Log – RBA or SCN) is stored in a file on the server hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file, to force the next extraction to restart from any of these points. If the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
592 of 818
ODBC interface for PowerExchange is being used this is the best solution to implement. When running periodic extractions of changes and everything finishes cleanly, the restart token history is a good approach to recover back to a previous extraction. Simply choose the recovery point from the list and re-use it. There are likelier scenarios though. If running real time extractions (potentially never-ending or until there is a failure) there are no end-points to memorize for restarts. If a batch extraction fails, many changes may have already been processed and committed. You can’t afford to “miss” any changes and you don’t want to reapply the same changes you’ve just processed, but the previous restart token does not correspond to the reality of what has been processed. When using the Power Exchange Client for PowerCenter (PWXPC) the best answer to the recovery problem lies within PowerCenter, which has historically been able to deal with restarting this type of process – Guaranteed Message Delivery. This functionality is applicable to both real time and change CDC options. The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each Application ID in files on the PowerCenter Server. The directory and file name are required parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures compared to using the ODBC interface to PowerExchange. To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the session properties. During normal session execution, the PowerCenter Server stores recovery information in cache files in the directory specified for $PMCacheDir.
Normal CDC Execution If the session ends "cleanly" (i.e., zero return code) PowerCenter writes tokens to the restart file and the GMD cache is purged. If the session fails, unprocessed changes are left in the GMD cache and a Restart Token corresponding to the point in time of the last of the unprocessed changes. This information is useful for recovery.
Recovery If a CDC session fails and it was executed with recovery enabled, it can be restarted in recovery mode – either from the PowerCenter Client interfaces or by using the pmcmd command line instruction. Obviously, this assumes that that the session that failed previously can be identified. 1. 2. 3. 4.
Start from the point in time specified by the Restart Token in the GMD cache. PowerCenter reads the change records from the GMD cache. PowerCenter processes and commits the records to the target system(s). Once the records in the GMD cache have been processed and committed, PowerCenter purges the records from the GMD cache and writes a restart token to the restart file. 5. The PowerCenter session ends “cleanly”. The CDC session is now ready to be executed in normal mode again.
Recovery Using PWX ODBC Interface Of course, successfully recover can be accomplished when using the ODBC connectivity to PowerCenter, but some manual intervention is needed – to cope with processing all the changes from the last restart token, even if some of them have already been processed. When a failed CDC session is re-executed all of the changed data since the last Power Exchange restart token is received. The session has to cope with processing some of the same changes that were already processed at the start of the failed execution – either using lookups/joins to the target to see if the change being processed has already been applied or by simply ignoring database error messages such as trying to delete a record that was already deleted. If DTLUAPPL is run to generate a restart token periodically during the execution of a CDC extraction, saving the results, the generated restart token can be used to force a recovery at a more recent point in time than the last session-end restart token. This is especially useful when running real time extractions using ODBC. Otherwise, several days of changes that were already processed may be re-processed. Finally, the target and the CDC processing can always be re-initialized:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
593 of 818
Take an image copy of the tablespace containing the table to be captured, with QUIESCE option. Monitor the EDMMSG output from the PowerExchange Logger job. Look for the message DTLEDM172774I which identifies the PowerExchange Logger sequence number corresponding to the QUIESCE event. The output logger shows detail with the following format: DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 000849C56185 EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E0000000000 Sequence number . . . . . . . . . : 000000084E0000000000 Edition number . . . . . . . . . : B93C4F9C2A79B000 Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1 Take note of the log sequence number Repeat for all tables that form part of the same PowerExchange Application. Run the DTLUAPPL utility specifying the application name and the registration name for each table in the application. Alter the SYSIN as follows: MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navigator) add RSTTKN CAPDEMO (where CAPDEMO is Capture name from Navigator) SEQUENCE 000000084E0000000000000000084E0000000000 RESTART D5D3D3D34040000000084E0000000000 END APPL REGDEMO (where REGDEMO is Registration name from Navigator) Note how the sequence number is a repeated string from the sequence number found in the Logger messages after the Copy/Quiesce. Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the QUIESCE was done above. The image copy obtained above can be used for the initial materialization of the target tables.
PowerExchange Tasks: MVS Start and Stop Command Summary Task
Start Stop Command* Command
Listener
/S DTLLST
Agent
/S DTLA
Logger
/S DTLL
ECCR
/S
Notes
Description of Task
/F DTLLST, Preferred method /F CLOSE DTLLST, CLOSE /F DTLLST, The PowerExchange listener is used for bulk data If CLOSE doesn’t work CLOSE, movement and registering sources for Change Data FORCE Capture If FORCE doesn’t work /P DTLLST If STOP doesn’t work /C DTLLST /DTLA DRAIN and The PowerExchange Agent, used to manage connections SHUTDOWN COMPLETELY to the PowerExchange Logger and handle repository and /DTLA can be used only at the other tasks. This must be started before the shutdown request of Informatica Logger. Support ****(if you are installing, you /P DTLL The PowerExchange Logger used to manage the Linear need to run setup2 here /F DTLL, prior to starting the Logger) datasets and hiperspace that hold change capture data. STOP /f DTLL, display STOP command just cancel /F ECCR, QUIESCE wait for DTLDB2EC, open UOWs to complete. There must be registrations present prior to bringing up STOP or /F
INFORMATICA CONFIDENTIAL
BEST PRACTICES
594 of 818
(DB2)
DTLDB2EC
Condense /S DTLC
DTLDB2EC, /F DTLDB2EC, display will most adaptor ECCRs. QUIESCE or publish stats into the ECCR /P DTLDB2EC sysout The PowerExchange Condenser used to run condense jobs against the PowerExchange Logger. This is used /F DTLC, with PowerExchange CHANGE to organize the data by SHUTDOWN table, allow for interval-based extraction, and optionally fully condense multiple changes to a single row.
Apply
(1) To identify all tasks running through a certain listener issue the following:
(1) F
Notes: 1. /p is an MVS STOP command , /f is an MVS MODIFY command. 2. REMOVE the / if the command is done from the console not SDSF. If an attempt is made to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and that the logger will come down AFTER the ECCRS go away. Instead, do the following to shut the Listener and the ECCR(s) down at the same time:
The Listener: 1. F
The DB2 ECCR: 1. F
The Logger: P
The Agent: CMDPREFIX SHUTDOWN If an IPL is imminent, all of these commands can be issued at the same time. The Listener and ECCR(s) should start down, if speed is necessary, issue F
INFORMATICA CONFIDENTIAL
BEST PRACTICES
595 of 818
Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the source is registered returns a “Not being captured” answer. The update, therefore, occurs without being captured, leaving the target in a broken state (which may not be uncovered until it is too late!)
Sizing the Logger When PWX-CHANGE is installed, up to two active log data sets are allocated with minimum size requirements. The information in this section can help to determine if it is necessary to increase the size of the data sets, and if additional log data sets should be allocated. When defining active log data sets, consider the system’s capacity and requirements for changed data, including archiving and performance issues. After the PWX Logger is active the log data set configuration can be changed as needed. In general, remember to balance the following variables: Data set size Number of data sets Amount of archiving The choices made can depend on the following factors: Resource availability requirements Performance requirements Whether running near-realtime or batch replication Data recovery requirements An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger data sets need to be archived less often than smaller data sets. Note: Although smaller data sets require more frequent archiving the archiving process requires less time. Use the following formulas to estimate the total space needed for each active log data set. For an example of the calculated data set size, refer to the PowerExchange Reference Guide. active log data set size in bytes = (average size of captured change record * number of changes captured per hour * desired number of hours between archives) * (1 + overhead rate) active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder active log data set size in tracks = active log data set size in bytes/number of usable bytes per track When determining the average size of captured change records, note the following information: PWX Change Capture captures the full object that is changed. For example, if one field in an IMS segment has changed, the product captures the entire segment. The PWX header adds overhead to the size of the change record. Per record, the overhead is approximately 300 bytes plus the key length. The type of change transaction affects whether PWX Change Capture includes a before-image, after-image, or both: DELETE includes a before-image. INSERT includes an after-image. UPDATE includes both. Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors: Overhead for control information Overhead for writing recovery-related information, such as system checkpoints. Some control over the frequency of system checkpoints can be exerted when the PWX Logger parameters are defined. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter.
DASD Capacity Conversion Table INFORMATICA CONFIDENTIAL
BEST PRACTICES
596 of 818
Space Information usable bytes per track tracks per cylinder
Model 3390 49,152 15
Model 3380 40,960 15
This example is based on the following assumptions: estimated average size of a changed record = 600 bytes estimated rate of captured changes = 40,000 changes per hour desired number of hours between archives = 12 overhead rate = 5 percent DASD model = 3390 The estimated size of each active log data set in bytes is calculated as follows: 600 * 40,000 * 12 * 1.05 = 302,400,000< /FONT > The number of cylinders to allocate is calculated as follows: 302,400,000 / 49,152 = approximately 6152 tracks< /FONT > 6152 / 15 = approximately 410 cylinders< /FONT > The following example shows an IDCAMS DEFINE statement that uses the above calculations: DEFINE CLUSTER (NAME (HLQ.EDML.PRILOG.DS01) LINEAR VOLUMES(volser) SHAREOPTIONS(2,3) CYL(410) ) DATA (NAME(HLQ.EDML.PRILOG.DS01.DATA) ) The variable HLQ represents the high-level qualifier that was defined for the log data sets during installation.
Additional Logger Tips The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through IDCAMS with: No secondary allocation. A single VOLSER in the VOLUME parameter. An SMS STORCLAS, if used, without GUARANTEED SPACE= YES.
PowerExchange Agent Commands Commands from the MVS system can be used to control certain aspects of PowerExchange Agent processing. To issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix (as specified by CmdPrefix in the configuration parameters) followed by the command. For example, if CmdPrefix=AG01, issue the following command to close the Agent's message log: AG01 LOGCLOSE The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the agent address INFORMATICA CONFIDENTIAL
BEST PRACTICES
597 of 818
space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange Agent commands that are issued. If the PowerExchange Agent has not been started during the current IPL, or if the command is issued with the wrong prefix, MVS generates the following message: IEE305I command COMMAND INVALID See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.
PowerExchange Logger Commands The PowerExchange Logger uses two types of commands, interactive and batch. Interactive commands can be issued from the MVS console when the PowerExchange logger is running. Use PowerExchange Logger interactive commands to: Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections. Resolve in-doubt UOWs. Stop a PowerExchange Logger. Print the contents of the PowerExchange active log file (in hexadecimal format). Use batch commands primarily in batch change utility jobs to make changes to parameters and configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to: Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger names, archive log options, buffer options and mode (single or dual). Add log definitions to the restart data set. Delete data set records from the restart data set. Display log data sets, UOWs, and reader/writer connections. See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59).
Last updated: 05-Sep-08 14:47
INFORMATICA CONFIDENTIAL
BEST PRACTICES
598 of 818
Data Integration Load Traceability Challenge Load management is one of the major difficulties facing a data integration or data warehouse operations team. This Best Practice tries to answer the following questions: How can the team keep track of what has been loaded? What order should the data be loaded in? What happens when there is a load failure? How can bad data be removed and replaced? How can the source of data be identified? When it was loaded?
Description Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.
Benefits of a Load Management Architecture Data Lineage The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data. To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth. It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie. It is often assumed that data issues are produced by the transformation processes executed as part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.
Process Lineage Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.
Process Dependency Management Having a metadata structure in place provides an environment to facilitate the application and maintenance of business dependency rules. Once a structure is in place that identifies every process, it becomes very simple to add the necessary metadata and validation processes required to ensure enforcement of the dependencies among processes. Such enforcement resolves many of the scheduling issues that operations teams typically faces. Process dependency metadata needs to exist because it is often not possible to rely on the source systems to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple systems and must be loaded into the target schema in a specific order. This is usually difficult to manage because the various source systems have no way of coordinating the release of data to the target schema.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
599 of 818
Robustness Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.
Load Ordering Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata. There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars. Either way, load ordering should be employed in any data integration or data warehousing implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure. The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!
Rollback If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.
Simple Load Management Metadata Model
As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata: Source tracking Process tracking
INFORMATICA CONFIDENTIAL
BEST PRACTICES
600 of 818
Source Tracking Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.
Source Definitions Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems. The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters. These definitions should be held in a Source Master table like the one shown in the data model above. These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds. In the case of flat files, it is usual to hold details like: Header information (if any) How many columns Data types for each column Expected number of rows For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses). These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. It’s better to catch a bad data structure than to start loading bad data.
Source Instances A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type. The various source types may need slightly different source instance metadata to enable optimal control over each individual load. Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would be the selection criteria (i.e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed. This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.
Process Tracking Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data. While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.
Process Definition Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its INFORMATICA CONFIDENTIAL
BEST PRACTICES
601 of 818
basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.
Process Instances A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance. The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.
Integrating Source and Process Tracking Integrating source and process tracking can produce an extremely powerful investigative and control tool for the administrators of data warehouses and integrated schemas. This is achieved by simply linking every process ID with the source instance ID of the source it is processing. This requires that a write-back facility be built into every process to update its process instance record with the ID of the source instance being processed. The effect is that there is a one to one/many relationship between the source instance table and the process instance table containing several rows for each set of source data loaded into a target schema. For example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.
Integrated Load Management Flow Diagram
Tracking Transactions This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.
Tracking Reference Data This task is complicated by the fact that reference data, by its nature, is not static. This means that if you simply update the data in a row any time there is a change, there is no way that the change can be backed out using the load management practice described earlier. Instead, Informatica recommends always using slowly changing dimension processing on every reference data
INFORMATICA CONFIDENTIAL
BEST PRACTICES
602 of 818
and dimension table to accomplish source and process tracking. Updating the reference data as a ‘slowly changing table’ retains the previous versions of updated records, thus allowing any changes to be backed out.
Tracking Aggregations Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions. This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.
Last updated: 20-Dec-07 15:44
INFORMATICA CONFIDENTIAL
BEST PRACTICES
603 of 818
Disaster Recovery Planning with PowerCenter HA Option Challenge Develop a Disaster Recovery (DR) Plan for PowerCenter running on Unix/Linux platforms. Design a PowerCenter data integration platform for high availability (HA) and disaster recovery that can support a variety of missioncritical and time-sensitive operational applications across multiple business and functional areas.
Description To enable maximum resilience, the data integration platform design should provide redundancy and remoteness. The target architecture proposed in this document is based upon the following assumptions: A PowerCenter HA option license is present. A Cluster File System will be used to provide concurrent file access from multiple servers in order to provide a flexible, high-performance, and highly available platform for shared data in a SAN environment. Four servers will be available for installing PowerCenter components. PowerCenter binaries, repository/domain database, and shared file system for PowerCenter working files are considered in a failover scenario. The DR plan does not take into consideration source and target databases, ftp servers or scheduling tools. A standby database server (which requires replicated logs for recovery) will be used as the disaster recovery solution for the database tier. It will provide disaster tolerance for both the PowerCenter repository and the domain database. As this server will be used to achieve high availability it should have performance characteristics in parity with the primary repository database server. Recovery time for storage can be reduced using near real-time replication of data-over-distance from the primary SAN to a mirror SAN. Storage vendors should be consulted for optimal SAN and mirror SAN configuration.
Primary Data Center During Normal Operation
PowerCenter Domain During Normal Operation
INFORMATICA CONFIDENTIAL
BEST PRACTICES
604 of 818
Informatica Server Manager on Node 1 and Node 2 are running. Informatica Server Manager on Node 3 and Node 4 is shutdown.
A node is a logical representation of a physical machine. Each node runs a Service Manager (SM) process to control the services running on that node. A node is considered unavailable if the SM process is not up and running. For example, the SM process may not be running if the administrator has shut down the machine or the SM process. SM processes periodically exchange a heartbeat signal amongst themselves to detect any node/network failure. Upon detecting a primary (or backup) node failure, the remaining nodes determine the new primary (or backup) node via a distributed voting algorithm. Typically, the administrator will configure the OS to automatically start the SM whenever the OS boots up or in the event the SM fails unexpectedly. For unexpected failures of the SM, monitoring scripts should be used because the SM is the primary point of control for PowerCenter services on a node. When PowerCenter is installed on a Unix/Linux platform, the same user id (uid) and group id (gid) should be created for all Unix/Linux users on Node1, Node2, Node3 and Node4. When the infa_shared directory is placed on a shared file system like CFS, all Unix/Linux users should be granted read/write access to the same files. For example, if a workflow running on Node1 creates a log file in the log directory, Node2, Node3 and Node4 should be able to read and update this file. To install and configure PowerCenter services on four nodes: 1. 2. 3. 4.
For the Node1 installation, choose the option to “create domain”. For the Node2, Node3 and Node4 installations choose the option to “join the domain”. Node1 will be the master gateway. For Node2, Node3 and Node4 choose “Serves as Gateway: Yes”. For Node 1, use the following URL to confirm that it is the Master Gateway: http://node1_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : enabled /coreservices/AuthenticationService : initialized /coreservices/AuthorizationService : enabled /coreservices/DomainConfigurationService : enabled /coreservices/DomainService : [DOM_10004] Domain service is currently master gateway node and enabled. /coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/LicensingService : enabled /coreservices/LogService : enabled /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : enabled
5. For Node2, Node 3 and Node 4 respectively, use the following URL to confirm that they are not Master Gateways: http://node2_hostname:6001/coreservices/DomainService The result should look like this: /coreservices/AlertService : uninitialized /coreservices/AuthenticationService : initialized /coreservices/AuthorizationService : initialized /coreservices/DomainConfigurationService : initialized /coreservices/DomainService : [DOM_10005] Domain service is currently non-master gateway node and listening. /coreservices/DomainService/InitTime : Fri Aug 03 09:59:03 EDT 2007 /coreservices/LicensingService : initialized /coreservices/LogService : initialized /coreservices/LogServiceAgent : initialized /coreservices/NodeConfigurationService : INFORMATICA CONFIDENTIAL
BEST PRACTICES
605 of 818
enabled 6. Confirm the following settings: a. b. c. d. e.
For Node1 Repository Service should be created as primary. For Node1 “Acts as backup Integration Service” should be checked. For Node2 Integration Service should be created as primary. For Node2 “Acts as backup Repository Service” should be checked. Node3 and Node4 should be assigned as backup nodes for Repository Service and Integration Service.
Note: During the failover in order for Node3 and Node4 to act as primary repository services, they will need to have access to the standby repository database. After the installation, persistent cache files, parameter files, logs, and other run-time files should be configured to use the directory created on the shared file system by pointing the $PMRootDir variable to this directory. Otherwise a symbolic link can be created from the default infa_shared location to the infa_shared directory created on the shared file system. After the initial set up, Node3 and Node4 should be shutdown from the Administration Console. During normal operations Node3 and Node4 will be unavailable. In the event of a failover to the secondary data center, it is assumed that the servers for Node1 and Node2 will become unavailable. By rebooting the hosts for Node3 and Node4 the following script-placed init.d will start the Service Manager process: TOMCAT_HOME=/u01/app/informatica/pc8.0.0/server/tomcat/bin case "$1" in 'start') # Start the PowerCenter daemons: su - pmuser -c "$TOMCAT_HOME/infaservice.sh startup" exit ;; 'stop') Esac Every node in the domain sends a heartbeat to the primary gateway at a periodic interval. The default value for this interval is 15 seconds (this may change in a future release). The heartbeat is a message sent over an HTTP connection. As part of the heartbeat, each node also updates the gateway with the service processes currently running on the node. If a node fails to send a heartbeat during the default timeout value which is a multiple of the heartbeat interval (the default value is 90 seconds) then the primary gateway node marks the node unavailable and will failover any of the services running on that node. Six chances are given for the node to update the master before it is marked as down. This avoids any false alarms for a single packet loss or in cases of heavy network load where the packet delivery could take longer. When Node3 and Node4 are started in the backup data center, they will try to establish a connection to the Master Gateway Node1. After failing to reach Node1, one of them will establish itself as the new Master Gateway. When normal operations resume, Node1 and Node2 will be rebooted and the Informatica Service Manager process will start on these nodes. Since the Informatica Service Manager process on Node3 and Node4 will be shutdown, Node1 will try to become the Master Gateway. The change in configuration required for the DR servers (there will be two servers as in production) can be set up as a script to automate the switchover to DR. For example, the database connectivity should be configured such that failover to the standby database is transparent to the PowerCenter repository and the Domain database. All database connectivity information should be identical in both data centers to make sure that the same source and target databases are used. For scheduling tools, FTP servers and message queues additional steps are required to switch to the ETL platform in the backup data center. As a result of using the PowerCenter HA option, redundancy in the primary data center is achieved. By using SAN mirroring, a standby repository database, and PowerCenter installations at the backup data center, remoteness is achieved. A further scaleout approach is recommended using the PowerCenter grid option to leverage resources on all of the servers. A single cluster file system across all nodes is essential to coordinate read/write access to the storage pool, ensure data integrity, and attain performance.
Backup Data Center After Failover From Primary Data Center
INFORMATICA CONFIDENTIAL
BEST PRACTICES
606 of 818
PowerCenter Domain During DR Operation Informatica Server Manager on Node 2 and Node 3 are running. Informatica Server Manager on Node 1 and Node 2 is shutdown.
Last updated: 04-Dec-07 18:00
INFORMATICA CONFIDENTIAL
BEST PRACTICES
607 of 818
High Availability Challenge Increasingly, a number of customers find that their Data Integration implementation must be available 24x7 without interruption or failure. This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it is critical to address both architectural (e.g., systems, hardware, firmware) and procedural (e.g., application design, code implementation, session/workflow features) recovery with HA.
Description One of the common requirements of high volume data environments with non-stop operations is to minimize the risk exposure from system failures. PowerCenter’s High Availability Option provides failover, recovery and resilience for business critical, always-on data integration processes. When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:
External Resilience External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informatica’s data integration setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered: Is the pre-existing set of servers already in a sustained HA configuration? Is there a schematic with applicable settings to use for reference? If so, is there a unit test or system test to exercise before installing PowerCenter products? It is important to remember, as a prerequisite for the PowerCenter architecture that the external systems must be HA. What are the bottlenecks or perceived failure points of the existing system? Are these bottlenecks likely to be exposed or heightened by placing PowerCenter in the infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running). Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation. The customer may merely want the grid capability of multiple PowerCenter nodes to recover Informatica tasks, and expect their O/S level HA capabilities to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion. However, PowerCenter's HA capability extends as far as the PowerCenter components.
Internal Resilience In an HA PowerCenter environment key elements to keep in mind are: Rapid and constant connectivity to the repository metadata. Rapid and constant network connectivity between all gateway and worker nodes in the PowerCenter domain. A common highly-available storage system accessible to all PowerCenter domain nodes with one service name and one file protocol. Only domain nodes on the same operating system can share gateway and log files (see Admin Console>Domain->Properties->Log and Gateway Configuration). Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels: Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain. Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings. Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes: Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user
INFORMATICA CONFIDENTIAL
BEST PRACTICES
608 of 818
or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down. Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.
Common Elements of Concern in an HA Configuration Restart and Failover Restart and Failover has to do with the Domain Services (Integration and Repository). If these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc.) and artifacts of the ETL process cannot be highly available. If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption. Backup nodes can be configured for services with the the high availability option. If an application service is configured to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service: If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable. If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem. If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. The service process can be disabled on the backup node to cause it to fail back to the primary node.
Recovery Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption. The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation: Service Manager. The Service Manager for each node in the domain maintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain. Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress and connected clients. Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains the session and workflow state of operations based on the recovery strategy configured for the session and workflow. When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery. Architectural recovery for a PowerCenter domain involves the Service Manager, Repository Service and Integration Service restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration Service cannot recover the restart is not successful and has little value to a production environment. Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
609 of 818
A PowerCenter domain cannot be established without at least one gateway node running. Even if a domain consists of ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain. An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository. A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections. Often database connections are established on periodic windows that expire – which puts the repository offline. If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter repository access is lost. If the installation uses any additional authentication outside PowerCenter (such as LDAP), an additional recovery and restart plan is required. Procedural recovery is supported with many features of PowerCenter. Consider the following very simple mapping that might run in production for many ETL applications:
Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail - they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:
Since it is not critical the ff_customer records run each time, record the failure but continue the process. Now say the situation has changed. Sessions are failing on a PowerCenter server due to target database timeouts. A requirement is given that the session must recover from this: INFORMATICA CONFIDENTIAL
BEST PRACTICES
610 of 818
Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work. To finish this second case, consider three basic items on the workflow side when the HA option is implemented:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
611 of 818
An Integration Service in an HA environment can only recover those workflows marked with “Enable HA recovery”. For all critical workflows, this should be considered. For a mature set of ETL code running in QA or Production, consider the following workflow property:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
612 of 818
This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature. In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the “Suspend On Error” feature from the General tab of Workflow settings. The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
613 of 818
Features A variety of HA features exist in PowerCenter. Specifically, they include: Integration Service HA option PowerCenter Enterprise Grid Option Repository Service HA option First, proceed from an assumption that nodes have been provided such that a basic HA configuration of PowerCenter can take place. A lab-tested version completed by Informatica is configured as below with an HP solution. Your solution can be completed with any reliable clustered file system. Your first step would always be implementing and thoroughly exercising a clustered file system:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
614 of 818
Now, let’s address the options in order:
Integration Service HA Option You must have the HA option on the license key for this to be available on install. Note that once the base PowerCenter install is configured, all nodes are available from the Admin Console->Domain->Integration Services->Grid/Node Assignments. From the above example, you would see Node 1, Node 2, Node 3 as dropdown options on that browse page. With the HA (Primary/Backup) install complete, Integration Services are then displayed with both “P” and “B” in a configuration, with the current operating node highlighted:
If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02. Then the “B” button would highlight showing this Node as providing INT_SVCS_DEV. A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. The paths for Integration Service files must be specified for each Integration Service process. Examples
INFORMATICA CONFIDENTIAL
BEST PRACTICES
615 of 818
of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If an Integration Service is configured to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes. When an Integration Service is enabled, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. The shared location for these directories can be set by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.
Integration Service Grid Option The Grid option provides implicit HA since the Integration Service can be configured as active/active to provide redundancy. The Server Grid option should be included on the license key for this to be available upon install. In configuring the $PMRootDir files for the Integration Service, retain the methodology described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above. A grid must be created before it can be used in a Power Center domain. Be sure to remember these key points: PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can run Workflow on Grid. For the Session on Grid option, an homogeneous grid is required. An homogeneous grid is necessary for Session on Grid because a session may have a sharing cache file and other objects that may not be compatible with all of the operating systems. If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.
Repository Service HA Option You must have the HA option on the license key for this to be available on install. There are two ways to include the Repository Service HA capability when configuring PowerCenter: The first is during install. When the Install Program prompts for your nodes to do a Repository install (after answering “Yes” to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node. Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node. When the install is complete, the Repository Service displays a “P”/”B” link similar to that illustrated above for the INT_SVCS_DEV example Integration Service. A second method for configuring Repository Service HA allows for measured, incremental implementation of HA from a tested base configuration. After ensuring that your initial Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a second node and make it the Repository Backup. Install the PowerCenter Service on this second server following the PowerCenter Installation and Configuration Guide. In particular, skip creating Repository Content or an Integration Service on the node. Following this, Go to Admin Console->Domain and select: “Create->Node”. The server to contain this node should be of the exact same configuration/clustered file system/OS as the Primary Repository Service.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
616 of 818
The following dialog should appear:
Assign a logical name to the node to describe its place, and select “Create”. The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:
Click “OK” and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration: 1. Be certain the same version of the DBMS client is installed on the server and can access the metadata. 2. Both nodes must be on the same clustered file system. 3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds). 4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain. This should be clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console>Repository->Logs. Note: Remember that when a node is taken offline, you cannot access Admin Console from that node.
Using a Script to Monitor Informatica Services A script should be used with the High Availability Option that will check all the Informatica Services in the domain as well as the domain itself. If any of the services are down the script can bring them back up. To implement the HA Option using a script, the Domain, Repository and Integration details need to be provided as input to the script; and the script needs to be scheduled to INFORMATICA CONFIDENTIAL
BEST PRACTICES
617 of 818
run at regular intervals. The script can be developed with eight functions (and one main function to check and bring up the services). A script can be implemented in any environment by providing input in the section only. Comments have been provided for each function to make them easy to understand. Below is a brief description of the eight functions: print_msg: Called to print output to the I/O and also writes to the log file. domain_service_lst: Accepts the list of services to be checked for in the domain. check_service: Calls the service manager, repository, and the integration functions internally to check if they are up and running. check_repo_service: Checks if the repository is up or down. If it is down it calls another function to bring it up. enable_repo_service: Called to enable the repository service. check_int_service: Checks if the integration is up or down. If it is down it calls another function to bring it up. enable_int_service: Called to enable the integration service. disable_int_service: Called to disable the integration service.
Last updated: 25-May-08 19:00
INFORMATICA CONFIDENTIAL
BEST PRACTICES
618 of 818
Load Validation Challenge Knowing that all data for the current load cycle has loaded correctly is essential for effective data warehouse management. However, the need for load validation varies depending on the extent of error checking, data validation, and data cleansing functionalities inherent in your mappings. For large data integration projects with thousands of mappings, the task of reporting load statuses becomes overwhelming without a well-planned load validation process.
Description Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process: 1. Determine what information you need for load validation (e.g., work flow names, session names, session start times, session completion times, successful rows and failed rows). 2. Determine the source of the information. All of this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting it. 3. Determine how you want the information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table so that history is easily preserved? Do you want it stored as a flat file? Weigh all of these factors to find the correct solution for your project. Below are descriptions of five possible load validation solutions, ranging from fairly simple to increasingly complex:
1. Post-session Emails on Success or Failure One practical application of the post-session email functionality is the situation in which a key business user waits for completion of a session to run a report. Email is configured to notify the user when the session was successful so that the report can be run. Another practical application is the situation in which a production support analyst needs to be notified immediately of any failures. Configure the session to send an email to the analyst upon failure. For round-the-clock support, a pager number that has the ability to receive email can be used in place of an email address. Post-session email is configured in the session, under the General tab and ‘Session Commands’. A number of variables are available to simplify the text of the email: %s Session name %e Session status %b Session start time %c Session completion time %i Session elapsed time %l Total records loaded %r Total records rejected %t Target table details %m Name of the mapping used in the session %n Name of the folder containing the session %d Name of the repository containing the session %g Attach the session log to the message %a
2. Other Workflow Manager Features In addition to post session email messages, there are other features available in the Workflow Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the features that can be used to place multiple controls on the behavior of loads. Another solution is to place conditions within links. Links are used to connect tasks within a workflow or worklet. Use the pre-defined or user-defined variables in the link conditions. In the example below, upon the ‘Successful’
INFORMATICA CONFIDENTIAL
BEST PRACTICES
619 of 818
completion of both sessions A and B, the PowerCenter Server executex session C.
3. PowerCenter Reports (PCR) The PowerCenter Reports (PCR) is a web-based business intelligence (BI) tool that is included with every PowerCenter license to provide visibility into metadata stored in the PowerCenter repository in a manner that is easy to comprehend and distribute. The PCR includes more than 130 pre-packaged metadata reports and dashboards delivered through Data Analyzer, Informatica’s BI offering. These pre-packaged reports enable PowerCenter customers to extract extensive business and technical metadata through easy-to-read reports including: Load statistics and operational metadata that enable load validation. Table dependencies and impact analysis that enable change management. PowerCenter object statistics to aid in development assistance. Historical load statistics that enable planning for growth.
In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you can develop additional custom reports and dashboards that are based upon the PCR limited-use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include: Repository-wide reports and/or dashboards with indicators of daily load success/failure. Customized project-based dashboard with visual indicators of daily load success/failure. Detailed daily load statistics report for each project that can be exported to Microsoft Excel or PDF. Error handling reports that deliver error messages and source data for row level errors that may have occurred during a load.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
620 of 818
Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.
4. Query Informatica Metadata Exchange (MX) Views Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter repository. The Repository Manager generates these views when you create or upgrade a repository. Almost any query can be put together to retrieve metadata related to the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows, and session duration: select subject_area, session_name, session_timestamp, successful_rows, (session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where session_timestamp = (select max(session_timestamp) from rep_sess_log where session_name =a.session_name) order by subject_area, session_name The sample output would look like this:
TIP Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository tables.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
621 of 818
5. Mapping Approach A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or a flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information. The following graphic illustrates a sample mapping:
This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations. Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCR report. However, you can use a business intelligence tool of your choice instead.
Last updated: 06-Dec-07 15:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
622 of 818
Repository Administration Challenge Defining the role of the PowerCenter Administrator to understand the tasks required to properly manage the domain and repository.
Description The PowerCenter Administrator has many responsibilities. In addition to regularly backing up the domain and repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions: Determines metadata strategy Installs/configures client/server software Migrates development to test and production Maintains PowerCenter servers Upgrades software Administers security and folder organization Monitors and tunes environment Note: The Administrator is also typically responsible for maintaining domain and repository passwords; changing them on a regular basis and keeping a record of them in a secure place.
Determine Metadata Strategy The PowerCenter Administrator is responsible for developing the structure and standard for metadata in the PowerCenter Repository. This includes developing naming conventions for all objects in the repository, creating a folder organization, and maintaining the repository. The Administrator is also responsible for modifying the metadata strategies to suit changing business needs or to fit the needs of a particular project. Such changes may include new folder names and/or a different security setup.
Install/Configure Client/Server Software This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a Windows 2000/2003 or UNIX Admin and a DBA. The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.
Migrate Development to Production When the time comes for content in the development environment to be moved to the test and production environments, it is the responsibility of the Administrator to schedule, track, and copy folder changes. Also, it is crucial to keep track of the changes that have taken place. It is the role of the Administrator to track these changes through a change control process. The Administrator should be the only individual able to physically move folders from one environment to another. If a versioned repository is used, the Administrator should set up labels and instruct the developers on the labels that they must apply to their repository objects (i.e., reusable transformations, mappings, workflows and sessions). This task also requires close communication with project staff to review the status of items of work to ensure, for example, that only tested or approved work is migrated.
Maintain PowerCenter Servers The Administrator must also be able to understand and troubleshoot the server environment. He or she should have a good understanding of PowerCenter’s Service Oriented Architecture and how the domain and application services interact with each other. The Administrator should also understand what the Integration Service does when a session is running and be able to identify those processes. Additionally, certain mappings may produce files in addition to the standard session and workflow logs.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
623 of 818
The Administrator should be familiar with these files and know how and where to maintain them.
Upgrade Software If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.
Security and Folder Administration Security administration consists of both the PowerCenter domain and repository. For the domain, it involves creating, maintaining, and updating all domain users and their associated rights and privileges to services and alerts. For the repository, it involves creating, maintaining, and updating all users within the repository, including creating and assigning groups based on new and changing projects and defining which folders are to be shared, and at what level. Folder administration involves creating and maintaining the security of all folders. The Administrator should be the only user with privileges to edit folder properties.
Monitor and Tune Environment Proactively monitoring the domain and user activity helps ensure a healthy functioning PowerCenter environment. The Administrator should review user activity for the domain to verify that the appropriate rights and privileges have been applied. The domain activity will ensure correct CPU and license usage. The Administrator should have sole responsibility for implementing performance changes to the server environment. He or she should observe server performance throughout development so as to identify any bottlenecks in the system. In the production environment, the Repository Administrator should monitor the jobs and any growth (e.g., increases in data or throughput time) and communicate such change to appropriate staff address bottlenecks, accommodate growth, and ensure that the required data is loaded within the prescribed load window.
Last updated: 06-Dec-07 15:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
624 of 818
Third Party Scheduler Challenge Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler.
Description Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel. Many companies want to automate the scheduling process by using scripts or third-party schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process. A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. Pmcmd is a program used to communicate with the PowerCenter server.
Third Party Scheduler Integration Levels In general, there are three levels of integration between a third-party scheduler and PowerCenter: Low, Medium, and High.
Low Level Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the thirdparty scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler. This type of integration is very simple to implement because the third-party scheduler kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor. Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the company’s standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.
Medium Level With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks. With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.
High Level With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter. Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support INFORMATICA CONFIDENTIAL
BEST PRACTICES
625 of 818
personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support team.
Sample Scheduler Script There are many independent scheduling tools on the market. The following is an example of a AutoSys script that can be used to start tasks; it is included here simply as an illustration of how a scheduler can be implemented in the PowerCenter environment. This script can also capture the return codes, and abort on error, returning a success or failure (with associated return codes to the command line or the Autosys GUI monitor). # Name: jobname.job # Author: Author Name # Date: 01/03/2005 # Description: # Schedule: Daily # # Modification History # When Who Why # #-----------------------------------------------------------------. jobstart $0 $* # set variables ERR_DIR=/tmp # Temporary file will be created to store all the Error Information # The file format is TDDHHMISS
INFORMATICA CONFIDENTIAL
BEST PRACTICES
626 of 818
Last updated: 06-Dec-07 15:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
627 of 818
Updating Repository Statistics Challenge The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up queries. Most databases use column distribution statistics to determine which index to use to optimize performance. It can be important, especially in large or high-use repositories, to update these statistics regularly to avoid performance degradation.
Description For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script. For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.
Features in PowerCenter version 7 and later Copy, Backup and Restore Repositories PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.
PMREP Command PowerCenter also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile} The –s option allows for you to skip different tables you may not want to update statistics.
Example of Automating the Process One approach to automating this would be to use a UNIX shell that includes the pmrep command “updatestatistics” which is incorporated into a special workflow in PowerCenter and run on a scheduled basis. Note: Workflow Manager supports command line as well as scheduling. Below listed is an example of the command line object.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
628 of 818
In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.
Tuning Strategies for PowerCenter version 6 and earlier The following are strategies for generating scripts to update distribution statistics. Note that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".
Oracle Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This will produce output like: 'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics; analyze table OPB_BATCH_OBJECT compute statistics; 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' analyze index OPB_DBD_IDX compute statistics; analyze index OPB_DIM_LEVEL compute statistics;
INFORMATICA CONFIDENTIAL
BEST PRACTICES
629 of 818
analyze index OPB_EXPR_IDX compute statistics; Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like: 'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;' Run this as a SQL script. This updates statistics for the repository tables.
MS SQL Server Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like : name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.
Sybase Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This will produce output like name update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file. Run this as a SQL script. This updates statistics for the repository tables.
Informix Run the following query: select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%'; This will produce output like : (constant) tabname (constant) update statistics low for table OPB_ANALYZE_DEP ; update statistics low for table OPB_ATTR ;
INFORMATICA CONFIDENTIAL
BEST PRACTICES
630 of 818
update statistics low for table OPB_BATCH_OBJECT ; Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like: (constant) tabname (constant) Run this as a SQL script. This updates statistics for the repository tables.
DB2 Run the following query : select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;' from sysstat.tables where tabname like 'OPB_%' This will produce output like: runstats on table PARTH.OPB_ANALYZE_DEP and indexes all; runstats on table PARTH.OPB_ATTR and indexes all; runstats on table PARTH.OPB_BATCH_OBJECT and indexes all; Save the output to a file. Run this as a SQL script to update statistics for the repository tables.
Last updated: 06-Dec-07 15:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
631 of 818
Determining Bottlenecks Challenge Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management.
Description The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; using a process of elimination, investigating each area in the order indicated: 1. 2. 3. 4. 5.
Target Source Mapping Session System
Best Practice Considerations Use Thread Statistics to Identify Target, Source, and Mapping Bottlenecks Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default, an Integration Service uses one reader, one transformation, and one target thread to process a session. Within each session log, the following thread statistics are available: Run time – Amount of time the thread was running Idle time – Amount of time the thread was idle due to other threads within application or Integration Service. This value does not include time the thread is blocked due to the operating system. Busy – Percentage of the overall run time the thread is not idle. This percentage is calculated using the following formula: (run time – idle time) / run time x 100 By analyzing the thread statistics found in an Integration Service session log, it is possible to determine which thread is being used the most. If a transformation thread is 100 percent busy and there are additional resources (e.g., CPU cycles and memory) available on the Integration Service server, add a partition point in the segment. If reader or writer thread is 100 percent busy, consider using string data types in source or target ports since non-string ports require more processing.
Use the Swap Method to Test Changes in Isolation Attempt to isolate performance problems by running test sessions. You should be able to compare the session’s original performance with that of tuned session’s performance. The swap method is very useful for determining the most common bottlenecks. It involves the following five steps: 1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune the copy before making changes to the original. 2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment. 3. Document the change made to the mapping, session and/or workflow and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric. 4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning. 5. Make appropriate tuning changes to mappings, sessions and/or workflows.
Evaluating the Five Areas of Consideration
INFORMATICA CONFIDENTIAL
BEST PRACTICES
632 of 818
Target Bottlenecks Relational Targets The most common performance bottleneck occurs when the Integration Service writes to a target database. This type of bottleneck can easily be identified with the following procedure: 1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file and run the session. 3. Read the thread statistics in session log If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance: Drop indexes and key constraints Increase checkpoint intervals Use bulk loading Use external loading Minimize deadlocks Increase database network packet size Optimize target databases
Flat file targets If the session targets a flat file, you probably do not have a write bottleneck. If the session is writing to a SAN or a non-local file system, performance may be slower than writing to a local file system. If possible, a session can be optimized by writing to a flat file target local to the Integration Service. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives. If the SAN or non-local file system is significantly slower than the local file system, work with the appropriate network/storage group to determine if there are configuration issues within the SAN.
Source Bottlenecks Relational sources If the session reads from a relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks. Using a Filter Transformation. Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck. Using a Read Test Session. You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing any transformation logic from the mapping. Use the following steps to create a read test mapping: 1. 2. 3. 4.
Make a copy of the original mapping. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. Remove all transformations. Connect the source qualifiers to a file target.
Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck. Using a Database Query You can also identify source bottlenecks by executing a read query directly against the source database. To do so, perform the following steps:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
633 of 818
Copy the read query directly from the session log. Run the query against the source database with a query tool such as SQL Plus. Measure the query execution time and the time it takes for the query to return the first row. If there is a long delay between the two time measurements, you have a source bottleneck. If your session reads from a relational source and is constrained by a source bottleneck, review the following suggestions for improving performance: Optimize the query. Create tempdb as in-memory database. Use conditional filters. Increase database network packet size. Connect to Oracle databases using IPC protocol.
Flat file sources If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the line sequential buffer length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may improve performance when reading flat file sources. Also, ensure the flat file source is local to the Integration Service.
Mapping Bottlenecks If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping. Begin by adding a Filter transformation in the mapping immediately before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks: Create a test mapping without transformations 1. 2. 3. 4.
Make a copy of the original mapping. In the copied mapping, retain only the sources, source qualifiers, and any custom joins or queries. Remove all transformations. Connect the source qualifiers to the target.
Check for High Rowsinlookupcache counters Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions. Check for High Errorrows counters Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors. For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance
Session Bottlenecks Session performance details can be used to flag other problem areas. Create performance details by selecting “Collect Performance Data” in the session properties before running the session. View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation within the mapping to help you understand session and mapping efficiency.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
634 of 818
To view the performance details during the session run: Right-click the session in the Workflow Monitor. Choose Properties. Click the Properties tab in the details dialog box. To view the resulting performance daa file, look for the file session_name.perf in the same directory as the session log and open the file in any text editor. All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage. Low buffer input and buffer output counters If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance. Aggregator, Rank, and Joiner readfromdisk and writetodisk counters If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number other than zero because the Integration Service reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes. Note: PowerCenter versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level. For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.
System Bottlenecks After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the Integration Service. The Integration Service uses system resources to process transformations, session execution, and the reading and writing of data. The Integration Service also uses system memory for other data tasks such as creating aggregator, joiner, rank, and lookup table caches. You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks. Windows NT/2000. Use system tools such as the Performance and Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows. UNIX. Use the following system tools to monitor system performance and identify system bottlenecks: lsattr -E -I sys0 - To view current system settings iostat - To monitor loading operation for every disk attached to the database server vmstat or sar –w - To monitor disk swapping actions sar –u - To monitor CPU loading. For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows 2000/2003 Systems.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
635 of 818
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
636 of 818
Maximizing Performance in CEP Systems Challenge RulePoint provides a number of system configuration parameters that can have a dramatic impact on the performance of a deployed CEP System.
Description While the default RulePoint configuration settings are sufficient for initial installation and functional testing, they are not recommended for use in production. Significant performance gains can be achieved by tuning the system configuration settings for the customer’s specific needs. Performance tuning is a balancing act where the resources of the system are allocated among competing customer priorities and constraints. The optimal balance is determined by the specifics of the CEP system environment.
Review Available Configuration Parameters With each version of RulePoint the set of available configuration parameters is modified and expanded in response to customer requirements. The first step of performance tuning is to review the Admin Guide for the specific version of RulePoint used in the customer deployment. The Admin Guide contains a chapter on optimization, which lists the system configuration parameters available for performance tuning for that version. If a newer version of RulePoint provides a new configuration parameter that could provide a performance benefit in the customer’s CEP environment, this can provide an argument for recommending an upgrade.
Record All Configuration Changes Performance optimization is an experimental endeavor. For any change to a system configuration parameter for performance optimization, record the following data: Performance metrics before the change Suspected performance bottleneck the change should address Old parameter value New parameter value Performance metrics after the change Accept/revert change
Analyze Production Event Load RulePoint provides system configuration parameters to tune how memory and database resources are allocated to event storage. Gather as much information as possible about the expected event load in the CEP System environment. Relevant data includes: Expected Expected Expected Expected
event event event event
data size rate traffic pattern time-relevance
Expected event data size should be considered at the order-of-magnitude level. Event data size acts as a multiplier for the effect of all event-related configuration parameters. In most CEP environments, event data size is on the order of a kilobyte or less. If larger event data sizes are expected in the CEP environment, then event-related configuration parameters in general should be targeted for optimization. Expected event rate varies widely across CEP environments. In general, the higher the expected event rate, the more resources must be allocated for events, and the more performance gains can be achieved by reducing the resources required for events. For example, in an environment that receives a few thousand events a day, less memory needs to be devoted to events, freeing up memory for rule processing, auditing and so on. But it is also the case in such an environment that there will be little performance to be gained in optimizing event storage in the database. In an environment that receives a few thousand events a minute, on the other hand, the opposite holds true. Such a system will require more memory dedicated to events, and INFORMATICA CONFIDENTIAL
BEST PRACTICES
637 of 818
optimization of database event storage will be critical. In some environments it may even be necessary to turn off database storage of events entirely. The expected event traffic pattern affects how precisely system configuration patterns can be tuned. If event traffic is steady, then system parameters can be tuned quite closely. On the other hand, if event traffic tends to come in bursts, then sufficient excess resources must be allocated to maintain acceptable performance during a burst. The expected time-relevance of events affects how quickly events may safely be expired from the system. To maximize performance, the sooner events expire, the better. For example, if events in the CEP environment will only be relevant to users for a matter of minutes, any resources spent to keep those events for longer will be wasted. In general, the time-relevance of events tends to be inversely related to the event rate. In most CEP environments the time-relevance of events is on the order of minutes, hours or days. For longer time periods, consider incorporating archiving and data-mining techniques into the architecture, which would involve the development of Custom RulePoint Services. See Custom Rulepoint Service Development,
Optimize Rule Performance The vast majority of performance optimization for rules is done at the rule level, but there are some system level considerations. The most important is the effect of the statementExecutorConfig.maxPoolSize property. A common mistake is to assume that raising this value will always improve system performance. Each additional executor will require additional resources and overhead, which should only be incurred if a resulting performance benefit can be demonstrated. In some cases increasing the statementExecutorConfig.maxPoolSize property above the default value will actually decrease performance, while lowering it to below the default value will give an improvement.
Optimize Responder Performance Because they sit at the end of the event processing pipeline, bottlenecks in Responders can have a severe impact on system performance. Almost all Responders send alerts out from RulePoint and are therefore IO bound. As a result, raising the value of the responseManagerConfig.maxPoolSize property is more likely to give a performance improvement than the statementExecutorConfig.maxPoolSize property.
Disable Non-critical Features Every feature of RulePoint has an associated performance cost. Some features are more expensive than others, and that cost may be dependent on the particulars of the CEP environment. For all features that can be disabled, work with the customer to determine which features are must-haves, which are nice-to-have and which are unused. Disable all the unused features. For the nice-to-have, disable them in order of lowest priority until performance requirements are met.
Consider Architecture and Design Changes If the performance requirements for the CEP System have not been met after RulePoint system configuration optimization is complete, changes to the architecture and design of the implementation may be necessary. The data collected during optimization will be a valuable input to this process.
Last updated: 30-Oct-10 20:10
INFORMATICA CONFIDENTIAL
BEST PRACTICES
638 of 818
Performance Tuning Databases (Oracle) Challenge Database tuning can result in a tremendous improvement in loading performance. This Best Practice covers tips on tuning Oracle.
Description Performance Tuning Tools Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so we’ve included only a short description of some of the major ones here.
V$ Views V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the ‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.
Explain Plan Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them. Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should be tested.
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.
UTLBSTAT & UTLESTAT
Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying). ‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.
Disk I/O Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Colocate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.
Dynamic Sampling Dynamic sampling enables the server to improve performance by: Estimating single-table predicate statistics where available statistics are missing or may lead to bad estimations. Estimating statistics for tables and indexes with missing statistics. INFORMATICA CONFIDENTIAL
BEST PRACTICES
639 of 818
Estimating statistics for tables and indexes with out of date statistics. Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter, which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of "2". At compile-time, Oracle determines if dynamic sampling can improve query performance. If so, it issues recursive statements to estimate the necessary statistics. Dynamic sampling can be beneficial when: The sample time is small compared to the overall query execution time. Dynamic sampling results in a better performing query. The query can be executed multiple times.
Automatic SQL Tuning in Oracle Database 10g In its normal mode, the query optimizer needs to make decisions about execution plans in a very short time. As a result, it may not always be able to obtain enough information to make the best decision. Oracle 10g allows the optimizer to run in tuning mode, where it can gather additional information and make recommendations about how specific statements can be tuned further. This process may take several minutes for a single statement so it is intended to be used on high-load, resourceintensive statements. In tuning mode, the optimizer performs the following analysis: Statistics Analysis. The optimizer recommends the gathering of statistics on objects with missing or stale statistics. Additional statistics for these objects are stored in an SQL profile. SQL Profiling. The optimizer may be able to improve performance by gathering additional statistics and altering session specific parameters such as the OPTIMIZER_MODE. If such improvements are possible, the information is stored in an SQL profile. If accepted, this information can then used by the optimizer when running in normal mode. Unlike a stored outline, which fixes the execution plan, an SQL profile may still be of benefit when the contents of the table alter drastically. Even so, it's sensible to update profiles periodically. The SQL profiling is not performed when the tuining optimizer is run in limited mode. Access Path Analysis. The optimizer investigates the effect of new or modified indexes on the access path. Because its index recommendations relate to a specific statement, where practical, it is also suggest the use of the SQL Access Advisor to check the impact of these indexes on a representative SQL workload. SQL Structure Analysis. The optimizer suggests alternatives for SQL statements that contain structures that may affect performance. Be aware that implementing these suggestions requires human intervention to check their validity.
TIP The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page
Useful Views Useful views related to automatic SQL tuning include:
DBA_ADVISOR_TASKS DBA_ADVISOR_FINDINGS DBA_ADVISOR_RECOMMENDATIONS DBA_ADVISOR_RATIONALE DBA_SQLTUNE_STATISTICS DBA_SQLTUNE_BINDS DBA_SQLTUNE_PLANS DBA_SQLSET DBA_SQLSET_BINDS DBA_SQLSET_STATEMENTS DBA_SQLSET_REFERENCES DBA_SQL_PROFILES
INFORMATICA CONFIDENTIAL
BEST PRACTICES
640 of 818
V$SQL V$SQLAREA V$ACTIVE_SESSION_HISTORY
Memory and Processing Memory and processing configuration is performed in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and is not likely to ever exist.
TIP Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance. Note that svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the initialization parameters. The settings presented here are those used in a four-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing queries and indexes. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle) systems determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.
HASH_AREA_SIZE = 16777216 Default value: 2 times the value of SORT_AREA_SIZE Range of values: any integer This parameter specifies the maximum amount of memory, in bytes, to be used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt). HASH_JOIN_ENABLED In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to true. In Oracle 8i and above hash_join_enabled=true is the default value HASH_MULTIBLOCK_IO_COUNT Allows multiblock reads against the TEMP tablespace It is advisable to set the NEXT extentsize to greater than the value for hash_multiblock_io_count to reduce disk I/O This is the same behavior seen when setting the db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace STAR_TRANSFORMATION_ENABLED Determines whether a cost-based query transformation will be applied to star queries When set to TRUE, the optimizer will consider performing a cost-based query transformation on the n-way join table OPTIMIZER_INDEX_COST_ADJ Numeric parameter set between 0 and 1000 (default 1000) This parameter lets you tune the optimizer behavior for access path selection to be more or less index friendly
Optimizer_percent_parallel=33 This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full-table scan operation. The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
641 of 818
Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL. parallel_max_servers=40 Used to enable parallel query. Initially not set on Install. Maximum number of query servers or parallel recovery processes for an instance. Parallel_min_servers=8 Used to enable parallel query. Initially not set on Install. Minimum number of query server processes for an instance. Also the number of query-server processes Oracle creates when the instance is started. SORT_AREA_SIZE=8388608 Default value: operating system-dependent Minimum value: the value equivalent to two database blocks This parameter specifies the maximum amount, in bytes, of program global area (PGA) memory to use for a sort. After the sort is complete, and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system. Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time. The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.
Automatic Shared Memory Management in Oracle 10g Automatic Shared Memory Management puts Oracle in control of allocating memory within the SGA. The SGA_TARGET parameter sets the amount of memory available to the SGA. This parameter can be altered dynamically up to a maximum of the SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will control the memory pools that would otherwise be controlled by the following parameters: DB_CACHE_SIZE (default block size) SHARED_POOL_SIZE LARGE_POOL_SIZE JAVA_POOL_SIZE If these parameters are set to a non-zero value, they represent the minimum size for the pool. These minimum values may be necessary if you experience application errors when certain pool sizes drop below a specific threshold. The following parameters must be set manually and take memory from the quota allocated by the SGA_TARGET parameter: DB_KEEP_CACHE_SIZE DB_RECYCLE_CACHE_SIZE DB_nK_CACHE_SIZE (non-default block size) STREAMS_POOL_SIZE LOG_BUFFER
IPC as an Alternative to TCP/IP on UNIX On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to
INFORMATICA CONFIDENTIAL
BEST PRACTICES
642 of 818
IPC reduced this to 45 seconds. In another mapping, the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000 row write (array inserts), and primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec). A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this: DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) ) Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance: DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc) (KEY=DW) ) (CONNECT_DATA=(SID=DW)) )
Improving Data Load Performance Alternative to Dropping and Reloading Indexes Experts often recommend dropping and reloading indexes during very large loads to a data warehouse but there is no easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it, can be a very tedious process. Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes. Run the following to generate output to disable the foreign keys in the data warehouse: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'R' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ; ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;
INFORMATICA CONFIDENTIAL
BEST PRACTICES
643 of 818
ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ; Dropping or disabling primary keys also speeds loads. Run the results of this SQL statement after disabling the foreign key constraints: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'P' This produces output that looks like: ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ; Finally, disable any unique constraints with the following: SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;' FROM USER_CONSTRAINTS WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') AND CONSTRAINT_TYPE = 'U' ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ; ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ; Save the results in a single file and name it something like ‘DISABLE.SQL’ To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-session command. Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.
TIP Dropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.
Optimizing Query Performance INFORMATICA CONFIDENTIAL
BEST PRACTICES
644 of 818
Oracle Bitmap Indexing With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/highly-duplicated data and may even increase query time. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance. Keep in mind however, that b-tree indexing is still the Oracle default. If you don’t specify an index type when creating an index, Oracle defaults to b-tree. Also note that for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree index on the same column. Bitmap indexes are suited to data warehousing because of their performance, size, and ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-create or re-enable them after the load. The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is identical.
Bitmap Indexes drop index emp_active_bit; drop index emp_gender_bit; create bitmap index emp_active_bit on emp (active_flag); create bitmap index emp_gender_bit on emp (gender);
B-tree Indexes drop index emp_active; drop index emp_gender; create index emp_active on emp (active_flag); create index emp_gender on emp (gender); Information for bitmap indexes is stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique. To enable bitmap indexes, you must set the following items in the instance initialization file: compatible = 7.3.2.0.0 # or higher event = "10111 trace name context forever" event = "10112 trace name context forever" event = "10114 trace name context forever" Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error appears in the SQL statement; the keyword ‘bitmap’ won't be recognized.
TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed,
INFORMATICA CONFIDENTIAL
BEST PRACTICES
645 of 818
the word ‘parallel’ appears in the banner text.
Index Statistics Table method Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following should improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse: The following SQL statement can be used to analyze the tables in the database: SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;' FROM USER_TABLES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following result: ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS; ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS; ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS; The following SQL statement can be used to analyze the indexes in the database: SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;' FROM USER_INDEXES WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT') This generates the following results: ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS; ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS; Save these results as a SQL script to be executed before or after a load.
Schema method Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, you can use the following command to update the statistics: EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute'); In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command. TIP
These SQL statements can be very resource intensive, especially for very large tables. For this reason, Informatica recommends running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.
Parallelism Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of INFORMATICA CONFIDENTIAL
BEST PRACTICES
646 of 818
processors being the minimum degree.
SQL Level Parallelism Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors: SELECT /*+ PARALLEL(order_fact,4) */ …; SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;
TIP When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message. Example of improper use of alias: SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A Here, the parallel hint will not be used because of the used alias “A” for table EMP. The correct way is: SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A
Table Level Parallelism Parallelism can also be defined at the table and index level. The following example demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table: ALTER TABLE order_fact PARALLEL 4; Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.
Additional Tips Executing Oracle SQL Scripts as Pre- and Post-Session Commands on UNIX You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is: sqlplus –s user_id/password@database @ script_name.sql For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’), you would execute the following as a post-session command: sqlplus –s user_id/password@infadb @ enable.sql In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating system’s authentication to log onto the database instance. In the following example, the Informatica id “pmuser” is used to log onto the Oracle database. Create the Oracle user “pmuser” with the following SQL statement: CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . . In the following pre-session command, “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script: sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL
INFORMATICA CONFIDENTIAL
BEST PRACTICES
647 of 818
You may want to use the init.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “externalidentified” ones. DRIVING_SITE ‘Hint’ If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance. For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as: SELECT /*+ DRIVING_SITE */ …;
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
648 of 818
Performance Tuning Databases (SQL Server) Challenge Database tuning can result in tremendous improvement in loading performance. This Best Practice offers tips on tuning SQL Server.
Description Proper tuning of the source and target database is a very important consideration in the scalability and usability of a business data integration environment. Managing performance on an SQL Server involves the following points. Manage system memory usage (RAM caching). Create and maintain good indexes. Partition large data sets and indexes. Monitor disk I/O subsystem performance. Tune applications and queries. Optimize active data. Taking advantage of grid computing is another option for improving the overall SQL Server performance. To set up a SQL Server cluster environment, you need to set up a cluster where the databases are split among the nodes. This provides the ability to distribute the load across multiple nodes. To achieve high performance, Informatica recommends using a fibre-attached SAN device for shared storage.
Manage RAM Caching Managing RAM buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same information from disk. If database I/O can be reduced to the minimal required set of data and index pages, the pages stay in RAM longer. Too much unnecessary data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is used effectively. Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage: Max async I/O is used to specify the number of simultaneous disk I/O operations that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000 SQL Server allows several selectable models for database recovery, these include: Full Recovery Bulk-Logged Recovery Simple Recovery
Create and Maintain Good Indexes Creating and maintaining good indexes is key to maintaining minimal I/O for all database queries.
Partition Large Data Sets and Indexes To reduce overall I/O contention and improve parallel operations, consider partitioning table data and indexes. Multiple techniques for achieving and managing partitions using SQL Server 2000 are addressed in this document.
Tune Applications and Queries Tuning applications and queries is especially important when a database server is likely to be servicing requests from hundreds or thousands of connections through a given application. Because applications typically determine the SQL queries that are executed on a database server, it is very important for application developers to understand SQL Server architectural basics and know how to take full advantage of SQL Server indexes to minimize I/O.
Partitioning for Performance The simplest technique for creating disk I/O parallelism is to use hardware partitioning and create a single "pool of drives" that
INFORMATICA CONFIDENTIAL
BEST PRACTICES
649 of 818
serves all SQL Server database files except transaction log files, which should always be stored on physically-separate disk drives dedicated to log files. (See Microsoft documentation for installation procedures.)
Objects For Partitioning Consideration The following areas of SQL Server activity can be separated across different hard drives, RAID controllers, and PCI channels (or combinations of the three): Transaction logs Tempdb Database Tables Nonclustered Indexes Note: In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.
Segregating the Transaction Log Transaction log files should be maintained on a storage device that is physically separate from devices that contain data files. Depending on your database recovery model setting, most update activity generates both data device activity and log activity. If both are set up to share the same device, the operations to be performed compete for the same limited resources. Most installations benefit from separating these competing I/O activities.
Segregating tempdb SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins. To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands: alterdatabasetempdbmodifyfile(name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF') alterdatabasetempdbmodifyfile(name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')
The master database, msdb, and model databases are not used much during production (as compared to user databases), so it is generally y not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.
Database Partitioning Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are: Primary filegroup. Contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup. User-defined filegroup. Any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager. Default filegroup. Contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup. Files and filegroups are useful for controlling the placement of data and indexes and eliminating device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more INFORMATICA CONFIDENTIAL
BEST PRACTICES
650 of 818
control over their database backup/recovery strategy.
Horizontal Partitioning (Table) Horizontal partitioning segments a table into multiple tables, each containing the same number of columns but fewer rows. Determining how to partition tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance. When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly. By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually
Cost Threshold for Parallelism Option Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP).
Max Degree of Parallelism Option Use this option to limit the number of processors (from a maximum of 32) to use in parallel plan execution. The default value is zero, which uses the actual number of available CPUs. Set this option to one to suppress parallel plan generation. Set the value to a number greater than one to restrict the maximum number of processors used by a single query execution.
Priority Boost Option Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is zero, which is a priority base of seven.
Set Working Set Size Option Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It can vary dynamically among minimum server memory and maximum server memory. Setting ‘set working set’ size means the operating system does not attempt to swap out SQL Server pages, even if they can be used more readily by another process when SQL Server is idle.
Optimizing Disk I/O Performance When configuring a SQL Server that contains only a few gigabytes of data and does not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for optimal performance. To build larger SQL Server databases however, which can contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.
Partitioning for Performance For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism. Partitioning can be performed using a variety of techniques. Methods for creating and managing partitions include configuring the storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include: Transaction log Tempdb Database Tables
INFORMATICA CONFIDENTIAL
BEST PRACTICES
651 of 818
Non-clustered indexes
Using bcp and BULK INSERT Two mechanisms exist inside SQL Server to address the need for bulk movement of data: the bcp utility and the BULK INSERT statement. Bcp is a command prompt utility that copies data into or out of SQL Server. BULK INSERT is a Transact-SQL statement that can be executed from within the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt. TIP
Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.
General Guidelines for Initial Data Loads While loading data: Remove indexes. Use Bulk INSERT or bcp. Parallel load using partitioned data files into partitioned tables. Run one load stream for each available CPU. Set Bulk-Logged or Simple Recovery model. Use the TABLOCK option. Create indexes. Switch to the appropriate recovery model. Perform backups
General Guidelines for Incremental Data Loads Load data with indexes in place. Use performance and concurrency requirements to determine locking granularity (sp_indexoption). Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a point–in-time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
652 of 818
Performance Tuning Databases (Teradata) Challenge Database tuning can result in tremendous improvement in loading performance. This Best Practice provides tips on tuning Teradata.
Description Teradata offers several bulk load utilities including: MultiLoad which supports inserts, updates, deletes, and “upserts” to any table. FastExport which is a high-performance bulk export utility. BTEQ which allows you to export data to a flat file but is suitable for smaller volumes than FastExport. FastLoad which is used for loading inserts into an empty table. TPump which is a light-weight utility that does not lock the table that is being loaded.
Tuning MultiLoad There are many aspects to tuning a Teradata database. Several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.
MultiLoad parameters Below are the MultiLoad-specific parameters that are available in PowerCenter: TDPID. A client based operand that is part of the logon string. Date Format. Ensure that the date format used in your target flat file is equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database. Checkpoint. A checkpoint interval is similar to a commit interval for other databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed. Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running. Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table. Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables. Max Sessions. This parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process). Sleep. This parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.
Estimating Space Requirements for MultiLoad Jobs Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for: Work tables Error tables Restart Log table Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
653 of 818
Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes: PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML) Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.
Monitoring MultiLoad Performance Below are tips for analyzing MultiLoad performance: 1. Determine which phase of the MultiLoad job is causing poor performance. If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system. The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation. 2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job. 3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes (NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table. 6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks. 7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems 8. Poor performance can happen when the input data is skewed with respect to the Primary Index of the database. Teradata depends upon random and well distributed data for data input and retrieval. For example, a file containing a million rows with a single value 'AAAAAA' for the Primary Index will take an infinite time to load. 9. One common tool used for determining load issues/skewed data/locks is Performance Monitor (PMON). PMON requires MONITOR access on the Teradata system. If you do not have Monitor access, then the DBA can help you to look at the system. 10. SQL against the system catalog can also be used to determine any performance bottle necks. The following query is used to see if the load is inserting data into the system. Spool space (a type of work space) is inside the build as data is transferred to the database. So if the load is going well, the spool will be built rapidly in the database. Use the following query to check: SELECT sum(currentspool) from dbc.diskspace where databasename = userid loading the database. After the spool rises has a reached its peak, spool will fall rapidly as data is inserted from spool into the table. If the spool grows slowly, then the input data is probably skewed.
FastExport FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/Sources is by using ODBC since there is not native connectivity to Teradata. However, ODBC is slow. For higher performance, use FastExport if the number of rows to be pulled is in the order of a million rows. FastExport writes to a file. The lookup or source qualifier then reads this file. FastExport integrated within PowerCenter.
BTEQ BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you to export data to a flat file, but is suitable for smaller volumes of data. This provides faster performance than ODBC but doesn't tax Teradata system resources the way FastExport can. A possible use for BTEQ with PowerCenter is to export smaller volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a presession script.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
654 of 818
TPump TPump was a load utility primarily intended for streaming data (think of loading bundles of messages arriving from MQ using Power Center Real Time). TPump can also load from a file or a named pipe. While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility. Another important difference between MultiLoad and TPump is that TPump locks at the row-hash level instead of the table level thus providing users read access to fresher data. Although Teradata says that it has improved the speed of TPump for loading files to compare with that of MultiLoad. So, try a test load using TPump first. Also, be cautious with the use of TPump to load streaming data if the data throughput is large.
Push Down Optimization PowerCenter embeds a powerful engine that actually has a memory management system built within and all the smart algorithms built into the engine to perform various transformation operations such as aggregation, sorting, joining, lookup etc. This is a typically referred to as an ETL architecture where Extracts, Transformations and Loads are performed. So, data is extracted from the data source to the PowerCenter Engine (can be on the same machine as the source or a separate machine) where all the transformations are applied and then pushed to the target. Some of the performance considerations for this type of architecture are: Is the network fast enough and tuned effectively to support the necessary data transfer? Is the hardware on which PowerCenter is running sufficiently robust with high processing capability and high memory capacity. ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm that became popular with the advent of high performance RDBM systems such asDSS and OLTP. Because Teradata typically runs on well tuned operating systems and well tuned hardware, the ELT paradigm tries to push as much of the transformation logic as possible onto the Teradata system. The ELT design paradigm can be achieved through the Pushdown Optimization option offered with PowerCenter.
ETL or ELT Because many database vendors and consultants advocate using ELT (Extract, Load and Transform) over ETL (Extract, Transform and Load), the use of Pushdown Optimization can be somewhat controversial. Informatica advocates using Pushdown Optimization as an option to solve specific performance situations rather than as the default design of a mapping. The following scenarios can help in deciding on when to use ETL with PowerCenter and when to use ELT (i.e., Pushdown Optimization): 1. When the load needs to look up only dimension tables then there may be no need to use Pushdown Optimization. In this context, PowerCenter's ability to build dynamic, persistent caching is significant. If a daily load involves 10s or 100s of fact files to be loaded throughout the day, then dimension surrogate keys can be easily obtained from PowerCenter's cache in memory. Compare this with the cost of running the same dimension lookup queries on the database. 2. In many cases large Teradata systems contain only a small amount of data. In such cases there may be no need to push down. 3. When only simple filters or expressions need to be applied on the data then there may be no need to push down. The special case is that of applying filters or expression logic to non-unique columns in incoming data in PowerCenter. Compare this to loading the same data into the database and then applying a WHERE clause on a non-unique column, which is highly inefficient for a large table. The principle here is: Filter and resolve the data AS it gets loaded instead of loading it into a database, querying the RDBMS to filter/resolve and re-loading it into the database. In other words, ETL instead of ELT. 4. Push Down optimization needs to be considered only if a large set of data needs to be merged or queried for getting to your final load set.
Maximizing Performance using Pushdown Optimization You can push transformation logic to either the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
655 of 818
When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.
Known Issues with Teradata You may encounter the following problems using ODBC drivers with a Teradata database: Teradata sessions fail if the session requires a conversion to a numeric data type and the precision is greater than 18. Teradata sessions fail when you use full pushdown optimization for a session containing a Sorter transformation. A sort on a distinct key may give inconsistent results if the sort is not case sensitive and one port is a character port. A session containing an Aggregator transformation may produce different results from PowerCenter if the group by port is a string data type and it is not case-sensitive. A session containing a Lookup transformation fails if it is configured for target-side pushdown optimization. A session that requires type casting fails if the casting is from x to date/time. A session that contains a date to string conversion fails
Working with SQL Overrides You can configure the Integration Service to perform an SQL override with Pushdown Optimization. To perform an SQL override, you configure the session to create a view. When you use a SQL override for a Source Qualifier transformation in a session configured for source or full Pushdown Optimization with a view, the Integration Service creates a view in the source database based on the override. After it creates the view in the database, the Integration Service generates a SQL query that it can push to the database. The Integration Service runs the SQL query against the view to perform Pushdown Optimization. Note: To use an SQL override with pushdown optimization, you must configure the session for pushdown optimization with a view.
Running a Query If the Integration Service did not successfully drop the view, you can run a query against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view, it uses a prefix of PM_V. You can search for views with this prefix to locate the views created during pushdown optimization. Teradata specific SQL: SELECT TableName FROM DBC.Tables WHERE CreatorName = USER AND TableKind ='V' AND TableName LIKE 'PM\_V%' ESCAPE '\'
Rules and Guidelines for SQL OVERIDE Use the following rules and guidelines when you configure pushdown optimization for a session containing an SQL override:
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
656 of 818
Performance Tuning in a Real-Time Environment Challenge As Data Integration becomes a broader and more service-oriented Information Technology initiative, real-time and right-time solutions will become critical to the success of the overall architecture. Tuning real-time processes is often different then tuning batch processes.
Description To remain agile and flexible in increasingly competitive environments, today’s companies are dealing with sophisticated operational scenarios such as consolidation of customer data in real time to support a call center or the delivery of precise forecasts for supply chain operation optimization. To support such highly demanding operational environments, data integration platforms must do more than serve analytical data needs. They must also support real-time, 24x7, mission-critical operations that involve live or current information available across the enterprise and beyond. They must access, cleanse, integrate and deliver data in real time to ensure up-to-the-second information availability. Also, data integration platforms must intelligently scale to meet both increasing data volumes and also increasing numbers of concurrent requests that are typical of shared services Integration Competency Center (ICC) environments. The data integration platforms must also be extremely reliable, providing high availability to minimize outages and ensure seamless failover and recovery as every minute of downtime can lead to huge impacts on business operations. PowerCenter can be used to process data in real time. Real-time processing is on-demand processing of data from real-time sources. A real-time session reads, processes and writes data to targets continuously. By default, a session reads and writes bulk data at scheduled intervals unless it is configured for real-time processing. To process data in real time, the data must originate from a real-time source. Real-time sources include JMS, WebSphere MQ, TIBCO, webMethods, MSMQ, SAP, and web services. Real-time processing can also be used for processes that require immediate access to dynamic data (i.e., financial data).
Latency Impact on performance Use the Real-time Flush Latency session condition to control the target commit latency when running in real-time mode. PWXPC commits source data to the target at the end of the specified maximum latency period. This parameter requires a valid value and has a valid default value. When the session runs, PWXPC begins to read data from the source. After data is provided to the source qualifier, the RealTime Flush Latency interval begins. At the end of each Real-Time Flush Latency interval and an end-UOW boundary is reached, PWXPC issues a commit to the target. The following message appears in the session log to indicate that this has occurred: [PWXPC_10082] [INFO] [CDCDispatcher] raising real-time flush with restart tokens [restart1_token], [restart2_token] because Real-time Flush Latency [RTF_millisecs] occurred Only complete UOWs are committed during real-time flush processing. The commit to the target when reading CDC data is not strictly controlled by the Real-Time Flush Latency specification. The UOW Count and the Commit Threshold values also determine the commit frequency. The value specified for Real-Time Flush Latency also controls the PowerExchange Consumer API (CAPI) interface timeout value (PowerExchange latency) on the source platform. The CAPI interface timeout value is displayed in the following PowerExchange message on the source platform (and in the session log if “Retrieve PWX Log Entries” is specified in the Connection Attributes): PWX-09957 CAPI i/f: Read times out after
INFORMATICA CONFIDENTIAL
BEST PRACTICES
657 of 818
TIP Use the PowerExchange STOPTASK command to shutdown more quickly when using a high RTF Latency value. For example, if the value for Real-Time Flush Latency is 10 seconds, PWXPC will issue a commit for all data read after 10 seconds have elapsed and the next end-UOW boundary is received. The lower the value is set, the faster the data commits data to the target. As the lowest possible latency is required for the application of changes to the target, specify a low Real-Time Flush Latency value. Warning: When you specify a low Real-Time Flush Latency interval, the session might consume more system resources on the source and target platforms. This is because: The session will commit to the target more frequently therefore consuming more target resources. PowerExchange will return more frequently to the PWXPC reader thereby passing fewer rows on each iteration and consuming more resources on the source PowerExchange platform Balance performance and resource consumption with latency requirements when choosing the UOW Count and Real-Time Flush Latency values.
Commit Interval Impact on performance Commit Threshold is only applicable to Real-Time CDC sessions. Use the Commit Threshold session condition to cause commits before reaching the end of the UOW when processing large UOWs. This parameter requires a valid value and has a valid default value Commit Threshold can be used to cause a commit before the end of a UOW is received, a process also referred to as subpacket commit. The value specified in the Commit Threshold is the number of records within a source UOW to process before inserting a commit into the change stream. This attribute is different from the UOW Count attribute in that it is a count records within a UOW rather than complete UOWs. The Commit Threshold counter is reset when either the number of records specified or the end of the UOW is reached. This attribute is useful when there are extremely large UOWs in the change stream that might cause locking issues on the target database or resource issues on the PowerCenter Integration Server. The Commit Threshold count is cumulative across all sources in the group. This means that sub-packet commits are inserted into the change stream when the count specified is reached regardless of the number of sources to which the changes actually apply. For example, a UOW contains 900 changes for one source followed by 100 changes for a second source and then 500 changes for the first source. If the Commit Threshold is set to 1000, the commit record is inserted after the 1000th change record which is after the 100 changes for the second source. Warning: A UOW may contain changes for multiple source tables. Using Commit Threshold can cause commits to be generated at points in the change stream where the relationship between these tables is inconsistent. This may then result in target commit failures. If 0 or no value is specified, commits will occur on UOW boundaries only. Otherwise, the value specified is used to insert commit records into the change stream between UOW boundaries, where applicable. The value of this attribute overrides the value specified in the PowerExchange DBMOVER configuration file parameter SUBCOMMIT_THRESHOLD. For more information on this PowerExchange parameter, refer to the PowerExchange Reference Manual. The commit to the target when reading CDC data is not strictly controlled by the Commit Threshold specification. The commit records inserted into the change stream as a result of the Commit Threshold value affect the UOW Count counter. The UOW Count and the Real-Time Flush Latency values determine the target commit frequency. For example, a UOW contains 1,000 change records (any combination of inserts, updates, and deletes). If 100 is specified for the Commit Threshold and 5 for the UOW Count, then a commit record will be inserted after each 100 records and a target commit will be issued after every 500 records.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
658 of 818
Last updated: 29-May-08 18:40
INFORMATICA CONFIDENTIAL
BEST PRACTICES
659 of 818
Performance Tuning UNIX Systems Challenge Identify opportunities for performance improvement within the complexities of the UNIX operating environment.
Description This section provides an overview of the subject area, followed by discussion of the use of specific tools.
Overview All system performance issues are fundamentally resource contention issues. In any computer system, there are three essential resources: CPU, memory, and I/O - namely disk and network I/O. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter and its sub-processes have adequate resources to execute in a timely and efficient manner. Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with each other. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initially take a baseline measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck revealed on each system resource during your load window and determine the removal of whichever resource contention offers the greatest opportunity for performance enhancement. Here is a summary of each system resource area and the problems it can have.
CPU On any multiprocessing and multi-user system, many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing is likely to reflect a negative impact on performance; the system scheduler puts each process in a queue to wait for CPU availability. An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column). On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; binding the process to a specific CPU may improve performance.
Memory Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory, the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. When this happens, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running. Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.
Disk I/O The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each type of device has its own peculiarities and, therefore, its own problems. Tools are available to evaluate specific parts of the I/O subsystem iostat can give you information about the transfer rates for each disk drive. ps and vmstat can give some information about how many processes are blocked waiting for I/O.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
660 of 818
sar can provide voluminous information about I/O efficiency. sadp can give detailed information about disk access patterns.
Network I/O The source data, the target data, or both the source and target data are likely to be connected through an Ethernet channel to the system where PowerCenter resides. Be sure to consider the number of Ethernet channels and bandwidth available to avoid congestion. netstat shows packet activity on a network, watch for high collision rate of output packets on each interface. nfstat monitors NFS traffic; execute nfstat –c from a client machine (not from the nfs server); watch for high time rate of total call and “not responding” message. Given that these issues all boil down to access to some computing resource, mitigation of each issue con sists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance: Adjusting execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc. Migrating other applications to other hardware is likely tol reduce demand on the hardware hosting PowerCenter. For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions. Adding hardware resources, such as adding memory, can make more resource available to all processes. Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes.
Detailed Usage The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips are likely to be more helpful than others in a particular environment, all are worthy of consideration. Availability, syntax and format of each varies across UNIX versions.
Running ps -axu
Run ps -axu to check for the following items: Are there any processes waiting for disk access or for paging? If so check the I/O and memory subsystems. What processes are using most of the CPU? This may help to distribute the workload better. What processes are using most of the memory? This may help to distribute the workload better. Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.
Identifying and Resolving Memory Issues
Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that excessive paging/swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/swapping. If paging or excessive swapping does occur at any time, increase memory to prevent it. Paging/swapping, on any database system, causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server. Some swapping may occur normally regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased. Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and check for the following: Are pages-outs occurring consistently? If so, you are short of memory. Are there a high number of address translation faults? (System V only). This suggests a memory shortage. Are swap-outs occurring consistently? If so, you are extremely short of memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should always be
INFORMATICA CONFIDENTIAL
BEST PRACTICES
661 of 818
zero. If memory seems to be the bottleneck, try following remedial steps: Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES. If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need. Reduce the size of your kernels tables. This may limit the systems capacity (i.e., number of files, number of processes, etc.). Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much. Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily. Try to limit the time spent running sendmail, which is a memory hog. If you dont see any significant improvement, add more memory.
Identifying and Resolving Disk I/O Issues
Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how evenly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks? Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends (bad)? Reorganize your file systems and disks to distribute I/O activity as evenly as possible. Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention. Use your fastest disk drive and controller for your root file system; this almost certainly has the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system. Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD). Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance. Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new file system, and restore). If you are using NFS and using remote files, look at your network situation. You don’t have local disk I/O problems. Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse. If your system has disk capacity problem and is constantly running out of disk space try the following actions: Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron. Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage. Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).
Identifying and Resolving CPU Overload Issues
Use uptime or sar -u to check for CPU loading. Sar provides more detail, including %usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server. When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload. Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help. Get users to run jobs at night with at or any queuing system thats available. You may not care if the CPU (or the memory
INFORMATICA CONFIDENTIAL
BEST PRACTICES
662 of 818
or I/O system) is overloaded at night, provided the work is done in the morning. Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also, using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.
Identifying and Resolving Network I/O Issues
Suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS. Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network. If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isn’t a file server. A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors. Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If timeout and retrans are high, but badxid is low, some part of the network between the NFS client and server is overloaded and dropping packets. Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work. Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system. Use systems with good network performance as file servers. lsattr E l sys0 is used to determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on most systems. Choose a file system. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, the UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system at all. Additionally, for the PowerCenter Enterprise Grid Option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the available choices.
Cluster File System Tuning In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file system (CFS) is recommended. PowerCenter Grid option requires that the directories for each Integration Service to be shared with other servers. This allows Integration Services to share files such as cache files between different session runs. CFS performance is a result of tuning parameters and tuning the infrastructure. Therefore, using the parameters recommended by each CFS vendor is the best approach for CFS tuning.
PowerCenter Options The Integration Service Monitor is available to display system resource usage information about associated nodes. The
INFORMATICA CONFIDENTIAL
BEST PRACTICES
663 of 818
window displays resource usage information about the running tasks, including CPU%, memory, and swap usage. The PowerCenter 64-bit option can allocate more memory to sessions and achieve higher throughputs compared to 32-bit version of PowerCenter.
Last updated: 06-Dec-07 15:16
INFORMATICA CONFIDENTIAL
BEST PRACTICES
664 of 818
Performance Tuning Windows 2000/2003 Systems Challenge Windows Server is designed as a self-tuning operating system. Standard installation of Windows Server provides good performance out-of-the-box, but optimal performance can be achieved by tuning. Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.
Description The following tips have proven useful in performance-tuning Windows Servers. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration. The two places to begin tuning an NT server are: Performance Monitor. Performance tab (hit ctrl+alt+del, choose task manager, and click on the Performance tab). Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.
Resolving Typical Windows Server Problems The following paragraphs describe some common performance problems in a Windows Server environment and suggest tuning solutions. Server Load: Assume that some software will not be well coded, and some background processes (e.g., a mail server or web server) running on a single machine, can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs may be the only recourse. Device Drivers: The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem. Memory and services: Although adding memory to Windows Server is always a good solution, it is also expensive and usually must be planned in advance. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources. I/O Optimization: This is, by far, the best tuning option for database applications in the Windows Server environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too. Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows Server disk defragmentation product. Finally, on Windows Servers, be sure to implement disk striping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the Windows Server. Windows Server, by default, sets the disk device priority low.
Monitoring System Performance in Windows Server In Windows Server, PowerCenter uses system resources to process transformation, session execution, and reading and writing of data. The PowerCenter Integration Service also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows Server, you can use the system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the PowerCenter and to identify system bottlenecks. Windows Server provides the following tools (accessible under the Control Panel/Administration Tools/Performance) for
INFORMATICA CONFIDENTIAL
BEST PRACTICES
665 of 818
monitoring resource usage on your computer: System Monitor Performance Logs and Alerts These Windows Server monitoring tools enable you to analyze usage and detect bottlenecks at the disk, memory, processor, and network level.
System Monitor The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.
Performance Monitor The Performance Logs and Alerts tool provides two types of performance-related logs—counter logs and trace logs—and an alerting function. Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel. Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system. Note: You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Queries). The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time. If you want to create your own log setting, press the right mouse on one of the log types.
PowerCenter Options The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about running task including CPU%, Memory and Swap usage. PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64-bit Windows Server 2003 can allocate more memory to sessions and achieve higher throughputs than the 32-bit version of PowerCenter on Windows Server. Using PowerCenter Grid option on Windows Server enables distribution of a session or sessions in a workflow to multiple servers and reduces the processing load window. The PowerCenter Grid option requires that the directories for each integration service to be shared with other servers. This allows integration services to share files such as cache files among various session runs. With a Cluster File System (CFS), integration services running on various servers can perform concurrent reads and write to the same block of data.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
666 of 818
Recommended Performance Tuning Procedures Challenge To optimize PowerCenter load times by employing a series of performance tuning procedures.
Description When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can help to diagnose problems that may be adversely affecting various components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, you must consider the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases to achieve optimal performance. More often than not, an issue external to PowerCenter is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, you need to execute the performance tuning steps in a specific order. This enables you to methodically rule out individual pieces and narrow down the specific areas on which to focus your tuning efforts.
1. Perform Benchmarking You should always have a baseline of current load times for a given workflow or session with a similar row count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on comparison with other similar tasks running faster. Use the benchmark to estimate what your desired performance goal should be and tune to that goal. Begin with the problem mapping that you created, along with a session and workflow that use all default settings. This helps to identify which changes have a positive impact on performance.
2. Identify the Performance Bottleneck Area This step helps to narrow down the areas on which to focus further. Follow the areas and sequence below when attempting to identify the bottleneck: Target Source Mapping Session/Workflow System. The methodology steps you through a series of tests using PowerCenter to identify trends that point where next to focus. Remember to go through these tests in a scientific manner; running them multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.
3. "Inside" or "Outside" PowerCenter Depending on the results of the bottleneck tests, optimize “inside” or “outside” PowerCenter. Be sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is also the order in which you should make any performance changes. Problems “outside” PowerCenter refers to anything that indicates the source of the performance problem is external to PowerCenter. The most common performance problems “outside” PowerCenter are source/target database problem, network bottleneck, server, or operating system problem. For source database related bottlenecks, refer to Tuning SQL Overrides and Environment for Better Performance For target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server, or Teradata For operating system problems, refer to Performance Tuning UNIX Systems or Performance Tuning Windows 2000/2003 Systems for more information. Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that
INFORMATICA CONFIDENTIAL
BEST PRACTICES
667 of 818
can greatly improve performance. Refer to the Tuning Sessions for Better Performance for more information. Although there are certain procedures to follow to optimize mappings, keep in mind that, in most cases, the mapping design is dictated by business logic; there may be a more efficient way to perform the business logic within the mapping, but you cannot ignore the necessary business logic to improve performance. Refer to Tuning Mappings for Better Performance for more information.
4. Re-Execute the Problem Workflow or Session After you have completed the recommended steps for each relevant performance bottleneck, re-run the problem workflow or session and compare the results to the benchmark and compare load performance against the baseline. This step is iterative, and should be performed after any performance-based setting is changed. You are trying to answer the question, “Did the performance change have a positive impact?” If so, move on to the next bottleneck. Be sure to prepare detailed documentation at every step along the way so you have a clear record of what was and wasn't tried. While it may seem like there are an enormous number of areas where a performance problem can arise, if you follow the steps for finding the bottleneck(s), and apply the tuning techniques specific to it, you are likely to improve performance and achieve your desired goals.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
668 of 818
Tuning and Configuring Data Analyzer and Data Analyzer Reports Challenge A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning Data Analyzer and Data Analyzer reports.
Description Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself. 1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic, database server load, and application server load. This provides a baseline to measure changes against. 2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level. 3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters pre-created to allow for scheduling of the report. 4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This provides the DBA the opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves Data Analyzer performance significantly. 5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server that the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracle’s local database communication protocol) and monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the query performs well regardless of where it is executed, but the report continues to be slow, this indicates an application server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing Data Analyzer on a dedicated application server. 6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports. Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database. Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few. 7. Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as INFORMATICA CONFIDENTIAL
BEST PRACTICES
669 of 818
required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view. Note: Editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning. The Data Analyzer repository database should be tuned for an OLTP workload.
Tuning Java Virtual Machine (JVM)
JVM Layout The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the following primary jobs: Execute code Manage memory Remove garbage objects The size of the JVM determines how often and how long garbage collection runs. The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application server.
Parameters of the JVM 1. -Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data Analyzer, the values should be set equal to each other. 2. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection. 3. Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line parameter controls the permanent generation's size. 4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size. 5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap while the new generation occupies 1/6 of the heap). When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation. When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection. 6. If you increase the new generation size, the old generation size decreases. Minor collections occur less often, but the frequency of major collection increases. 7. If you decrease the new generation size, the old generation size increases. Minor collections occur more, but the frequency of major collection decreases. 8. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size). 9. Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three CPUs per JVM.
Other Areas to Tune Execute Threads Threads available to process simultaneous operations in Weblogic.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
670 of 818
Too few threads means CPUs are under-utilized and jobs are waiting for threads to become available. Too many threads means system is wasting resource in managing threads. The OS performs unnecessary context switching. The default is 15 threads. Informatica recommends using the default value, but you may need to experiment to determine the optimal value for your environment.
Connection Pooling The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it. Initial capacity = 15 Maximum capacity = 15 Sum of connections of all pools should be equal to the number of execution threads. Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and maximum pool size at the same level. Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux. Check Enable Native I/O on the server attribute tab. Adds
Application Server-Specific Tuning Details JBoss Application Server
Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made available to optimal number of threads so that it can accept and process more HTTP requests.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
671 of 818
the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value. If the number of threads is too low, the following message may appear in the log files: ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships Data Analyzer with pre-compiled JSPs. The following is a typical configuration:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
672 of 818
EJB Container Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time functionalities. Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune the EJB INFORMATICA CONFIDENTIAL
BEST PRACTICES
673 of 818
pool parameters in the following file:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
674 of 818
strictTimeout. If you set
INFORMATICA CONFIDENTIAL
BEST PRACTICES
675 of 818
INFORMATICA CONFIDENTIAL
BEST PRACTICES
676 of 818
RMI Pool The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other custom applications, you can optimize the RMI thread pool parameters. To optimize the RMI pool, modify the following configuration file:
BEST PRACTICES
677 of 818
Web Container Navigate to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the following parameters. Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate. Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be optimal. Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered optimal. Is Growable: Specifies whether the number of threads can increase beyond the maximum size configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum threads should be hard-limited to the value given in the “Maximum Size”. Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server instances.
Transaction Services Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120 seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.
Diagnostic Trace Services Disable the trace in a production environment . Navigate to “Application Servers > [your_server_instance] > Administration Services > Diagnostic Trace Service “ and make sure “Enable Tracing” is not checked.
Debugging Services Ensure that the tracing is disabled in a production environment. Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service “ and make sure “Startup” is not checked.
Performance Monitoring Services This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the application server after a certain interval; if the server is found to be dead, it then tries to restart the server.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
678 of 818
Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the parameters according to a policy determined for each Data Analyzer installation. Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is faulty. The monitoring service then attempts to kill the server and restart it if “Automatic restart” is checked. Take care that “Ping Timeout” is not set to too small a value.
Process Definitions (JVM Parameters) For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning on the verbosegc option. The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for Data Analyzer installation: -Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions. -Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out compaction, regardless of whether it's useful. -Xgcthreads. This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection. -Xclassnogc. This disables collection of class objects. -Xinitsh. This sets the initial size of the application-class system heap. The system heap is expanded as needed and is never garbage collected. You may want to alter the following parameters after carefully examining the application server processes: Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine" Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment. Initial heap size. This is the –ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may actually reduce throughput because the garbage collection cycles will take more time to go through the large heap, even though the cycles may be occurring less frequently. Maximum heap size. This is the –mx value. It should be equal to the “Initial heap size” value. RunHProf:. This should remain unchecked in production mode, because it slows down the VM considerably. Debug Mode. This should remain unchecked in production mode, because it slows down the VM considerably. Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).
Performance Monitoring Services Be sure that performance monitoring services are not enabled in a production environment. Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure “Startup” is not checked.
Database Connection Pool The repository database connection pool can be configured by navigating to “JDBC Providers > User-defined JDBC Provider > Data Sources > IASDataSource > Connection Pools” The various parameters that may need tuning are:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
679 of 818
Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds, the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an exception. In that case, the pool size may need to be increased. Max Connections. The maximum number of connections in the pool. Informatica recommends a value of 50 for this. Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10 for this. Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new connection form the pool. If the database and the network are reliable, this should have a very high value (e.g., 1000). Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine. Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections do not age). If the database or the network connection to the repository database frequently comes down (compared to the life of the AppServer), this can be used to age-out the stale connections. Much like the repository database connection pools, the data source or data warehouse databases also have a pool of connections that are created dynamically by Data Analyzer as soon as the first client makes a request. The tuning parameters for these dynamic pools are present in
# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_ dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20 The various parameters that may need tuning are: dynapool.initialCapacity - the minimum number of initial connections in the data-source pool. dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow up to. dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification purposes. dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a connection from the pool if none is readily available. dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool. dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be in the pool.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
680 of 818
After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.
Message Listeners Services To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report consumes considerable resources (e.g., database connections, and CPU processing at both the application-server and database server levels) and setting this to a very high value may actually be detrimental to the whole system. Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort” . The parameters that can be tuned are: Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not recommend going beyond five. Maximum messages. This should remain as one. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.
Plug-in Retry Intervals and Connect Timeouts When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the load-balancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in then routes the request to the proper application-server. The plug-in file can be generated automatically by navigating to “ Environment > Update web server plugin configuration”. The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is sent over to the next available member of the cluster. The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down. The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send a request to the same member for 10 seconds.
Last updated: 13-Feb-07 17:59
INFORMATICA CONFIDENTIAL
BEST PRACTICES
681 of 818
Tuning Mappings for Better Performance Challenge In general, mapping-level optimization takes time to implement, but can significantly boost performance. Sometimes the mapping is the biggest bottleneck in the load process because business rules determine the number and complexity of transformations in a mapping. Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally, bringing about a performance increase in all scenarios. The second group of tuning processes may yield only small performance increase, or can be of significant value, depending on the situation. Some factors to consider when choosing tuning processes at the mapping level include the specific environment, software/ hardware limitations, and the number of rows going through a mapping. This Best Practice offers some guidelines for tuning mappings.
Description Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations. For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option. Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.
Consider Single-Pass Reading If several mappings use the same data source, consider a single-pass reading. If you have several sessions that use the same sources, consolidate the separate mappings with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows. Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that function is called in the session. For example, if you need to subtract percentage from the PRICE ports for both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage before splitting the pipeline.
Optimize SQL Overrides When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. See Tuning SQL Overrides and Environment for Better Performance for more information
.
Scrutinize Datatype Conversions PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary. In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
682 of 818
Eliminate Transformation Errors Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log. Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformations. If you need to run a session that generates a large number of transformation errors, you might improve performance by setting a lower tracing level. However, this is not a long-term response to transformation errors. Any source of errors should be traced and eliminated.
Optimize Lookup Transformations There are a several ways to optimize lookup transformations that are set up in a mapping.
When to Cache Lookups
Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-byrow basis. Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better Performance. A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example. In Mapping X, the source and lookup contain the following number of records:
ITEMS (source): MANUFACTURER: DIM_ITEMS:
5000 records 200 records 100000 records
Number of Disk Reads
Cached Lookup
LKP_Manufacturer Build Cache Read Source Records Execute Lookup Total # of Disk Reads LKP_DIM_ITEMS Build Cache Read Source Records Execute Lookup Total # of Disk Reads
Un-cached Lookup
200 5000 0 5200
0 5000 5000 100000
100000 5000 0 105000
0 5000 5000 10000
Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario. Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
683 of 818
Use the following eight step method to determine if a lookup should be cached: 1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a "where" clause on a relational source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS. 7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS. 8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.
Sharing Lookup Caches There are a number of methods for sharing lookup caches: Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads. Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs. Across different mappings and sessions, the use of a named persistent cache allows sharing an existing cache file.
Reducing the Number of Cached Rows There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache. Note: If you use a SQL override in a lookup, the lookup must be cached.
Optimizing the Lookup Condition In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
684 of 818
Indexing the Lookup Table The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and uncached lookups. In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement. In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.
Use a Persistent Lookup Cache for Static Lookups If the lookup source does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session, eliminating the time required to read the lookup source.
Optimize Filter and Router Transformations Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance. Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition. Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved. Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow.
Optimize Aggregator Transformations Aggregator Transformations often slow performance because they must group data before processing it. Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports. Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an Aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option. Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert, and the following rows to update. Use incremental aggregation if you can capture changes from the source that changes less than half the target. When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. The PowerCenter Server updates your target incrementally, rather than processing the entire source and recalculating the same calculations every time you run the session.
Joiner Transformation INFORMATICA CONFIDENTIAL
BEST PRACTICES
685 of 818
Joining Data from the Same Source You can join data from the same source in the following ways: Join two branches of the same pipeline. Create two instances of the same source and join pipelines from these source instances. You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping. When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input. If you want to join unsorted data, you must create two instances of the same source and join the pipelines. For example, you may have a source with the following ports: Employee Department Total Sales In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations: Sorter transformation. Sort the data. Sorted Aggregator transformation. Average the sales data and group by department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data. Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data. Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.
Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformations. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input. Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing. You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances. Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline. Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source: Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once. Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data. Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
686 of 818
Join two instances of a source if one pipeline may process much more slowly than the other pipeline.
Performance Tips
Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema. Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller. Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets. For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process. For a sorted Joiner transformation, designate as the master source the source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed. Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.
Add a hash auto-keys partition upstream of the sort origin To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you must group and sort data. To group data, ensure that rows with the same key value are routed to the same partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort the data ensures that you maintain grouping and sort the data within each group.
Use n:n partitions You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not need to cache all of the master data. This reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline, it must then read the data from disk to compare the master and detail pipelines.
Optimize Sequence Generator Transformations Sequence Generator transformations need to determine the next available sequence number; thus, increasing the Number of Cached Values property can increase performance. This property determines the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the PowerCenter Server must query the repository each time to determine the next number to be used. You may consider configuring the Number of Cached Values to a value greater than 1000. Note that any cached values not used in the course of a session are lost since the sequence generator value in the repository is set when it is called next time, to give the next set of cache values.
Avoid External Procedure Transformations For the most part, making calls to external procedures slows a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures, and Advanced External Procedures.
Field-Level Transformation Optimization As a final step in the tuning process, you can tune expressions used in transformations. When examining expressions, focus on INFORMATICA CONFIDENTIAL
BEST PRACTICES
687 of 818
complex expressions and try to simplify them when possible. To help isolate slow expressions, do the following: 1. 2. 3. 4. 5.
Time the session with the original expression. Copy the mapping and replace half the complex expressions with a constant. Run and time the edited session. Make another copy of the mapping and replace the other half of the complex expressions with a constant. Run and time the edited session.
Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.
Factoring Out Common Logic Factoring out common logic can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the logic to be performed just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits.
Minimize Function Calls Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated. Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data. Thus, the following expression: SUM(Column A) + SUM(Column B) Can be optimized to: SUM(Column A + Column B) In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as: CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME) It can be optimized to: FIRST_NAME || LAST_NAME Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example: IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT> IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT> IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT>
INFORMATICA CONFIDENTIAL
BEST PRACTICES
688 of 818
Can be optimized to: IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT> The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in three IIFs, three comparisons, and two additions. Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one: IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT> Can be optimized to: IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >
Calculate Once, Use Many Times Avoid calculating or testing the same value multiple times. If the same sub-expression is used several times in a transformation, consider making the sub-expression a local variable. The local variable can be used only within the transformation in which it was created. Calculating the variable only once and then referencing the variable in following sub-expressions improves performance.
Choose Numeric vs. String Operations The PowerCenter Server processes numeric operations faster than string operations. For example, if a lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.
Optimizing Char-Char and Char-Varchar Comparisons When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of CHAR source fields.
Use DECODE Instead of LOOKUP When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance.
Reduce the Number of Transformations in a Mapping Because there is always overhead involved in moving data among transformations, try, whenever possible, to reduce the number of transformations. Also, resolve unnecessary links between transformations to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.
Use Pre- and Post-Session SQL Commands You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier transformation and in the Properties tab of the target instance in a mapping. To increase the load speed, use these commands to drop indexes on the target before the session runs, then recreate them when the session completes. Apply the following guidelines when using SQL statements: You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may. You can use mapping parameters and variables in SQL executed against the source, but not against the target. Use a semi-colon (;) to separate multiple statements. The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/. If you need to use a semi-colon outside of quotes or comments, you can escape it with a back slash (\). The Workflow Manager does not validate the SQL.
Use Environmental SQL For relational databases, you can execute SQL commands in the database environment when connecting to the database. You INFORMATICA CONFIDENTIAL
BEST PRACTICES
689 of 818
can use this for source, target, lookup, and stored procedure connections. For instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the guidelines listed above for using the SQL statements.
Use Local Variables You can use local variables in Aggregator, Expression, and Rank transformations.
Temporarily Store Data and Simplify Complex Expressions Rather than parsing and validating the same expression each time, you can define these components as variables. This also allows you to simplyfy complex expressions. For example, the following expressions: AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT > SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT > can use variables to simplify complex expressions and temporarily store data:
Port
Value
V_CONDITION1
JOB_STATUS = 'Full-time'
V_CONDITION2
OFFICE_ID = 1000
AVG_SALARY
AVG( SALARY, V_CONDITION1 AND V_CONDITION2 )
SUM_SALARY
SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )
Store Values Across Rows You can use variables to store data from prior rows. This can help you perform procedural calculations. To compare the previous state to the state just read: IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >
Capture Values from Stored Procedures Variables also provide a way to capture multiple columns of return values from stored procedures.
Last updated: 13-Feb-07 17:43
INFORMATICA CONFIDENTIAL
BEST PRACTICES
690 of 818
Tuning Sessions for Better Performance Challenge Running sessions is where the pedal hits the metal. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, PowerCenter 8 comes with PowerCenter Enterprise Grid Option and Pushdown optimizations that also improve performance tremendously.
Description Once you optimize the source and target database, and mapping, you can focus on optimizing the session. The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches. The PowerCenter Server uses index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows. You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The transformation_readfromdisk or transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Index and data caches should both be sized according to the requirements of the individual lookup. The sizing can be done using the estimation tools provided in the Transformation Guide, or through observation of actual cache sizes on in the session caching directory. The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index].dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0 arrangement that gives maximum performance with no redundancy is recommended for volatile cache file directories (i.e., no persistent caches). If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, the RAM allocated needs to be available on the server. If the server doesn’t have available RAM and uses paged memory, your session is again accessing the hard disk. In this case, it is more efficient to allow PowerCenter to page the data rather than the operating system. Adding additional memory to the server is, of course, the best solution. Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes. The PowerCenter Server writes to the index and data cache files during a session in the following cases: The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation. The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time. The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache. The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space. When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.
Configuring Automatic Memory Settings
INFORMATICA CONFIDENTIAL
BEST PRACTICES
691 of 818
PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you can configure the Integration Service to automatically calculate cache memory settings at run time. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. The values stored in the data and index caches depend upon the requirements of the transformation. For example, the Aggregator index cache stores group values as configured in the group by ports, and the data cache stores calculations based on the group by ports. When the Integration Service processes a Sorter transformation or writes data to an XML target, it also creates a cache.
Configuring Session Cache Memory The Integration Service can determine cache memory requirements for the Lookup, Aggregator, Rank, Joiner, Sorter and XML. You can configure auto for the index and data cache size in the transformation properties or on the mappings tab of the session properties
Max Memory Limits Configuring maximum memory limits allows you to ensure that you reserve a designated amount or percentage of memory for other processes. You can configure the memory limit as a numeric value and as a percent of total memory. Because available memory varies, the Integration Service bases the percentage value on the total memory on the Integration Service process machine. For example, you configure automatic caching for three Lookup transformations in a session. Then, you configure a maximum memory limit of 500MB for the session. When you run the session, the Integration Service divides the 500MB of allocated memory among the index and data caches for the Lookup transformations. When you configure a maximum memory value, the Integration Service divides memory among transformation caches based on the transformation type. When you configure a numeric value and a percent both, the Integration Service compares the values and uses the lower value as the maximum memory limit. When you configure automatic memory settings, the Integration Service specifies a minimum memory allocation for the index and data caches. The Integration Service allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for each transformation instance. If you configure a maximum memory limit that is less than the minimum value for an index or data cache, the Integration Service overrides this value. For example, if you configure a maximum memory value of 500 bytes for session containing a Lookup transformation, the Integration Service overrides or disable the automatic memory settings and uses the default values. When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory Attributes, the Integration Service divides the allocated memory among all the nodes in the grid. When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes, the Integration Service allocates the specified percentage of memory on each node in the grid.
Aggregator Caches Keep the following items in mind when configuring the aggregate memory cache sizes: Allocate at least enough space to hold at least one row in each aggregate group. Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses Session Process memory to process an Aggregator transformation with sorted ports, not cache memory. Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run. When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data INFORMATICA CONFIDENTIAL
BEST PRACTICES
692 of 818
cache size, connect only the necessary output ports to subsequent transformations.
Joiner Caches When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data. The number of rows the PowerCenter Server stores in the cache depends on the partitioning scheme, the data in the master source, and whether or not you use sorted input. After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.
Lookup Caches Several options can be explored when dealing with Lookup transformation caches. Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session with a persistent cache lookup is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt. You can also delete the cache files before the session run to force the session to rebuild the caches. Lookup caching should be enabled for relatively small tables. Refer to the Best Practice Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can usually increase session performance. Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup
Allocating Buffer Memory The Integration Service can determine the memory requirements for the buffer memory: DTM Buffer Size Default Buffer Block Size You can also configure DTM buffer size and the default buffer block size in the session properties. When the PowerCenter Server initializes a session, it allocates blocks of memory to hold source and target data. Sessions that use a large number of sources and targets may require additional memory blocks. To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks. If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.
Increasing the DTM Buffer Pool Size The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns. If a session's performance details show low numbers for your source and target BufferInput_efficiency and
INFORMATICA CONFIDENTIAL
BEST PRACTICES
693 of 818
BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance. Using DTM buffer memory allocation generally causes performance to improve initially and then level off. (Conversely, it may have no impact on source or target-bottlenecked sessions at all and may not have an impact on DTM bottlenecked sessions). When the DTM buffer memory allocation is increased, you need to evaluate the total memory available on the PowerCenter Server. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. You can increase the DTM buffer size in the Performance settings of the Properties tab.
Running Workflows and Sessions Concurrently The PowerCenter Server can process multiple sessions in parallel and can also process multiple partitions of a pipeline within a session. If you have a symmetric multi-processing (SMP) platform, you can use multiple CPUs to concurrently process session data or partitions of data. This provides improved performance since true parallelism is achieved. On a single processor platform, these tasks share the CPU, so there is no parallelism. To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available.
Partitioning Sessions Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently. When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline: Location of partition points. The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you have PowerCenter partitioning available, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably. Number of partitions. By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. You can also overload source and target systems, so that is another consideration. Partition types. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types: 1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each partition receives approximately the same number of rows. 2. Hash keys. The PowerCenter Server uses a hash function to group rows of data among partitions. The Server groups the data based on a partition key. There are two types of hash partitioning: Hash auto-keys. The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations. Hash user keys. The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key. 3. Key range. The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges INFORMATICA CONFIDENTIAL
BEST PRACTICES
694 of 818
to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range. partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point. 5. Database partitioning partition. You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets. If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum singlepartition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session: Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition. Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition. Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the session with one partition. Partition the source data evenly. Configure each partition to extract the same number of rows. Or redistribute the data among partitions early using a partition point with round-robin. This is actually a good way to prevent hammering of the source system. You could have a session with multiple partitions where one partition returns all the data and the override SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause prevents any rows being returned). Some source systems react better to multiple concurrent SQL queries; others prefer smaller numbers of queries. Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time), then performance may improve for this session by adding a partition. Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration. Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.
Increasing the Target Commit Interval One method of resolving target database bottlenecks is to increase the commit interval. Each time the target database commits, performance slows. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve. When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate the larger number of rows. One of the major reasons that Informatica set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (i.e., you may get a database error like "unable to extend rollback segments" in Oracle).
Disabling High Precision If a session runs with high precision enabled, disabling high precision may improve session performance. The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable High Precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server down, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to using a dataype of Double.
Reducing Error Tracking If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount
INFORMATICA CONFIDENTIAL
BEST PRACTICES
695 of 818
of data the PowerCenter Server writes to the session log. To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility. As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to verbose initialization or verbose data. Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics. Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation. However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete. The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.
Pushdown Optimization You can push transformation logic to the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration. When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.
Source-Side Pushdown Optimization Sessions In source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target until it reaches a downstream transformation that cannot be pushed to the database. The Integration Service generates a SELECT statement based on the transformation logic up to the transformation it can push to the database. Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement at run time. Then, it reads the results of this SQL statement and continues to run the session. Similarly it create the view for SQL override and then generate SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.
Target-Side Pushdown Optimization Sessions When you run a session configured for target-side pushdown optimization, the Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.
Full Pushdown Optimization Sessions INFORMATICA CONFIDENTIAL
BEST PRACTICES
696 of 818
To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping from source to target and analyze each transformation in the pipeline until it analyzes the target. It generates and executes the SQL on sources and targets, When you run a session for full pushdown optimization, the database must run a long transaction if the session contains a large quantity of data. Consider the following database performance issues when you generate a long transaction: A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur. The Rank transformation cannot be pushed to the database. If you configure the session for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database, and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database and the session is configured for full optimization.
Using a Grid You can use a grid to increase session and workflow performance. A grid is an alias assigned to a group of nodes that allows you to automate the distribution of workflows and sessions across nodes. When you use a grid, the Integration Service distributes workflow tasks and session threads across multiple nodes. Running workflows and sessions on the nodes of a grid provides the following performance gains: Balances the Integration Service workload. Processes concurrent sessions faster. Processes partitions faster. When you run a session on a grid, you improve scalability and performance by distributing session threads to multiple DTM processes running on nodes in the grid. To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.
Running a Session on Grid When you run a session on a grid, the master service process runs the workflow and workflow tasks, including the Scheduler. Because it runs on the master service process node, the Scheduler uses the date and time for the master service process node to start scheduled workflows. The Load Balancer distributes Command tasks as it does when you run a workflow on a grid. In addition, when the Load Balancer dispatches a Session task, it distributes the session threads to separate DTM processes. The master service process starts a temporary preparer DTM process that fetches the session and prepares it to run. After the preparer DTM process prepares the session, it acts as the master DTM process, which monitors the DTM processes running on other nodes. The worker service processes start the worker DTM processes on other nodes. The worker DTM runs the session. Multiple worker DTM processes running on a node might be running multiple sessions or multiple partition groups from a single session depending on the session configuration. For example, you run a workflow on a grid that contains one Session task and one Command task. You also configure the session to run on the grid. When the Integration Service process runs the session on a grid, it performs the following tasks: On Node 1, the master service process runs workflow tasks. It also starts a temporary preparer DTM process, which becomes the master DTM process. The Load Balancer dispatches the Command task and session threads to nodes in the grid. On Node 2, the worker service process runs the Command task and starts the worker DTM processes that run the session threads.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
697 of 818
On Node 3, the worker service process starts the worker DTM processes that run the session threads. For information about configuring and managing a grid, refer to the PowerCenter Administrator Guide and to the best practice PowerCenter Enterprise Grid Option. For information about how the DTM distributes session threads into partition groups, see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.
Last updated: 06-Dec-07 15:20
INFORMATICA CONFIDENTIAL
BEST PRACTICES
698 of 818
Tuning SQL Overrides and Environment for Better Performance Challenge Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from source database tables, which positively impacts the overall session performance. This Best Practice explores ways to optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.
Description SQL Queries Performing Data Extractions Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.
DB2 Coalesce and Oracle NVL When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used. Here is an example of the Oracle NLV function: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM exp.exp_bio_result bio, sar.sar_data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log < /FONT > WHERE load_status = 'P')< Here is the same query in DB2: SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT > WHERE load_status = 'P')< /FONT >
Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation. You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view
INFORMATICA CONFIDENTIAL
BEST PRACTICES
699 of 818
You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause: SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT, N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT, N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER, N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM DOSE_REGIMEN N, (SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID FROM EXPERIMENT_PARAMETER R, NEW_GROUP_TMP TMP WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT > AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT > )X WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT > ORDER BY N.DOSE_REGIMEN_ID
Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example: WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT > SELECT DISTINCT bio.experiment_group_id, bio.database_site_code FROM bio_result bio, data_load_log log WHERE bio.update_date BETWEEN log.start_time AND log.end_time AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’) AND log.seq_no = maxseq. seq_no< /FONT > Here is another example using a WITH clause that uses recursive SQL: WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS (SELECT PERSON_ID, NAME, PARENT_ID FROM PARENT_CHILD WHERE NAME IN (‘FRED’, ‘SALLY’, ‘JIM’) UNION ALL SELECT C.PERSON_ID, C.NAME, C.PARENT_ID FROM PARENT_CHILD C, PERSON_TEMP RECURS
INFORMATICA CONFIDENTIAL
BEST PRACTICES
700 of 818
WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT > AND LEVEL < 5) SELECT * FROM PERSON_TEMP The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.
CASE (DB2) vs. DECODE (Oracle) The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case since it was the only legal way to test a condition in earlier versions. DECODE is not allowed in DB2. In Oracle: SELECT EMPLOYEE, FNAME, LNAME, DECODE (SALARY) < 10000, ‘NEED RAISE’, > 1000000, ‘OVERPAID’, ‘THE REST OF US’) AS COMMENT FROM EMPLOYEE In DB2: SELECT EMPLOYEE, FNAME, LNAME, CASE WHEN SALARY < 10000 THEN ‘NEED RAISE’ WHEN SALARY > 1000000 THEN ‘OVERPAID’ ELSE ‘THE REST OF US’ END AS COMMENT FROM EMPLOYEE
Debugging Tip: Obtaining a Sample Subset It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can be commented out or removed after it is put in general use. DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows: SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT > FETCH FIRST 12 ROWS ONLY Oracle does it this way using the ROWNUM variable:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
701 of 818
SELECT EMPLOYEE, FNAME, LNAME FROM EMPLOYEE WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT > AND ROWNUM <= 12< /FONT>
INTERSECT, INTERSECT ALL, UNION, UNION ALL Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL return all rows.
System Dates in Oracle and DB2 Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or the date however you want with date functions. Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy): SELECT TRUNC(SYSDATE) – 1 FROM DUAL DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP Here is an example for DB2: SELECT FNAME, LNAME, CURRENT DATE AS TODAY FROM EMPLOYEE
Oracle: Using Hints Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer control over the execution. Hints are always honored unless execution is not possible. Because the database engine does not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and access method hints are the most common. In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using older versions of Oracle however, the use of the proper INDEX hints should help performance. The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below provides a partial list of optimizer hints and descriptions.
Optimizer hints: Choosing the best join method Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires memory to build the hash table. Hash joins are most effective when the amount of data is large and one table is much larger than the other. Here is an example of a select that performs best as a hash join: SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M WHERE C.CUST_ID = M.MANAGER_ID< /FONT > Considerations Better throughput Better response time INFORMATICA CONFIDENTIAL
Join Type Sort/Merge Nested loop BEST PRACTICES
702 of 818
Large subsets of data Index available to support join Limited memory and CPU available for sorting Parallel execution Joining all or most of the rows of large tables Joining small sub-sets of data and index available Hint ALL_ROWS FIRST_ROWS
CHOOSE
RULE USE NL USE MERGE HASH
Sort/Merge Nested loop Nested loop Sort/Merge or Hash Sort/Merge or Hash Nested loop
Description The database engine creates an execution plan that optimizes for throughput. Favors full table scans. Optimizer favors Sort/Merge The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as possible. Favors index lookups. Optimizer favors Nested-loops The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor. The database engine creates an execution plan based on a fixed set of rules. Use nested loops Use sort merge joins The database engine performs a hash scan of the table. This hint is ignored if the table is not clustered.
Access method hints Access method hints control how data is accessed. These hints are used to force the database engine to use indexes, hash scans, or row id scans. The following table provides a partial list of access method hints. Hint ROWID INDEX
USE_CONCAT
Description The database engine performs a scan of the table based on ROWIDS. DO NOT USE in Oracle 9.2 and above. The database engine performs an index scan of a specific table, but in 9.2 and above, the optimizer does not use any indexes other than those mentioned. The database engine converts a query with an OR condition into two or more queries joined by a UNION ALL statement.
The syntax for using a hint in a SQL statement is as follows: Select /*+ FIRST_ROWS */ empno, ename From emp; Select /*+ USE_CONCAT */ empno, ename From emp;
SQL Execution and Explain Plan The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually provide better execution time. The developer can determine which type of execution is being used by running an explain plan on the SQL query in question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results of that statement are then used as input by the next level statement. Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full table
INFORMATICA CONFIDENTIAL
BEST PRACTICES
703 of 818
scans cause degradation in performance. Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following additional information including: The number of executions The elapsed time of the statement execution The CPU time used to execute the statement The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.
Using Indexes The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.
Reviewing SQL Logic The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme cases, the entire SQL statement may need to be re-written to become more efficient.
Reviewing SQL Syntax SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example: EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example: SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN (SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster SELECT * FROM DEPARTMENTS D WHERE EXISTS (SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT > Situation Index supports subquery No Index to support subquery
Sub-query returns many rows Sub-query returns one or a few rows Most of the sub-query rows are eliminated by the parent query Index in parent that match sub-query columns
Exists In Yes Yes Yes No Table scans per parent Table scan once row Probably not Yes Yes Yes No Yes Possibly not since the Yes – IN uses the EXISTS cannot use the index index
Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent. Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
704 of 818
Choosing the Best Join Order Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the incremental ETL load. Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause. Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.
Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN
Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables. SELECT NAME_ID FROM CUSTOMERS WHERE NAME_ID NOT IN (SELECT NAME_ID FROM EMPLOYEES) Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan. SELECT C.NAME_ID FROM CUSTOMERS C WHERE NOT EXISTS (SELECT * FROM EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID)< /FONT > In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator. SELECT C.NAME_ID FROM CUSTOMERS C MINUS SELECT E.NAME_ID* FROM EMPLOYEES E Also consider using outer joins with IS NULL conditions for anti-joins. SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E WHERE C.NAME_ID = E.NAME_ID (+)< /FONT > AND C.NAME_ID IS NULL Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine. In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders. Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example: SELECT R.BATCH_TRACKING_NO, INFORMATICA CONFIDENTIAL
BEST PRACTICES
705 of 818
R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.GCW_LOAD_DATE FROM CDS_SUPPLIER R, (SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV, L.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG L WHERE L.LOAD_DATE_PREV IN (SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV FROM ETL_AUDIT_LOG Y) )Z WHERE R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds. Here is the improved SQL: SELECT R.BATCH_TRACKING_NO, R.SUPPLIER_DESC, R.SUPPLIER_REG_NO, R.SUPPLIER_REF_CODE, R.LOAD_DATE FROM /* In-line view for lower limit */ (SELECT R1.BATCH_TRACKING_NO, R1.SUPPLIER_DESC, R1.SUPPLIER_REG_NO, R1.SUPPLIER_REF_CODE, R1.LOAD_DATE FROM CDS_SUPPLIER R1,
INFORMATICA CONFIDENTIAL
BEST PRACTICES
706 of 818
(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV FROM ETL_AUDIT_LOG Y) Z WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT> ORDER BY R1.LOAD_DATE) R, /* end in-line view for lower limit */ (SELECT MAX(D.LOAD_DATE) AS LOAD_DATE FROM ETL_AUDIT_LOG D) A /* upper limit /* WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>
Tuning System Architecture Use the following steps to improve the performance of any system: 1. 2. 3. 4. 5.
6. 7. 8.
9. 10. 11. 12.
Establish performance boundaries (baseline). Define performance objectives. Develop a performance monitoring plan. Execute the plan. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider reducing the number of measurements because performance monitoring itself uses system resources. Otherwise continue with Step 6. Determine the major constraints in the system. Decide where the team can afford to make trade-offs and which resources can bear additional load. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a time. If there are no options left at any level, this indicates that the system has reached its limits and hardware upgrades may be advisable. Return to Step 4 and continue to monitor the system. Return to Step 1. Re-examine outlined objectives and indicators. Refine monitoring and tuning strategy.
System Resources The PowerCenter Server uses the following system resources: CPU Load Manager shared memory DTM buffer memory Cache memory When tuning the system, evaluate the following considerations during the implementation process. Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases. Use multiple PowerCenter Servers on separate systems to potentially improve session performance. When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance Check hard disks on related machines. Slow disk access on source and target databases, source and target file systems, as well as the PowerCenter Server and repository machines can slow session performance. When an operating system runs out of physical memory, it starts paging to disk to free physical memory. Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions. INFORMATICA CONFIDENTIAL
BEST PRACTICES
707 of 818
In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server. In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation. In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation. In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.
Database Performance Features Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the many available alternatives is the best implementation choice for the particular database. The project team must have a thorough understanding of the data, database, and desired use of the database by the end-user community prior to beginning the physical implementation process. Evaluate the following considerations during the implementation process. Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables. Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts. Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings. Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads. OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority. Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput. Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.
Last updated: 13-Feb-07 17:47
INFORMATICA CONFIDENTIAL
BEST PRACTICES
708 of 818
Using Metadata Manager Console to Tune the XConnects Challenge Improving the efficiency and reducing the run-time of your XConnects through the parameter settings of the Metadata Manager console.
Description Remember that the minimum system requirements for a machine hosting the Metadata Manager console are: Windows operating system (2000, NT 4.0 SP 6a) 400MB disk space 128MB RAM (256MB recommended) 133 MHz processor. If the system meets or exceeds the minimal requirements, but an XConnect is still taking an inordinately long time to run, use the following steps to try to improve its performance. To improve performance of your XConnect loads from database catalogs: Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion) Carefully examine how many old objects the project needs by default. Modify the “sysdate -5000” to a smaller value to reduce the result set. To improve performance of your XConnect loads from the PowerCenter repository: Load only the production folders that are needed for a particular project. Run the XConnects with just one folder at a time, or select the list of folders for a particular run.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
709 of 818
Advanced Client Configuration Options Challenge Setting the Registry to ensure consistent client installations, resolve potential missing or invalid license key issues, and change the Server Manager Session Log Editor to your preferred editor.
Description Ensuring Consistent Data Source Names To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine. Solution: From Repository Manager, choose Export Registry from the Tools drop-down menu. For all subsequent client installs, simply choose Import Registry from the Tools drop-down menu.
Resolving Missing or Invalid License Keys The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than Administrator. This problem also occurs when the client software tools are installed under the Administrator account, and a user with a nonadministrator ID subsequently attempts to run the tools. The user who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the PowerCenter Client tools. Instead, the software displays the message indicating that the license key is missing or invalid. Solution: While logged in as the installation user with administrator authority, use regedt32 to edit the registry. Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)
Changing the Session Log Editor In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. Then browse for the editor that you want on the General tab. For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it. Solution: While logged in as the installation user with administrator authority, use regedt32 to go into the registry. Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data. Select the Log File Editor entry by double clicking on it. Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write.exe). Select Registry --> Exit from the menu bar to save the entry. For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
710 of 818
The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for workflow and session logs.
Adding a New Command Under Tools Menu Other tools, in addition to the PowerCenter client tools, are often needed during development and testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad (Oracle) to query the database. You can add shortcuts to executable programs from any client tool’s ‘Tools’ drop-down menu to provide quick access to these programs. Solution: Choose ‘Customize’ under the Tools menu and add a new item. Once it is added, browse to find the executable it is going to call (as shown below).
INFORMATICA CONFIDENTIAL
BEST PRACTICES
711 of 818
When this is done once, you can easily call another program from your PowerCenter client tools. In the following example, TOAD can be called quickly from the Repository Manager tool.
Changing Target Load Type In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type ‘bulk’, although this was not necessarily what was desired and could cause the session to fail under certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow Manager to choose the default load type to be either 'bulk' or 'normal'. Solution: In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab. Click the button for either 'normal' or 'bulk', as desired. Click OK, then close and open the Workflow Manager tool. After this, every time a session is created, the target load type for all relational targets will default to your choice.
Resolving Undocked Explorer Windows The Repository Navigator window sometimes becomes undocked. Docking it again can be frustrating because double clicking on the window header does not put it back in place.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
712 of 818
Solution: To get the Window correctly docked, right-click in the white space of the Navigator window. Make sure that ‘Allow Docking’ option is checked. If it is checked, double-click on the title bar of the Navigator Window.
Resolving Client Tool Window Display Issues If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e.g., Designer) disappears, try the following solutions to recover it: Clicking View > Navigator Toggling the menu bar Uninstalling and reinstalling Client tools Note: If none of the above solutions resolve the problem, you may want to try the following solution using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause serious problems that may require reinstalling the operating system. Informatica does not guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the Registry Editor at your own risk. Solution: Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can often be resolved as follows: Close the client tool. Go to Start > Run and type "regedit". Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z Where x.y.z is the version and maintenance release level of the PowerCenter client as follows:
PowerCenter Folder Version Name 7.1
7.1
7.1.1
7.1.1
7.1.2
7.1.1
INFORMATICA CONFIDENTIAL
BEST PRACTICES
713 of 818
7.1.3
7.1.1
7.1.4
7.1.1
8.1
8.1
Open the key of the affected tool (for the Repository Manager open Repository Manager Options). Export all of the Toolbars sub-folders and rename them. Re-open the client tool.
Enhancing the Look of the Client Tools The PowerCenter client tools allow you to customize the look and feel of the display. Here are a few examples of what you can do.
Designer From the Menu bar, select Tools > Options. In the dialog box, choose the Format tab. Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts). Changing the background workspace colors can help identify which workspace is currently open. For example, changing the Source Analyzer workspace color to green or the Target Designer workspace to purple to match their respective metadata definitions helps to identify the workspace. Alternatively, click the Select Theme button to choose a color theme, which displays background colors based on predefined themes.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
714 of 818
Workflow Manager You can modify the Workflow Manager using the same approach as the Designer tool. From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or customize each element individually.
Workflow Monitor You can modify the colors in the Gantt Chart view to represent the various states of a task. You can also select two colors for one task to give it a dimensional appearance; this can be helpful in distinguishing between running tasks, succeeded tasks, etc. To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt Chart.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
715 of 818
Using Macros in Data Stencil Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you cannot run the Data Stencil macros. To use the security level for the Visio, select Tools > Macros > Security from the menu. On the Security Level tab, select Medium. When you start Data Stencil, Visio displays a security warning about viruses in macros. Click Enable Macros to enable the macros for Data Stencil.
Last updated: 19-Mar-08 19:00
INFORMATICA CONFIDENTIAL
BEST PRACTICES
716 of 818
Advanced Server Configuration Options Challenge Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings; using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.
Description Configuring Advanced Integration Service Properties Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties > Edit. The following Advanced properties are included:
Limit on Resilience Timeouts
Optional Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings. Valid values are between 0 and 2592000, inclusive. Default is blank.
Resilience Timeout
Optional Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If blank, the value is derived from the domainlevel settings. Valid values are between 0 and 2592000, inclusive. Default is blank.
Configuring Integration Service Process Variables One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits include: Ease of deployment across environments (DEV > TEST > PRD) Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths. All the variables are related to directory paths used by a given Integration Service. You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files. Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption. All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location. By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for INFORMATICA CONFIDENTIAL
BEST PRACTICES
717 of 818
each Integration Service process. You must specify the directory path for each type of file. You specify the following directories using service process variables: Each registered server has its own set of variables. The list is fixed, not user-extensible.
Service Process Variable $PMRootDir $PMSessionLogDir $PMBadFileDir $PMCacheDir $PMTargetFileDir $PMSourceFileDir $PMExtProcDir $PMTempDir $PMSuccessEmailUser $PMFailureEmailUser $PMSessionLogCount $PMSessionErrorThreshold $PMWorkflowLogCount $PMWorkflowLogDir $PMLookupFileDir $PMStorageDir
Value (no default – user must insert a path) $PMRootDir/SessLogs $PMRootDir/BadFiles $PMRootDir/Cache $PMRootDir/TargetFiles $PMRootDir/SourceFiles $PMRootDir/ExtProc $PMRootDir/Temp (no default – user must insert a path) (no default – user must insert a path) 0 0 0 $PMRootDir/WorkflowLogs $PMRootDir/LkpFiles $PMRootDir/Storage
Writing PowerCenter 8 Service Logs to Files Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous versions. To write all Integration Service logs (session, workflow, server, etc.) to files: 1. 2. 3. 4. 5.
Log in to the Admin Console. Select the Integration Service Add a Custom property called UseFileLog and set its value to "Yes". Add a Custom property called LogFileName and set its value to the desired file name. Restart the service.
Integration Service Custom Properties (undocumented server parameters) can be entered here as well: 1. At the bottom of the list enter the Name and Value of the custom property 2. Click OK.
Adjusting Semaphore Settings on UNIX Platforms When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server. Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers. The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
718 of 818
HP/UX: Use sam (1M) to change the parameters. Solaris: Use admintool or edit /etc/system to change the parameters. AIX: Use smit to change the parameters.
Setting Shared Memory and Semaphore Parameters on UNIX Platforms Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after configuring the UNIX kernel.
HP-UX For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT. Use the HP System V IPC Shared-Memory Subsystem to update parameters. To change a value, perform the following steps: 1. 2. 3. 4. 5. 6. 7.
Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program. Double click the Kernel Configuration icon. Double click the Configurable Parameters icon. Double click the parameter you want to change and enter the new value in the Formula/Value field. Click OK. Repeat these steps for all kernel configuration parameters that you want to change. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.
The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.
IBM AIX None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.
SUN Solaris Keep the following points in mind when configuring and tuning the SUN Solaris platform. These settings are valid for the Solaris 8 and 9 releases and have become obsolete in the Solaris 10 release where most of the shared memory and semaphore parameters have been removed (with the exception of shmmax and shmmni which Solaris 10 does use). Other parameters are set dynamically within the Solaris 10 IPC model. The Solaris 10 IPC resource management framework was designed to overcome the shortcomings of older versions and several parameters were converted to be dynamically resized and the defaults were increased: 1. Edit the /etc/system file and add the following variables to increase shared memory segments: set shmsys:shminfo_shmmax=value (Maximum size of a shared memory segment. Solaris 10 sets this to 0.25 of physical memory by default versus 512k in previous versions) set shmsys:shminfo_shmmin=default is 1 byte (Smallest possible shared memory segment size) set shmsys:shminfo_shmmni=value (Maximum number of shared memory identifiers at any given time. For Solaris 10, the default is 128 and the maximum is MAXINT) set shmsys:shminfo_shmseg=value (Maximum number of segments per process) set semsys:seminfo_semmni=1 to 65535; default 10 (Specifies the maximum number of semaphore identifiers)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
719 of 818
set semsys:seminfo_semmns=1 to MAXINT; default 60 (Specifies the maximum number of System V semaphores on the system) set semsys:seminfo_semmsl=1 to MAXINT; default 25 (Specifies the maximum number of System V per semaphore identifier) set semsys:seminfo_semmnu=1 to MAXINT; default 30 (Total number of undo structures supported by the System V semaphore system) set semsys:seminfo_semume=1 to MAXINT; default 10 (Maximum number of System V semaphore undo structures that can be used by any one process ) 2. Verify the shared memory value changes: # grep shmsys /etc/system 3. Restart the system: # init 6
Red Hat Linux The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a restart. For example, to allow 128MB, type the following command: $ echo 134217728 >/proc/sys/kernel/shmmax You can put this command into a script run at startup. Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar to the following: kernel.shmmax = 134217728 This file is usually processed at startup, but sysctl can also be called explicitly later. To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/sem.h.
SuSE Linux The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a restart. For example, to allow 512MB, type the following commands: #sets shmall and shmmax shared memory echo 536870912 >/proc/sys/kernel/shmall #Sets shmall to 512 MB echo 536870912 >/proc/sys/kernel/shmmax #Sets shmmax to 512 MB You can also put these commands into a script run at startup. Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following: #sets user limits (ulimit) for system memory resources ulimit -v 512000 #set virtual (swap) memory to 512 MB ulimit -m 512000 #set physical memory to 512 MB
Configuring Automatic Memory Settings
INFORMATICA CONFIDENTIAL
BEST PRACTICES
720 of 818
With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer memory and cache memory settings, consider the overall memory usage for best performance. Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the Integration Service disables automatic memory settings and uses default values.
Last updated: 22-Feb-10 14:47
INFORMATICA CONFIDENTIAL
BEST PRACTICES
721 of 818
Causes and Analysis of UNIX Core Files Challenge This Best Practice explains what UNIX core files are and why they are created, and offers some tips on analyzing them.
Description Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a "core dump" file is also created, which can be used to analyze the reason for the abnormal termination.
What is a Core File and What Causes it to be Created? UNIX operating systems may terminate a process before its normal, expected exit for several reasons. These reasons are typically for bad behavior by the program, and include attempts to execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory space allocated to the program, attempts to write to memory marked read-only by the operating system, and other similar incorrect low-level operations. Most of these bad behaviors are caused by errors in programming logic in the program. UNIX may also terminate a process for some reasons that are not caused by programming errors. The main examples of this type of termination are when a process exceeds its CPU time limit, and when a process exceeds its memory limit. When UNIX terminates a process in this way, it normally writes an image of the processes memory to disk in a single file. These files are called "core files", and are intended to be used by a programmer to help determine the cause of the failure. Depending on the UNIX version, the name of the file may be "core", or in more recent UNIX versions, "core.nnnn" where nnnn is the UNIX process ID of the process that was terminated. Core files are not created for "normal" runtime errors such as incorrect file permissions, lack of disk space, inability to open a file or network connection, and other errors that a program is expected to detect and handle. However, under certain error conditions a program may not handle the error conditions correctly and may follow a path of execution that causes the OS to terminate it and cause a core dump. Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.
Core File Locations and Size Limits A core file is written to the current working directory of the process that was terminated. For PowerCenter, this is always the directory the services were started from. For other applications, this may not be true. UNIX also implements a per user resource limit on the maximum size of core files. This is controlled by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the total memory size of the process, a partial core file will be written. Refer to the Best Practice Understanding and Setting UNIX Resources for PowerCenter Installations .
Analyzing Core Files Core files provide valuable insight into the state and condition the process was in just before it was terminated. It also contains the history or log of routines that the process went through before that fateful function call; this log is known as the stack trace. There is little information in a core file that is relevant to an end user; most of the contents of a core file are only relevant to a developer, or someone who understands the internals of the program that generated the core file. However, there are a few things that an end user can do with a core file in the way of initial analysis. The most important aspect of analyzing a core file is the task of extracting this stack trace out of the core dump. Debuggers are the tools that help retrieve this stack trace and other vital information out of the core. Informatica recommends using the pmstack utility.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
722 of 818
The first step is to save the core file under a new name so that it is not overwritten by a later crash of the same application. One option is to append a timestamp to the core, but it can be renamed to anything: mv core core.ddmmyyhhmi The second step is to log in with the same UNIX user id that started up the process that crashed. This sets the debugger's environment to be same as that of the process at startup time. The third step is to go to the directory where the program is installed. Run the "file" command on the core file. This returns the name of the process that created the core file. file
Using the pmstack Utility Informatica provides a ‘pmstack’ utility that can automatically analyze a core file. If the core file is from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to Informatica Customer Support for further analysis. The track contains everything necessary to further diagnose the problem. Core files themselves are normally not useful on a system other than the one where they were generated. The pmstack utility can be downloaded from the Informatica Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com. Once downloaded, run pmstack with the –c option, followed by the name of the core file: $ pmstack -c core.21896 ================================= SSG pmstack ver 2.0 073004 ================================= Core info : -rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896 core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from ''''''''pmdtm'''''''' Process name used for analyzing the core : pmdtm Generating stack trace, please wait.. Pmstack completed successfully Please send file core.21896.trace to Informatica Technical Support You can then look at the generated trace file or send it to support. Pmstack also supports a –p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung to determine what the process is doing.
Last updated: 19-Mar-08 19:01
INFORMATICA CONFIDENTIAL
BEST PRACTICES
723 of 818
Domain Configuration Challenge The domain architecture in PowerCenter simplifies the administration of disparate PowerCenter services across the enterprise as well as the maintenance of security throughout PowerCenter. It allows for the grouping of previously separately administered application services and nodes into logically-grouped folders within the domain, based on administrative ownership. It is vital when installing or upgrading PowerCenter, that the Application Administrator understand the terminology and architecture surrounding the Domain Configuration in order to effectively administer, upgrade, deploy, and maintain PowerCenter Services throughout the enterprise.
Description The domain architecture allows PowerCenter to provide a service-oriented architecture where you can specify which services are running on which node or physical machine from one central location. The components in the domain are ‘aware’ of each other’s presence and continually monitor one another via ‘heartbeats’. The various services within the domain can move from one physical machine to another without any interruption to the PowerCenter environment. As long as clients can connect to the domain, the domain can route their needs to the appropriate physical machine. From a monitoring perspective, the domain provides the ability to monitor all services in the domain as well as control security from a central location. You no longer have to log into and ping multiple machines in a robust PowerCenter environment – instead a single screen displays the current availability state of all services. For more details on the individual components and detailed configuration of a domain, refer to the PowerCenter Administrator Guide.
Key Domain Components There are several key domain components to consider during installation and setup: Master Gateway – The node designated as the master gateway or domain controller is the main ’entry point’ to the domain. This server or set of servers should be your most reliable and available machine in the architecture. It is the first point of entry for all clients wishing to connect to one of the PowerCenter services. If the master gateway is unavailable, the entire domain is unavailable. You may designate more than one node to run the gateway service. One gateway is always the master or primary, but by having the gateway services running on more than one node in a multimode configuration, your domain can continue to function if the master gateway is no longer available. In a high-availability environment, it is critical to have one or more nodes running the gateway service as a backup to the master gateway. Shared File System – The PowerCenter domain architecture provides centralized logging capability and; when highavailability is enabled, a highly available environment with automatic fail-over of workflows and sessions. In order to achieve this, the base PowerCenter server file directories must reside on a file system that is accessible by all nodes in the domain. When PowerCenter is initially installed, this directory is called infa_shared and is located under the server directory of the PowerCenter installation. It includes logs and checkpoint information that is shared among nodes of the domain. Ideally, this file system is both high-performance and highly available. Domain Metadata – As of PowerCenter 8, a store of metadata exists to hold all of the configuration settings for the domain. This domain repository is separate from the one or more PowerCenter repositories in a domain. Instead, it is a handful of tables that replace the older version 7.xpmserver.cfg, pmrep.cfg and other PowerCenter configuration information. As of PowerCenter 8.5, all PowerCenter security is also maintained here. Upon installation you will be prompted for the RDBMS location for the domain repository. This information should be treated like a PowerCenter repository, with regularly-scheduled backups and a disaster recovery plan. Without this metadata, a domain is unable to function. The RDBMS user provided to PowerCenter requires permissions to create and drop tables, as well as insert, update, and delete records. Ideally, if you are going to be grouping multiple independent nodes within this domain, the domain configuration database should reside on a separate and independent server so as to eliminate the single point of failure if the node hosting the domain configuration database fails.
Domain Architecture Just as in other PowerCenter architectures, the premise of the architecture is to maintain flexibility and scalability across the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
724 of 818
environment. There is no single best way to deploy the architecture. Rather, each environment should be assessed for external factors and then PowerCenter should be configured appropriately to function best in that particular environment. The advantage of the service-oriented architecture is that components in the architecture (i.e., repository services, integration services, and others) can be moved among nodes without needing to make changes to the mappings or workflows. Starting in PowerCenter 8.5, all reporting components of PowerCenter (Data Analyzer and Metadata Manager) are now all configured and administered from the Domain. Because of this architecture, it is very simple to alter architecture components if you find a suboptimal configuration and want to alter it in your environment. The key here is that you are not tied to any choices you make at installation time and have the flexibility to make changes to your architecture as your business needs change. Tip While the architecture is very flexible and provides easy movement of services throughout the environment, an item to carefully consider at installation time is the name of the domain and its subsequent nodes. These are somewhat troublesome to change later because of their criticality to the domain. It is not recommended that you imbed server IP addresses and names in the domain name or the node names. You never know when you may need to move to new hardware or move nodes to new locations. For example, instead of naming your domain ‘PowerCenter_11.5.8.20’, consider naming it ‘Enterprise_Dev_Test’. This makes it more intuitive to understand what domain you are attaching to and if you ever decide to move the main gateway to another server, you don’t need to change the domain or node name. While these names can be changed, the change is not easy and requires using command line programs to alter the domain metadata. In the next sections, we look at a couple of sample domain configurations.
Single Node Domain Even in a single server/single node installation, you must still create a domain. In this case, all domain components reside on a single physical machine (i.e., node). You can have any number of PowerCenter services running in this domain. It is important to note that with PowerCenter 8 and beyond, you can run multiple integration services at the same time on the same machine – even in a NT/Windows environment. Naturally this configuration exposes a single point of failure for every component in the domain and high availability is not available in this situation.
Multiple Node Domains Domains can continue to expand to meet the demands of true enterprise-wide data integration.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
725 of 818
Domain Architecture for Production/Development/Quality Assurance Environments The architecture picture becomes more complex when you consider a typical development environment, which usually includes some level of a Development, Quality Assurance, and Production environment. In most implementations, these are separate PowerCenter repositories and associated servers. It is possible to define a single domain to include one or more of these development environments. However, there are a few points to consider: If the domain gateway is unavailable for any reason, the entire domain is inaccessible. Keep in mind that if you place your development, quality assurance and production services in a single domain, you have the possibility of affecting your production environment with development and quality assurance work. If you decide to restart the domain in Development for some reason, you are effectively restarting development, quality assurance and production at the same time. Also, if you experience some sort of failure that affects the domain in production, you have also brought down your development environment and have no place to test a fix the problem since your entire environment is compromised. For the domain you should have a common, shared, high-performance file system to share the centralized logging and checkpoint files. If you have all three environments together on one domain, you are mixing production logs with development logs and other files on the same physical disk. Your production backups and disaster recovery files will have more than just production information in them. For a future upgrade, it is very likely that you will need to upgrade all components of the domain at once to the new version of PowerCenter. If you have placed development, quality assurance, and production in the same domain, you may need to upgrade all of it at once. This is an undesirable situation in most data integration environments. For these reasons, Informatica generally recommends having at least two separate domains in any environment: Production Domain Development/Quality Assurance Domain
Some architects choose to deploy a separate domain for each environment to further isolate them and to ensure no disruptions occur in the Quality Assurance environment due to any changes in the development environment. The tradeoff is an additional INFORMATICA CONFIDENTIAL
BEST PRACTICES
726 of 818
administration console to log into and maintain. One thing to keep in mind is that while you may have separate domains with separate domain metadata repositories, there is no need to migrate any of the metadata from the separate domain repositories between development, Quality Assurance and production. The domain metadata repositories collect information based on the physical location and connectivity of the components and thus, it makes no sense to migrate between environments. You do need to provide separate database locations for each, but there is no migration needs for the data within; each one is specific to the environment it services.
Administration The domain administrator has the access to start/shutdown all services within the domain, as well as the ability to create other users and delegate roles and responsibilities to them. Keep in mind that if the domain is shutdown, it has to be restarted via the command line or the host operating system GUI. PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a domain, such that if the Master Gateway Node fails, another can assume its responsibilities; including authentication, logging, and service management.
Security and Folders Much like traditional repository security, security in the domain interface is set up on a “per-folder” basis, with owners being designated per logical groupings of objects/services in the domain. One of the major differences is that Domain security allows the creation of subfolders to segment nodes and services as desired. There are many considerations when deciding on a folder structure, keeping in mind that this logical administrative interface should be accessible to Informatica Administrators only and not to users and groups associated with a developer role (which are designated at the Repository level). New legislation in the United States and Europe, such as Basel II and the Public Company Accounting Reform and Investor Protection Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to place many restrictions on the ability of persons in development roles to have direct write access to production systems, and consequently, administration roles should be planned accordingly. An organization may simply need to use different folders to group objects into Development, Quality Assurance and Production roles; each with separate administrators. In some instances, systems may need to be entirely separate, with different domains for the Development, Quality Assurance, and Production systems. Sharing of metadata remains simple between separate domains, with PowerCenter’s ability to “link” domains, and copy data between linked domains. For Data Migration projects, it is recommended to establish a standardized architecture that includes a set of folders, connections and developer access in accordance with the needs of the project. Typically this includes folders for: Acquiring data Converting data to match the target system The final load to the target application Establishing reference data structures When configuring security in PowerCenter 8.5, there are two interrelated security aspects that should be addressed when planning a PowerCenter security policy: Role Differentiation – Groups should be created separately to define roles and privileges typically needed for an Informatica Administrator and for an Informatica Developer. Using this separation at the group level allows for a more efficient administration of PowerCenter user privileges and provides for a more secure PowerCenter environment. Maintenance of Privileges – As privileges typically are the same for several users within a PowerCenter environment, care should be taken to define these distinct separations ahead of time, so that privileges can be defined at a group level, rather than at an individual user level. As a best practice, users should not be granted user specific privileges, unless it is temporary.
Maintenance As part of a regular backup of metadata, a recurring backup should be scheduled for the PowerCenter domain configuration database metadata. This can be accomplished through PowerCenter by using the infasetup command, further explained in the Command Line Reference. The schema should also be added to the normal RDBMS backup schedule, thus providing two reliable backup methods for disaster recovery purposes.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
727 of 818
Licensing As part of PowerCenter 8.5’s new Service-Oriented Architecture (SOA), licensing for PowerCenter services is centralized within the domain. License key file(s) are received from Informatica at the same time the download location for the software is provided. Adding license object(s) and assigning individual PowerCenter Services to the license(s) is the method used to enable a PowerCenter Service. This can be done during install, or add initial/incremental license keys can be used after install via the Administration Console web-based utility (or the infacmd command line utility).
Last updated: 26-May-08 17:36
INFORMATICA CONFIDENTIAL
BEST PRACTICES
728 of 818
Managing Repository Size Challenge The PowerCenter repository is expected to grow over time as new development and production runs occur. Over time, the repository can be expected to grow to a size that may start slowing performance of the repository or make backups increasingly difficult. This Best Practice discusses methods to manage the size of the repository. The release of PowerCenter version 8.x added several features that aid in managing the repository size. Although the repository is slightly larger with version 8.x than it was with the previous versions, the client tools have increased functionality to limit the dependency on the size of the repository. PowerCenter versions earlier than 8.x require more administration to keep the repository sizes manageable.
Description Why should we manage the size of the repository? Repository size affects the following: DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit what gets backed up. Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in repository table performance. Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a long period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large repository. The type of information stored in the repository includes: Versions Objects Run statistics Scheduling information Variables
Tips for Managing Repository Size Versions and Objects Delete old versions or purged objects from the repository. Use your repository queries in the client tools to generate reusable queries that can determine out-of-date versions and objects for removal. Use Query Browser to run object queries on both versioned and non-versioned repositories.. Old versions and objects not only increase the size of the repository, but also make it more difficult to manage further into the development cycle. Cleaning up the folders makes it easier to determine what is valid and what is not. One way to keep repository size small is to use shortcuts by creating shared folders if you are using the same source/target definition, reusable transformations in multiple folders.
Folders Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may exist in development or test repositories.
Run Statistics Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove the run statistics, go to Repository Manager and truncate the logs based on the dates.
Recommendations INFORMATICA CONFIDENTIAL
BEST PRACTICES
729 of 818
Informatica strongly recommends upgrading to the latest version of PowerCenter since the most recent release includes such features as skip workflow and session log, skip deployment group history, skip MX data and so forth. The repository size in version 8.x and above is larger than the previous versions of PowerCenter, but the added size does not significantly affect the performance of the repository. It is still advisable to analyze the tables or run statistics to optimize the tables. Informatica does not recommend directly querying the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica technical support personnel.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
730 of 818
Organizing and Maintaining Parameter Files & Variables Challenge Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.
Description Parameter files are a means of providing run time values for parameters and variables defined in a workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell script, or an Informatica mapping. Variable values are stored in the repository and can be changed within mappings. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.
Parameter File Contents A Parameter File contains the values for variables and parameters. Although a parameter file can contain values for more than one workflow (or session), it is advisable to build a parameter file to contain values for a single or logical group of workflows for ease of administration. When using the command line mode to execute workflows, multiple parameter files can also be configured and used for a single workflow if the same workflow needs to be run with different parameters.
Types of Parameters and Variables A parameter file contains the following types of parameters and variables: Service Variable. Defines a service variable for an Integration Service. Service Process Variable. Defines a service process variable for an Integration Service that runs on a specific node. Workflow Variable. References values and records information in a workflow. For example, use a workflow variable in a decision task to determine whether the previous task ran properly. Worklet Variable. References values and records information in a worklet. You can use predefined worklet variables in a parent workflow, but cannot use workflow variables from the parent workflow in a worklet. Session Parameter. Defines a value that can change from session to session, such a database connection or file name. Mapping Parameter. Defines a value that remains constant throughout a session, such as a state sales tax rate. Mapping Variable. Defines a value that can change during the session. The Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time the session runs.
Configuring Resources with Parameter File If a session uses a parameter file, it must run on a node that has access to the file. You create a resource for the parameter file and make it available to one or more nodes. When you configure the session, you assign the parameter file resource as a required resource. The Load Balancer dispatches the Session task to a node that has the parameter file resource. If no node has the parameter file resource available, the session fails.
Configuring Pushdown Optimization with Parameter File Depending on the database workload, you may want to use source-side, target-side, or full pushdown optimization at different times. For example, you may want to use partial pushdown optimization during the database's peak hours and full pushdown optimization when activity is low. Use the $$PushDownConfig mapping parameter to use different pushdown optimization configurations at different times. The parameter lets you run the same session using the different types of pushdown optimization. When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute. Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in the parameter file: INFORMATICA CONFIDENTIAL
BEST PRACTICES
731 of 818
None. The Integration Service processes all transformation logic for the session. Source. The Integration Service pushes part of the transformation logic to the source database. Source with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. Target. The Integration Service pushes part of the transformation logic to the target database. Full. The Integration Service pushes all transformation logic to the database. Full with View. The Integration Service creates a view to represent the SQL override value, and runs an SQL statement against this view to push part of the transformation logic to the source database. The Integration Service pushes any remaining transformation logic to the target database.
Parameter File Name Informatica recommends giving the Parameter File the same name as the workflow with a suffix of “.par”. This helps in identifying and linking the parameter file to a workflow.
Parameter File: Order of Precedence While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a file specified at the workflow level always supersedes files specified at session levels.
Parameter File Location Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files. Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command. The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.
Multiple Parameter Files for a Workflow To run a workflow with different sets of parameter values during every run: 1. Create multiple parameter files with unique names. 2. Change the parameter file name (to match the parameter file name defined in Session or Workflow properties). You can do this manually or by using a pre-session shell (or batch script). 3. Run the workflow. Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.
Generating Parameter Files Based on requirements, you can obtain the values for certain parameters from relational tables or generate them programmatically. In such cases, the parameter files can be generated dynamically using shell (or batch scripts) or using Informatica mappings and sessions. Consider a case where a session has to be executed only on specific dates (e.g., the last working day of every month), which are listed in a table. You can create the parameter file containing the next run date (extracted from the table) in more than one way.
Method 1: 1. The workflow is configured to use a parameter file. 2. The workflow has a decision task before running the session: comparing the Current System date against the date in the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
732 of 818
parameter file. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single date, which is greater than the System Date (today) from the table and write it to a file with required format. 4. The shell script uses pmcmd to run the workflow. 5. The shell script is scheduled using cron or an external scheduler to run daily. The following figure shows the use of a shell script to generate a parameter file.
The following figure shows a generated parameter file.
Method 2: 1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is to run. 3. The workflow has a decision task before running the session: comparing the Current System date against the date in the parameter file 4. The last task in the workflow generates the parameter file for the next run of the workflow (using a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file in the required format. 5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).
INFORMATICA CONFIDENTIAL
BEST PRACTICES
733 of 818
Parameter File Templates In some other cases, the parameter values change between runs, but the change can be incorporated into the parameter files programmatically. There is no need to maintain separate parameter files for each run. Consider, for example, a service provider who gets the source data for each client from flat files located in client-specific directories and writes processed data into global database. The source data structure, target data structure, and processing logic are all same. The log file for each client run has to be preserved in a client-specific directory. The directory names have the client id as part of directory structure (e.g., /app/data/Client_ID/) You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one parameter file per client. However, the number of parameter files may become cumbersome to manage when the number of clients increases. In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd. [PROJ_DP.WF:Client_Data] $InputFile_1=/app/data/Client_ID/input/client_info.dat $LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow. The following text is an excerpt from a parameter file that contains service variables for one Integration Service and parameters for four workflows: [Service:IntSvs_01] [email protected]
INFORMATICA CONFIDENTIAL
BEST PRACTICES
734 of 818
[email protected] [HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS] $$platform=unix [HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR] $$platform=unix $DBConnection_ora=qasrvrk2_hp817 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1] $$DT_WL_lvl_1=02/01/2005 01:05:11 $$Double_WL_lvl_1=2.2 [ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2] $$DT_WL_lvl_2=03/01/2005 01:01:01 $$Int_WL_lvl_2=3 $$String_WL_lvl_2=ccccc
Use Case 1: Fiscal Calendar-Based Processing Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping parameters to process the correct fiscal period. For example, create a calendar table in the database with the mapping between the Gregorian calendar and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates. Create another mapping with the logic to create a parameter file. Run the parameter file creation session before running the main session. The calendar table can be directly joined with the main table, but the performance may not be good in some databases depending upon how the indexes are defined. Using a parameter file can resolve the index and result in better performance.
Use Case 2: Incremental Data Extraction Mapping parameters and variables can be used to extract inserted/updated data since previous extract. Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp and the end timestamp for extraction. For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the timestamp of the last row the Integration Service read in the previous session. Use this variable for the beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source filter. Use the following filter to incrementally extract data from the database: LOAN.record_update_timestamp > TO_DATE(‘$$PREVIOUS_DATE_TIME’) and LOAN.record_update_timestamp <= TO_DATE(‘$$$SessStartTime’)
Use Case 3: Multi-Purpose Mapping Mapping parameters can be used to extract data from different tables using a single mapping. In some cases the table name is the only difference between extracts. For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in the source qualifier SQL override, create two parameter files for each table name. Run the workflow using the pmcmd command with the corresponding parameter file, or create two sessions with corresponding parameter file.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
735 of 818
Use Case 4: Using Workflow Variables You can create variables within a workflow. When you create a variable in a workflow, it is valid only in that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined workflow variables. Use user-defined variables when you need to make a workflow decision based on criteria you specify. For example, you create a workflow to load data to an orders database nightly. You also need to load a subset of this data to headquarters periodically, every tenth time you update the local orders database. Create separate sessions to update the local database and the one at headquarters. Use a user-defined variable to determine when to run the session that updates the orders database at headquarters. To configure user-defined workflow variables, set up the workflow as follows: Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that updates the local orders database. Set up the decision condition to check to see if the number of workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an Assignment task to increment the $$WorkflowCount variable by one. Link the Decision task to the session that updates the database at headquarters when the decision condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false. When you configure workflow variables using conditions, the session that updates the local database runs every time the workflow runs. The session that updates the database at headquarters runs every 10th time the workflow runs.
Last updated: 09-Feb-07 16:20
INFORMATICA CONFIDENTIAL
BEST PRACTICES
736 of 818
Platform Sizing Challenge Determining the appropriate platform size to support the PowerCenter environment based on customer infrastructure and requirements.
Description The main factors that affect the sizing estimate are the input parameters that are based on the requirements and the constraints imposed by the existing infrastructure and budget. Other important factors include choice of Grid/High Availability Option, future growth estimates and real time versus batch load requirements. The required platform size to support PowerCenter depends upon each customer’s unique infrastructure and processing requirements: The Integration Service allocates resources for individual extraction, transformation and load (ETL) jobs or sessions. Each session has its own resource requirement.The resources required for the Integration Service depend upon the number of sessions, the complexity of each session (i.e., what it does while moving data) and how many sessions run concurrently. This Best Practice discusses the relevant questions pertinent to estimating the platform requirements.
TIP An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle. A common mistake is to size the servers before any ETL is designed or developed, and in many cases these platforms are too small for the resulting system. Thus, it is better to analyze sizing requirements after the data transformation processes have been well defined during the design and development phases.
Environment Questions To determine platform size, consider the following questions regarding your environment: What sources do you plan to access? How do you currently access those sources? Have you decided on the target environment (e.g., database, hardware, operating system)? If so, what is it? Have you decided on the PowerCenter environment (e.g., hardware, operating system, 32/64-bit processing)? Is it possible for the PowerCenter services to be on the same server as the target? How do you plan to access your information (e.g., cube, ad-hoc query tool) and what tools will you use to do this? What other applications or services, if any, run on the PowerCenter server? What are the latency requirements for the PowerCenter loads?
PowerCenter Sizing Questions To determine server size, consider the following questions: Is the overall ETL task currently being performed? If so, how is it being done, and how long does it take? What is the total volume of data to move? What is the largest table (i.e., bytes and rows)? Is there any key on this table that can be used to partition load sessions, if needed? How often does the refresh occur? Will refresh be scheduled at a certain time, or driven by external events? Is there a "modified" timestamp on the source table rows? What is the batch window available for the load? Are you doing a load of detail data, aggregations, or both? If you are doing aggregations, what is the ratio of source/target rows for the largest result set? How large is the result set (bytes and rows)? The answers to these questions provide an approximation guide to the factors that affect PowerCenter's resource requirements. To simplify the analysis, focus on large jobs that drive the resource requirement.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
737 of 818
PowerCenter Resource Consumption The following sections summarize some recommendations for PowerCenter resource consumption.
Processor 1 to 1.5 CPUs per concurrent non-partitioned session or transformation job. Note: - However Virtual CPU is considered as 0.75 CPU. For example 4 CPU with 4 cores each, could be considered as 12 Virtual CPUs.
Memory 20 to 30MB of memory for the Integration Service for session coordination. 20 to 30MB of memory per session, if there are no aggregations, lookups, or heterogeneous data joins. Note that 32-bit systems have an operating system limitation of 2GB per session. Caches for aggregation, lookups or joins use additional memory: Lookup tables are cached in full; the memory consumed depends on the size of the tables and selected data ports. Aggregate caches store the individual groups; more memory is used if there are more groups. Sorting the input to aggregations greatly reduces the need for memory. Joins cache the master table in a join; memory consumed depends on the size of the master. Full Pushdown Optimization uses much less resources on PowerCenter server in comparison to partial (source/target) pushdown optimization.
System Recommendations PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple servers using the Grid Option. The Grid Option allows for adding capacity at a low cost while providing implicit High Availability with the active/active Integration Service configuration. Below are the recommendations for a single node PowerCenter server.
Minimum Server 1 Node, 4 CPUs and 16GB of memory (instead of the minimal requirement of 4GB RAM) and 6 GB storage for PowerCenter binaries. A separate file system is recommended for the infa_shared working file directory and it can be sized depending on the work load profile.
Disk Space Disk space is not a factor if the machine is used only for PowerCenter services, unless the following conditions exist: Data is staged to flat files on the PowerCenter machine. Data is stored in incremental aggregation files for adding data to aggregates. The space consumed is about the size of the data aggregated. Temporary space is needed for paging for transformations that require large caches that cannot be entirely cached by system memory Sessions logs are saved by timestamp If any of these factors is true additional storage should be allocated for the file system used by the infa_shared directory. Typically Informatica customers allocate a minimum of 100 to 200 GB for this file system. Informatica recommends monitoring disk space on a regular basis or maintaining some type of script to purge unused files.
Sizing Analysis. The basic goal is to size the server so that all jobs can complete within the specified load window. You should consider the answers to the questions in the "Environment" and "PowerCenter Server Sizing" sections to estimate the required number of sessions, the volume of data that each session moves, and its lookup table, aggregation, and heterogeneous join caching requirements. Use these estimates with the recommendations in the "PowerCenter Resource Consumption" section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window.PowerCenter provides an advanced level of auto memory configuration with the option of using manual configuration. The minimum required cache memory for each active transformation in a mapping can be calculated and accumulated for concurrent jobs. You can use the “Cache Calculator” feature for Aggregator, Joiner, Rank, and Lookup transformations:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
738 of 818
Note that the deployment environment often creates performance constraints that hardware capacity cannot overcome. The Integration Service throughput is usually constrained by one or more of the environmental factors addressed by the questions in the "Environment" section. For example, if the data sources and target are both remote from the PowerCenter server, the network is often the constraining factor. At some point, additional sessions, processors, and memory may not yield faster execution because the network (not the PowerCenter services) imposes the performance limit. The hardware sizing analysis is highly dependent on the environment in which the server is deployed. You need to understand the performance characteristics of the environment before making any sizing conclusions. It is also vitally important to remember that other applications (in addition to PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a database engine and query/analysis tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools all run on the same server, the query/analysis tool often drives the hardware requirements. However, if the loading is performed after business hours, the query/analysis tools requirements may not be a sizing limitation.
Last updated: 27-May-08 14:44
INFORMATICA CONFIDENTIAL
BEST PRACTICES
739 of 818
PowerCenter Enterprise Grid Option Challenge Build a cost-effective and scaleable data integration architecture that is reliable and able to respond to changing business requirements.
Description The PowerCenter Grid Option enables enterprises to build dynamic and scalable data integration infrastructures that have the flexibility to meet diverse business needs. The Grid Option can exploit underutilized computing resources to handle peak load periods and its dynamic partitioning and load balancing capabilities can improve the overall reliability of a data integration platform. If a server fails in a grid-only configuration without the HA option/capability, the tasks assigned to it are not automatically recovered, but any subsequent tasks are assigned to other servers. The foundation for a successful PowerCenter Grid Option implementation is the storage sub-system. In order to provide a high performance single file system view to PowerCenter, it is necessary to set up either a Clustered File System (CFS) or a Network Attached Storage (NAS). While NAS is directly accessible from multiple nodes and can use the existing network fabric, CFS allows nodes to share the same directories by managing concurrent read/write access. CFS should be configured for simultaneous read/writes and the CFS block size should be set to optimize the PowerCenter disk I/O. A separate mount point for the infa_shared directory should be created using the shared file system. The infa_shared directory contains the working file sub-directories such as Cache, SrcFiles, TgtFiles. The PowerCenter binaries should be installed on local storage. Some CFS alternatives include: Red Hat Global File System (GFS) Sun Cluster (CFS, QFS) Veritas Storage Foundation Cluster File System HP Serviceguard Cluster File System IBM AIX Cluster File System (GPFS) NAS provides access to storage over the network. NAS devices contain a server that provides file services to other hosts on the LAN using network file access methods such as CIFS or NFS. Most NAS devices offer file services over the Windows-centric SMB (Server Message Block) and CIFS (Common Internet File System) protocol, the Unix favorite NFS (Network File System), or the near-universal HTTP. With new emerging protocols such as DAFS (Direct Access File System), traditionally I/O intensive applications are moving to NAS. NAS is directly connected to a LAN and hence consumes large amounts of LAN bandwidth. In addition, special backup methods need to be used for backup and disaster recovery. The PowerCenter Integration Service reads from and writes to the shared file system in a Grid configuration. When persistent lookups are used, there may be simultaneous reads from multiple nodes. The Integration Service performs random reads for lookup caches. If it is determined that cache performance degradations are experienced as a result of using a certain type of CFS or NAS product, the cache directory can be placed in local storage. In the case of persistent cache files that need to be accessed from multiple nodes, the persistent cache file can be built on one node first and then copied to other nodes. This will reduce the random read performance impact of the CFS or NAS product. When installing the PowerCenter Grid Option on Unix, use the same user id (uid) and group id (gid) for each Unix account. If the infa_shared directory is placed on a shared file system like CFS or NAS the Unix accounts should have read/write access to the same files. For example, if a workflow running on node1 creates a persistent cache file in the Cache directory, node2 should be able to read and update this file. When installing the PowerCenter Grid Option on Windows, the user assigned to the Informatica Services joining the grid should have permissions to access the shared directory. This can be accomplished by granting Full Control, Change and Read access to the shared directory for the Machine Account. As a post installation step, the persistent cache files, parameter files, logs, and other run-time files should be configured to use the shared file system by pointing the $PMRootDir variable to this directory. PowerCenter resources can be configured to assign specific tasks to specific nodes. The objective in this type of configuration is to create a dynamic Grid to meet changing business needs. For example, a dummy custom resource can be defined and assigned to tasks. This custom resource can be made permanently available to the production nodes. If during peak month-end
INFORMATICA CONFIDENTIAL
BEST PRACTICES
740 of 818
processing the need arises to use an additional node from the test environment, simply make this custom resource available to the additional node to allow production tasks to run on the new server. In metric-based dispatch mode and adaptive dispatch mode, the Load Balancer collects and stores statistics from the last three runs of the task and compares them with node load metrics. This metadata is available in the OPB_TASK_STATS repository table. The CPU and memory metrics available in this table can be used for capacity planning and departmental charge backs. Since this table contains statistics from the last three runs of the task, it is necessary to build a process to extract data from this table to load a custom history table. The history table can be used to calculate averages and perform trend analysis. Proactive monitoring for Service Manager failures is essential. Service Manager manages both the integration service and the repository service. In a two-node grid configuration, two Service Manager processes are running. Use custom scripts or third party tools such as Tivoli Monitoring or HP OpenView to check the health and availability of the PowerCenter Service Manager process. Below is a sample script that can be called from a monitoring tool: # !/bin/ksh # Initializing runtime variables typeset integer no_srv=0 srv_env='uname -n' # Checking to verify if the PowerCenter process (tomcat) is currently running # in background no_srv=`ps -ef | grep tomcat | grep -v grep | wc -l` # If it is not running, then exit with message if [ $no_srv -eq 0 ] then echo "PowerCenter service process on $srv_env is not running" exit 1 fi exit 0 To upgrade a two-node grid without incurring downtime, follow the steps below: 1. 2. 3. 4. 5. 6.
Set up a separate schema/database to hold a copy of the production repository. Take node1 out of the existing grid. Upgrade the binaries and the repository while node2 is handling the production loads. Switch the production loads to node1. While node1 is handling the production loads, upgrade the node2 binaries. After the node2 upgrade is complete, node2 can be put back on the grid.
With Session on Grid, PowerCenter automatically distributes the partitions of the transformations across the grid. You do not need to specify the distribution of nodes for each transformation. By using Dynamic Partitioning that bases the number of partitions on the number of nodes in the grid, a session can scale up automatically when the number of nodes in the grid is expanded.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
741 of 818
Last updated: 24-Jun-10 16:11
INFORMATICA CONFIDENTIAL
BEST PRACTICES
742 of 818
Pushdown Optimization Challenge Informatica PowerCenter embeds a Powerful engine that actually has a memory management system and all of the smart algorithms built into the engine to perform various transformation operations such as aggregation, sorting, joining, lookup, etc. This is typically referred to as an ETL architecture where EXTRACTS, TRANSFORMATIONS and LOAD are performed. In other words, data is extracted from the data source to the PowerCenter Engine (either on the same machine as the source or on a separate machine) where all the transformations are applied and then pushed to the target. In such a scenario where there is data transfer, items to consider for optimal performance include: A network that is fast and tuned effectively A powerful server with high processing power and memory to run PowerCenter ELT is a new design or runtime paradigm that is becoming popular with the advent of higher performing RDBM systems (be it DSS or OLTP). Teradata specifically runs on a well-tuned operating system and well-tuned hardware that can lend itself to ELT. The ELT paradigm tries to maximize the benefits of this system by pushing much of the transformation logic onto the database servers. The ELT design paradigm can be achieved through the Pushdown Optimization option provided with PowerCenter.
Description Maximizing Performance Using Pushdown Optimization Transformation logic can be pushed to the source or target database using pushdown optimization. The amount of work that can be pushed to the database depends upon the pushdown optimization configuration, the transformation logic and the mapping and session configuration. When running a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables and it processes any transformation logic that it cannot push to the database. Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. The Pushdown Optimization Viewer can also be used to view messages related to Pushdown Optimization.
The above mapping contains a filter transformation that filters out all items except for those with an ID greater than 1005. The Integration Service can push the transformation logic to the database, and it generates the following SQL statement to process the transformation logic: INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC, n_PRICE) SELECT ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.PRICE AS INTEGER) FROM ITEMS WHERE (ITEMS.ITEM_ID >1005) The Integration Service generates an INSERT SELECT statement to obtain and insert the ID, NAME, and DESCRIPTION columns from the source table and it filters the data using a WHERE clause. The Integration Service does not extract any data from the database during this process.
Running Pushdown Optimization Sessions When running a session configured for Pushdown Optimization, the Integration Service analyzes the mapping and transformations to determine the transformation logic it can push to the database. If the mapping contains a mapplet, the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
743 of 818
Integration Service expands the mapplet and treats the transformations in the mapplet as part of the parent mapping. Pushdown Optimization can be configured in the following ways: Using source-side pushdown optimization: The Integration Service pushes as much transformation logic as possible to the source database. Using target-side pushdown optimization: The Integration Service pushes as much transformation logic as possible to the target database. Using full pushdown optimization: The Integration Service pushes as much transformation logic as possible to both the source and target databases. If a session is configured for full pushdown optimization and the Integration Service cannot push all the transformation logic to the database, it performs partial pushdown optimization instead.
Running Source-Side Pushdown Optimization Sessions When running a session configured for source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the database. The Integration Service generates a SELECT statement based on the transformation logic for each transformation it can push to the database. When running the session, the Integration Service pushes all of the transformation logic that is valid to the database by executing the generated SQL statement. Then it reads the results of this SQL statement and continues to run the session. If running a session that contains an SQL override the Integration Service generates a view based on that SQL override. It then generates a SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.
Running Target-Side Pushdown Optimization Sessions When running a session configured for target-side pushdown optimization, the Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline that it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database; then, it executes the generated SQL.
Running Full Pushdown Optimization Sessions To use full pushdown optimization, the source and target must be on the same database. When running a session configured for full pushdown optimization the Integration Service analyzes the mapping starting with the source, and analyzes each transformation in the pipeline until it analyzes the target. It generates SQL statements that are executed against the source and target database based on the transformation logic it can push to the database. If the session contains a SQL override, the Integration Service generates a view and runs a SELECT statement against that view. When running a session for full pushdown optimization, the database must run a long transaction if the session contains a large quantity of data. Consider the following database performance issues when generating a long transaction: A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.
Integration Service Behavior with Full Optimization When configuring a session for full optimization, the Integration Service might determine that it can push all of the transformation logic to the database. When it can push all of the transformation logic to the database, it generates an INSERT SELECT statement that is run on the database. The statement incorporates transformation logic from all the transformations in the mapping. When configuring a session for full optimization, the Integration Service might determine that it can push only part of the transformation logic to the database. When it can push part of the transformation logic to the database, the Integration Service pushes as much transformation logic to the source and target databases as possible. It then processes the remaining transformation logic. For example, a mapping contains the following transformations:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
744 of 818
The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database.
Sample Mapping with Two Partitions
The first key range is 1313 - 3340 and the second key range is 3340 - 9354. The SQL statement merges all of the data into the first partition: INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC) SELECT ITEMS ITEMS.ITEM_ID, ITEMS.ITEM_NAME, ITEMS.ITEM_DESC FROM ITEMS WHERE (ITEMS.ITEM_ID>=1313)AND ITEMS.ITEM_ID<9354) ORDER BY ITEMS.ITEM_ID The SQL statement selects items 1313 through 9354 (which includes all values in the key range) and merges the data from both partitions into the first partition. The SQL statement for the second partition passes empty data: INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC) ORDER BY ITEMS.ITEM_ID INFORMATICA CONFIDENTIAL
BEST PRACTICES
745 of 818
Working With SQL Overrides The Integration Service can be configured to perform an SQL override with pushdown optimization. To perform an SQL override configure the session to create a view. When an SQL override is used for a Source Qualifier transformation in a session configured for source or full pushdown optimization with a view, the Integration Service creates a view in the source database based on the override. After it creates the view in the database, the Integration Service generates an SQL query that it can push to the database. The Integration Service runs the SQL query against the view to perform pushdown optimization. Note: To use an SQL override with pushdown optimization, the session must be configured for pushdown optimization with a view.
Running a Query If the Integration Service did not successfully drop the view a query can be executed against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view it uses a prefix of PM_V. Search for views with this prefix to locate the views created during pushdown optimization.
Teradata-specific SQL SELECT TableName FROM DBC.Tables WHERE CreatorName = USER AND TableKind="V" AND TableName LIKE 'PM\_V%'ESCAPE \'
Rules and Guidelines for SQL OVERIDE Use the following rules and guidelines when pushdown optimization is configured for a session containing an SQL override: 1. 2. 3. 4. 5. 6. 7. 8.
Do not use an order by clause in the SQL override. Use ANSI outer join syntax in the SQL override. Do not use a Sequence Generator transformation. If a Source Qualifier transformation is configured for a distinct sort and contains an SQL override the Integration Service ignores the distinct sort configuration. If the Source Qualifier contains multiple partitions specify the SQL override for all partitions. If a Source Qualifier transformation contains Informatica outer join syntax in the SQL override, the Integration Service processes the Source Qualifier transformation logic. PowerCenter does not validate the override SQL syntax so test the SQL override query before pushing it to the database. When an SQL override is created ensure that the SQL syntax is compatible with the source database.
Configuring Sessions for Pushdown Optimization A session for pushdown optimization can be configured in the session properties. However, the transformation, mapping, or session configuration may need further editing to push more transformation logic to the database. Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. To configure a session for pushdown optimization: 1. In the Workflow Manager, open the session properties for the session containing the transformation logic to be pushed to the database.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
746 of 818
2. From the Properties tab, select one of the following Pushdown Optimization options: None To Source To Source with View To Target Full Full with View 3. Click on the Mapping Tab in the session properties. 4. Click on View Pushdown Optimization. 5. The Pushdown Optimizer displays the pushdown groups and the SQL that is generated to perform the transformation logic. It displays messages related to each pushdown group. The Pushdown Optimizer Viewer also displays numbered flags to indicate the transformations in each pushdown group. 6. View the information in the Pushdown Optimizer Viewer to determine if the mapping, transformation or session configuration needs editing to push more transformation logic to the database.
Effectively Designing Mappings for Pushdown Optimization Below is an example of a mapping that needs to be redesigned in order to use Pushdown Optimization:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
747 of 818
In the above mapping, there are two lookups and one filter. As the staging area is the same as the target area Pushdown Optimization can be used in order to achieve high performance. But parallel lookups are not supported within PowerCenter yet – so the mapping needs to be redesigned. See the redesigned mapping below:
In order to use Pushdown Optimization the lookups have been serialized which makes them a sub-query while generating the SQL. See the figure below that shows the complete SQL and Pushdown Configuration using the Full Pushdown option:
The sample SQL generated is shown below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
748 of 818
Group 1 INSERT INTO Target_Table (ID,ID2,SOME_CAST) SELECT Source_Table.ID, Source_Table.SOME_CONDITION, CAST(Source_Table.SOME_CAST), Lookup_1.ID, Source_Table.ID, FROM ((Source_Table LEFT OUTER JOIN Lookup_1 ON (Lookup_1.ID = Source_Table.ID) AND (Source_Table.ID2 = (SELECT Lookup_2.ID2 FROM Lookup_2 Lookup_1 WHERE (Lookup_1.ID = Source_Table.ID2)))) LEFT OUTER JOIN Lookup_1 Lookup_2 ON (Lookup_1.ID = Source_Table.ID) AND (Source_Table.ID = (SELECT Lookup_2.ID2 FROM Lookup_2 WHERE (Lookup_2.ID2 = Source_Table.ID2)))) WHERE (NOT (Lookup_1.ID1 IS NULL) AND NOT (Lookup_2.ID2 IS NULL)) As demonstrated in the above example, very complicated SQL can be generated using Pushdown Optimization. A point to remember while configuring sessions is to make sure that the right joins are being generated.
Best Practices for Teradata Pushdown Optimization Use Full Pushdown Optimization – because of large data volumes, best performance can be obtained by doing all processing inside the database. Use Pushdown overrides with view – override should contain tuned SQL. Filter data using a WHERE clause before doing outer joins Avoid full table scans for large tables Use staging processing if necessary Use temp tables if necessary (create pre-session, drop post-session) Validate the use of primary and secondary indexes Minimize the use of transformations since the resulting SQL may not be tuned. For pushdown optimization on Teradata, consider following Teradata functions if an override is needed so that all processing occurs inside the database. Detailed documentation on each function can be found at http://teradata.com. AVG COUNT MAX MIN SUM RANK PERCENT_RANK CSUM MAVG MDIFF MLINREG MSUM QUANTILE
INFORMATICA CONFIDENTIAL
BEST PRACTICES
749 of 818
AVG CORR COUNT COVAR_POP COVAR_SAMP GROUPING KURTOSIS MAX MIN REGR_AVGX REGR_AVGY REGR_COUNT REGR_INTERCEPT REGR_R2 REGR_SLOPE REGR_SXX REGR_SXY REGR_SYY SKEW STDDEV_POP STDDEV_SAMP SUM VAR_POP VAR_SAMP For Pushdown Optimization on Teradata, understand string-to-date time conversions in Teradata using the CAST function (useful in override SQL). Fully pushed down mapping do not necessarily result in the fastest execution. Some scenarios are best with ELT and some are best with ETL. Understanding the semantics of the data and the transformation logic is important; mappings may be tuned accordingly to get better results. Understanding the reason why something cannot be translated to SQL is important; mappings may be tuned accordingly to get better results. Update Strategy has a row by row operation and generates a SQL that may result in slow performance. To convert an integer into a string and pad the string with leading 0s. If the LPAD function is not supported in the database, full PDO is not possible. Consider using PowerCenter functions that have an equivalent function in the database for full PDO. Error Handling:Because the database executes the SQL and handles the errors, it is not possible to make use of PowerCenter error handling features like reject files. Recovery:Because the database processes the transformations, it is not possible to make use of PowerCenter features like incremental recovery. Logging:Because the transformations are processed in the database, PowerCenter does not get the same level of transformational statistics and hence these are not logged. If Staging and Target tables are in different oracle database servers, consider creating a synonym (or other equivalent object) in one database pointing to the tables of another database. Use synonyms in the mapping and use full PDO. Note that depending on the network topology, full PDO may or may not be beneficial. If Staging and Target tables are in different oracle users, but residing in the same database, consider that from PowerCenter 8.6.1 on, PDO can automatically qualify tables if the connections are “compatible”. Use the “Allow Pushdown for User Incompatible connections” option. Scenario: OLTP data has to be transformed and loaded to Database. Mapping with heterogeneous source and target
INFORMATICA CONFIDENTIAL
BEST PRACTICES
750 of 818
cannot be fully pushed down. Consider a Two-pass approach. OLTP to staging table using loader utilities or PowerCenter engine Staging table -> Transformations -> Target with Full pushdown. Scenario: PowerCenter mapping has sorter before aggregator and uses “Sorted Input” option in aggregator. Consider removing the un-necessary Sorter. It results in a better SQL.
Last updated: 24-Jun-10 16:11
INFORMATICA CONFIDENTIAL
BEST PRACTICES
751 of 818
Understanding and Setting UNIX Resources for PowerCenter Installations Challenge This Best Practice explains what UNIX resource limits are, and how to control and manage them.
Description UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding and setting these resources correctly is essential for PowerCenter installations.
Understanding UNIX Resource Limits UNIX systems impose limits on several different resources. The resources that can be limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the operating system. In general, all UNIX systems implement per-process limits on the following resources. There may be additional resource limits, depending on the operating system.
Resource Processor time Maximum file size Process data Process stack Number of open files Total virtual memory Core file size
Description The maximum amount of processor time that can be used by a process, usually in seconds. The size of the largest single file a process can create. Usually specified in blocks of 512 bytes. The maximum amount of data memory a process can allocate. Usually specified in KB. The maximum amount of stack memory a process can allocate. Usually specified in KB. The maximum number of files that can be open simultaneously. The maximum amount of memory a process can use, including stack, instructions, and data. Usually specified in KB. The maximum size of a core dump file. Usually specified in blocks of 512 bytes.
These limits are implemented on an individual process basis. The limits are also ‘inherited’ by child processes when they are created. In practice, this means that the resource limits are typically set at log-on time, and apply to all processes started from the log-in shell. In the case of PowerCenter, any limits in effect before the Integration Service is started also apply to all sessions (pmdtm) started from that node. Any limits in effect when the Repository Service is started also apply to all pmrepagents started from that repository service (repository service process is an instance of the repository service running on a particular machine or node). When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be exceeded. Depending on the limit that is reached, memory allocations fail, files can’t be opened, and processes are terminated when they exceed their processor time. Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required resources, while preventing problems.
Hard and Soft Limits Each resource that can be limited actually allows two limits to be specified – a ‘soft’ limit and a ‘hard’ limit. Hard and soft limits can be confusing. From a practical point of view, the difference between hard and soft limits doesn’t matter to PowerCenter or any other process; the lower value is enforced when it reached, whether it is a hard or soft limit.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
752 of 818
The difference between hard and soft limits really only matters when changing resource limits. The hard limits are the absolute maximums set by the System Administrator that can only be changed by the System Administrator. The soft limits are ‘recommended’ values set by the System Administrator, and can be increased by the user, up to the maximum limits.
UNIX Resource Limit Commands The standard interface to UNIX resource limits is the ‘ulimit’ shell command. This command displays and sets resource limits. The C shell implements a variation of this command called ‘limit’, which has different syntax but the same functions. ulimit –a Displays all soft limits ulimit –a –H Displays all hard limits in effect Recommended ulimit settings for a PowerCenter server:
Resource Processor time Maximum file size
Process data Process stack Number of open files
Total virtual memory
Core file size
Description Unlimited. This is needed for the pmserver and pmrepserver that run forever. Based on what’s needed for the specific application. This is an important parameter to keep a session from filling a whole filesystem, but needs to be large enough to not affect normal production operations. 1GB to 2GB 32MB At least 256. Each network connection counts as a ‘file’ so source, target, and repository connections, as well as cache files all use file handles. The largest expected size of a session. 1Gig should be adequate, unless sessions are expected to create large inmemory aggregate and lookup caches that require more memory. If you have sessions that are likely to required more than 1Gig, set the Total virtual memory appropriately. Remember that in 32bit OS, the maximum virtual memory for a session is 2Gigs. Unlimited, unless disk space is very tight. The largest core files can be ~2-3GB, but after analysis they should be deleted, and there really shouldn’t be multiple core files lying around.
Setting Resource Limits Resource limits are normally set in the log-in script, either .profile for the Korn shell or .bash_profile for the bash shell. One ulimit command is required for each resource being set, and usually the soft limit is set. A typical sequence is: ulimit ulimit ulimit ulimit ulimit ulimit ulimit
-S -S -S -S -S -S -S
-c unlimited -d 1232896 -s 32768 -t unlimited -f 2097152 -n 1024 -v unlimited
after running this, the limits are changed: % ulimit –S –a core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1232896 file size (blocks, -f) 2097152 max memory size (kbytes, -m) unlimited open files (-n) 1024
INFORMATICA CONFIDENTIAL
BEST PRACTICES
753 of 818
stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited virtual memory (kbytes, -v) unlimited
Setting or Changing Hard Resource Limits Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set the initial hard limits in the file /etc/profile, which must be changed by a System Administrator. In some cases, it is necessary to run a system utility such as smit on AIX to change the global system limits.
Last updated: 29-Sep-10 20:22
INFORMATICA CONFIDENTIAL
BEST PRACTICES
754 of 818
PowerExchange for Oracle CDC Challenge Configure the Oracle environment for optimal performance when using PowerExchange Change Data Capture (CDC) in a production environment.
Description There are two performance types that need to be considered when dealing with Oracle CDC; latency of the data and restartability of the environment. Some of the factors that impact these performance types are configurable within PowerExchange, while others are not. These two performance types are addressed separately in this Best Practice.
Minimize Data Latency The objective of data latency is to minimize the amount of time that it takes for a change made to the source database to appear in the target database. Some of the factors that can affect latency performance are discussed below.
Location of PowerExchange CDC The optimal location for installing PowerExchange CDC is on the server that contains the Oracle source database. This eliminates the need to use the network to pass data between Oracle’s LogMiner and PowerExchange. It also eliminates the need to use SQL*Net for this process and it minimizes the amount of data being moved across the network. For best results, install the PowerExchange Listener on the same server as the source database server.
Volume of Data The volume of data that the Oracle Log Miner has to process in order to provide changed data to PowerExchange can have a significant impact on performance. Bear in mind that in addition to the changed data rows, other processes may be writing large volumes of data to the Oracle redo logs. These include, but are not limited to: Oracle catalog dumps Oracle workload monitor customizations Other (non-Oracle) tools that use the redo logs to provide proprietary information In order to optimize PowerExchange’s CDC performance, the amount of data these processes write to the Oracle redo logs needs to be minimized (both in terms of volume and frequency). This includes minimizing the invocations of the LogMiner to just a single occurrence. Review the processes that are actively writing data to the Oracle redo logs and tune them within the context of a production environment. Monitoring the redo log switches and the creation of archived log files is one way to determine how busy the source database is. The size of the archived log files and how often they are being created over a day will give a good idea about performance implications.
Server Workload Optimize the performance of the Oracle database server by reducing the number of unnecessary tasks it is performing concurrently with the PowerExchange CDC components. This may include a full review of the backup and restore schedules, Oracle import and export processing and other application software utilized within the production server environment. PowerCenter also contributes to the workload on the server where PowerExchange CDC is running; so it is important to optimize these workload tasks. This can be accomplished through mapping design. If possible, include all of the processing of PowerExchange CDC sources within the same mapping. This will minimize the number of tasks generated and will ensure that all of the required data from either the Oracle archive log (i.e., near real time) or the CDC files (i.e., CAPXRT, condense) process within a single pass of the logs or CDC files.
Continuous Extraction Mode Considerations Continuous Extraction Mode is slightly different than Batch Extraction Mode and allows PowerCenter to use the Oracle Real-time CDC connection with an override for the CAPI-CONNECTION. In this mode, PowerExchange initiates the extraction process of changed data from the Oracle archive logs through PowerExchange’s condense process. PowerExchange first checks the change registrations for all of the CDC data sources associated with the collection id. Once this process has completed successfully, PowerExchange initiates an Oracle LogMiner session. The type of LogMiner session is controlled by a configuration
INFORMATICA CONFIDENTIAL
BEST PRACTICES
755 of 818
file parameter. This configuration file parameter should be set to continuous mode. This will establish an Oracle LogMiner continuous mining session to extract all of changes from the Oracle archive and the online redo logs. Oracle LogMiner will first process all of the archive logs that have not yet been processed and then it will process the Oracle online redo logs. This process runs continuously until it is manually stopped. PowerExchange flushes the write buffers periodically for the open condense data files for consumption by PowerCenter. When PowerExchange receives a request for change data source records from PowerCenter it retrieves the unprocessed records from the closed condense data files and passes them to PowerCenter. Once all of the condense files have been processed, PowerExchange starts processing the condense files that are still open and being written to periodically. Below is a high-level depiction of the processes associated with continuous extraction. Change Data Capture Component
1. At startup time, the PowerExchange CDC process (i.e. condense) reads the dbmover.cfg and dtlca.cfg files for configuration information. 2. The PowerExchange CDC process retrieves all of the change registrations for the collection id specified in the dtlca.cfg file. 3. The PowerExchange CDC process retrieves restart tokens from the CDC control files. 4. PowerExchange CDC initiates an Oracle LogMiner session to retrieve changed data records. 5. LogMiner passes all the changed data records to the PowerExchange CDC process. 6. PowerExchange CDC periodically flushes the write buffers to the condense data file based on the FILE_FLUSH_VAL parameter. PowerExchange CDC closes the condense data file and opens a new file based on the FILE_SWITCH_CRIT and FILE_SWITCH_VAL parameter settings. 7. The PowerExchange CDC process updates restart tokens from the CDC control files. Data Consumption Component
INFORMATICA CONFIDENTIAL
BEST PRACTICES
756 of 818
1. 2. 3. 4. 5. 6.
The PowerCenter session issues a request for CDC data to the PowerExchange client from a certain location. The PowerExchange client checks the dbmover.cfg file for the IP address associated with the location. The PowerExchange client sends the request to the PowerExchange Listener. The PowerExchange Listener sends the request to the PowerExchange CDC process. The PowerExchange CDC process retrieves the requested change data records from the condense data files. The PowerExchange CDC process passes only the change records for the requested sources to the PowerExchange Listener. 7. The PowerExchange Listener passes the requested change records to the PowerExchange client. 8. The PowerExchange client passes the change records to the PowerCenter session. The best PowerExchange CDC extraction method to be used in most cases involving Oracle is the Continuous Extraction Mode which has the following benefits. Not dependent on PowerCenter to initiate the change data capture process. Minimal impact on the Oracle database server resources. Captures changes for all change registrations for the Oracle database SID with a single pass of the available Oracle archive logs. Once all outstanding Oracle archive logs have been processed, Oracle LogMiner starts mining the online redo logs. Lower risk of unprocessed Oracle archive logs being swept from disk prior to being processed by Oracle LogMiner The latency of the data is relatively small as long as the requesting PowerCenter Workflow is running continuously. All of the change records are stored within a condense file and ready for consumption by PowerCenter when the request is received. PowerCenter has access to the change data records even though the PowerExchange condense file has not been closed. The following is a sample of a pwxccl.cfg parameters file that controls the Oracle CDC condense process: /* ----------------------------------------------------------------*/ /* PowerExchange Condense Configuration File
INFORMATICA CONFIDENTIAL
BEST PRACTICES
757 of 818
/* /* See PowerExchange CDC Guide for Linux, UNIX and Windows /* Chapter 3 – PowerExchange Logger for Linux, UNIX and Windows /* ----------------------------------------------------------------*/ /* The value for the DBID parameter must match the Collection-ID /* contained in the ORACLE-ID statement in the dbmover.cfg file. /* ----------------------------------------------------------------*/ /* DBID=ORACDC DB_TYPE=ORA EXT_CAPT_MASK=/home/pwx/v861/condense/condense CHKPT_BASENAME=/home/pwx/v861/condense/condense.CHKPT CHKPT_NUM=3 COND_CDCT_RET_P=5 NO_DATA_WAIT=1 NO_DATA_WAIT2=60 /* CONDENSE_SHUTDOWN_TIMEOUT=60 /* /* ----------------------------------------------------------------*/ /* COLL_END_LOG equal to 1 means BATCH MODE /* COLL_END_LOG equal to 0 means CONTINUOUS MODE /* ----------------------------------------------------------------*/ /* COLL_END_LOG=0 /* /* ----------------------------------------------------------------*/ /* FILE_SWITCH_CRIT of M means minutes /* FILE_SWITCH_CRIT of R means records /* ----------------------------------------------------------------*/ /* FILE_SWITCH_CRIT=M FILE_SWITCH_VAL=20 FILE_SWITCH_MIN=(1,2) FILE_FLUSH_VAL=10 /* /* ----------------------------------------------------------------*/ /* CAPT_IMAGE of AI means AFTER IMAGE /* CAPT_IMAGE of BA means BEFORE and AFTER IMAGE /* ----------------------------------------------------------------*/ /* CAPT_IMAGE=BA SIGNALLING=Y /* /* ----------------------------------------------------------------*/ /* Oracle User id and Password for PowerExchange CDC /* ----------------------------------------------------------------*/ /* UID=database userid PWD=database password /* /********************************************************************/ /* The following parameters are only used during a cold start and forces /* the cold start to use the most recent catalog copy. Without these /* parameter, if the v_$transaction and v_$archive_log views are out of /* sync and there is a very good chance that the most recent /* catalog copy will not be used for the cold start. INFORMATICA CONFIDENTIAL
BEST PRACTICES
758 of 818
/********************************************************************/ /* SEQUENCE_TOKEN=0 RESTART_TOKEN=0 These parameters have the following description and syntax requirements. Configuration Parameter DBID=collection id
DB_TYPE=ORA EXT_CAPT_MASK=/directory/mask name
CHKPT_BASENAME=/directory/mask name
CHKPT_NUM=number checkpoint files COND_CDCT_RET_P=number of days
NO_DATA_WAIT=number of minutes
NO_DATA_WAIT2=number of seconds CONDENSE_SHUTDOWN_TIMEOUT= number of seconds COLL_END_LOG=0 or 1
Description Specifies the PowerExchange change registration collection identifier, also called instance name. The value used must be identical to the first parameter of the ORACLEID statement in the dbmover.cfg. Mandatory parameter and must contain ORA for Oracle. Unique mask for the data files created by the Condense. A suffix will be added to this mask name which contains a date/time stamp of when the file was created. In addition, the directory must already exist. Unique mask for the checkpoint files created by the Condense. A suffix is added to this mask name which contains a Vnn. In addition, the directory must already exist. Specifies the number of checkpoint files. The default is 3. CDCT and Condensed files retention period in days. Files older than this period and their corresponding CDCT records are deleted during start-up, file switch, or shut down processing. When running in Continuous mode, it defines the number of minutes to wait on commands manually entered through the Command Handler before starting the next Condense. The default is 60. Defines the number of seconds before the Condenser stops. The default is 600 seconds. Specifies the maximum time period for the PowerExchange Condenser, DTLCACON, to shut down normally after a shutdown command. The default is 600. Specifies whether to use “Batch” or “Continuous”. 0 - Continuous mode. After each condense run, the system waits for the number of minutes defined in the NO_DATA_WAIT parameter, then performs another Condense.
FILE_SWITCH_CRIT=Records or Minutes
1 - Batch mode. The system shuts down after a single condense run. For example, a single condense run might be scheduled following a particular batch update job. Defines the criteria to use when deciding when to do an automatic file switch. The default is M. R – Records
FILE_SWITCH_VAL=file switch units
FILE_SWITCH_MIN=(number of units, number of INFORMATICA CONFIDENTIAL
M – Minutes Defines the number of FILE_SWITCH_CRIT units at which to perform a file switch automatically. Default is 30. Specifies file switch criteria for Condense when changes BEST PRACTICES
759 of 818
units ignored)
for new sources are encountered. Use this to reduce latency for continuous extraction mode. Number of Unitsspecifies the minimum number of FILE_SWITCH_CRIT units that must pass before a file switch is done when encountering a change for a source with no entry in the CDCT. Number of Units Ignored specifies the number of FILE_SWITCH_CRIT units which must occur during cold start processing before Condense uses the Number of Units value.
FILE_FLUSH_VAL=number of seconds
CAPT_IMAGE=before or after image
The default is (-1,0). Specifies the file flush interval in seconds. The file flush interval is the number of seconds that elapse before a flush is performed on the current partial condense file. When the Condense task flushes, the data is written to the disk condense files allowing it to be read by continuous extraction mode extractions. Specifies whether before images or after images should be captured. BA – Before and after images
SIGNALLING=Y or N
AI – Just after images Specifies how the system should come down when an abnormal condition occurs. Y – The system tries to shutdown normally N – The system abends with a dump
UID=user id PWD=password SEQUENCE_TOKEN=0 RESTART_TOKEN=0
Default is N. Oracle database user id Oracle database password Sequence portion of the restart token to be used when doing a cold start. Restart token to be used when doing a cold start.
Condense Option Considerations The condense option for Oracle CDC provides only the required data by reducing the collected data based on the Unit of Work information. This can prevent the transfer of unnecessary data and save CPU and memory resources. In order to properly allocate space for the files created by the condense process it is necessary to perform capacity planning. In determining the space required for the CDC data files it is important to know whether before and after images (or just after images) are required. Also, the retention period for these files must be considered. The retention period is defined in the COND_CDCT_RET_P parameter in the dtlca.cfg file. The value that appears for this parameter specifies the retention period in days. The general algorithms for calculating this space are outlined below. After Image Only – Estimated condense file disk space for Table A = ((The width of Table A in bytes * Estimated number of data changes for Table A per 24 hour period) + 700 bytes for the six fields added to each CDC record) * the value on theCOND_CDCT_RET_P parameter Before/After Image –
INFORMATICA CONFIDENTIAL
BEST PRACTICES
760 of 818
Estimated condense file disk space for Table A = ((The width of Table A in bytes * Estimated number of data changes for Table A per 24 hour period) * 2) + 700 bytes for the six fields added to each CDC record) * the value on theCOND_CDCT_RET_P parameter Accurate capacity planning can be accomplished by running sample condense jobs for a given number of source changes to determine the storage required. The size of files created by the condense process can be used for projecting the actual storage required in a production environment.
PowerExchange CDC Restart Performance The amount of time required to restart the PowerExchange CDC process should be considered when determining performance. The PowerExchange CDC process will need to be restarted whenever any of the following events occur: A schema change is made to a table. An existing Change Registration is amended. The PowerExchange service pack is applied or a configuration file is changed. An Oracle patch or bug fix is applied. An Operating System patch or upgrade is applied. A copy of the Oracle catalog must be placed on the archive log in order for LogMiner to function correctly. The frequency of these copies is very site specific and it can impact the amount of time that it takes the CDC process to restart. As a best practice, Informatica recommends avoiding major changes to the Oracle production environment. Significant changes can negatively impact the performance and re-start time of the Oracle CDC environment. There are several parameters that appear in the dbmover.cfg configuration file that can assist in optimizing restart performance. These parameters are: RSTRADV: The RSTRADV parameter specifies the number of seconds to wait after receiving a Unit of Work (UOW) for a source table before advancing the restart tokens by returning an “empty” UOW. This parameter is very beneficial in cases where the frequency of updates on some tables is low in comparison to other tables. CATINT: The CATINT parameter specifies the frequency in which the Oracle catalog is copied to the archive logs. Since LogMiner needs a copy of the catalog on the archive log to become operational, this is an important parameter as it will have an impact on which archive log is used to restart the CDC process. When Oracle places a catalog copy on the archive log, it will first flush all of the online redo logs to the archive logs prior to writing out the catalog. CATBEGIN: The CATBEGIN parameter specifies the time of day that the Oracle catalog copy process should begin. The time of day that is specified in this parameter is based on a 24 hour clock. CATEND: The CATEND parameter specifies the time of day that the Oracle catalog copy process should end. The time of day that is specified in this parameter is based on a 24 hour clock. Refer to the dbmover.cfg file below for a description of these parameters. It is important to carefully code these parameters as they can impact the amount of time it takes to restart the PowerExchange CDC process. /* ----------------------------------------------------------------*/ /* Trace Parameter /* ----------------------------------------------------------------*/ TRACE=(ORAC,2,99) /* /* ----------------------------------------------------------------*/ /* Capture Registrations and Extraction Maps /* ----------------------------------------------------------------*/ /* CAPT_XTRA=/home/pwx/v861/capture CAPT_PATH=/home/pwx/v861/capture/camaps /* /* ----------------------------------------------------------------*/
INFORMATICA CONFIDENTIAL
BEST PRACTICES
761 of 818
/* Oracle Change Data Capture Parms /* see PowerExchange CDC Guide for Linux, UNIX, and Windows /* Chapter 6 - Oracle Change Data Capture /* see PowerExchange Reference Guide /* Chapter 2 – DBMOVER Configuration File Parameters /* see Readme_ORACAPT.txt /* ----------------------------------------------------------------*/ /* /*ORACLEID=(collection_id,oracle_sid,connect_string,capture_connect_string) /* /* /* ----------------------------------------------------------------*/ /* Change Data Capture Parameters /* ----------------------------------------------------------------*/ /* ORACLEID=(ORACDC,orcl,orcl,orcl) /* CAPI_CONNECTION=(NAME=CAPCAPX,TYPE=(CAPX,DFLTINST=ORACDC)) CAPI_CONNECTION=(NAME=CAPIUOW,TYPE=(UOWC,CAPINAME=CAPIORA, RSTRADV=60,MEMCACHE=4096)) CAPI_CONNECTION=(NAME=CAPIORA,DLLTRACE=ORAC,TYPE=(ORCL,CATINT=30, CATBEGIN=00:01,CATEND=23:59,,ARRAYSIZE=1000, COMMITINT=5, REPNODE=local,BYPASSUF=Y,ORACOLL=ORACDC)) These parameters have the following description and syntax requirements. Configuration Parameter TRACE=(ORAC,2,99)
CAPT_XTRA=qualified directory pathname
CAPT_PATH=qualified directory pathname
Description Trace severely impacts performance and should only be done under the direction of Informatica Global Customer Support. In an Oracle CDC environment, ALWAYS turn-on this trace ahead of time to see what is going on even if there are no problems. Path to the local directory where the change registration extraction maps are to be stored. This is a required parameter for CDC and the directory path specified must exist or an error message will be generated. Path to the local directory where the following CDC files are to be stored. · CCT file contains the change registrations · CDEP file contains the application names for PowerCenter extractions that use ODBC connections. · CDCT file contains information about the PowerExchange condense. This is a required parameter for CDC and the directory path specified must exist or an error message will be generated.
ORACLEID=(collection id, Oracle SID,
INFORMATICA CONFIDENTIAL
Specifies the Oracle instance name and connection information. PowerExchange requires an ORACLEID statement for each Oracle instance which is being
BEST PRACTICES
762 of 818
connect string, capture connect string) - collection id
- Oracle SID
- connect string
- capture connect string
CAPI_CONNECTION=(NAME=name, TYPE=(CAPX, DFLTINST=collection id)) - NAME
- CAPX - DFLTINST
CAPI_CONNECTION=(NAME=name, TYPE= (UOWC, CAPINAME=capi name, RSTRADV=no of seconds, MEMCACHE=cache size)) - NAME
- UOWC - CAPI NAME - RSTADV
INFORMATICA CONFIDENTIAL
used with CDC. The maximum number of ORACLEID statements contained in the dbmover.cfg is 20. An Oracle instance identifier that matches the collection id specified in the change registration for the Oracle source table. The collection id is required and there is no default value. Name of the Oracle database that contains the tables that have been registered for change data capture. The Oracle SID is required and there is no default value. The database service name and could be the same as the Oracle SID. The connect string is not a required field; however, the best practice is to enter it since it can be required under certain conditions. The database service name (could be the same as the Oracle SID). The connect string is not a required field; however, the best practice is to enter it since it can be required under certain conditions. The CAPX CAPI_CONNECTION statement specifies the parameters used for continuous extraction from condense files.
A unique name for the CAPI_CONNECTION statement. The maximum length is eight characters and must be unique within all of the CAPICONNECTION statements. Mandatory parameter. Specifies the PowerExchange instance to process and must be identical to the first parameter in the ORACLEID statement. The UOWC CAPI_CONNECTION statement is used to specify parameters for the Unit of Work (UOW) Cleanser. In the change stream, the changes from multiple units of work are inter-mingled. The UOW Cleanser reconstructs each interleaved unit of work (UOW) from the change stream into complete units of work in chronological order based on change end time. A unique name for the CAPI_CONNECTION statement. The maximum length is eight characters and must be unique within all of the CAPICONNECTION statements. Mandatory parameter A unique name with a maximum length of eight characters. Specifies the number of seconds PowerExchange waits before advancing the restart tokens by returning an empty Unit of Work (UOW). Empty UOWs contain no data, only restart tokens. This parameter is very beneficial in cases where the frequency of updates on some tables is low in comparison to other tables.
BEST PRACTICES
763 of 818
- MEMCACHE
CAPI_CONNECTION=(NAME=name, DLLTRACE=trace name,
Specifies the memory cache in Kilobytes allocated to the UOW Cleanser to reconstruct complete UOWs. The UOW Cleanser keeps all changes in each UOW in cache until the end-UOW (commit record) is read. The default value is 1024. The ORCL CAPI_CONNECTION statement is used to specify parameters for Oracle CDC real-time extraction mode and for Oracle Condense.
TYPE=(ORCL, CATINT=interval, CATBEGIN=beginning time, CATEND=ending time, ARRAYSIZE=array size, COMMITINT=commit interval, REPNODE=node name, BYPASSUF=yes or no, ORACOLL=collection id)) - NAME - DLLTRACE - ORCL - CATINT
- CATBEGIN
- CATEND
- ARRAYSIZE
- COMMITINT - REPNODE
- BYPASSUF
- ORACOLL
INFORMATICA CONFIDENTIAL
The name specified MUST be identical to the CAPINAME used in the UOW CAPI statement. The trace name MUST be identical to the name specified in the TRACE statement. Mandatory parameter The number of minutes between attempts to write the Oracle catalog to the Oracle redo log. The default value is 1440, once a day. The earliest time of day at which Oracle can attempt to write the Oracle catalog to the Oracle redo log. The default value is 00:00. The latest time of day at which Oracle can attempt to write the Oracle catalog to the Oracle redo log. The default value is 24:00 This parameter is used to control the size of the prefetch array Oracle capture uses to read the Oracle archive logs. The default value is 100. Specifies the number of minutes between Oracle Capture commit points. The default value is 5. Specifies the name of the NODE statement in dbmover.cfg that points to the Capture repository. The NODE statement used should be for either local or node1 where the IP address is set to 127.0.0.1 which is a loop back address. If the Oracle instance has tables that contain LOB columns that can be included in the table row (this is the default), then specify BYPASSUF=Y. The default is N. Specifies the PowerExchange instance to process and must be identical to the first parameter in the
BEST PRACTICES
764 of 818
ORACLEID statement.
Last updated: 22-Feb-10 18:36
INFORMATICA CONFIDENTIAL
BEST PRACTICES
765 of 818
PowerExchange for SQL Server CDC Challenge Install, configure, and performance tune PowerExchange for MS SQL Server Change Data Capture (CDC).
Description PowerExchange Real-Time for MS SQL Server uses SQL Server publication technology to capture changed data. To use this feature Distribution must be enabled. The publisher database handles replication while the distributor database transfers the replicated data to PowerExchange; which is installed on the distribution database server. The following figure depicts a typical high-level architecture:
When looking at the architecture for SQL Server capture, we see that PowerExchange treats the SQL Server Publication process as a “virtual” change stream. By turning the standard SQL Server publication process on, SQL Server publishes changes to the SQL Server Distribution database. PowerExchange then reads the changes from the Distribution database. When Publication is used, and the Distribution function is enabled, support for capturing changes for tables of interest are dynamically activated through the registration of a source in the PowerExchange Navigator GUI (i.e., PowerExchange makes the appropriate calls to SQL Server automatically, via SQL DMO objects).
Key Setup Steps The key steps involved in setting up the change capture process are: 1. Modify the PowerExchange dbmover.cfg file on the server. Example statements that must be added: CAPI_CONN_NAME=CAPIMSSC CAPI_CONNECTION=(NAME=CAPIMSSC, TYPE=(MSQL,DISTSRV=SDMS052,DISTDB=distribution,repnode=SDMS052)) 2. Configure MS SQLServer replication.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
766 of 818
Microsoft SQL Server Replication must be enabled using the Microsoft SQL Server Publication technology. Informatica recommends enabling distribution through the SQL Server Management Console. Multiple SQL Servers can use a single Distribution database. However, Informatica recommends using a single Distribution database for Production and a separate one for Development/Test. In addition, for a busy environment, placing the Distribution database on a separate server is advisable. Also, configure the Distribution database for a retention period of 10 to 14 days. 3. Ensure that the MS SQL Server Agent Service is running. 4. Register sources using the PowerExchange Navigator. Source tables must have a primary key. Note that system admin authority is required to register source tables.
Performance Tuning Tips If you plan to capture large numbers of transaction updates, consider using a dedicated distributed server as the host of the distribution database. This will avoid contention for CPU and disk storage with a production instance. Sometimes SQL Server CDC performance is slow. It requires approximately ten seconds for changes made at the source to take effect at the target. This is specifically when data is coming in low volumes. You can alter the following parameters to enhance this performance: POLWAIT PollingInterval POLWAIT This parameter specifies the number of seconds to wait between polling for new data after end of current data has been achieved. Specify this parameter in the dbmover.cfg file of the Microsoft SQL Distribution database machine. The default is ten seconds. Reducing this value to one or two seconds can improve the performance. PollingInterval You can also decrease the polling interval parameter of the log reader agent in Microsoft SQL Server. Reducing this to a lower value reduces the delay in polling for new records. Modify this parameter using the SQL Server Enterprise Manager. The default value for this parameter is 10 seconds. Be aware, however, that the trade-off with the above options is, to some extent, increased overhead and frequency of access to the source distribution database. To minimize overhead and frequency of access to the database, increase the delay between the time an update is performed and the time it is extracted. Increasing the value of POLWAIT in the dbmover cfg file reduces the frequency with which the source distribution database is accessed. In addition, increasing the value of Real-Time Flush Latency in the PowerCenter Application Connection can also reduce the frequency of access to the source.
Last updated: 19-May-09 16:50
INFORMATICA CONFIDENTIAL
BEST PRACTICES
767 of 818
PowerExchange Installation (for AS/400) Challenge Installing and configuring PowerExchange for AS/400 includes setting up the LISTENER, modifying configuration files and creating application connections for use in the sessions.
Description Installing PowerExchange on AS/400 is a relatively straight-forward task that can be accomplished with the assistance of resources such as: AS/400 system Programmer DB2 DBA Be sure to adhere to the sequence of the following steps to successfully install PowerExchange: 1. 2. 3. 4. 5. 6. 7.
Complete the PowerExchange pre-install checklist and obtain valid license keys. Install PowerExchange on the AS/400. Start the PowerExchange Listener on AS/400. Install the PowerExchange client (Navigator) on a workstation. Test connectivity to the AS/400 from the workstation. Install PowerExchange client (Navigator) on the UNIX/NT server running PowerCenter Integration Service. Test connectivity to the AS/400 from the server.
Install PowerExchange on the AS/400 Informatica recommends using the following naming conventions for PowerExchange: datalib - for the user specified database library condlib - for the user specified condensed files library dtllib - for the software library name dtluser - as the userid The following example demonstrates use of the recommended naming conventions: PWX811P01D – PWX-PowerExchange, 851-Version, P01-Patch level, D-Datalib PWX811P01C – PWX-PowerExchange, 851-Version, P01-Patch level, C-Condlib PWX811P01S – PWX-PowerExchange, 851-Version, P01-Patch level, S-Sourcelib PWX811P01M – PWX-PowerExchange, 851-Version, P01-Patch level, M-Maplib PWX811P01X – PWX-PowerExchange, 851-Version, P01-Patch level, X-Extracts PWX811P01T – PWX-PowerExchange, 851-Version, P01-Patch level, T-Templib PWX811P01I – PWX-PowerExchange, 851-Version, P01-Patch level, I-ICUlib Informatica recommends using PWXADMIN as the user id for the PowerExchange Administrator Below are the installation steps:
Step 1: Create the PowerExchange Libraries Create the software library using the following command: CRTLIB(PWX851P01S) CRTAUT(*CHANGE) If the sourcing or targeting flat/sequential files are from or to the AS/400, you will need to create a data maps library. Use the following command to create the data maps library: CRTTLIB LIB(PWX851P01M) CRTAUT(*CHANGE)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
768 of 818
Because later on in the installation process you must choose a different library name to store the datamaps, you will need to change the DMX_DIR= parameter from stdatamaps to PWX851P01M in the configuration file (datalib/cfg member DBMOVER). You may choose to run PowerExchange within an Independent Auxiliary Storage Pool (IASP). If you intend to use the IASP, use the following command: CRTLIB LIB(PWX851P01S) CRTAUT(*CHANGE) ASP(*ASPDEV) ASPDEV(YOURASPDEV)
Step 2: Create Library SAVE File for Restore CRTSAVF FILE(QGPL/PWX851P01T) If you intend to run PowerExchange with multibyte support, you need to create a second save file using the following command: CRTSAVF FILE(QGPL/PWX851P01I)
Step 3: FTP Binary File to AS/400 You should have a file (pwxas4.vnnn.exe where nnn is the version/release/modification level) containing the appropriate PowerExchange AS/400. This file is a self-extracting executable. For the current release of PowerExchange 8.5.1 the file is pwxas4_v851_01.exe. Select this file from the CD or directory that the software was copied into, and double click on it Copy the PWXAS4.v851 file to your temp library on the AS400 (PWX851P01T) by entering the following command PUT PWXAS4.V851 QGPL/PWX851P01T Copy the PWXAS4.v851.ICU file to your temp ICU library on the AS400 (PWX811P01I) by entering the command PUT PWXAS4.V851.ICU QGPL/PWX851P01I
Step 4: Restore the Install Library You must now restore the library. After it is decompressed, the library is shipped to dtllib. Use the following command: RSTLIB SAVLIB(DTLV851) DEV(*SAVF) SAVF(QGPL/PWX851P01T) RSTLIB(PWX851P01S) MBROPT(*ALL) ALWOBJDIF(*ALL) If you intend to run PowerExchange with multibyte support, you must restore the additional objects using the following command: RSTOBJ OBJ(*ALL) SAVLIB(DTLV851) DEV(*SAVF) OBJTYPE(*ALL) SAVF(QGPL/PWX851P01I) MBROPT(*ALL) ALWOBJDIF(*ALL) RSTLIB(PWX851P01S) If you intend to run PowerExchange within an Independent Auxiliary Storage Pool (IASP), you need to specify the details for the IASP to know which the objects should be restored. Use the RSTASPDEV(YOURASPDEV) parameter. The following example specifies the additional objects for multibyte support. RSTLIB SAVLIB(DTLVXYZ) DEV(*SAVF) SAVF(QGPL/LIBREST) MBROPT(*ALL) ALWOBJDIF(*NONE) RSTLIB(DTLLIB) RSTASPDEV(YOURASPDEV) RSTOBJ OBJ(*ALL) SAVLIB(DTLVXYX) DEV(*SAVF) OBJTYPE(*ALL) SAVF(QGPL/ LIBRESTICU) MBROPT(*ALL) ALWOBJDIF(*ALL) RSTLIB(DTLLIB) RSTASPDEV(YOURASPDEV)
Step 5: Update License Key File PowerExchange requires a license key to run successfully. It is held in the file dtllib/LICENSE(KEY), which must be in the same library as the dtllst program (the PowerExchange LISTENER). The license key is normally IP address specific. Update the single record member with the 44 byte key, with hyphens every 4 bytes.
Step 6: Create PowerExchange Environment
INFORMATICA CONFIDENTIAL
BEST PRACTICES
769 of 818
After you have installed the software, you will need to create a Power Exchange environment for the software to run. This environment will consist of dtllib, datalib and (optionally) two additional libraries (for data capture processing), as follows: datalib - Base database files library condlib - Condensed files library Additionally, the cpxlib (Capture Extract) library is required if the environment is to support change data capture processing. Files in this these libraries are deleted during normal operation by PowerExchange. You should not, therefore, place your own files in this library without first contacting Informatica Support. Use the following command to create the PowerExchange environment: ADDLIBLE PWX851P01S POSITION(*FIRST) Use the following commands to create a new subsystem to run PowerExchange in. CRTPWXENV DESC(PWX_V851P01_Install) DATALIB(PWX851P01D) CONDLIB(PWX851P01C) ASPDEV(*NONE) CRTSYSOBJ(*YES) CPXLIB(PWX851P01X) CRTPWXENV DESC('User Description') DATALIB(datalib) CONDLIB(*NONE) ASPDEV(*NONE) CRTSYSOBJ(*YES) Note: If you restored dtllib into an IASP, you must specify that device name in the CRTPWXENV command. For example: CRTPWXENV DESC('User Description') DATALIB(DATALIB) CONDLIB(*NONE) CRTSYSOBJ(*YES) ASPDEV(YOURASPDEV)
Step 7: Update Configuration File One of the PowerExchange configuration files is datalib/CFG(DBMOVER); it holds many defaults and the information that PowerExchange uses to communicate with other platforms. You may not need to customize this file at this stage of the installation process, but for additional information on the contents of this file. Refer to the Configuration File Parameters section of the PowerExchange Reference Manual. An example of the DBMOVER file is shown below: *************** Beginning of data ****************************************** 0001.00 /********************************************************************/ 0002.00 /* PowerExchange Configuration File 0003.00 /********************************************************************/ 0004.00 LISTENER=(node1,TCPIP,2480) 0005.00 NODE=(local,TCPIP,127.0.0.1,2480) 0006.00 NODE=(node1,TCPIP,127.0.0.1,2480) 0007.00 NODE=(default,TCPIP,x,2480) 0008.00 APPBUFSIZE=256000 0009.00 COLON=: 0010.00 COMPRESS=Y 0011.00 CONSOLE_TRACE=Y 0012.00 DECPOINT=. 0013.00 DEFAULTCHAR=* 0014.00 DEFAULTDATE=19800101 0015.00 DMX_DIR=PWX851P01M 0016.00 MAXTASKS=25 0017.00 MSGPREFIX=PWX 0018.00 NEGSIGN=0019.00 NOGETHOSTBYNAME=N 0020.00 PIPE=| 0021.00 POLLTIME=1000 0022.00 SECURITY=(0,N) 0023.00 TIMEOUTS=(600,600,600) 0024.00 /* sample trace TRACE=(TCPIP,1,99)
INFORMATICA CONFIDENTIAL
BEST PRACTICES
770 of 818
0025.00 /* Enable to extract BIT data as CHAR: DB2_BIN_AS_CHAR=Y 0026.00 /* uncomment and modify the CAPI_CONNECTION lines to activate changed data 0027.00 /* propagation 0029.00 CAPI_CONNECTION=(NAME=DTECAPU, 0030.00 TYPE=(UOWC,CAPINAME=DTLJPAS4)) 0031.00 CAPI_CONNECTION=(NAME=DTLJPAS4, 0032.00 TYPE=(AS4J,JOURNAL=REPORTSDB2/QSQJRN,INST=FOCUST1,EOF=N, 0033.00 STOPIT=(CONT=5),LIBASUSER=N,AS4JRNEXIT=N)) 0040.00 CPX_DIR=PWX851P01X
Step 8: Change Object Ownership After all of the components in the shipped library have been created, they are owned in Informatica's internal systems by the userid dtluser. CALL PGM(PWX851P01S/CHGALLOBJ) PARM(‘PWX851P01S’ ‘PWXADMIN’) CALL PGM(PWX851P01S/CHGALLOBJ) PARM(‘PWX851P01D’ ‘PWXADMIN’) If Change Capture is installed CALL PGM(PWX851P01S/CHGALLOBJ) PARM(‘PWX851P01C’ ‘PWXADMIN’) CALL PGM(PWX851P01S/CHGALLOBJ) PARM(‘PWX851P01X’ ‘PWXADMIN’)
Step 9: Authorize PowerExchange Userid Security Setting Prior to running jobs, you will need to assign the userid that you created back in Step 1: (i.e., dtluser) *EXECUTE authority to the following objects: QSYGETPH QSYRLSPH QWTSETP QCLRPGMI Use the following commands to assign the userid: GRTOBJAUT OBJ(QSYGETPH) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QSYRLSPH) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QWTSETP) OBJTYPE(*PGM) AUT(*EXECUTE) USER(PWXADMIN) GRTOBJAUT OBJ(QCLRPGMI) OBJTYPE(*PGM) AUT(*EXECUTE *READ) USER(PWXADMIN)
Step 10: Start PowerExchange Listener The standard command form to start the Listener is as follows: SBMJOB CMD(CALL PGM(dtllib/DTLLST) PARM(NODE1)) JOB(MYJOB) JOBD(datalib/ DTLLIST) JOBQ(*JOBD) PRTDEV(*JOBD) OUTQ(*JOBD) CURLIB(*CRTDFT) INLLIBL(*JOBD) INLASPGRP(*JOBD)
Step 11: Stopping the PowerExchange LISTENER The standard command form to stop the LISTENER is as follows: SNDLSTCMD LSTMSGQLIB(PWX851P01D) LSTCMD(CLOSE) Once the LISTENER start/stop test is complete, installation on AS/400 is finished. The LISTENER can be started to run and PowerCenter application connections can be configured.
Power Center Real time Application Connections Click the ‘PWX DB2400 CDC Real time’ connection as show below and fill in the parameter details The User name and password can be anything if security in DBMOVER is set to 0; otherwise, they must be populated
INFORMATICA CONFIDENTIAL
BEST PRACTICES
771 of 818
with proper AS400 user id and password.
Mention the restart token folder name and file name as shown below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
772 of 818
Set the ‘Number of Runs to Keep’ parameter for the number of versions that you want to keep in your restart token file. If the workflow needs to run continuously in real time mode, set the idle time to -1. If real time needs to run from the time it is triggered till the end of changes, set the idle time to 0.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
773 of 818
Leave the Journal name blank if your tables reside on the default journal that is mentioned in the DBMOVER file. Alternatively, the journal name can be overridden by mentioning the journal lib and file. The first figure below shows an instance where the connection uses default journal; the second figure below shows the Journal override.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
774 of 818
INFORMATICA CONFIDENTIAL
BEST PRACTICES
775 of 818
The Session settings for the real time session look like the following:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
776 of 818
TIP If you plan to use PowerExchange with Informatica PowerCenter it is necessary to ensure that you install the same versions of both softwares.
Last updated: 19-May-09 17:09
INFORMATICA CONFIDENTIAL
BEST PRACTICES
777 of 818
PowerExchange Installation (for Mainframe) Challenge Installing and configuring a PowerExchange listener on a mainframe, ensuring that the process is both efficient and effective.
Description PowerExchange installation is very straight-forward and can generally be accomplished in a timely fashion. When considering a PowerExchange installation, be sure that the appropriate resources are available. These include, but are not limited to: MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources Be sure to adhere to the sequence of the following steps to successfully install PowerExchange. Note that in this very typical scenario, the mainframe source data is going to be “pulled” across to a server box. 1. 2. 3. 4. 5. 6. 7.
Complete the PowerExchange pre-install checklist and obtain valid license keys. Install PowerExchange on the mainframe. Start the PowerExchange jobs/tasks on the mainframe. Install the PowerExchange client (Navigator) on a workstation. Test connectivity to the mainframe from the workstation. Install Navigator on the UNIX/NT server. Test connectivity to the mainframe from the server.
Complete the PowerExchange Pre-install Checklist and Obtain Valid License Keys Reviewing the environment and recording the information in a detailed checklist facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed in the Documentation Folder when the PowerExchange software is installed. It is also available within the client from the PowerExchange Program Group. Be sure to complete all relevant sections. You will need a valid license key in order to run any of the PowerExchange components. This is a 44 or 64-byte key that uses hyphens every 4 bytes. For example: 1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1 The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F). Keys are valid for a specific time period and are also linked to an exact or generic TCP/IP address. They also control access to certain databases. You cannot successfully install PowerExchange without a valid key for all required components. Note: When copying software from one machine to another, you may encounter license key problems since the license key is IP specific. Be prepared to deal with this eventuality, especially if you are going to a backup site for disaster recovery testing. In the case of such an event Informatica Product Shipping or Support can generate a temporary key very quickly.
Install PowerExchange on the Mainframe Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD or from the extract of the zip file downloaded to this directory. Double click the file to unzip its contents into this directory. Step 2: Create the PDS “HLQ.PWXVxxx.RUNLIB” and “HLQ.PWXVxxx.BINLIB” with fixed blocks and a length of 80 attributes on the mainframe in order to pre-allocate the needed libraries.. Ensure sufficient space for the required jobs/tasks by setting the cylinders to 150 and directory blocks of 50. Step 3: Run the “MVS_Install” file. This displays the MVS Install Assistant. Configure the IP Address, Logon ID, Password, HLQ, and Default volume setting on the display screen. Also, enter the license key. INFORMATICA CONFIDENTIAL
BEST PRACTICES
778 of 818
Click the Custom buttons to configure the desired data sources. Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from step 2). Save these settings and click Process. This creates the JCL libraries and opens the following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process. Note: A new installer GUI was added as of PowerExchange 8.5. Simply follow the installation screens in the GUI for this step. Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution class, message class, etc.) Step 5: Edit the SETUPBLK member in RUNLIB. Copy in the JOBCARD and SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success) or 1, and a list of the needed installation jobs can be found in the XJOBS member.
Start The PowerExchange Jobs/Tasks on the Mainframe The installed PowerExchange Listener can be run as a normal batch job or as a started task. Informatica recommends that it initially be submitted as a batch job: RUNLIB(STARTLST). If it will be run as a started task then copy the PSTRTLST member in runlib to the started task proclib. It should return: DTL-00607 Listener VRM x.x.x Build Vxxx_P0x started. If implementing change capture, start the PowerExchange Agent (as a started task): /S DTLA It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization. Note: The load libraries must be APF authorized prior to starting the Agent.
Install The PowerExchange Client (Navigator) on a Workstation Step 1: Run the Windows or UNIX installation file in the software folder on the installation CD and follow the prompts. Step 2: Enter the license key. Step 3: Follow the wizard to complete the install and reboot the machine. Step 4: Add a node entry to the configuration file “\Program Files\Informatica\Informatica Power Exchange\dbmover.cfg” to point to the Listener on the mainframe. node = (mainframe location name, TCPIP, mainframe IP address, 2480)
Test Connectivity to the Mainframe from the Workstation Ensure communication to the PowerExchange Listener on the mainframe by entering the following in DOS on the workstation: DTLREXE PROG=PING LOC=mainframe location or nodename in dbmover.cfg It should return: DLT-00755 DTLREXE Command OK!
Install PowerExchange on the UNIX Server Step 1: Create a user for the PowerExchange installation on the UNIX box. Step 2: Create a UNIX directory “/opt/inform/pwxvxxxp0x”. Step 3: FTP the file “\software\Unix\dtlxxx_vxxx.tar” on the installation CD to the pwx installation directory on UNIX.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
779 of 818
Step 4: Use the UNIX tar command to extract the files. The command is “tar –xvf pwxxxx_vxxx.tar”. Step 5: Update the logon profile with the correct path, library path, and home environment variables. Step 6: Update the license key file on the server. Step 7: Update the configuration file on the server (dbmover.cfg) by adding a node entry to point to the Listener on the mainframe. Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the odbc.ini file on the server by adding data source entries that point to PowerExchange-accessed data: [pwx_mvs_db2] DRIVER=
Test Connectivity to the Mainframe from the Server Ensure communication to the PowerExchange Listener on the mainframe by entering the following on the UNIX server: DTLREXE PROG=PING LOC=mainframe location It should return: DLT-00755 DTLREXE Command OK!
Changed Data Capture There is a separate manual for each type of change data capture option. This manual contains the specifics on the following general steps. You will need to understand the appropiate options guide to ensure success. Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required for external security. Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site PROCLIB. Step 3: After the Agent has been started, run job SETUP2. Step 4: Create an active registration for a table/segment/record in Navigator that is setup for changes. Step 5: Start the ECCR. Step 6: Issue a change to the table/segment/record that you registered in Navigator. Step 7: Perform an extraction map row test in Navigator TIP When using PowerExchange in conjunction with PowerCenter ensure that the versions of PowerExchange and PowerCenter are the same. Using different versions is not supported.
Last updated: 19-May-09 17:14
INFORMATICA CONFIDENTIAL
BEST PRACTICES
780 of 818
Assessing the Business Case Challenge Assessing the business case for a project must consider both the tangible and intangible potential benefits. The assessment should also validate the benefits and ensure they are realistic to the Project Sponsor and Key Stakeholders to ensure project funding.
Description A Business Case should include both qualitative and quantitative measures of potential benefits. The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the project beneficiaries regarding the expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities. Many qualitative items are intangible, but you may be able to cite examples of the potential costs or risks if the system is not implemented. An example may be the cost of bad data quality resulting in the loss of a key customer or an invalid analysis resulting in bad business decisions. Risk factors may be classified as business, technical, or execution in nature. Examples of these risks are uncertainty of value or the unreliability of collected information, new technology employed, or a major change in business thinking for personnel executing change. It is important to identify an estimated value added or cost eliminated to strengthen the business case. The better definition of the factors, the better the value to the business case. The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI. This may involve the following calculations: Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results. Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment (i.e., a net present value) of $2,311.44. Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return-on-equity or return-on-capital. Payback Period - Determines how much time must pass before an initial capital investment is recovered. The following are steps to calculate the quantitative business case or ROI: Step 1 – Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible participants, requirements, and systems involved. A data integration or migration initiative or amendment may require estimating customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers and complexity of target data systems (data marts or operational databases, for example) and data sources, types of sources, and size of data set. A data migration project may require customer participation, legacy system migrations, and retirement procedures. The types of estimations vary by project types and goals. It is important to note that the more details you have for estimations, the more precise your phased solutions are likely to be. The scope of the project should also be made known in the deployment map. Step 2 – Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations, Include in this step, costs that can be avoided by the deployment of this project. Step 3 – Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits are going to be allocated throughout the organization over time, using
INFORMATICA CONFIDENTIAL
BEST PRACTICES
781 of 818
the enterprise deployment map as a guide. Step 4 – Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components: Hardware Networks RDBMS software Back-end tools Query/reporting tools Internal labor External labor Ongoing support Training Step 5 – Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted. Step 6 – Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are: Scope creep, which can be mitigated by thorough planning and tight project scope. Integration complexity, which may be reduced by standardizing on vendors with integrated product sets or open architectures. Architectural strategy that is inappropriate. Current support infrastructure may not meet the needs of the project. Conflicting priorities may impact resource availability. Other miscellaneous risks from management or end users who may withhold project support; from the entanglements of internal politics; and from technologies that don't function as promised. Unexpected data quality, complexity, or definition issues often are discovered late in the course of the project and can adversely affect effort, cost, and schedule. This can be somewhat mitigated by early source analysis. Step 7 – Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total incremental revenue plus cost savings).
Final Deliverable The final deliverable of this phase of development is a complete business case that documents both tangible (quantified) and intangible (non-quantified, but estimate of benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This allows them to review the Business Case in order to justify the development effort. If your organization has the concept of a Project Office which provides the governance for project and priorities, many times this is part of the original Project Charter which states items like scope, initial high level requirements, and key project stakeholders. However, developing a full Business Case can validate any initial analysis and provide additional justification. Additionally, the Project Office should provide guidance in building and communicating the Business Case. Once completed, the Project Manager is responsible for scheduling the review and socialization of the Business Case.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
782 of 818
Defining and Prioritizing Requirements Challenge Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst. Requirements need to be gathered from business users who currently use and/or have the potential to use the information being assessed. All input is important since the assessment should encompass an enterprise view of the data rather than a limited functional, departmental, or line-of-business view. Types of specific detailed data requirements gathered include: Data names to be assessed Data definitions Data formats and physical attributes Required business rules including allowed values Data usage Expected quality levels By gathering and documenting some of the key detailed data requirements, a solid understanding the business rules involved is reached. Certainly, all elements can’t be analyzed in detail, but it helps in getting to the heart of the business system so you are better prepared when speaking with business and technical users.
Description The following steps are key for successfully defining and prioritizing requirements:
Step 1: Discovery Gathering business requirements is one of the most important stages of any data integration project. Business requirements affect virtually every aspect of the data integration project starting from Project Planning and Management to End-User Application Specification. They are like a hub that sits in the middle and touches the various stages (spokes) of the data integration project. There are two basic techniques for gathering requirements and investigating the underlying operational data: interviews and facilitated sessions.
Data Profiling Informatica Data Explorer (IDE) is an automated data profiling and analysis software product that can be extremely beneficial in defining and prioritizing requirements. It provides a detailed description of data content, structure, rules, and quality by profiling the actual data that is loaded into the product. Some industry examples of why data profiling is crucial prior to beginning the development process are: Cost of poor data quality is 15 to 25 percent of operating profit. Poor data management is costing global business $1.4 billion a year. 37 percent of projects are cancelled; 50 percent are completed but with 20 percent overruns, leaving only 13 percent completed on time and within budget. Using a Data Profiling Tool can lower the risk and lower the cost of the project and increase the chances of success. Data Profiling reports can be posted to a central presence where all team members can review results and track accuracy. IDE provides the ability to promote collaboration through tags, notes, action items, transformations and rules. By profiling the information, the framework is set to have an effective interview process with business and technical users.
Interviews By conducting interview research before starting the requirements gathering process, interviewees can be categorized into functional business management and Information Technology (IT) management. This, in conjunction with effective data
INFORMATICA CONFIDENTIAL
BEST PRACTICES
783 of 818
profiling, helps to establish a comprehensive set of business requirements. Business Interviewees. Depending on the needs of the project, even though you may be focused on a single primary business area, it is always beneficial to interview horizontally to achieve a good cross-functional perspective of the enterprise. This also provides insight into how extensible your project is across the enterprise. Before you interview, be sure to develop an interview questionnaire based upon profiling results, as well as business questions; schedule the interview time and place; and prepare the interviewees by sending a sample agenda. When interviewing business people, it is always important to start with the upper echelons of management so as to understand the overall vision, assuming you have the business background, confidence and credibility to converse at those levels. If not adequately prepared, the safer approach is to interview middle management. If you are interviewing across multiple teams, you might want to scramble interviews among teams. This way if you hear different perspectives from finance and marketing, you can resolve the discrepancies with a scrambled interview schedule. A note to keep in mind is that business is sponsoring the data integration project and is going to be the end-users of the application. They will decide the success criteria of your data integration project and determine future sponsorship. Questioning during these sessions should include the following: Who are the stakeholders for this milestone delivery (IT, field business analysts, executive management)? What are the target business functions, roles, and responsibilities? What are the key relevant business strategies, decisions, and processes (in brief)? What information is important to drive, support, and measure success for those strategies/processes? What key metrics? What dimensions for those metrics? What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used? How can it be improved? IT interviewees. The IT interviewees have a different flavor than the business user community. Interviewing the IT team is generally very beneficial because it is composed of data gurus who deal with the data on a daily basis. They can provide great insight into data quality issues, help in systematic exploration of legacy source systems, and understanding business user needs around critical reports. If you are developing a prototype, they can help get things done quickly and address important business reports. Questioning during these sessions should include the following: Request an overview of existing legacy source systems. How does data current flow from these systems to the users? What day-to-day maintenance issues does the operations team encounter with these systems? Ask for their insight into data quality issues. What business users do they support? What reports are generated on a daily, weekly, or monthly basis? What are the current service level agreements for these reports? How can the DI project support the IS department needs? Review data profiling reports and analyze the anomalies in the data. Note and record each of the comments from the more detailed analysis. What are the key business rules involved in each item?
Facilitated Sessions Facilitated sessions - known sometimes as JAD (Joint Application Development) or RAD (Rapid Application Development) - are ways to work as a group of business and technical users to capture the requirements. This can be very valuable in gathering comprehensive requirements and building the project team. The difficulty is the amount of preparation and planning required to make the session a pleasant, and worthwhile experience. Facilitated sessions provide quick feedback by gathering all the people from the various teams into a meeting and initiating the requirements process. You need a facilitator who is experienced in these meetings to ensure that all the participants get a chance to speak and provide feedback. During individual (or small group) interviews with high-level management, there is often focus and clarity of vision that may be hindered in large meetings. Thus, it is extremely important to encourage all attendees to participate and minimize a small number from dominating the requirement process. A challenge of facilitated sessions is matching everyone’s busy schedules and actually getting them into a meeting room. However, this part of the process must be focused and brief or it can become unwieldy with too much time expended just trying to coordinate calendars among worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questions asked during facilitated sessions are similar to the questions asked to business and IS interviewees.
Step 2: Validation and Prioritization INFORMATICA CONFIDENTIAL
BEST PRACTICES
784 of 818
The Business Analyst, with the help of the Project Architect, documents the findings of the discovery process after interviewing the business and IT management. The next step is to define the business requirements specification. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements. Defining the business requirements is a time consuming process and should be facilitated by forming a working group team. A working group team usually consists of business users, business analysts, project manager, and other individuals who can help to define the business requirements. The working group should meet weekly to define and finalize business requirements. The working group helps to: Design the current state and future state Identify supply format and transport mechanism Identify required message types Develop Service Level Agreement(s), including timings Identify supply management and control requirements Identify common verifications, validations, business validations and transformation rules Identify common reference data requirements Identify common exceptions Produce the physical message specification At this time also, the Architect develops the Information Requirements Specification to clearly represent the structure of the information requirements. This document, based on the business requirements findings, can facilitate discussion of informational details and provide the starting point for the target model definition. The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.
Step 3: The Incremental Roadmap Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project. As general technical feasibility is compared to the prioritization from Step 2, the Project Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap).
Final Deliverable The final deliverable of this phase of development is a complete list of business requirement, a diagram of current and future state, and a list of high-level business rules affected by the requirements that will effect the change from current to future. This provides the development team with much of the information in order to begin the design effort of the system modifications. Once completed, the Project Manager is responsible for scheduling the review and socialization of the requirements and plan to achieve sign-off on the deliverable. This is presented to the Project Sponsor for approval and becomes the first "increment" or starting point for the Project Plan.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
785 of 818
Developing a Work Breakdown Structure (WBS) Challenge Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required for a data integration project. Many times, items such as full analysis, testing, or even specification development, can create a sense of false optimism for the project. The WBS clearly depicts all of the various tasks and subtasks required to complete a project. Most project time and resource estimates are supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also facilitates communication with project sponsors and key stakeholders.
Description The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves as a starting point as well as a monitoring tool for the project. One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and too much detail. The WBS shouldn’t include every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. It is also important to maintain consistency across project for the level of detail. A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown in the following sample. The actual WBS for the project manager may, for example, may be a level of detail deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll up a level or two to make things more clear.
Plan
% Complete
Budget Hours
Actual Hours
Architecture - Set up of Informatica Environment Develop analytic solution architecture Design development architecture Customize and implement Iterative Framework Data Profiling Legacy Stage Pre-Load Stage Reference Data Reusable Objects Review and signoff of Architecture
82% 46% 59%
167 28 32
137 13 19
100% 150% 150% 128% 56% 50%
32 10 10 18 27 10
32 15 15 23 15 5
Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (7 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) Orders (13 tables) Review and signoff of Functional Specification
48% 87% 98% 0% 0% 0% 37% 0%
1000 135 215 60 60 140 380 10
479 117 210 0 0 0 140 0
Total Architecture and Analysis
52%
1167
602
A fundamental question is to whether to include “activities” as part of a WBS. The following statements are generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving this question. The project manager should have the right to decompose the WBS to whatever level of detail he or she requires to effectively plan and manage the project. The WBS is a project management tool that can be used in different ways,
INFORMATICA CONFIDENTIAL
BEST PRACTICES
786 of 818
depending upon the needs of the project manager. The lowest level of the WBS can be activities. The hierarchical structure should be organized by deliverables and milestones with process steps detailed within it. The WBS can be structured from a process or life cycle basis (i.e., the accepted concept of Phases), with non-deliverables detailed within it. At the lowest level in the WBS, an individual should be identified and held accountable for the result. This person should be an individual contributor, creating the deliverable personally, or a manager who will in turn create a set of tasks to plan and manage the results. The WBS is not necessarily a sequential document. Tasks in the hierarchy are often completed in parallel. At part, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks. For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements. It is important to remember that a task is not complete until all of its corresponding subtasks are completed whether sequentially or in parallel. For example, the Build Phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the Deploy Phase long before the Build Phase is complete. The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to ensure that it corresponds to the specific development effort, removing any steps that aren’t relevant or adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the development effort. If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort of developing the WBS. Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the project team. The project manager can act as a facilitator or can appoint one, freeing up the project manager and enabling team members to focus on determining the actual tasks and effort needed. Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams creating their own project plans. The overall project manager then brings the plans together into a master project plan. This group of projects can be defined as a program and the project manager and project architect manage the interaction among the various development teams. Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses; new information becomes available; scope, resources and priorities change; deliverables are (or are not) completed on time, etc. The process of estimating and modifying the plan should be repeated many times throughout the project. Even initial planning is likely to take several iterations to gather enough information. Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape direction.
Approaches to Building WBS Structures: Waterfall vs. Iterative Data integration projects differ somewhat from other types of development projects, although they also share some key attributes. The following list summarizes some unique aspects of data integration projects: Business requirements are less tangible and predictable than in OLTP (online transactional processing) projects. Database queries are very data intensive, involving few or many tables, but with many, many rows. In OLTP, transactions are data selective, involving few or many tables and comparatively few rows. Metadata is important, but in OLTP the meaning of fields is predetermined on a screen or report. In a data integration project (e.g., warehouse or common data management, etc.), metadata and traceability are much more critical. Data integration projects, like all development projects, must be managed. To manage them, they must follow a clear plan. Data integration project managers often have a more difficult job than those managing OLTP projects because
INFORMATICA CONFIDENTIAL
BEST PRACTICES
787 of 818
there are so many pieces and sources to manage. Two purposes of the WBS are to manage work and ensure success. Although this is the same as any project, data integration projects are unlike typical waterfall projects in that they are based on a iterative approach. Three of the main principles of iteration are as follows: Iteration. Division of work into small “chunks” of effort using lessons learned from earlier iterations. Time boxing. Delivery of capability in short intervals, with the first release typically requiring from three to nine months (depending on complexity) and quarterly releases thereafter. Prototyping. Early delivery of a prototype, with a working database delivered approximately one-third of the way through. Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The danger is that projects can iterate or spiral out of control..
The three principles listed above are very important because even the best data integration plans are likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a fully detailed plan, is a large common data management project that gathers all requirements upfront and delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all requirements upfront" and the "all-at-once in three years." Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The feedback you can gather from increment to increment is critical to the success of the future increments. The benefit is that such incremental deliveries establish patterns for development that can be used and leveraged for future deliveries.
What is the Correct Development Approach? The correct development approach is usually dictated by corporate standards and by departments such as the Project Management Office (PMO). Regardless of the development approach chosen, high-level phases typically include planning the project; gathering data requirements; developing data models; designing and developing the physical database(s); developing the source, profile, and map data; and extracting, transforming, and loading the data. Lower-level planning details are typically carried out by the project manager and project team leads.
Preparing the WBS The WBS can be prepared using manual or automated techniques, or a combination of the two. In many cases, a manual technique is used identify and record the high-level phases and tasks, then the information is transferred to project tracking software such as Microsoft Project. Project team members typically begin by identifying the highlevel phases and tasks, writing the relevant information on large sticky notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time estimates, and dependencies. This information can then be fed into the project tracking software. Once you have a fairly detailed methodology, you can enter the phase and task information into your project tracking software. When the project team is assembled, you can enter additional tasks and details directly into the software. Be aware however, that the project team can better understand a project and its various components if they actually participate in the high-level development activities, as they do in the manual approach. Using software alone, without input from relevant project team members, to designate phases, tasks, dependencies and time lines can be difficult and prone to errors and ommissions. Benefits of developing the project timeline manually, with input from team members include: Tasks, effort and dependencies are visible to all team members. Team has a greater understanding of and commitment to the project. Team members have an opportunity to work with each other and set the foundation. This is particularly important if the team is geographically dispersed and cannot work face-to-face throughout much of the project.
How Much Descriptive Information is Needed? The project plan should incorporate a thorough description of the project and its goals. Be sure to review the business objectives, constraints, and high-level phases but keep the description as short and simple as possible. In many cases, a verb-noun form
INFORMATICA CONFIDENTIAL
BEST PRACTICES
788 of 818
works well (e.g., interview users, document requirments, etc.). After you have described the project on a high-level, identify the tasks needed to complete each phase. It is often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative for each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20 days. Remember to break down the tasks only to the level of detail that you are willing to track. Include key checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e.g., requirements completed, data model completed, etc.).
Assigning and Delegating Responsibility Identify a single owner for each task in the project plan. Although other resources may help to complete the task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and any associated deliverables, is completed on time. After the WBS is loaded into the selected project tracking software and refined for the specific project requirements, the Project Manager can begin to estimate the level of effort involved in completing each of the steps. When the estimate is complete, the project manager can assign individual resources and prepare a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan. Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan updated throughout the project.
Last updated: 09-Feb-07 16:29
INFORMATICA CONFIDENTIAL
BEST PRACTICES
789 of 818
Developing and Maintaining the Project Plan Challenge The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while retaining the flexibility necessary to accommodate change. A two-fold approach is required to meet the challenge: 1. A project that is clear in scope contains the following elements: A designated begin and end date. Well-defined business and technical requirements Adequate resources must be assigned. Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor. 2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a communication plan with the Project Sponsor; such communication may involve a weekly status report of accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes. If your organization has the concept of a Project Office that provides governance for the project and priorities, look for a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally, the Project Office should provide guidance in funding and resource allocation for key projects. Informatica’s PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose here is to provide some key elements that can be used to develop and maintain a data integration, data migration, or data quality project.
Description Use the following steps as a guide for developing the initial project plan: 1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design, development, and testing.) 2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion. 3. Continue the detail breakdown, if possible, to a level at which there are logical “chunks” of work can be completed and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes difficult to maintain. 4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). This helps to build commitment for the project plan. 5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another). 6. Define the resources based on the role definitions and estimated number of resources needed for each role. 7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan. 8. Ensure that project plan follows your organization’s system development methodology. Note: Informatica Professional Services has found success in projects that blend the“waterfall” method with the “iterative” method. The“Waterfall” method works well in the early stages of a project, such as analysis and initial design. The “Iterative” methods work well in accelerating development and testing where feedback from extensive testing validates the design of the system. At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. Set the constraint type to “As Soon As Possible” and avoid setting a constraint date. Use the Effort-Driven approach so that the Project Plan can be easily modified as adjustments are made.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
790 of 818
By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good faith. This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule. When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.
Reviewing and Revising the Project Plan Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions. One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the Project Manager. Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level), b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next Period.
Key Accomplishments Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is an opportunity to bring in the lead developers and have them report to management on what they have accomplished; it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they have to own and account to management. Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to ten minute maximum during this portion of the meeting.
Progress against Initial Plan The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is readable to the Project Sponsor (see sample below).
Plan Architecture - Set up of Informatica Migration Environment Develop data integration solution architecture Design development architecture Customize and implement Iterative Migration Framework Data Profiling Legacy Stage Pre-Load Stage Reference Data Reusable Objects Review and signoff of Architecture
Analysis - Target-to-Source Data Mapping Customer (9 tables) Product (6 tables) Inventory (3 tables) Shipping (3 tables) Invoicing (7 tables) INFORMATICA CONFIDENTIAL
BEST PRACTICES
Percent Budget Complete Hours
10% 28%
167 28 32
80% 100% 100% 83% 19% 0%
32 10 10 18 27 10
90% 90% 0% 0% 57%
1000 135 215 60 60 140 791 of 818
Orders (19 tables) Review and signoff of Functional Specification
40% 0%
380 10
Budget versus Actual A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when additional funding or a change in scope is required. Many projects are cancelled because of cost overruns, so it is the Project Manager’s job to keep expenditures under control. The following example shows how a budgeted vs. actual report may look.
Resource Resource Resource Resource Project Manager
A B C D
10- 17- 24Apr Apr Apr 28 40 24 10
24
12
110 160
1May 8-May 15-May 40 40 40 40 40 40 40 36 40 40 36 40
97
8
8
*462
160
160 687
22May 40 40 40 40
29May 32 32 32 32
16
160
32
160
76 962
160
284 202 188 212
1167
Key Issues This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks, key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate decisions and minimizes the risk of impact to the project.
Plans for Next Period This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change, this is an opportunity to redirect the resources and use them correctly. Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.
Tracking Changes One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is completed.
Summary Managing a data integration, data migration, or data quality project requires good project planning and communications. Many data integration project fail because of issues such as poor data quality or complexity of integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such issues from causing a project to fail.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
792 of 818
Developing the Business Case Challenge Identifying the departments and individuals that are likely to benefit directly from the project implementation. Understanding these individuals, and their business information requirements, is key to defining and scoping the project.
Description The following four steps summarize business case development and lay a good foundation for proceeding into detailed business requirements for the project. 1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization. Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation. Deliverable - Organization chart of corporate beneficiaries and participants. 2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries. Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project. Deliverable - Problem/Need Statement 3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information. Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project. Deliverable - Statement of Project Goals and Objectives 4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not. The assumptions statements are optional and may include qualifiers on the scope, such as assumptions of feasibility, specific roles and responsibilities, or availability of resources or data. Activity -Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor. Deliverable - Project Scope and Assumptions statement
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
793 of 818
Managing the Project Lifecycle Challenge To provide an effective communications plan to provide on-going management throughout the project lifecycle and to inform the Project Sponsor regarding status of the project.
Description The quality of a project can be directly correlated to the amount of review that occurs during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.
Project Status Reports In addition to the initial project plan review with the Project Sponsor, it is critical to schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates. This is known as the project sponsor meeting. Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Project Status Report . In addition, make sure lead developers of major assignments are present to report on the status and issues, if applicable.
Project Management Review The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel. Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code"). Reviews may include: Project scope and business case review. Business requirements review. Source analysis and business rules reviews. Data architecture review. Technical infrastructure review (hardware and software capacity and configuration planning). Data integration logic review (source to target mappings, cleansing and transformation logic, etc.). Source extraction process review. Operations review (operations and maintenance of load sessions, etc.). Reviews of operations plan, QA plan, deployment and support plan.
Project Sponsor Meetings A project sponsor meeting should be completed weekly to bi-weekly to communicate progress to the Project Sponsor and Key Stakeholders. The purpose is to keep key user management involved and engaged in the process. In addition, it is to communicate any changes to the initial plan and to have them weigh in on the decision process. Elements of the meeting include: Key Accomplishments. Activities Next Week. Tracking of Progress to-Date (Budget vs. Actual). Key Issues / Roadblocks. It is the Project Manager’s role to stay neutral to any issue and to effectively state facts and allow the Project Sponsor or other key executives to make decisions. Many times this process builds the partnership necessary for success.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
794 of 818
Change in Scope Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan. The Project Manager should institute a change management process in response to any issue or request that appears to add or alter expected activities and has the potential to affect the plan. Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Note that such a change-in-scope document helps capture key documentation that is particularly useful if the project overruns or fails to deliver upon Project Sponsor expectations. Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth. Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project. Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.
Management of Issues Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible. Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution. Significant or "showstopper" issues should also be mentioned on the status report and communicated through the weekly project sponsor meeting. This way, the Project Sponsor has the opportunity to resolve and cure a potential issue.
Project Acceptance and Close A formal project acceptance and close helps document the final status of the project. Rather than simply walking away from a project when it seems complete, this explicit close procedure both documents and helps finalize the project with the Project Sponsor. For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion. Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing: What was accomplished. Any justification for tasks expected but not completed. Recommendations. Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans. Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.
Last updated: 01-Feb-07 18:54
INFORMATICA CONFIDENTIAL
BEST PRACTICES
795 of 818
Using Interviews to Determine Corporate Data Integration Requirements Challenge Data warehousing projects are usually initiated out of a business need for a certain type of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”). Except in the case of narrowly-focused, departmental data marts however, this is not enough guidance to drive a full data integration solution. Further, a successful, single-purpose data mart can build a reputation such that, after a relatively brief period of proving its value to users, business management floods the technical group with requests for more data marts in other areas. The only way to avoid silos of data marts is to think bigger at the beginning and canvas the enterprise (or at least the department, if that’s your limit of scope) for a broad analysis of data integration requirements.
Description Determining the data integration requirements in satisfactory detail and clarity is a difficult task however, especially while ensuring that the requirements are representative of all the potential stakeholders. This Best Practice summarizes the recommended interview and prioritization process for this requirements analysis.
Process Steps The first step in the process is to identify and interview “all” major sponsors and stakeholders. This typically includes the executive staff and CFO since they are likely to be the key decision makers who will depend on the data integraton. At a minimum, figure on 10 to 20 interview sessions. The next step in the process is to interview representative information providers. These individuals include the decision makers who provide the strategic perspective on what information to pursue, as well as details on that information, and how it is currently used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors and stakeholders regarding the findings of the interviews and the recommended subject areas and information profiles. It is often helpful to facilitate a Prioritization Workshop with the major stakeholders, sponsors, and information providers in order to set priorities on the subject areas.
Conduct Interviews The following paragraphs offer some tips on the actual interviewing process. Two sections at the end of this document provide sample interview outlines for the executive staff and information providers. Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A focused, consistent interview format is desirable. Don't feel bound to the script, however, since interviewees are likely to raise some interesting points that may not be included in the original interview format. Pursue these subjects as they come up, asking detailed questions. This approach often leads to “discoveries” of strategic uses for information that may be exciting to the client and provide sparkle and focus to the project. Questions to the “executives” or decision-makers should focus on what business strategies and decisions need information to support or monitor them. (Refer to Outline for executive Interviews at the end of this document). Coverage here is critical if key managers are left out, you may miss a critical viewpoint and may miss an important buy-in. Interviews of information providers are secondary but can be very useful. These are the business analyst-types who report to decision-makers and currently provide reports and analyses using Excel or Lotus or a database program to consolidate data from more than one source and provide regular and ad hoc reports or conduct sophisticated analysis. In subsequent phases of the project, you must identify all of these individuals, learn what information they access, and how they process it. At this stage however, you should focus on the basics, building a foundation for the project and discovering what tools are currently in use and where gaps may exist in the analysis and reporting functions. Be sure to take detailed notes throughout the interview process. If there are a lot of interviews, you may want the interviewer to partner with someone who can take good notes, perhaps on a laptop to save note transcription time later. It is important to take down the details of what each person says because, at this stage, it is difficult to know what is likely to be important. While some interviewees may want to see detailed notes from their interviews, this is not very efficient since it takes time to clean up the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
796 of 818
notes for review. The most efficient approach is to simply consolidate the interview notes into a summary format following the interviews. Be sure to review previous interviews as you go through the interviewing process, You can often use information from earlier interviews to pursue topics in later interviews in more detail and with varying perspectives. The executive interviews must be carried out in “business terms.” There can be no mention of the data warehouse or systems of record or particular source data entities or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use any technical language. It can be valuable to have an industry expert prepare and even accompany the interviewer to provide business terminology and focus. If the interview falls into “technical details,” for example, into a discussion of whether certain information is currently available or could be integrated into the data warehouse, it is up to the interviewer to re-focus immediately on business needs. If this focus is not maintained, the opportunity for brainstorming is likely to be lost, which will reduce the quality and breadth of the business drivers. Because of the above caution, it is rarely acceptable to have IS resources present at the executive interviews. These resources are likely to engage the executive (or vice versa) in a discussion of current reporting problems or technical issues and thereby destroy the interview opportunity. Keep the interview groups small. One or two Professional Services personnel should suffice with at most one client project person. Especially for executive interviews, there should be one interviewee. There is sometimes a need to interview a group of middle managers together, but if there are more than two or three, you are likely to get much less input from the participants.
Distribute Interview Findings and Recommended Subject Areas At the completion of the interviews, compile the interview notes and consolidate the content into a summary.This summary should help to breakout the input into departments or other groupings significant to the client. Use this content and your interview experience along with “best practices” or industry experience to recommend specific, well-defined subject areas. Remember that this is a critical opportunity to position the project to the decision-makers by accurately representing their interests while adding enough creativity to capture their imagination. Provide them with models or profiles of the sort of information that could be included in a subject area so they can visualize its utility. This sort of “visionary concept” of their strategic information needs is crucial to drive their awareness and is often suggested during interviews of the more strategic thinkers. Tie descriptions of the information directly to stated business drivers (e.g., key processes and decisions) to further accentuate the “business solution.” A typical table of contents in the initial Findings and Recommendations document might look like this: I. Introduction II. Executive Summary A. Objectives for the Data Warehouse B. Summary of Requirements C. High Priority Information Categories D. Issues III. Recommendations A. Strategic Information Requirements B. Issues Related to Availability of Data C. Suggested Initial Increments D. Data Warehouse Model IV. Summary of Findings A. Description of Process Used B. Key Business Strategies [this includes descriptions of processes, decisions, other drivers) C. Key Departmental Strategies and Measurements D. Existing Sources of Information E. How Information is Used F. Issues Related to Information Access V. Appendices A. Organizational structure, departmental roles B. Departmental responsibilities, and relationships
INFORMATICA CONFIDENTIAL
BEST PRACTICES
797 of 818
Conduct Prioritization Workshop This is a critical workshop for consensus on the business drivers. Key executives and decision-makers should attend, along with some key information providers. It is advisable to schedule this workshop offsite to assure attendance and attention, but the workshop must be efficient — typically confined to a half-day. Be sure to announce the workshop well enough in advance to ensure that key attendees can put it on their schedules. Sending the announcement of the workshop may coincide with the initial distribution of the interview findings. The workshop agenda should include the following items: Agenda and Introductions Project Background and Objectives Validate Interview Findings: Key Issues Validate Information Needs Reality Check: Feasibility Prioritize Information Needs Data Integration Plan Wrap-up and Next Steps Keep the presentation as simple and concise as possible, and avoid technical discussions or detailed sidetracks.
Validate information needs Key business drivers should be determined well in advance of the workshop, using information gathered during the interviewing process. Prior to the workshop, these business drivers should be written out, preferably in display format on flipcharts or similar presentation media, along with relevant comments or additions from the interviewees and/or workshop attendees. During the validation segment of the workshop, attendees need to review and discuss the specific types of information that have been identified as important for triggering or monitoring the business drivers. At this point, it is advisable to compile as complete a list as possible; it can be refined and prioritized in subsequent phases of the project. As much as possible, categorize the information needs by function, maybe even by specific driver (i.e., a strategic process or decision). Considering the information needs on a function by function basis fosters discussion of how the information is used and by whom.
Reality check: feasibility With the results of brainstorming over business drivers and information needs listed (all over the walls, presumably), take a brief detour into reality before prioritizing and planning. You need to consider overall feasibility before establishing the first priority information area(s) and setting a plan to implement the data warehousing solution with initial increments to address those first priorities. Briefly describe the current state of the likely information sources (SORs). What information is currently accessible with a reasonable likelihood of the quality and content necessary for the high priority information areas? If there is likely to be a high degree of complexity or technical difficulty in obtaining the source information, you may need to reduce the priority of that information area (i.e., tackle it after some successes in other areas). Avoid getting into too much detail or technical issues. Describe the general types of information that will be needed (e.g., sales revenue, service costs, customer descriptive information, etc.), focusing on what you expect will be needed for the highest priority information needs.
Data Integration Plan The project sponsors, stakeholders, and users should all understand that the process of implementing the data warehousing solution is incremental.. Develop a high-level plan for implementing the project, focusing on increments that are both high-value and high-feasibility. Implementing these increments first provides an opportunity to build credibility for the project. The objective during this step is to obtain buy-in for your implementation plan and to begin to set expectations in terms of timing. Be practical though; don't establish too rigorous a timeline!
Wrap-up and next steps At the close of the workshop, review the group's decisions (in 30 seconds or less), schedule the delivery of notes and findings to the attendees, and discuss the next steps of the data warehousing project.
Document the Roadmap INFORMATICA CONFIDENTIAL
BEST PRACTICES
798 of 818
As soon as possible after the workshop, provide the attendees and other project stakeholders with the results: Definitions of each subject area, categorized by functional area Within each subject area, descriptions of the business drivers and information metrics Lists of the feasibility issues The subject area priorities and the implementation timeline.
Outline for Executive Interviews I. Introductions II. General description of information strategy process A. Purpose and goals B. Overview of steps and deliverables Interviews to understand business information strategies and expectations Document strategy findings Consensus-building meeting to prioritize information requirements and identify “quick hits” Model strategic subject areas Produce multi-phase Business Intelligence strategy III. Goals for this meeting A. Description of business vision, strategies B. Perspective on strategic business issues and how they drive information needs Information needed to support or achieve business goals How success is measured IV. Briefly describe your roles and responsibilities? The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add. A. What are your key business strategies and objectives? How do corporate strategic initiatives impact your group? These may include “MBOs” (personal performance objectives), and workgroup objectives or strategies. B. What do you see as the Critical Success Factors for an Enterprise Information Strategy? What are its potential obstacles or pitfalls? C. What information do you need to achieve or support key decisions related to your business objectives? D. How will your organization’s progress and final success be measured (e.g., metrics, critical success factors)? E. What information or decisions from other groups affect your success? F. What are other valuable information sources (i.e., computer reports, industry reports, email, key people, meetings, phone)? G. Do you have regular strategy meetings? What information is shared as you develop your strategy? H. If it is difficult for the interviewee to brainstorm about information needs, try asking the question this way: "When you return from a two-week vacation, what information do you want to know first?" I. Of all the information you now receive, what is the most valuable? J. What information do you need that is not now readily available? K. How accurate is the information you are now getting? L. To whom do you provide information? M. Who provides information to you? N. Who would you recommend be involved in the cross-functional Consensus Workshop?
Outline for Information Provider Interviews I. Introductions II. General description of information strategy process A. Purpose and goals B. Overview of steps and deliverables Interviews to understand business information strategies and expectations Document strategy findings and model the strategic subject areas Consensus-building meeting to prioritize information requirements and identify “quick hits” Produce multi-phase Business Intelligence strategy III. Goals for this meeting 1. Understanding of how business issues drive information needs 2. High-level understanding of what information is currently provided to whom
INFORMATICA CONFIDENTIAL
BEST PRACTICES
799 of 818
IV. Briefly
A. B. C. D. E. F. G.
Where does it come from How is it processed What are its quality or access issues describe your roles and responsibilities? The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add. Who do you provide information to? What information do you provide to help support or measure the progress/success of their key business decisions? Of all the information you now provide, what is the most requested or most widely used? What are your sources for the information (both in terms of systems and personnel)? What types of analysis do you regularly perform (i.e., trends, investigating problems)? How do you provide these analyses (e.g., charts, graphs, spreadsheets)? How do you change/add value to the information? Are there quality or usability problems with the information you work with? How accurate is it?
Last updated: 05-Jun-08 15:16
INFORMATICA CONFIDENTIAL
BEST PRACTICES
800 of 818
Upgrade Testing Strategies Challenge During the upgrade of any offering of software, the testing phase is just as important as the upgrade of the software code. The new environment must be certified for operation, for accuracy in data and for performance compared to the previous environment. Often more than 60 percent of the total upgrade time is devoted to testing the data integration environment with the new software release. During this time a lot of questions arise on what type of testing is required, what specific tests should be conducted and what is the right level of testing to conduct.
Description Typically there six different types of testing that may occur with an upgrade of data integration software. Operability Same Data Application Security Application Performance Third-party Integration HA / DR Failover Testing Depending on the sophistication of the enterprise an upgrade may incorporate testing from each category to fully certify an environment. During the planning phase these categories should be kept in mind and specific metrics or requirements recorded. This will aid in the development of the unit test plans prior to testing. Below is further information on each of the categories of testing and specific examples commonly conducted for each.
Operability Operability is defined as having the ability of being put into use, operation or practice. This class of tests is conducted to determine if the Informatica applications are running and executing without any abnormal abends or failures with the start up or in some cases day-to-day operations. These are most commonly referred to as smoke tests. Usually these tests are conducted during or right after the upgrade process is complete.
Examples of Operability Tests: Installation of software completes successfully Domain services start and continue to operate Repository upgrade wizard completes successfully Services are able to start after upgrade completes successfully Connectivity to required systems is established and verified Workflow / sessions are able to execute Ability to log into client tools
Same Data Same data defines a state where the new environment processes the data and provides the data in the same state / format as the legacy environment. This can be the most time-consuming process in the project plan due to the time involved for setup, execution and review. There are various methods for same data testing (some that can reduce the amount of time needed), each having their own set of pros and cons.
Full Regression Pros Provides the lowest amount of risk of data differences Allows for every part of the existing processes that are upgraded ETL continue to operate
INFORMATICA CONFIDENTIAL
BEST PRACTICES
801 of 818
Cons Can be a time-consuming task Large level of effort for setup and coordination Could lead to having large number of testing resources
Partial Regression Pros Provides less overall LOE dependent on the percentage of regression testing that will be done Potentially less resources needed Lower amount of setup Cons Assumes higher level of risk of identified issues Identification of required regression tests
Automated Testing Pros Provides rapid execution of tests Instant analysis of results Repeatable process Cons Requires additional software Additional setup and delays if process is currently not in place
Application Security Application security relates to security that is both internal and external. The external security is tested to confirm if outside threats or exposure exist. Internal security would audit the internal application to confirm users have access to only those areas they need to conduct business.
External Application Security Examples SQL injection testing SSL encryption levels Database permissions testing Open ports testing
Internal Application Security Examples Administrator and admin user password strength and default tests Folder access and permissions for users Connection access and permissions for users Tool access and permissions for users
Application Performance Application performance testing measures metrics in multiple areas. This can be conducted through both scientific and nonscientific approaches. It is usually expected that the results are equal to or better than that of the legacy environment.
Application Performance Testing Categories Client Performance - tests designed to display metrics around the responsiveness of the client tools INFORMATICA CONFIDENTIAL
BEST PRACTICES
802 of 818
Integration Service Performance - tests designed to display metrics of the ETL workflows Server Performance - tests designed to display metrics around server behavior Network Performance - tests designed to display the metrics around the network connectivity
Client Performance Metrics Amount of time to open mappings of various sizes Amount of time to save mappings of various sizes Amount of time to import / export mappings of various object counts Amount of time for workflow initialization Amount of time to log into environment Amount of time to open log files Responsiveness of the admin console Responsiveness of data analyzer
Integration Performance Metrics Total write throughput per mapping Total execution time of series of mappings or project mappings Total amount of concurrent workflows / sessions able to execute Total amount of time to execute command tasks Total amount of time to start up Total amount of time to shutdown
Server Performance Metrics Total CPU % under no load Total CPU % under max load Total amount of memory utilized under no load Total amount of memory utilized under max load Disk IO utilization
Network Performance Metrics Client performance via VPN Client Network Packet Load under various conditions Server Network Packet Load under various conditions Packet latency
Third-Party Integration Even though Informatica supplies a comprehensive platform there are times when integration with third-party tools is required. These third-party tools can cover scheduling, development, deployment and monitoring. Upgrades often adjust the underlying framework or syntax required for the integration of these tools. Though often overlooked or placed last on the list of the items checked, these tests should warrant just as much attention as any of the other tests.
Third-Party Integration Examples External scheduling tools External code management applications such subversion or svn External deployment frameworks tools Encryption / decryption tools File transfer management applications Proprietary or customer monitoring applications
INFORMATICA CONFIDENTIAL
BEST PRACTICES
803 of 818
HA/DR Failover Testing What are the differences of High Availability environments vs. Disaster Recovery Environments? High-availability solutions primarily address single instance failures. High-availability solutions are measured in uptime. Disaster recovery solutions address catastrophic data center failures. Disaster recovery solutions are measured in recovery time and data stagnation periods.
High Availability Failover Examples Domain failover between nodes Database failover between nodes Service failover (Note: not all services are HA ready)
Disaster Recovery Failover Examples File transfer between primary and DR environments DR startup and execution ability Stress test of DR environment
Last updated: 31-Oct-10 19:39
INFORMATICA CONFIDENTIAL
BEST PRACTICES
804 of 818
Upgrading Data Analyzer Challenge Seamlessly upgrade Data Analyzer from one release to another while safeguarding the repository.
Description Preparing to Upgrade Data Analyzer Before starting the upgrade process be sure to check through the Informatica support information for the Data Analyzer upgrade path and any associated caveats. For instance, Data Analyzer must be first upgraded to version 8.6.1 prior to upgrading Informatica 9.0.1. This may affect PC upgrade plans since the PowerCenter Upgrade can support direct upgrades from version 8.1.1. 1. Conduct a cleanup exercise inside of the legacy Data Analyzer environment. This includes the following tasks: a. Removing unused or unneeded reports b. Removing unused or unneeded dashboards c. Review of security permissions in place d. Removing outdated users and groups 2. Install or upgrade the Informatica environment to Target version. Data Analyzer is a service of the Informatica domain. 3. For upgrades that adjust the security model (example Native to LDAP) it is a good idea to generate a list of reports and the associated user / group permissions prior to upgrading to help map those permissions post upgrade. 4. Make sure that the userid’s from the old environment have been migrated to the new environment. 5. Allocate a new schema/database for Data Analyzer repository. This is not required if there are plans to use the current schema. It is recommended that a new schema/database be used. It would help to revert back if any problems are encountered.
Upgrading Data Analyzer (Reporting) Service 1. If required set up the LDAP connectivity inside of the Administrator / Admin console and sync the users into the Informatica environment. 2. Using the backup functionality inside of the Administrator / Admin console command, backup the old DA repository. 3. Create a new Data Analyzer Service in the new Informatica environment, specifying the new desired reporting service. 4. Using the restore functionality inside of the Administrator / Admin console command, restore the repository backup and restore it to the new schema. 5. As the service starts, it will recommend that the repository contents be upgraded. Upgrade the Data Analyzer Repository Contents through the application services upgrade wizard.
6. If the Reconcile Users and Groups dialog box appears, specify whether or not to use the existing Administrators or Public groups. Then specify a resolution for each conflict and click Next. This dialog box appears when an upgrade to 8.1.1 PowerCenter Repository Service users and groups is performed and the selection is made not to automatically reconcile
INFORMATICA CONFIDENTIAL
BEST PRACTICES
805 of 818
user and group conflicts. This step ensures that the existing users from the old environment have the same set of privileges in the new environment. 7. If switching security domains make any needed modifications to grant access to the service, reports and dashboards.
Verifying the Data Analyzer Service Upgrade The following checks are recommended to ensure that the upgrade process goes smoothly. Check that all reports and dashboards from the prior version show up in the new environment. Check the user permissions in the new environment. If the users were correctly migrated prior to upgrading the Data Analyzer Service, the permissions should be the same in the old and new environment. Check to see if the reports and dashboard are connecting to the proper data sources.
Post Upgrade Tasks for Data Analyzer Service Service Clean Up Remove the legacy instance of the Data Analyzer reporting service from the Administrator / Admin console. This will prevent confusion and an improper upgrade of the legacy environment.
Performance Tuning of the Environment Data Analyzer requires the interaction of several components and services, including those that may already exist in the enterprise infrastructure, such as the enterprise data warehouse and authentication server. The following components can be tuned to optimize the performance of Data Analyzer: Database Operating system Application server Data Analyzer The Data Analyzer Administrator guide provides an excellent source of areas for tuning in each of the aforementioned areas.
Installation of Data Analyzer Administrative Reports Data Analyzer provides a set of administrative reports that enable system administrators to track user activities and monitor processes. They include details on Data Analyzer usage and report schedules and errors. After setting up the Data Analyzer administrative reports, the reports can be viewed and used just like any other set of reports in Data Analyzer. If additional information in a report is needed, modify it to add metrics or attributes. For any report, charts or indicators can be added or the format can be changed. Reports can be enhanced to suit specific needs and help manage the users and processes in Data Analyzer more efficiently. View the administrative reports in two areas: 1. Administrator’s Dashboard. On the Administrator’s Dashboard, one can quickly see how well Data Analyzer is working and how often users log in. 2. Data Analyzer Administrative Reports folder. Access all administrative reports in the Data Analyzer Administrative Reports public folder under the Find tab. Before importing the Data Analyzer administrative reports, ensure that the Reporting Service is enabled and the Data Analyzer instance is running properly. Import the XML files under the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
806 of 818
Last updated: 31-Oct-10 20:05
INFORMATICA CONFIDENTIAL
BEST PRACTICES
807 of 818
Upgrading Metadata Manager Challenge This Best Practice summarizes one recommended upgrade path for Metadata Manager.
Description Preparing to Upgrade Metadata Manager Before starting the upgrade process be sure to check through the Informatica support information for the Metadata Manager upgrade path. For instance, Superglue 2.1 (as Metadata Manager was previously called) should first be upgraded to Metadata Manager 8.1 and then to the Metadata Manager 8.6/9.x. 1. Install or upgrade PowerCenter environment to target version. Metadata Manager is a service of the PowerCenter domain. In order to upgrade the Metadata Manager Repository, first upgrade PowerCenter. 2. Make sure that the userid’s from the old environment have been migrated to the new environment. 3. Allocate a new schema/database for the Metadata Manager Repository. This is not required if the plan is to use the current schema. It is recommended that a new schema/database be used. It would help to revert back if any problems are encountered. 4. ODBC Sources: Make sure that the ODBC DSN’s used in the old environment are also valid in the new environment.
Upgrading Metadata Manager Service 1. Using the mmBackupUtil command, backup the old MM repository and restore it to the new schema. Check the return status of mmBackupUtil to ensure that the backup completed successfully. If there is a problem backing up the MM repository with this utility, make a backup using the database’s backup utility and restore it to the new schema. 2. When the new Service is created, the new Metadata Manager mappings are created under the “Metadata_Load” folder in the associated repository service. If the associated repository service was also upgraded from a prior version, most probably it already has a “Metadata Load” folder. Make sure to delete the old Metadata_Load prior to creating the MM service. 3. Create a new Metadata Manager Service in the new PowerCenter environment. Use the above mentioned schema to host the repository. 4. As the service starts, it will recommend that the repository contents be upgraded. The Metadata Repository Contents can be upgraded at this point. If version 9.0.1 is running, the upgrade wizard tool can be used to upgrade contents.
This wizard provides more in-depth logging information with the upgrade process. 5. If the Reconcile Users and Groups dialog box appears, specify whether or not to use the existing Administrators or Public groups. Then specify a resolution for each conflict and click Next. This dialog box appears when an upgrade to 8.1.1 PowerCenter Repository Service users and groups is performed and the selection is made not to automatically reconcile user and group conflicts. This step ensures that the existing users from the old environment have the same set of privileges in the new environment. 6. Custom XConnects: If there are any custom XConnects in the old environment, the workflows for these XConnects will have to be regenerated. The upgrade process updates any existing custom metadata to the new release; however, if there is a need to re-run the custom XConnect, generate the workflow first.To generate the custom XConnect workflows, follow the steps below:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
808 of 818
a. Open Custom Metadata Configurator and log on to the new Metadata Repository. b. Oen the custom XConnect template. Note: The metadata files are needed in order to open the template. c. The metadata rules should be able to be viewed at this point. d. Click on Generate Workflow. At this point the Custom Metadata Configurator generates the Custom XConnect Workflows. e. Repeat the above steps for all custom XConnect templates. 7. Search Indexing: The search indexes have to be rebuilt for all resources. If MM does not find its search indexes, it automatically rebuilds them. 8. MM Files: Metadata Manager uses an “MM Files” directory to store all files related to a service. This path is configurable from the Admin console. There is an excellent chance that the new MM_Files path and old MM_Files path are not the same.
Metadata Manager uses this location to store the following: PowerCenter Parameter Files: Parameter files that are uploaded via the MM Console will be stored in the MM Files location. These files are later used during the PowerCenter loads. Custom Metadata Files: Files that are used by custom XConnects are also stored in the MM Files location. Data Model Files: If the modeling metadata is being loaded in using files (e.g., ERStudio, Erwin files), these files are also stored in the MM Files location. If the user simply re-runs these loads without considering the MM Files path, it is possible that the metadata loads will fail or load incorrectly. There are two options: 1. Reload all files manually in the new environment. This could be a painful option, especially if a large number of files are involved. 2. Copy files from the old location to the new location. This step will ensure that the content in MM_Files is the same in both the old and new environments. For example, the new MM Files location could be: C:\Informatica\9.0.1\services\MetadataManagerService\mm_files\MM_IPS_Kickoff The specific files mentioned above are stored in a sub-folder named “mm_load”. Copy the mm_load sub-directory from the old environment to the new environment. DO NOT COPY the mm_index and mm_etl sub-folders.
Verifying the Metadata Manager Service Upgrade The following checks are recommended to ensure that the upgrade process goes smoothly.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
809 of 818
Check that all metadata resources from the prior version show up in the new environment. Check the user permissions in the new environment. If the users were correctly migrated prior to upgrading the MM service, the permissions should be the same in the old and new environment. Check that data lineage and search operations work correctly.
Last updated: 31-Oct-10 21:32
INFORMATICA CONFIDENTIAL
BEST PRACTICES
810 of 818
Upgrading PowerCenter Challenge Upgrading an existing installation of PowerCenter to a newer version encompasses upgrading the repositories, implementing any necessary modifications, testing and configuring new features. With PowerCenter 8.1, the expansion of the Service-Oriented Architecture with its domain and node concept brings additional challenges to the upgrade process. The challenge is for data integration administrators to approach the upgrade process in a structured fashion and minimize risk to the environment and on-going project work. Some of the challenges typically encountered during an upgrade include: Limiting development downtime. Ensuring that development work performed during the upgrade is accurately migrated to the upgraded environment. Testing the upgraded environment to ensure that data integration results are identical to the previous version. Ensuring that all elements of the various environments (e.g., Development, Test, and Production) are upgraded successfully.
Description Typical reasons for initiating a PowerCenter upgrade include: To take advantage of additional features and capabilities in the new version of PowerCenter that enhance development productivity and administration. To keep pace with higher demands for data integration. To achieve process performance gains. To maintain an environment of fully supported software as older PowerCenter versions end support status.
Upgrade Planning The following items should be considered when planning for an upgrade: An upgrade requires a detailed project plan as it should be treated as a full blown project. Training on new features for Developers and/or Administrators Testing that includes a baseline run and regression testing to fully certify the upgrade. Communicating with teams that support external touch points (i.e. Scheduling group, Salesforce, FTP groups) Review the Product Availability Matrix (PAM) to ensure compatibility of all PowerCenter tools. Scan release notes and New Features list
Upgrade Team Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade is key to completing the process within schedule and budgetary guidelines. Typically, the upgrade team needs the following key players: PowerCenter Administrator Database Administrator System Administrator Informatica team - the business and technical users that "own" the various areas in the Informatica environment. These resources are required for knowledge transfer and testing during the upgrade process and after the upgrade is complete.
Upgrade Paths The upgrade process details depend upon which of the existing PowerCenter versions is being upgraded and to which version. The following items summarize the upgrade paths for the various PowerCenter versions: PowerCenter 8.6 (available since July 2008) Direct upgrade for PowerCenter 7.x to 8.6
INFORMATICA CONFIDENTIAL
BEST PRACTICES
811 of 818
Direct upgrade for PowerCenter 8..x to 8.6 Direct upgrade for PowerCenter 8.5.1 to 8.6 (no repository upgrade required) Other versions: For version 4.6 or earlier - upgrade to 5.x, then to 7.x , and then to 8.6 For version 4.7 thru 5x - upgrade to 6.x, then to 8.1.1, and then to 8.6 For version 6x – upgrade to 8.1.1 and then to 8.6
Configuration Support Manager The Configuration Support Manager Tool (CSM) is a proactive support tool created by GCS. This tool provides the following benefits: Simple configuration management. Automated environment information collection via CSM Client Advanced diagnostics including health checks and environment “compare” capability Proactive alerts (EBFs/KB) by email or RSS subscription) Informatica suggests that the CSM tool be installed in all environments prior to the upgrade process to ensure that all environments are similar. If the tool identifies any issues, they can be resolved prior to the actual upgrade to mitigate any issues with the new PowerCenter version.
Upgrade Tips Some of the following items may seem obvious, but adhering to these tips should help to ensure that the upgrade process goes smoothly. Be sure to have sufficient memory and disk space (database) for the installed software. As new features are added into PowerCenter, the repository grows in size anywhere from 5 to 25 percent per release to accommodate the metadata for the new features. Plan for this increase in all PowerCenter repositories. Always read and save the upgrade log file. Backup Repository Server and PowerCenter Server configuration files prior to beginning the upgrade process. Test the AEP/EP (Advanced External Procedure/External Procedure) prior to beginning the upgrade. Recompiling may be necessary. PowerCenter 8.x and beyond require Domain Metadata in addition to the standard PowerCenter Repositories. Work with the DBA to create a location for the Domain Metadata Repository that is created at install time. Ensure that all repositories for upgrade are backed up and that they can be restored successfully. Repositories can be restored to the same database in a different schema to allow an upgrade to be carried out in parallel. Note that the restoration of the repository must be done in the same version as the backup. This is especially useful if the PowerCenter test and development environments reside in a single repository. When naming nodes and domains in PowerCenter 8, think carefully about the naming convention before the upgrade. While changing the name of a node or the domain later is possible, it is not an easy task since it is embedded in much of the general operation of the product. Avoid using IP addresses and machine names for the domain and node names since over time machine IP addresses and server names may change. With PowerCenter 8, a central location exists for shared files (e.g., log files, error files, checkpoint files, etc.) across the domain. If using the Grid option or High Availability option, it is important that this file structure is on a high-performance file system and viewable by all nodes in the domain. If High Availability is configured, the file system should also be highly available. With PowerCenter 8x, the recovery options have changed. It will be necessary to review the session setting for recovery to ensure the defaults will meet the recovery needs of the business. To prevent any folder level permission issues when upgrading to a sandbox environment export the users and groups and import them into the new domain before upgrading the repository. This will ensure that all user and groups are present when the folders are upgraded.
Upgrading Multiple Projects Be sure to consider the following items if the upgrade involves multiple projects:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
812 of 818
All projects sharing a repository must upgrade at same time (test concurrently). Projects using multiple repositories must all upgrade at same time. After upgrade, each project should undergo full regression testing.
Upgrade Project Plan The full upgrade process from version to version can be time-consuming, particularly around the testing and verification stages. Informatica strongly recommends developing a project plan to track progress and inform managers and team members of the tasks that need to be completed, uncertainties or missed steps. In addition, issue tracking and resolution is key to the success of the actual upgrade process. Use the Issues Tracking Sample Deliverable to capture and communicate any issues encountered during the upgrade.
Scheduling the Upgrade When an upgrade is scheduled in conjunction with other development work, it is prudent to have it occur within a separate test environment that mimics (or at least closely resembles) production. This reduces the risk of unexpected errors and can decrease the effort spent on the upgrade. It may also allow the development work to continue in parallel with the upgrade effort, depending on the specific site setup.
Environmental Impact With each new PowerCenter release, there is the potential for the upgrade to effect the data integration environment based on new components and features. The PowerCenter 8 upgrade changes the architecture from PowerCenter version 7, time should be spent in planning the upgrade strategy concerning domains, nodes, domain metadata and the other architectural components with PowerCenter 8. Depending on the complexity of the data integration environment, this may be a minor or major impact. Single integration server/single repository installations are not likely to notice much of a difference to the architecture, but customers striving for highly-available systems with enterprise scalability may need to spend time understanding how to alter the physical architecture to take advantage of these new features in PowerCenter 8. For more information on these architecture changes, reference the PowerCenter documentation and the Best Practice on Domain Configuration
Upgrade Process Informatica recommends using the following approach to handle the challenges inherent in an upgrade effort.
Choosing an Appropriate Environment It is always advisable to have at least three separate environments (one each) for Development, Test, and Production. The QA environment is generally the best place to start the upgrade process since it is likely to be the most similar to Production. If possible, select a test sandbox that parallels production as closely as possible. This enables performing data comparisons between PowerCenter versions. An added benefit of starting the upgrade process in a test environment is that development can continue without interruption. The corporate policies on development, test, and sandbox environments and the work that can or cannot be done in them will determine the precise order for the upgrade and any associated development changes. Note that if changes are required as a result of the upgrade, they need to be migrated to Production. Use the existing version to backup the PowerCenter repository, then ensure that the backup works by restoring it to a new schema in the repository database. Alternatively, begin the upgrade process in the Development environment or create a parallel environment in which to start the effort. The decision to use or copy an existing platform depends on the state of project work across all environments. If it is not possible to set up a parallel environment, the upgrade may start in Development, then progress to the Test and Production systems. However, using a parallel environment is likely to minimize development downtime. All changes made in development will need to be incorporated into the parallel environment if the development environment is not upgraded. The important thing is to understand the upgrade process and the business and technical requirements, then adapt the approaches described in this document to one that suits the particular situation.
Organizing the Upgrade Effort Begin by evaluating the entire upgrade effort in terms of resources, time and environments. This includes training, availability of database, operating system and PowerCenter administrator resources as well as time to perform the upgrade and carry out the necessary testing in all environments. Refer to the release notes to help identify mappings and other repository objects that may need changes as a result of the upgrade. Provide detailed training for the Upgrade team to ensure that everyone directly involved in the upgrade process understands the
INFORMATICA CONFIDENTIAL
BEST PRACTICES
813 of 818
new version and is capable of using it for their own development work and to assist others with the upgrade process. Run regression tests for all components on the old version. If possible, store the results so that they can be used for comparison purposes after the upgrade is complete. Before beginning the upgrade, be sure to backup the repository and server caches, scripts, logs, bad files, parameter files, source and target files and external procedures. Also be sure to copy backed-up server files to the new directories as the upgrade progresses. For UNIX environment which need to use the same machine for existing and upgrade versions, be sure to use separate users and directories. Be careful to ensure that profile path statements do not overlap between the new and old versions of PowerCenter. For additional information, refer to the installation guide for path statements and environment variables for specific platforms and operating systems.
Installing and Configuring the Software Install the new version of the PowerCenter components on the server. Ensure that the PowerCenter client is installed on at least one workstation to be used for upgrade testing and that connections to repositories are updated if parallel repositories are being used. Re-compile any Advanced External Procedures/External Procedures if necessary, and test them. The PowerCenter license key is now in the form of a file. During the installation of PowerCenter, you will be prompted for the location of this key file. The key should be saved on the server prior to beginning the installation process. When installing PowerCenter 8.x, configure the domain, node, repository service and the integration service at the same time. Ensure that all necessary database connections are ready before beginning the installation process. If upgrading to PowerCenter 8.x from PowerCenter 7.x (or earlier), gather all of the configuration files that are going to be used in the automated process to upgrade the Integration Services and Repositories. See the PowerCenter Upgrade Manual for more information on how to gather them and where to locate them for the upgrade process. Once the installation has been completed, use the Administration Console to perform the upgrade. Unlike previous versions of PowerCenter, in version 8 the Administration Console is a web application. The Administration Console URL is http://hostname:portnumber where hostname is the name of their server where the PowerCenter services are installed and port number is the port identified during the installation process. The default port number is 6001. Re-register any plug-ins (such as PowerExchange) to the newly upgraded environment. Both the repository and integration services can be started from the Admin Console. Analyze upgrade activity logs to identify areas where changes may be required, rerun full regression tests on the upgraded repository. Execute test plans. Ensure that there are no failures and all the loads run successfully in the upgraded environment. Verify the data to ensure that there are no changes and no additional or missing records.
Testing After the Upgrade Testing is a key milestone during an upgrade project because it verifies that the upgrade has not changed the behavior of the application. Prior to the upgrade, it is suggested that a baseline set of data be saved to be used during the testing phase to compare the before and after results. The following describes the approaches that can be used for testing after the upgrade is completed. Brute Force – execution of each mapping in the environment to compare the baseline with the after. This type of testing is very time-consuming due to the fact that every mapping is being touched; however, this does provide the best coverage. Operation Target Livelihood – execution of only the mission critical mappings. This type of testing reduces the testing time and ensures that the most important mappings have not been affected by the upgrade; however, not all the mappings will be touched Automation – execution of a 3rd party testing tool or custom code. This type of testing provides for a repeatable process and allows all mappings to be tested; however, it does require that the tools be developed or setup prior to the upgrade which can extend the timeline. In addition, the Comparison Utility is available to aid in upgrade testing. The Comparison utility compares target results along with metadata between two live repositories or a live repository and the baseline. The tool is available free from my.informatica.com or it will install automatically with versions 8.5.x or higher. Documentation for the Comparison Utility can be found at
INFORMATICA CONFIDENTIAL
BEST PRACTICES
814 of 818
http://www1.informatica.com/downloads/informaticaComparisonUtility.pdf
Implementing Changes If changes are needed, decide where those changes are going to be made. It is generally advisable to migrate work back from test to an upgraded development environment. Complete the necessary changes and then migrate forward through test to production. Assess the changes when the results from the test runs are available. If changes are made in test and migrated forward to production (which is a deviation from Best Practices) remember that these changes still need to be implemented in development. Otherwise, these changes will be lost the next time work is migrated from development to the test environment.
After the Upgrade If multiple nodes were configured and your business owns the PowerCenter Grid option, a server grid can be created to test performance gains. If your business owns the high-availability option, your environment should be configured for high availability including setting up failover gateway node(s) and designating primary and backup nodes for the various PowerCenter services. In addition, the shared file location for the domain should be located on a highly available, high-performance file server. Lastly, ensure that rep caching is turned on in the repository advanced settings to prevent session initialization issues. Start measuring data quality by creating a sample data profile. If LDAP is in use, associate LDAP users with PowerCenter users. Install PowerCenter Reports and configure the built-in reports for the PowerCenter repository.
Repository Versioning After upgrading to version 8.x, the repository can be set to versioned if your business purchased the Team-Based Management option and enabled it via the license key. Keep in mind that once the repository is set to versioned, it cannot be set back to non-versioned. The team-based development option can be invoked in the Administration Console.
Upgrading Folder Versions After upgrading to version 8.x, remember the following: There are no more folder versions in version 8. The folder with the highest version number becomes the current folder. Other versions of the folders are folder_
Upgrading Pmrep and Pmcmd Scripts No more folder versions for pmrep and pmrepagent scripts. Ensure that the workflow/session folder names match the upgraded names. Note that pmcmd command structure changes significantly after version 5. Version 5 pmcmd commands can still run in version 8, but may not be backwards-compatible in future versions. Users that administer the domain must have “Manage Service” permissions. This is due to the fact that security has been moved to the domain level.
Advanced External Procedure Transformations AEPs are upgraded to Custom Transformation, a non-blocking transformation. To use this feature, the procedure must be recompiled, but the old DLL/library can be used if recompilation is not required.
Upgrading XML Definitions Version 8 supports XML schema. The upgrade removes namespaces and prefixes for multiple namespaces. Circular reference definitions are read-only after the upgrade. Some datatypes are changed in XML definitions by the upgrade.
Sequence Generators Data type changes for NEXTVAL CURRVAL to BIGINT. Upgrade process updates the data type in all mappings.
INFORMATICA CONFIDENTIAL
BEST PRACTICES
815 of 818
Downstream transformations must be manually updated with new data type. This must be done when mappings are imported into the 8.6 environment or when they are upgraded in place. End value must be updated to 9223372036854775807.
Run-time Parameters With the inclusion of run-time parameter in version 8.6, the need for parameter files to capture mapping/session information is no longer required. Mapping, session, workflow, and folder name are now available in the Designer to assign during run-time. For more information on the specific changes to the PowerCenter software for a particular upgraded version, reference the release notes as well as the PowerCenter documentation.
Last updated: 29-Oct-10 17:39
INFORMATICA CONFIDENTIAL
BEST PRACTICES
816 of 818
Upgrading PowerExchange Challenge Upgrading and configuring PowerExchange on a mainframe to a new release and ensuring that there is minimum impact to the current PowerExchange schedule.
Description The PowerExchange upgrade is essentially an installation with a few additional steps and some changes to the steps of a new installation. When planning for a PowerExchange upgrade the same resources are required as the initial implementation requires. These include, but are not limited to: MVS systems operator Appropriate database administrator; this depends on what (if any) databases are going to be sources/and or targets (e.g., IMS, IDMS, etc.). MVS Security resources Since an upgrade is so similar to an initial implementation of PowerExchange, this document does not address the details of the installation. This document addresses the steps that are not documented in the Best Practices Installation document, as well as changes to existing steps in that document. For details on installing a new PowerExchange release see the Best Practice PowerExchange Installation (for Mainframe) .
Upgrading PowerExchange on the Mainframe The following steps are modifications to the installation steps or additional steps required to upgrade PowerExchange on the mainframe. More detailed information for upgrades can also be found in the PWX Migration Guide that comes with each release. 1. Choose a new high-level qualifier when allocating the libraries, RUNLIB and BINLIB, on the mainframe. Consider using the version of PowerExchange as part of the dataset name. An example would be SYSB.PWX811.RUNLIB. These two libraries need to be APF authorized. 2. Backup the mainframe datasets and libraries. Also, backup the PowerExchange paths on the client workstations and the PowerCenter server. 3. When executing the MVS Install Assistant and providing values on each screen, make sure the following parameters differ from those used in the existing version of PowerExchange. Specify new high-level qualifiers used for the PowerExchange datasets, libraries, and VSAM files. The value needs to match the qualifier used for the RUNLIB and BINLIB datasets allocated earlier. Consider including the version of PowerExchange in the high-level nodes of the datasets. An example could be SYSB.PWX811. The PowerExchange Agent/Logger three character prefix needs to be unique and differ from that used in the existing version of PowerExchange. Make sure the values on Logger/Agent/Condenser Parameters screen reflect the new prefix. For DB2, the plan name specified should differ from that used in the existing release. 4. Run the jobs listed in the XJOBS member in the RUNLIB. 5. Before starting the Listener, rename the DBMOVER member in the new RUNLIB dataset. 6. Copy the DBMOVER member from the current PowerExchange RUNLIB to the corresponding library for the new release of PowerExchange. Update the port numbers to reflect the new ports. Update any dataset names specified in the NETPORTS to reflect the new high-level qualifier. 7. Start the Listener and make sure the PING works. See the other document or the Implementation guide for more details. 8. The existing Datamaps must now be migrated to the new release using the DTLURDMO utility. Details and examples can be found in the PWX Utilities Guide and the PWX Migration Guide. At this point, the mainframe upgrade is complete for bulk processing. For PowerExchange Change Data Capture or Change Data Capture Real-time, complete the additional steps in the installation manual. Also perform the following steps:
INFORMATICA CONFIDENTIAL
BEST PRACTICES
817 of 818
1. 2. 3. 4.
Use the DTLURDMO utility to migrate existing Capture Registrations and Capture Extractions to the new release. Create a Registration Group for each source. Open and save each Extraction Map in the new Extraction Groups. Insure the values for CHKPT_BASENAME and EXT_CAPT_MASK parameters are correct before running a Condense.
Upgrade PowerExchange on a Client Workstation and the Server The installation procedures on the client workstations and the server are the same as they are for an initial implementation with a few exceptions. The differences are as follow: 1. New paths should be specified during the installation of the new release. 2. After the installation, copy the old DBMOVER.CFG configuration member to the new path and modify the ports to reflect those of the new release. 3. Make sure the PATHS reflects the path specified earlier for the new release. Testing can begin now. When testing is complete, the new version can go live.
Go Live With New Release 1. 2. 3. 4. 5. 6.
Stop all workflows. Stop all production updates to the existing sources. Ensure all captured data has been processed. Stop all tasks on the mainframe (Agent, Listener, etc.) Start the new tasks on the mainframe. Resume production updates to the sources and resume the workflow schedule.
After the Migration Consider removing or de-installing the software for the old release on the workstations and server to avoid any conflicts.
Last updated: 29-Oct-10 17:39
INFORMATICA CONFIDENTIAL
BEST PRACTICES
818 of 818