Q.How can we count a number of records in a flat file using Abinitio? A. Using the aggregate function "count" Or Use rollup component to count the number of record in the flat file. Use {} as key in the key specifier. It will consider all the fields as one record and count the total number of records. Q. How do we use SCD Types in the Abinitio graphs? A. Q. What is the order of execution of a graph when it runs? A. Order of Graph Execution 1. Initialisation of Parameters 2. Start script execution 3. Graph execution 4. End script execution Q. How to calculate the total number of records in the file using REFORMAT instead of ROLLUP? A. Via its log port. Or Connect reformat to log port and use this code and in select parameter specify event_type "finish" type reformat_final_msg record decimal("records") read_count; string("readn") filler_read; decimal("records") written_count; string("writtenn") filler_written; decimal("records") rejected_count; string("rejected") filler_rejected; end; out::reformat(in) begin out.rec_count :1: string_lrtrim(reinterpret_as(reformat_final_msg in.event_text).read_count); end; Q. How do we append records to an already existing file usin abinitio graph? A. Create a graph by taking the existing file as the out put file and keep the mode of the output file in Append Mode. Pass the new records from the input file to this output file through a reformat. This will append new records in the existing File.
Q. What is output index? How does it work in reformat? Does below function show Output index in use output:1:if(in.emp.sal<500)in.emp.sal output:2:force_error("Employee salary is less than 500)? A. Output index function is used in reformat having multiple output ports to direct which record goes to which out port. for eg. for a reformat with 3 out ports such a function could be like if (value 'A') 1 else if (value 'B') 2 else 3 which basically means that if the field 'value' of any record evaluates to A in the transform function it will come out of port 1 only and not from 2 or 3. Q. How does component folding works? A. Q. What is the advantage of SORT within GROUP Clause? A. Sort within Groups refines the sorting of data records already sorted according to one key and it sorts the records within the groups formed by the first sort according to a second key. Q. What are environment variable? Why are they required? A. Environment Variables or other wise know as ABINITIO environment variable. Its set in stdenv under which private project and public project will be there. Parameters like $AB_HOME $AB_AIR_PROT will be present in environment variable and this will link to the relational path respectively . Q. What is the need of config variables in abinitio? (ab_job,ab_max_core) and where to define them? A. Q. How to avoid duplicates without using dedup component? A. To avoid the duplicates use rollup component. Rollup component avoidS the duplicates and produces actual results. or We can avoid duplicate by using "key_change" method of the rollup component. The code will be like below. out :: key_change(prev curr) begin out :: cur ! prev ; end out :: rollup(in) begin out :: in ; end Q. What will happen when we pass dot or invlaid parameters in the inout component layout URL? A.
Q. What is use of Ab_job command in Abinitio? A. AB_JOB parameter is set when we want to run the same graph at the same time for different job names. or When you want to run the same instance of the graph many times which is palced in one place then we go for AB_JOB. its should be defined in sandbox parameter. If you dont give the value for it it will take AB_JOB as default. Q. How to use a normal batch graph as a sub graph in Continuous graph? A. Q. How to open Abinitio in UNIX? A. We cannot open AB Initio in UNIX. We can only run graphs in UNIX using the .ksh. Q. How do you do production support for Graph? A. If the graph failed in the production usually we get emergency access to see the failure then analyse the failure if it is a code bug then we go back to development env and fix the bug test it then deploy back to production and run. Q.How do you check whether graph is completed successfully or not (is it $? of unix?) A. $mpjret 0 then success if it is 1 then fail. Q. What are different return values? A. 0 and 1 0 is success 1 is failure $? return status of last executed command. Q. Why and When do we get the "Pipeline Broken Error" in Ab Initio? A. Pipeline broken error will actually indicates the failure of a downstream component. It normally occurs when the database is running out of memory which makes database components in the graph unavailable.
Q. What are the two types of .dbc files? A. Generally .dbc files are classified in 2 types with accordance of their parametric value for fixed_size_dml which can be either true or false. If the value is false the database generates delimited types whenever possible(it recognizes null as zero-length string).In case of true it takes fixed length dml. Other parameters for .dbc files are: dbms db_version db_home db_name db_nodes User & password case generate_dml_with_nulls fixed_size_dml treat_blanks_as_null oldstyle_emptystring_as_null fully_qualify_dml delimited_dml_with_maximum_size interface environment direct_parallel Q. What is the usage of .mfctl and .mdir files in the mfs directory of Ab Initio? A. .mfctl and .mdir are both related to multifile system. .mfctl extension of control file created when we are using the MFS. The file extension .mfctl will contain the URLs of all the data partitions. The file with the extension .mdir will contain the URL of the control file used by MFS. Q. How to separate duplicate records with out Dedup sorted from the grouped input file? A. in.* Rollup or with help roolup component functions like last first. Or you can use rollup to remove a duplicate record in a input file(note that it is key based duplicate) it will keep the last record based on that key. Or Rollup will help to avoid the duplicate without using dedup component. It takes the first record and reject the rest.
Q. how to rerun a graph in UNIX? A. we can rerun graph by using ab_job variable. Or you can run the graph by giving the following command in unix dtm run -continue or when ever a graph fails it creates a .rec file in the working directory the directory may be where ur graph deployed script is stored .so remove that .rec file and then run the deployed script of the graph from unix u may use m_rollback –d. Q. what do you mean by rerun? A. Your graph failed and you want to run it again or you want to run multiple instances of this graph. Q. How do you pass parameters to a graph in AI ? A. Using Input Parameters/ Graph parameter. Or If you want to pass a parameter to your graph then declare a formal parameter in editparametrs region. Or yes you can declare parametes in edit paramter option in GDE while running the .ksh you can pass the value in command line. Q. Which component does not work in pipeline parallelism? A. Sort component does not work in pipeline parallelism. Or Sort component does not work in pipeline parallelism it blocks the pipeline parallelism. Or sort component does not work in pipleline parallelism because in case of sort all the data must read before writing any records hence it does not support pipeline parallelism. Hope this make sense. or Sort Sort within group Rollup will break pipeline parallelism. Q. How does one make use of the "Call Web Service" component in the $AB_HOME/connectors/Internet directory of the component selectory window of the Ab Initio Console? Explain with Sample Code? A.
Q. What is patch database (IPD etc)? A. Q. How do you check root disk failed? A. Q. How do you restore whole OS backup and a selected single file? A. Q. how to create SCDs(slowly changing dimensions) in abinitio? A. If you want to implement the SCDs in abinitio then you should do the delta processing. Q. How do you join two files with different layouts? A. if the two files have totally different layout....u can use Fuse Component.Read about it from Abinitio Help.<. Or If the layout is totally different ----use Fuse Component. Or To join a serial file and a multifile if that is the case use broadcast component after the serial file and before join. Q. What is Vector Field? Explain? A. Vector field. This field is used in the denormalize component. Denormalize generates multiple output data records to each of its input records. We specify field names, we specify output length, this legnth called the vector field. Depends on vector field length generates output records. Denomalize specify one element type & count the index. According to this vector field generates output records. Q. Which file should we keep it as a look up file, large file or less data records file & why? A. We should always use small file ( i.e. file with less no. of records ) as lookup. The reason is - This file will be kept in main memory ( RAM ) from the starting to ending of the script/graph run. Hence less the file size more performance from server. or Lookup file should be always small. If the data is growing every day then the performance will become poor and its not wise to use bigger file as lookup. It spoils the lookup concept.
Q. How metadata management takes place in ABinitio? A. it is possible with help of EME. it follows UNIX file structure. Q. Is there a way of implementing File Listener in Ab Initio?? It should continuously scan a given directory, as soon as a file is placed in that directory, it should copy that file to a working directory and trigger a corresponding Ab Initio graph? A. You can use the CONTINOUS components to build this. It requires and environment setup though. You can read through the Ab Initio help by searching on 'Continuous graphs'. Q. How many Sandboxes can be there for a project? A. A Project can have many sandboxes. We can see many developers working in different sandboxes which is attached to a single project. Or we can have any no of sandboxes sand box is nothing but users work area where each user will get copy of the project & do the modifications acc. or There can be numerous sandboxes for a project but there should be only one sandbox associated with EME for a project. Q. How will you connect two servers? A. Connecting two different servers in Abinito is done thorugh a file called abinitio.rc. This is used for remote connectivity. This file contains information like the server ip(or name) the user name and the password required to connect. Q. How can you extract and load without transforming? A. Provided the DML is same you can directly connect both input and output datasets and perform and extract and load operation. For example If the input dataset is a table and output is file you can directly connect both these making sure the DML of the file is propagated from table. Q. If want to run the graph in unix !what command i need to use ? A. 1. First design the graph. 2. Save it 3. Run it. 4. Go to runtab then go to deploy press deploy. Now Abintio automatically generates ksh of the graph in run folder of your sand box. 5. Go to sand box in run folder there you will find your graph.ksh.
Q. how will i can implemate Insert,Update,delete in abinitio? A. to find records which should be inserted , updated or deleted one should use ab initio flow a. unload master table b. read delta file c. use inner join to join a and b unused a will be your delete records (if required) unused b will be your insert record . joined a and b will be your update records. Q. how will u view MFS in unix? A. to view MFS in unix you should run m_expand command. Q. what is diff/btween conditional dml & conditional component? A. conditional DML can be pass as program variable conditional components will be used only when condition past to the graph is true. Q. Q.What is the difference between In-Memory Sort and Inputs must be sorted? A.The Inmemory sort and input must be sorted options are there in the Join,Rollup and Dedup components. Main difference between these two is if you selected input must be sorted options in the above mentioned components the the downstream components will get the records in a sorted oder. if you are selected option as Inmemory sort then the downstream components will not get the sorted records. Q. Graph was failed how it is achived ? A. There are several resons that graph will be failed. I have one specific Answar for this is... If the graph is failed then Abinitio will create one .rec file in the run directory of your sendbox. if you want to rollback the graph then use m_rollback command in the unix directory or you can use m_cleanup utilities in the Unix command. Q. how will i can implemate Insert,Update,delete in abinitio? how will u view MFS in unix?what is diff/btween conditional dml& conditional component? A. to find records which should be inserted , updated or deleted one should use ab initio flow a. unload master table b. read delta file c. use inner join to join a and b unused a will be your delete records (if required) unused b will be your insert record . joined a and b will be your update records
to view MFS in unix you should run m_expand command conditional DML can be pass as program variable conditional components will be used only when condition past to the graph is true. Q. What is meant header and tailer, suppose header and tailer had some junk data how will delete junk data ? which components r used? A. 1. If you know the signature of header and tailer record then use filerby expression component to filter the header and tailer records 2. Use one reformate component and then inside the transformation use next_in_sequence() function to assign unique numbers to each record,and then use filter by expression component to filter the records based on sequence numbers. 3.Follow the step 2 and use instead of filter by expression component use leading records component to filter the header and tailer records. Q. I had 10,000 records r there i loded today 4000 records, i need load to 4001 - 10,000 next day how is in Type 1 and how is it on type 2? A. simply take a reformat component and then put next_in_sequence()> 4000 in select parameter.
Q. what are the steps in actual ab initio graph processing including general,pre and post process settings? A. 1. Start script 2. Graph components. 3.End script Q. What is air_project_parameters and air_sandbox_overrides? what is the relation between them? A. .air-project-parameters Contains the parameter definitions of all the parameters within a sandbox. This file is maintained by the GDE and the Ab Initio environment scripts. .air-sandbox-overrides This file exists only if you are using version 1.11 or a later version of the GDE. It contains the user's private values for any parameters in .air-project-parameters that have the Private Value flag set. It has the same format as the .air-project-parameters file. When you edit a value (in GDE) for a parameter that has the Private Value flag checked, the value is stored in the .airsandbox-overrides file rather than the .air-projectparameters file. Q. In Join component which record will go to unused port and which will go to reject port ? A. In case of inner-join all the records not matching the key specified goes to the respective unused ports, in full
outer-join none of the records goes to the unused ports. In case of reject port, records which do not match with DML come to the reject port. OR In case of inner-join all the records not matching the key specified goes to the respective unused ports, in full outer-join none of the records goes to the unused ports. All the records which evaluates to NULL during joiin transformation will go into reject port if the limit + ramp*number_of_input_records_so_far < number_of_input_records_so_far. Q. wt is meant by repartioning in howmany ways it can be done? A. Repartitioning means changing one or both of the following: 1) The degree of parallelism of partitioned data 2) The grouping of records within the partitions of partitioned data Q. How to Create Surrogate Key using Ab Initio? A. There r many ways to create Surrogatekey but it depends on your business logic.
here u can try these ways... 1. use next_in_sequence() function in your transform. 2.use Assign key values component (if ur gde is higher than 1.10) 3.write a stored proc to this and call this stor proc wherever u need Q. What is semi-join? A. In abinitio,there are 3 types of join
1.inner join. 2.outer join and
3.semi join.
for inner join 'record_requiredn' parameter is true for all in ports.
for outer join it is false for all the in ports.
if you want the semi join you put 'record_required' as true for the required component and false for other components.
Q. How will you ensure that the components created in one version do not malfunction/cease functioning in other version? A. Runtime behaviour of components will remain same in all versions unless its requires to have any additional paramter to be defined in any version. Evolution of new version of ETL comes with some changes in component level parameters (observation as of now). or
Components should be compatibile to run in previous versions of GDE. The depreciated components would run in new versions. Q. What data modelling do you follow while loading of data to tables? Also the DB you are inserting the data has Star schema or Snow flake schema? A.
Q. How does force_error function work ? If we set never abort in reformat , will force_error stop the graph or will it continue to process the next set of records ? A. Here you can set the two conditions for the reformat component 1. If you want to fail set the reject thresold to fail on first reject 2. If don't want to fail you set never to abort. Force_error is used to abort any graph if the conditions are not met and you write the error errors records in file and then abort the graphs this can done in different ways. Or force_error() fuction will not stop the graph it will write the error message to the error port for that record and will process the next record. Q. Phase verses Checkpoint? A. Phase is breaking the graph into different block. It create some temp file while running and
deletes it once the completion is done. Checkpoint is used for recovery purpose. when the graph is interrupted instead of rerunning the graph from the start. the excution starts from the stop where it is stopeed. Q. what is the function of XFR in abinitio? It would be great if one of you can explain me in
brief what is the function of xfr (like what does it do ,where is it stored ,how does it affect )? A. As you know when you create a new sandbox in ab initio environment the following directories will be created 1.mp 2.dml 3.xfr 4.db etc etc. xfr is directory in abinitio where we can write our own function and use them during the tranformation(rollup , reformat etc..). example you can write a function to convert a string into decimal or to get string max length , I can write that function in a file called user_define_function.xfr in xfr directory inside this file i can define a function called string_to_interger or get_string_max_length or both. In any transform component you can include the file liek include "/user_define_function.xfr " you can called the function like anyother function in ab initio. Q.
What is the difference between the flows of 3 parallelisms?
A. Parallelism's are of 3 types: 1. Component Parallelism: All program components runnings simultaneously on different data sets.
2. Pipeline Parallelism: All program components runnings simultaneously on same data sets. we can break the pipeline parallelism using all sort based components. Ex: sort sort within groups AGG Rollup Join etc. 3. Data Parallelism: Distributes data records into multiple locations using partition components. Q. How can I calculate the total memory requirement of a graph?
A. You can roughly calculate memory requirement as: 1. Each partition of a component uses: ~ 7 MB + max-core (if any) 2. Add size of lookup files used in phase (if multiple components use same lookup only count it once) 3. Multiply by degree of parallelism. Add up all components in a phase; that is how much memory is used in that phase. 4. (Total memory requirement of a graph) > (the largest-memory phase in the graph). Q. How can I achieve cummulative sumary in AB Initio other than using SCAN component. Is
there any inbuilt function available for that? A. Scan is really the most simple way to achieve this. Another way is to use a ROLLUP since it is a multistage component. You need to put the ROLLUP component into multistage format and write the intermediate results to a temp array (I think they're called vectors in AI). The ROLLUP loops through each record in your defined group. Let's say you want to get intermediate results by date. You sort your data by {ID; DATE} first. Then ROLLUP by {ID}. The ROLLUP will execute it's transformation for each record per ID. So store your results in a temp vector which will need to be initialized to be the size of your largest group. Each time the ROLLUP enters the tranformation write to the [i] position in the array and increment i each time. As long as this is all done in the "rollup" transformation and not the "finalize" transformation it will run the "initialize" portion before it moves to the next ID. I have done it this way but the Scan is easier. I was doing a more simple rollup before I found that I needed cumulative intermediate results so I just modified my existing ROLLUP. Ab Initio documentation does not explain this technique in detail but it can be done. or There are three ways 1) You can use Scan with rollup component 2) Use Rollup component 3) You can also use Scan followed by Dedup sort and select the last record. That will solve the purpose or
Other then scan we can use rollup to do the cumulative summary.
Or
Use in built componenet in Abinitio .. "SCANWITHROLLUP" Q. I have file containing 5 unique rows and I am passing them through SORT component using null key and and passing output of SORT to Dedup sort. What will happen, what will be the output.? A. If there is no key used in the sort component while using the dedup sort the output depends on the keep parameter. If its set to firt then the output would have only the first record if its set to last the output would have the last record if its set to unique_only then there would be no records in the output file. Q. Can we process 1 GB data(1 million records) by using Lookup? How? A. I think it is not adviseable to use a 1GB lookup file it will definitely effect the parallel processing of other applications and affect the performance. I would prefer to use the MFS lookup file and not serial lookup file in this case. Q. If I have 2 files containing field file1(A,B,C) and file2(A,B,D), if we partition both the files on
key A using partition by key and pass the output to join component, if the join key is (A,B) will it join or not and WHY? A.
Q. In my sandbox i am having 10 graphs, i checked-in those graphs into EME. Again i checkedout the graph and i do the modifications, i found out the modifications was wrong. what i have to do if i want to get the original graph..? A.
How do I create subgraphs in Ab Initio? Q.What is a sandbox? A. Sandbox is a directory structure of which each directory level is assigned a variable name, is used to manage check-in and checkout of repository based objects such as graphs. fin -------> top level directory ( $AI_PROJECT ) | |---- dml -------> second level directory ( $AI_DML ) | |----- xfr -------> second level directory ( $AI_XfR ) | |----- run --------> second level directory ( $AI_RUN ) | You'll require a sandbox when you use EME (repository s/w) to maintain release control. Within EME for the same project an identical structure will exist.
The above-mentioned structure will exist under the os (eg unix), for instance for the project called fin, and is usually name of the top-level directory. In EME, a similar structure will exist for the project: fin. When you checkout or check-in a whole project or an object belonging to a project, the information is exchanged between these two structures. For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox with the same structure as the EME project called fin. Once you've created that, as shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your sandbox. Q. I have a job that will do the following: ftps files from remote server; reformat data in those files and updates the database; deletes the temporary files. How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be considered? does *.rec file have anything to do with it? A. AbInitio has very good restartability and recovery features built into it. In Your situation you can do the tasks you mentioned in one graph with phase breaks. FTP in phase 1 and your transaformation in next phase and then DB update in another pahse (This is just an example this may not best of doing it as best design depends on various other factors) If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would see a message saying recovery file exists, do you want to start your graph from last successful check point or restart from begining. Same thing if it fails in Phase 2. Phases are expensive from Disk I/O perspective, so have to be careful in doing too much phasing. Coming back to error trapping each component has reject, error, log ports, reject captures rejected records, error captures corresponding error and log captures the execution statistics of the component. You can control reject status of each component by setting reject threshold to either "Never Abort", "Abort on first reject" or setting "ramp/limit" Recovery files keep tack of crucial information for recovering the graph from failed status, which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always want to rollback the recovery fils cleanly so that temporary files created during graph execution won't hang around and occupy disk space and create issues. always use m_rollback –d Q. What is Ad hoc multifile? How is it used? A. Here is a description of Ad hoc multifile: Ad hoc multifiles treat several serial files having the same record format as a single graph component. Frequently, the input of a graph consists of a set of serial files, all of which have to be processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without needing to define a multifile system to contain it. This enables you to represent the needed set of serial files with a single input file component in the graph. Moreover, the set of files used by the component can be determined at runtime. This lets the user customize which set of files the
graph uses as input without having to change the graph itself, even after it goes into production. Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files. The simplest way to define an Ad hoc multifile is to list the files explicitly as follows: 1. Insert an input file component in your graph. 2. Open the properties dialog. Select Description tab. 3. Select Partitions in the Data Location of the Description tab 4. Click Edit to open the Define multifile Partitions dialog box. 5. Click New and enter the first file name. Click New again and enter the second file name and so on. 6. Click OK. If you have added 'n' files, then the input file now acts something like a file in a n-way multifile system, whose data partitions are the n files you listed. It is possible for components to run in the layout of the input file component. However, there is no way to run commands such as m_ls or m_dump on the files, because they do not comprise a real multifile system. There are other ways than listing the input files explicitly in an Ad hoc multifile. 1. Listing files using wildcards - If the input file names have a common pattern then you can use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile. 2. Listing files in a variable. You can create a runtime parameter for the graph and inside the parameter you can list all the files separated by spaces. 3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing the input files, since you can use complex commands also that involves owner of file or date time stamp. Q. What is the difference between Replicate and Broadcast? A. Broadcast and Replicate are similar components but generally Replicate is used to increase Component Parallelism, emitting multiple straight flows to seperate pipelines. Broadcast is used to increase data parallelism by feeding records to fan-out or all-to-all flows. Or Replicate is old component when compared to broadcast. You can use Broadcast as join component, where as Replicate you can't use as join. By Default, Replicate is Straight flow and Broadcast is fan-out or All-To-All Flow. Broadcast is used for Data Parallism whereas Replicate is used for Component Parallesim. Or Replicate Supports component parallelism Input File -------> Replicate --------> Format ---->Output File | | | --------->Rollup-------> output File Broadcast
Supports data parallelism Input File1 (MF) -----------------> JOIN -----------> Output File ^ | | Input File 2(Serial)---> Broadcast --> Input File2 is a serial file and it is being joined with a mf, input file2, without being partitioned. The compoment, Broadcast, is writing data to all partitions of Input file1, creating an implicit fan out flow. Or The short answer is that the Replicate copies a flow while a Broadcast multiplies it. Broadcast is a partitioner where Replicate is a simple flow-copy mechanism. Replicate appears in over 90% of all AI graphs (across the board of all implementations worldwide) where Broadcast appears in less than 1% of all graphs. You won't see any difference in the two until you start using data-parallel, then it will go south rather quickly. Here's an experiment: Use a simple serial input file, followed by a broadcast, then a 4-way multifile output file component. If you run the graph with say, 100 records from the input file, it will create 400 records in the output file - 100 records for each flow partition encountered. If you had used a Replicate, it would have read and written 100 records.
Hi Just went through 8 ab initio interviews and some of the tough questions were as follows. 1.What is the function you would use to transfer a string into a decimal.? 2.How many parallelisms in ab initio and a definition of the three. ? 3.What is the difference between db config and a cfg file? 4.Have you eveer encountered an error called depth not equal (this apparently occurs when you extensively create graphs.....kinda a trick question)? 5.How do you truncate a table.....each candidate would say only 1 of the several ways to do this. ? 6.How do you improve the performance of a graph? 7.Whats the difference between partitioning with key and round robin?
8.Have you worked with packages? 9.How do you add default rules in transformer? 10.What is a ramp limit 11.Have you used rollup component ....describe? 12.How many components in your most complicated graph? 13.Do you know what a local lookup is? Latest Features in Ab Initio - 2.14 Dynamic Script Generation is the latest buzz in Ab Initio world and one of it’s finest. It comes with lots of other advantages which were not there in earlier versions of Ab Initio Co>Operating System. Now it is available in Co>Operating System version 2.14.46 and above. This feature typically enables the use of Ab Initio PDL (Parameter Definition Language) and Component Folding. Now if we enable this feature by changing the script generation method to Dynamic in Run Settings we will be able to run a graph without deploying it through GDE. From now onwards we will execute the mp file only; there is no need to have the ksh. In production server once we run the mp file using air sandbox run command on the fly it generates a reduced script, which contains the commands to set up the host environment. It doesn’t include component details of the graph at all. You can check the mp file of dynamic script generation enabled graph. It is an editable text file. Component Folding: It is a feature by which Co>Operating system combines group of components and runs them as a single process. Now question - Does it improve the performance? Yes, in most of the cases it will bring a significant performance boost over the traditional approach of execution.
Prerequisites of Component Folding: • The components must be foldable • They must be in same phase and layout • Components must be connected via a straight flow.
How it works (Advantages): 1. When this is enabled by checking the folding option in Run Setting, Co>Operating System runtime folds all the processes (foldable components) in a single process. As a result number of processes is reduced when a graph executes. Every process has overheads of creation of new process, scheduling, memory consumption etc. These overheads will vary from OS to OS. In some OS like MVS, creation and maintenance of processes are very costly compared to different flavors of UNIX.
2. Another major benefit of component folding is the reduction of interpretation time for the DML between processes. Because it will end up with multitool folded processes communicating with other multitool or unitool. 3. Apart from that increase in number of processes results higher interprocess communication. Data movement between two or more processes will not only consume time but memory too. In CFG (Continuous Flow Graph) interprocess communication is always very high. So it is worth enabling Component folding in a CFG. Disadvantages of Component Folding: 1. Pipeline Parallelism: As component folding folds different component in a single process it will hurt the pipeline parallelism of Ab Initio. If flow of our graph is like - Input File -> Filter By Expression -> Reformat -> Output File. In traditional method by the help of Pipeline Parallelism FBE and Reformat will execute concurrently. But now these two components are folded together so there is no chance of parallel execution. 2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we combine 4 different components to a single process by component folding OS will allow only 4 GB of address space for all 4 instead of 4X4 total 16 GB of spaces. So we should avert component folding components where memory use is very high as in-memory Rollup, Join, and Reformat with lookup. Some components like Sort, in-memory Join causes internal buffering of data. Combing them in a single process will result writing to disk (Higher IO).
Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the folded component group. Excluding any component from Component Folding: I know sometime you would wish to prevent components to be folded to allow pipeline parallelism or to access more address space. Then you need to exclude some components from being folded. Set AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES configuration variable to space separated mpname of the components in your $HOME/.abinitiorc or system wide $AB_HOME/config/abinitiorc file. e.g. export AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES= hash-rollup reformat-transform In other way to prevent two different components from getting folded together right click on the flow between them and uncheck the Allow Component Folding option. Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent and allow component folding for your components of the graph, tune it for the highest performance. CPU tracking report of folded components in a graph: To report the execution detail of folded graph on console we need to override the AB_REPORT variable with show-folding option as – AB_REPORT=”show-folding flows times interval=180 scroll=true spillage totals file-percentages”.
The folded components are displayed as multitool process in CPU tracking information. The CPU time for a folded component is shown twice one for the component itself once as a multitool component. Parameter Definition Language (PDL): PDL is used to put logic for inline computation in parameter value. It provides high flexibility in terms of interpretation. It supports both $ and ${} substitution. For this you need to set the interpretation PDL and write the DML expression within $[ ]. This approach is much faster than traditional shell scripting. It is the way to move forward to a much flexible and robust technique of designing. With the use of it we can abolish the old shell scripting as script-end and script-start are already beaten enough to death since last few years. You can use PDL interpretation for condition of a component. NOTE. The detail of PDL within the GDE is lacking any consistency. Basically, we can use the majority of the Ab Initio DML functions. I would recommend looking at the metaprograming section for starters. Then play with the parameters editor.
e.g. Suppose in a graph we have a conditional component which runs based on existence of a file called emp.dat. Now FILE_NAME parameter is defined as /home/xyz/emp.dat and a conditional parameter called EXIST is defined as $[if (file_information($’FILE_NAME’).found) 1 else 0] We can define a parameter with type and transform function with the help of parameter AB_DML_DEFS. e.g. Suppose AB_DML_DEFS is defined as out :: sqrt(in) = begin out :: math_sqrt(in); end; Now in a parameter called SQRT is defined as $[sqrt (16)] Resolved value from this parameter will be 4. Ensure your host run settings are checked for dynamic script generation, and read the 2.14 patchset notes for a description of any hint.