tutorialspoint.com
http://www.tutorialspoint.com/cgi-bin/printpage.cgi
Data Wa Wareh rehousing ousing - Quick Qu ick Guide Data Warehousing - Overview T he term "Data Warehous Warehous e" was f irst coined by Bill Bill Inm Inmo o n in 1990. 1990. He said that Data warehous e is subject s ubject Oriented, Orie nted, Integrated, Integrated, Time-V Time-Varian ariantt and nonvolatile collectio collectio n o f data. data.T T his data help helps s in support ing decision decision maki aking ng process by analyst analyst in an organization T he operational database undergoes undergoes t he per day day transactions whi which ch causes causes t he f reque requent nt changes changes t o t he data o n daily daily basis.B basis.But ut if in f uture the business exe executive cutive wants wants to anal analyse yse t he previous previous f eedb eedback ack on any data such as product,supplier,or the consumer data. In this case the analyst will be having no data available to anal analyse yse because because the previous data is updated due to t ransactions. T he Data Wareho Warehouses uses pro provide vide us generalized and cons o lidated data in multidimensio multidimensio nal view. view. Along with generali gene ralize ze and co nso lid lidated ated view of data t he Data Warehouses Warehouses also provide us Online Online Anal Analytical ytical Processing (OLAP (OLAP)) to ols. These t oo ls help us us in interactive and ef ef f ective analy analysis sis o f data in multidimensional space. This analysis results in data generalization and data mining. T he data mining mining f unctions lik like e ass ociation,clustering ,class ,class if ica icatio tio n, prediction prediction can be integrated integrated with OLAP OLAP operatio ns t o enha enhance nce interactive interactive mini mining ng of knowle knowledge dge at multiple multiple level level of abstractio n. T hat's why data warehous wareh ous e has now become im import ant platf orm f or data analysis analysis and online analy analytical tical processing.
Understanding Data Warehouse T he Data Data Warehouse Warehouse is that database which which is kept separate f rom the organization's o perational database. T here is is no f reque requent nt updation done in data data warehous warehous e. Data warehouse warehouse po ss ess conso lid lidated ated histo rica ricall data which which help the o rgani rganizat zat ion to anal analyse yse it's business. Data warehouse warehouse helps the executives executives t o o rgani rganize,understand ze,understand and use use t hei heirr data to take st rategic decision. Data warehouse warehouse s ystems availabl available e which which helps helps in integration o f div diversity ersity o f appl applica icatio tio n syst em ems. s. T he Data Data warehous warehous e syst em all allows ows analysis of conso lid lidated ated histo rica ricall data data analysis. analysis.
Definition Data warehouse is Su Subje bject ct Orien Oriented, ted, Integrated, Time-V Time-Varia ariant nt and Nonvolatile collectio collectio n o f data t hat support manage anagem ment's deci decisio sio n making making process .
Why Data Warehouse Separated from Operational Databases T he f ollowing are are the reaso ns why Data Data Warehouse Warehouse are kept kept s eparate f rom operational databases: T he operational database is const ructed f or well known known tas ks and workload workload s uch as as s earch earching ing particular particul ar records, indexing indexing etc but t he data warehous warehous e queries queries are o f ten co mple plex x and it presents the general general f orm of data.
Operatio nal databases Operatio databases s upports the concurrent process ing of multipl ultiple e transactio ns. ns.Concurren Concurrency cy contro l and recovery mech mechanism anism are required required fo r operatio nal databases databases t o ensure ro bustness and consistency consistenc y of datab database. ase. Operatio nal database query allow to r ead, modif modif y operat ions whil while e the OLAP OLAP query need only read only ac only access cess of sto red data. Operatio nal database Operatio database maintain maintain the current data on t he ot her hand data data warehouse maintain maintain the histo rica ricall data.
Data Warehouse Features T he key f eatures o f Data Warehouse Warehouse s uch as Subject Subject Oriented, Integrated, Integrated, Nonvolatile and Time-V Time-Varian ariantt are are discussed below: Subject Oriente d - T he Data Subject Data Warehouse Warehouse is Su Subjec bjectt Oriented because because it pro vid vide e us t he inf inf ormation around a subject subject rat her the organization's ongo ing operations. These s ubj ubjects ects can be product, custo mers, suppliers, suppliers, sales, revenue revenue etc. etc. T he data data warehouse does no t f ocus o n the ongoing operatio ns rather it f ocuses on modelling modelling and anal analysis ysis o f data f or decision maki making. ng. Integrated - Data Warehouse Integrated Warehouse is const ructed by by integration integration o f data f rom hetero hetero gene geneous ous s ources such as relational relational databases, f lat f ile iles s etc. T his integratio integratio n enhanc enhance e the eff ective anal analysis ysis of data. Time-Variant - T he Data in Data Time-Variant Data Warehouse Warehouse is iden identif tif ied with with a particular time time period. T he data in data warehouse warehouse pro vid vide e inf inf ormation f rom hist hist orical point point o f vie view w. Non Volat Volat ile ile - Non volatile means means that the previous data is not rem removed oved when new data data is added added to it. T he data data warehouse is kept separate f rom the operational database theref ore f reque requent nt changes changes in operational database database is no t ref lec lected ted in data warehouse. warehouse. Note: - Data Warehouse Note: Warehouse do es no t r equi equire re transactio n process ing, recovery and and concurrency concurrency contro l because beca use it is physically physically sto red separate f rom the operatio nal database. database.
Data Warehouse Warehouse Applicatio ns As dis cus cusse sed d bef o re Dat a Wareho Wareho us use e helps t he bus ines iness s execut e xecutives ives in o rganiz e, analys e and us e their t heir data f or decision maki making. ng. Data Warehouse Warehouse serves as a so ul part part o f a plan-execute-ass plan-execute-ass ess "clos ed-loo p" f eedb eedback ack syst em fo r enterprise managem management. ent. Data Data Warehouse Warehouse is wid widely ely used in the f ollowing f iel ields: ds: f inanci inancial al services Banking Services Consumer goods Retail sectors. Contro lle lled d manuf manuf acturing
Data Warehouse Warehouse Types Inf ormation pro cessing, An Analy alytical tical processing and Data Mini Mining ng are the t hree types o f data warehouse applica appl icatio tio ns t hat are discuss ed below: below: Information processing processing - Data Warehouse Warehouse allow allow us to process t he inf inf ormation st ored in it.The it.The inf ormation can be processed by means means of query querying, ing, basic stat istical analysis, analysis, reporting using crosstabs, tables, charts, or graphs.
Operatio nal databases Operatio databases s upports the concurrent process ing of multipl ultiple e transactio ns. ns.Concurren Concurrency cy contro l and recovery mech mechanism anism are required required fo r operatio nal databases databases t o ensure ro bustness and consistency consistenc y of datab database. ase. Operatio nal database query allow to r ead, modif modif y operat ions whil while e the OLAP OLAP query need only read only ac only access cess of sto red data. Operatio nal database Operatio database maintain maintain the current data on t he ot her hand data data warehouse maintain maintain the histo rica ricall data.
Data Warehouse Features T he key f eatures o f Data Warehouse Warehouse s uch as Subject Subject Oriented, Integrated, Integrated, Nonvolatile and Time-V Time-Varian ariantt are are discussed below: Subject Oriente d - T he Data Subject Data Warehouse Warehouse is Su Subjec bjectt Oriented because because it pro vid vide e us t he inf inf ormation around a subject subject rat her the organization's ongo ing operations. These s ubj ubjects ects can be product, custo mers, suppliers, suppliers, sales, revenue revenue etc. etc. T he data data warehouse does no t f ocus o n the ongoing operatio ns rather it f ocuses on modelling modelling and anal analysis ysis o f data f or decision maki making. ng. Integrated - Data Warehouse Integrated Warehouse is const ructed by by integration integration o f data f rom hetero hetero gene geneous ous s ources such as relational relational databases, f lat f ile iles s etc. T his integratio integratio n enhanc enhance e the eff ective anal analysis ysis of data. Time-Variant - T he Data in Data Time-Variant Data Warehouse Warehouse is iden identif tif ied with with a particular time time period. T he data in data warehouse warehouse pro vid vide e inf inf ormation f rom hist hist orical point point o f vie view w. Non Volat Volat ile ile - Non volatile means means that the previous data is not rem removed oved when new data data is added added to it. T he data data warehouse is kept separate f rom the operational database theref ore f reque requent nt changes changes in operational database database is no t ref lec lected ted in data warehouse. warehouse. Note: - Data Warehouse Note: Warehouse do es no t r equi equire re transactio n process ing, recovery and and concurrency concurrency contro l because beca use it is physically physically sto red separate f rom the operatio nal database. database.
Data Warehouse Warehouse Applicatio ns As dis cus cusse sed d bef o re Dat a Wareho Wareho us use e helps t he bus ines iness s execut e xecutives ives in o rganiz e, analys e and us e their t heir data f or decision maki making. ng. Data Warehouse Warehouse serves as a so ul part part o f a plan-execute-ass plan-execute-ass ess "clos ed-loo p" f eedb eedback ack syst em fo r enterprise managem management. ent. Data Data Warehouse Warehouse is wid widely ely used in the f ollowing f iel ields: ds: f inanci inancial al services Banking Services Consumer goods Retail sectors. Contro lle lled d manuf manuf acturing
Data Warehouse Warehouse Types Inf ormation pro cessing, An Analy alytical tical processing and Data Mini Mining ng are the t hree types o f data warehouse applica appl icatio tio ns t hat are discuss ed below: below: Information processing processing - Data Warehouse Warehouse allow allow us to process t he inf inf ormation st ored in it.The it.The inf ormation can be processed by means means of query querying, ing, basic stat istical analysis, analysis, reporting using crosstabs, tables, charts, or graphs.
Analyt ical Proce ssing Analyt ssing - Da Data ta Warehouse Warehouse supports ana analy lytica ticall processing processing of the inf orma ormation tion s to red in it.The data can be analysed by means means o f basic OLAP OLAP operat ions ,incl ,including uding slice-and- dice,dril dice,drilll down,drill up, and pivot pivot ing. Data Mining Mining - Data Mining Mining supports knowle knowledge dge disco disco very by finding the hidden hidden patterns and ass ociations , cons tructing analytical analytical models, models, perf orming class class if ica icatio tio n and prediction.These prediction.These mining mining results can be presented presented using the visualization visualization t oo ls. SN
Dat a War e hou se (O LAP )
O pe r at io nal Dat ab ase (O LT P )
1
T hi his in invo lv lves hi his t or orical pro ce ces s in ing o f in inf o r ma mat io io n. n.
T hi his in invo lv lves da day t o da day pro ce ces s in ing.
2
OLAP syste OLA system ms are use used d by by know knowlled edge ge work orker ers s such such as as executive, manager manager and analyst .
OLT P syst em are used by clerk, DBA, OLT DBA, or database datab ase prof essional essionals. s.
3
T his is us ed t o analys is t he bus ines s .
T his is us ed t o run t he bus ines s .
4
It f o cus es o n Inf o rmat io n o ut .
It f o cus es o n Dat a in.
5
This is based This based on Star Star Sch Schem ema, a, Snow nowff la lake ke Sch chem ema a and and Fact Constellation Schema.
T his is based on Entity Relations Relations hip Model.
6
It f o cus es o n Inf o rmat io n o ut .
T his is applicat io n o rient ed.
7
T his co nt ains his t o rical dat a.
T his co nt ains cur rent dat a.
8
This Thi s pro prov vides su summari rize zed d and con onsol soliidate ted d data ta..
This prov This proviide pri rim miti tiv ve and highly deta taiiled data.
9
This provi provide de sum summ mari arized zed an and d mul ultid tidim imen ensiona sionall vi view ew of data.
T his provides detailed detailed and and f lat relational view vie w of data.
10
T he number o r us ers are in Hundr eds .
T he number o f us us er s are in t ho us ands .
11
T he he number of of r ec eco rd rds ac acces se sed are in millio ns ns .
T he he number of of reco rd rds ac acces se sed are in tens.
12
T he dat abas e s iz e is f ro ro m 100GB t o T B
T he dat abas e s iz e is f ro ro m 100 MB t o GB.
13
T his are highly f lexible.
T his pr o vide high perf o rmance.
Data Warehousing - Concepts What is Data Dat a Wareho Warehousing using? ? Data Warehousing Warehousing is is t he process o f const ructing and and using the data warehouse. T he data data warehouse is const ructed by integrating the data f rom multipl multiple e heterogeneous s ources. This data warehouse warehouse s upports analytical reporting, structured and/or ad hoc queries and decision making. Data Warehousing involves data cleanin cle aning, g, data integration and data co nso lid lidations ations .
Using Data Data Warehouse Inf orma ormatio tion n T here are decision decision s upport technologies available available whic which h help to utilize the data ware warehouse. house. These technologies helps t he executive executives s to use t he warehouse warehouse qui quickl ckly y and ef f ective ectively ly.. They can gather gather t he data, analyse anal yse it and take the decisions decisions bas ed on the inf ormation in the warehous warehous e. T he inf inf ormation gathered f rom the warehous warehous e can be used in in any of the f ollowing domai domains: ns: Tuning production production strate gies gies - T he product st rategies can be be well well tuned by repos repos itioning the products and manag managing ing product port f olios by comparin comparing g the s ales quarterly or yearly. yearly.
Customer Analysis - T he custo mer analysis is done by analyzing the custo mer's buying pref erences, buying time, budget cycles etc. Ope rations Analysis - Data Warehousing also helps in customer relationship management, making environmental corrections .T he Inf ormation also allow us to analyse the business o perations.
Integrating Heterogeneous Databases To integrate hetero geneous databases we have the two approaches as f ollows: Query Driven Approach Update Driven Approach
Query Driven Approach T his is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrato rs o n the top o f multiple hetero geneous dat abases. T hese integrato rs are also known as mediators .
Process of Query Driven Approach: when the query is issued t o a client s ide, a metadata dictionary translat e the query into the queries appropriate f or the individual hetero geneous s ite involved. Now these queries are mapped and sent t o t he local query processo r. T he results f rom hetero geneous s ites are integrated into a global answer set.
Disadvantages T he Query Driven Approach needs co mplex integration and f iltering pro cesses. T his approach is very ineff icient. T his approach is very expensive fo r f requent queries. T his approach is also very expensive f or queries that requires aggregatio ns.
Update Driven Approach We are provided with t he alternative approach t o traditional approach. Today's Data Warehouse s ystem f ollo ws update driven approach rather than the t raditio nal approach discussed earlier. In Update driven approach the info rmatio n f rom multiple heterogeneous so urces is integrated in advance and st ored in a warehous e. T his inf ormation is available f or direct querying and analysis.
Advantages T his approach has the f ollowing advantages: T his approach provide high perf ormance. T he data are co pied, process ed, integrated, annot ated, summarized and rest ructured in semantic data s to re in advance. Query process ing does not require interf ace with the process ing at local so urces.
Data Warehouse Tools and Utilities Functions The f ollowing are the f unctions of Data Warehouse to ols and Utilities: Data Extraction - Data Extraction involves gathering the data f rom multiple heterogeneous s ources. Data Cleaning - Data Cleaning involves f inding and correcting the errors in data. Data Transformation - Data Transf ormation involves converting data f rom legacy f ormat to warehouse f ormat. Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions. Refreshing - Ref reshing involves updating f rom data so urces to warehouse. Note: Data Cleaning and Data Trans f ormation are import ant s teps in improving the quality of data and data mining results.
Data Warehousing - Terminologies In t his article, we will discuss so me o f the co mmonly used t erms in Data Warehouse.
Data Warehouse Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collectio n o f data t hat support of management's decision making process . Let's explore t his Def initio n of data warehouse. Subject Oriente d - T he Data warehous e is subject o riented because it pro vide us the inf ormation around a subject rat her the organization's ongo ing operations. These s ubjects can be product, custo mers, suppliers, sales, revenue etc. T he data warehouse does no t f ocus o n the ongoing operatio ns rather it f ocuses on modelling and analysis o f data f or decision making. Integrated - Data Warehouse is const ructed by integration o f data f rom hetero geneous s ources such as relational databases, f lat f iles etc. T his integratio n enhance the eff ective analysis of data. Time-Variant - T he Data in Data Warehouse is identif ied with a particular time period. T he data in data warehouse pro vide inf ormation f rom hist orical point o f view. Non Volat ile - Non volatile means that the previous data is not removed when new data is added to it. T he data warehouse is kept separate f rom the operational database theref ore f requent changes in operational database is no t ref lected in data warehouse. Metadata - Metadata is s imply def ined as data about data. T he data that are used to represent ot her data is known as metadata. For example the index of a boo k serve as metadata f or t he content s in t he book.In ot her words we can say that metadata is t he summarized data that lead us t o the det ailed data. In terms of data warehouse we can def ine metadata as f ollowing: Metadata is a ro ad map to data warehous e. Metadata in data warehous e def ine the warehouse o bjects .
Metadata Respiratory T he Metadata Respiratory is an integral part o f data warehouse s ystem. T he Metadata Respiratory contains t he fo llowing metadata: Business Met adata - T his metadata has the data ownership inf ormation, business def initio n and changing policies. Operational Metadata - T his metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means histo ry of data migrated and transf ormation applied on it. Data for mapping f rom operational environment to data warehouse - T his metadata includes so urce databases and their cont ents, data extraction,data partition, cleaning, transf ormation rules, data ref resh and purging rules. The algorithms for summarization - T his includes dimension algorithms, data on granularity, aggregation, summarizing etc.
Data cube Data cube help us to repres ent t he data in multiple dimensio ns. The data cube is def ined by dimensio ns and f acts. The dimensions are the entities with respect t o which an enterprise keep the records.
Illustration of Data cube Suppose a company wants to keep track of sales records with help of sales data warehouse with respect to time, item, branch and location. These dimensions allow t o keep track of mont hly sales and at which branch t he items were so ld.T here is a t able asso ciated with each dimension. This table is kno wn as dimensio n table. T his dimensio n table f urther describes t he dimensio ns. For example "item" dimensio n table may have att ributes such as item_name, item_type and item_brand. T he f ollowing table represents 2- D view of Sales Data f or a co mpany with res pect t o t ime,item and locatio n dimensions.
But here in this 2- D table we have records with res pect to time and item only. The sales f or New Delhi are sho wn with respect t o time and item dimensions according to type of item sold. If we want t o view the sales data with one new dimension s ay the location dimension. The 3- D view of the s ales data with respect t o time, item, and location is shown in the table below:
T he above 3-D table can can be represented represented as 3- D data cube cube as sho wn in in the f ollowing f igure igure::
Data mart Data mart mart cont ain ains s t he subset of organisation- wid wide e data. data. T his subset o f data is valuab valuable le to specif ic group of an organisation. in other words we can can say that data mart contains o nly that dat a which which is specif ic to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. T he data data mart are conf conf ine ined d to subjec subjects ts .
Points to remember about data marts: window based o r Unix/Linux Unix/Linux based servers are used t o im implem plement ent dat a mart mart s. They are implemented implemented on low cost server. T he implem implementatio entatio n cycle cycle of data mart mart is measured measured in sho rt period of time i.i.e. in weeks weeks rather t han months or years.
T he life lif e cycle cycle of a data mart mart may be complex in lo lo ng run ifif it's planning and design are not organisation-wide. Data mart are small in size. Data mart are cust omized by department. department. T he so urce of data mart mart is departmentally departmentally structured data warehouse. warehouse. Data mart are f lexibl lexible. e. Graphica Graph icall Representatio Representatio n of data mart. mart.
Virtual Warehouse T he view view over a o perational data warehous warehous e is known as virtual warehouse. warehouse. ItIt is easy t o buil builtt the virtual warehouse. Building the virtual warehouse requires excess capacity on operational database servers.
Data Warehousing - Delivery Process Introduction T he data warehouse are never stat ic. It evolve evolves s as t he business increases. The to day day's 's need may may be dif f erent f rom the f uture needs.We needs.We must must design the data data warehouse t o change cons cons tantly tantly.. T he real problem is that business its elf is no t aware of its requirement requirement f or inf ormation in the f uture. uture.As As business business evolves it's need also also chan changes ges theref ore t he data warehuose warehuose must be designed to ride with with these changes. changes. Hence Henc e the data warehouse s ystems need to be f lex lexibl ible. e. T here should be a delivery delivery process to deli deliver ver the data warehouse.But warehouse.But t here are many many issues in data warehous wareh ous e projects t hat it is very diff icu icult lt t o co mplete the tas k and del delive iverable rables s in the st rict, ordered f ashio n demanded demanded by by water waterff all me metho tho d because because t he requirements requirements are hardly f ully underst oo d. Hence when the requirements are completed only then the architectures designs, and build components can be completed.
Delivery Deliv ery Method Metho d
T he delive delivery ry metho metho d is a variant variant o f the joint appl applica icatio tio n developme development nt approach, adopted f or deli delivery very of data warehous wareh ous e. We st aged the data warehouse del delive ivery ry process to mini inim mize t he risk. The approach that i will will discuss do es no t reduce the o veral veralll delive delivery ry time-s time-s cal cales es but ensures business benefit s are delivered delivered incrementally through the development process. Note: T he delive delivery ry process is broken into phases t o reduc reduce e the pro jec jectt and deliv delivery ery risk. Following diagram Exp Explain lain the Stages in delivery delivery proces s:
IT Strategy Data warehouse warehouse are st rategic inv invest est ments, t hat require require business business process to gene generate rate the project benef its. IT Strategy is required required to procure and retain retain f undi unding ng f or t he project. project.
Business Case T he objective objective of Bu Business siness case is to know the projected projected business business bene beneff its t hat sho uld be deriv derived ed f rom using the data warehouse. warehouse. These benef its may not not be quantif quantif iabl iable e but the projected benefit s need to be clearly cle arly st ated. ated... If the data warehous warehous e does no t have a clear clear business business case then the business t end to s uf f er f rom the credibili credibility ty pro blem blems s at so me st age during during the delive delivery ry process .T heref ore in data warehouse project we need need to understand the business case f or investment.
Educatio Edu cation n and Prot Prot ot otypi yping ng T he organizatio n will will experim experiment ent with t he concept of data analysis analysis and educate educate t hem hemselves selves o n the value of data warehous warehous e bef ore determining determining that a data warehouse warehouse is prio r so lution. T his is addressed by prot ot ypi yping. ng. T his prot ot ypi yping ng activity activity helps helps in understanding understanding the feas ibi ibility lity and and benef benef its o f a data warehous wareh ous e. T he Prot Prot ot ypi yping ng activity activity on a s mall scale scale can f urther t he educational educational process as long as : T he proto type address address a def ined technic technical al objective. objective. T he proto type can be thrown away away af af ter t he f easibi easibility lity concept concept has been shown. T he activity activity address address es a small small subset o f even eventual tual data cont cont ent if the data warehouse. warehouse.
T he activity t imescale is non- critical. Points t o remember to pro duce an early release of a part of a data warehouse to deliver business benef its. Identif y the architecture t hat is capable of evolving. Focus o n t he business requirements and technical blueprint phases. Limit t he scope of the f irst build phase to the minimum that delivers business benef its. Underst and the sho rt term and medium term requirements of the data warehouse.
Business Requirements To provide the quality deliverables we should make sure that overall requirements are understood. The business requirements and the technical blueprint s tages are required because o f the f ollowing reaso ns: If we understand t he business requirements f or bot h sho rt and medium term then we can design a so lution that s atisf ies t he short t erm need. T his would be capable of growing to the f ull so lution. T hings t o determine in this s tage are fo llowing. T he business rule to be applied on data. T he logical model f or inf ormation within the data warehouse. T he query prof iles f or the immediate requirement. The s ource systems t hat provide this data.
Technical Blueprint T his phase need to deliver an o verall architecture s atisf ying the lo ng term requirements. T his phase also deliver the components that must be implemented in a sho rt term to derive any business benef it. The blueprint need to identif y the f ollowings. T he overall system architecture. T he data retent ion po licy. The backup and recovery strategy. T he server and data mart architecture. T he capacity plan f or hardware and inf rast ructure. T he components of database design.
Building the version In this s tage the f irst production deliverable is produced. T his production deliverable smallest co mponent o f data warehous e. T his smallest component adds business benef it.
History Load T his is t he phase where the remainder of the required histo ry is loaded into t he data warehouse. In this phase we do not add the new entities but additio nal physical tables would probably be created to st ore t he increased data volumes. Let's have an example, Suppose the build version phase has delivered a retail sales analysis data warehous e with 2 months worth o f histo ry. T his inf ormation will allow the user t o analyse only the recent trends and address the sho rt t erm issues . T he user can not identif y the annual and seaso nal trends. So the 2 years worth o f sales histo ry could be loaded fro m the archive to make user t o analyse the sales trend yearly and seasonal. Now the 40GB data is extended to 400GB. Note:T he backup and recovery procedures may become complex theref ore it is recommended that perf orm this activity within separate phase.
Ad hoc Query In this phase we conf igure an ad hoc query too l. This ad hoc query tool is used to operate the data warehouse. T hese to ols can generate t he database query. Note:It is recommended that not to use t hese access to lls when database is being substantially modif ied.
Automation In t his phase o perational management pro cesses are f ully auto mated. These wo uld include: Transf orming the data into a f orm suitable f or analysis. Monito ring query prof iles and determining the appropriate aggregatio ns t o maintain system perf ormance. Extracting and loading the data f rom dif f erent so urce syst ems. Generating aggregations f rom predef ined def initio ns within the data warehouse. Backing Up, resto ring and archiving the data.
Extending Sco pe In this phase the data warehouse is extended to address a new set of business requirements . T he scope can be extended in two ways: By loading additional data into the data warehous e. By intro ducing new data marts using the existing inf ormation. Note:T his phase sho uld be perf ormed separately since this phase involves subst antial ef f ort s and complexity.
Requirements Evolution From the perspective of delivery process the requirement are always changeable. T hey are not st atic.T he delivery process must support this and allow t hese changes to be ref lected within the syst em.
T his issue is addressed by designing the data warehouse around the use o f data within business process es, as oppos ed to t he data requirements o f exist ing queries. T he architecture is designed to change and grow to match the business needs,the process operates as a pseudo applicatio n development process , where the new requirements are continually f ed into the development activities. The part ial deliverables are pro duced.T hese part ial deliverables are f ed back to users and then reworked ensuring that o verall syst em is cont inually updated t o meet t he business needs.
Data Warehousing - System Processes We have f ixed number of operatio ns t o be applied on o perational databases and we have well def ined techniques s uch as use normalized data,keep table small etc. T hese techniques are suitable fo r delivering a so lution. But in case of decision s upport syst em we do not know what query and operation need to be executed in f uture. T heref ore t echniques applied on operatio nal databases are not suitable f or data warehouses. In this chapter We'll f ocus o n designing data warehous ing so lution built o n the to p open-s yst em technologies like Unix and relational databases.
Process Flow in Data Warehouse T here are f our major process es that build a data warehous e. Here is the list of f our process es: Extract and load data. Cleaning and transf orming the data. Backup and Archive t he dat a. Managing queries & directing them to the appropriate data so urces.
Extract and Load Process T he Data Extractio n takes data f rom the source syst ems. Data load takes extracted data and loads it into dat a warehouse. Note: Bef ore loading the data into data warehouse the inf ormation extracted f rom external so urces must be reconstructed. Points to remember while extract and load process :
Contro lling the process When to Initiate Extract Loading the Data
Cont rolling t he proce ss Contro lling the process involves determining that when to st art data extractio n and cons istency check on data. Contro lling process ensures that to ols , logic modules, and the programs are executed in correct sequence and at correct time.
When t o Initiate Extract Data need to be in consis tent st ate when it is extracted i.e. the data warehouse s hould represent single, consistent version of inf ormation to the user. For example in a cust omer prof iling data warehous e in telecommunication s ecto r it is illogical to merge list of custo mers at 8 pm on wednesday f rom a cust omer database with the cust omer subscription events up to 8 pm on t uesday. T his would mean that we are finding the cust omers f or whom there are no ass ociated subscription.
Loading t he Dat a Af ter ext ract ing the dat a it is lo aded int o a temporary dat a store.Here in the t emporary dat a store it is cleaned up and made consistent. Note: Cons ist ency checks are executed only when all data s ources have been loaded into temporary data store.
Clean and Transform Process Once data is extracted and loaded into temporary data st ore it is t he time to perf orm Cleaning and Transf orming. Here is t he list of st eps involved in Cleaning and Transf orming: Clean and Transf orm the loaded data into a st ructure. Partit ion the data. Aggregat ion
Clean and Transf orm the loaded dat a int o a st ruct ure T his will speed up the queries.T his can be done in the f ollowing ways: Make sure data is consist ent within itself . Make sure data is co nsist ent with ot her data within the same data s ource. Make sure data is consist ent with data in ot her source syst ems. Make sure data is consist ent with data already in the warehouse. Transf orming involves converting the so urce data into a st ructure. Structuring the data will result in increases query perf ormance and decreases operatio nal cos t. Inf ormation in data warehouse must be transf ormed to support perf ormance requirement f rom the business and also the ongo ing operatio nal cost.
Partition t he dat a It will optimize the hardware perf ormance and simplif y the management o f data warehouse. In this we partition each f act table into a multiple separate partitions.
Aggregation Aggregat ion is required to speed up t he common queries . Aggregat ion rely o n the f act that mos t common queries will analyse a s ubset o r an aggregation o f the detailed data.
Backup and Archive the data In order to recover the data in event of data loss , so f tware f ailure or hardware f ailure it is necessary to backed up on regular basis.Archiving involves removing the o ld data f rom the s ystem in a f ormat t hat allow it to be quickly restored whenever required. For example in a retail sales analysis dat a warehouse, it may be required to keep data f or 3 years with latest 6 mont hs data being kept online. In this kind of scenario there is of ten requirement to be able to do mont h-o n-mont h comparisons f or t his year and last year. In this case we require some data t o be rest ored f rom the archive.
Query Management Process This process perfo rms t he fo llowing f unctions T his process manages t he queries. T his process speed up the queries execution. This Process direct the queries t o mos t ef f ective data so urces. T his process s hould also ensure that all syst em so urces are used in mos t ef f ective way. T his process is also required to monito r actual query prof iles. Inf ormation in t his process is us ed by warehouse management pro cess t o determine which aggregations to generate. T his process do es not generally operate during regular load of inf ormation into dat a warehouse.
Data Warehousing - Architecture In this article, we will discuss t he business analysis f ramework f or data warehouse design and architecture of a data warehouse.
Business Analysis Framewo rk T he business analyst get the info rmation f rom the data warehous es to measure the perf ormance and make critical adjust ments in order t o win over ot her business holders in t he market. Having data warehouse has the f ollowing advantages f or the business. Since the data warehouse can gather the inf ormation quickly and ef f iciently theref ore it can enhance the business productivity. T he data warehouse provides us t he consist ent view of custo mers and items hence help us t o manage t he cust omer relationship.
T he data warehouse also helps in bringing cos t reduction by t racking trends, patterns over a long period in a consistent and reliable manner. To design an ef f ective and ef f icient data warehouse we are required to underst and and analyze the business needs and construct a business analysis f ramework . Each person has dif f erent views regarding the design of a data warehouse. T hese views are as f ollows: The top-down view - T his view allows t he selection of relevant inf ormation needed fo r data warehouse. The data source view - T his view presents the inf ormation being captured, sto red, and managed by operational system. The data warehouse view - T his view includes t he f act t ables and dimension t ables.This represent the inf ormation st ored inside the data warehouse. The Business Query view - It is the view of the data f rom the viewpoint o f the end user.
Three-Tier Data Warehouse Architect ure Generally the data warehouses ado pt t he three-t ier architecture. Following are the three tiers o f data warehous e architecture. Bott om Tier - T he bott om tier of the architecture is t he data warehouse database server.It is t he relational database syst em.We use t he back end too ls and utilities to f eed data into bo tt om tier.these back end too ls and utilities perf orms t he Extract, Clean, Load, and ref resh f unctions. Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in either of the f ollowing ways. By relational OLAP (ROLAP), which is an extended relat ional dat abase management sys tem. T he ROLAP maps t he operatio ns o n multidimensional data t o st andard relational operations . By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and operations. Top-Tier - T his tier is the f ront - end client layer. T his layer hold the query too ls and report ing to ol, analysis tools and data mining tools. Following diagram explains the T hree-t ier Architecture of Data warehouse:
Data Warehouse Models From the perspective of data warehous e architecture we have the f ollowing data warehouse models: Virtual Warehouse Data mart Enterprise Warehouse
Virtual Warehouse T he view over a operational data warehouse is known as virtual warehous e. It is easy to built the virtual warehouse. Building the virtual warehouse requires excess capacity o n o perational database s ervers.
Data Mart Data mart contains the subset of organisation- wide data. This s ubset of data is valuable to specif ic group of an organisation Note: in other words we can say that data mart contains o nly that data which is s pecif ic to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. T he data mart are conf ined to subjects. Points t o remember about data marts window based o r Unix/Linux based servers are used t o implement dat a mart s. They are implemented on low cost server.
T he implementatio n cycle of data mart is measured in sho rt period of time i.e. in weeks rather t han months or years. T he life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide. Data mart are small in size. Data mart are cust omized by department. T he so urce of data mart is departmentally structured data warehouse. Data mart are f lexible.
Enterprise Warehouse T he enterprise warehouse co llects all the inf ormation all the subjects s panning the entire organization T his provide us the enterprise-wide data integration. T his provide us the enterprise-wide data integration. T he data is integrated f rom operational syst ems and external inf ormation providers. T his inf ormation can vary f rom a f ew gigabytes to hundreds of gigabytes, terabytes o r beyond.
Load Manager T his Component perf orms the operatio ns required to extract and load process . T he size and complexity of load manager varies between specif ic solutio ns f rom data warehouse t o data warehouse.
Load Manager Archite ct ure T he load manager perf orms the f ollowing f unctions: Extract t he data f rom source syst em. Fast Load the extracted data into temporary data st ore. Perf orm simple transf ormations into st ructure similar to the one in the data warehouse.
Extract Data f rom Source T he data is extracted f rom the operatio nal databases o r the external inf ormation providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection( ODBC), Java Database Connection (JDBC), are examples of gateway.
Fast Load In order t o minimize the t ot al load window the data need to be loaded into the warehouse in the f astest poss ible time. The transf ormations af f ects t he speed of data processing. It is more eff ective to load the data into relational database prior to applying transf ormations and checks. Gateway technology proves t o be not suitable, since they tend not be perf ormant when large data volumes are involved.
Simple Transf ormations While loading it may be required to perf orm simple transf ormations . Aft er this has been completed we are in pos ition t o do t he complex checks. Suppose we are loading the EPOS sales transactio n we need to perf orm the f ollowing checks: Strip out all the columns t hat are no t required within t he warehouse. Convert all the values t o required data types.
Warehouse Manager Warehouse manager is respo nsible f or the warehouse management pro cess. T he warehouse manager consist of third party system sof tware, C programs and shell scripts.
Warehouse Manager Architecture T he warehous e manager includes the f ollowing: The Controlling process Stored pro cedures o r C with SQL Backup/Recovery tool SQL Scripts
Operations Performed by Warehouse Manager Warehouse manager analyses t he data to perf orm consis tency and ref erential integrity checks. Creates t he indexes, business views, partition views against the base dat a. Generates the new aggregations and also updates t he exist ing aggregation. Generates the normalizations. Warehouse manager Warehouse manager trans f orms and merge the so urce data into the t emporary st ore into the published data warehous e. Backup the data in the data warehouse. Warehouse Manager archives t he data that has reached the end of its captured lif e. Note: Warehouse Manager also analyses query prof iles to determine index and aggregations are appropriate.
Query Manager Query Manager is res ponsible f or directing the queries t o the s uitable tables.
By directing the queries t o appropriate table the query request and respo nse pro cess is speed up. Query Manager is res ponsible f or s cheduling the execution o f the queries po sed by the user.
Query Manager Archit ect ure Query Manager includes the f ollo wing: The query redirection via C tool or RDBMS. Stored procedures. Query Management tool. Query Scheduling via C to ol or RDBMS. Query Scheduling via third party Sof tware.
Detailed information T he f ollowing diagram sho ws the detailed inf ormation
T he detailed inf ormation is no t kept online rather is aggregated to the next level of detail and then archived to the tape. T he detailed inf omation part o f data warehous e keep the detailed inf ormation in the starf lake schema. the detailed inf ormation is loaded into the data warehouse to supplement t he aggregated data. Note: If the detailed inf ormation is held of f line to minimize the disk st orage we should make sure that t he data has been extracted, cleaned up, and transf ormed then into s tarf lake schema bef ore it is archived.
Summary Information In this area of data warehous e the predefined aggregations are kept. These aggregations are generated by warehouse manager. T his area changes on o ngoing basis in order to respond t o t he changing query prof iles. T his area of data warehouse must be treated as transient. Points t o remember about s ummary inf ormation. T he summary data speed up the perf ormance of common queries. It increases t he operational cos t. It need to be updated whenever new data is loaded into the data warehouse. It may not have been backed up, since it can be generated f resh f rom the det ailed inf ormation.
Data Warehousing - OLAP Introduction
Online Analytical Proces sing Server (OLAP) is bas ed o n multidimensio nal data model. It allows the managers , analyst s t o get insight the info rmation t hrough f ast , cons istent , interactive access t o inf ormation. In this chapter we will discuss about types o f OLAP, operat ions on OLAP, Diff erence between OLAP and Statis tical Databases and OLTP.
Types of OLAP Servers We have f our t ypes of OLAP servers t hat are listed below. Relational OLAP(ROLAP) Multidimensio nal OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers
Relational O LAP(ROLAP) T he Relational OLAP servers are placed between relational back-end server and client f ront - end to ols . To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS. ROLAP includes t he f ollo wing. implementatio n o f aggregation navigation logic. optimizatio n f or each DBMS back end. additio nal too ls and services.
Multidimensional OLAP (MOLAP) Multidimensional OLAP (MOLAP) uses the array- based multidimensio nal storage engines f or multidimensional views o f data.With multidimensional data st ores , the s to rage utilizat ion may be low if the data set is spars e. T heref ore many MOLAP Server uses t he two level of data st orage representation t o handle dense and sparse dat a set s.
Hybrid OLAP (HOLAP) T he hybrid OLAP technique combination o f ROLAP and MOLAP bot h. It has bot h the higher s calability o f ROLAP and f ast er computat ion o f MOLAP. HOLAP server allows t o s to re the large data volumes o f detail data. the aggregations are st ored s eparated in MOLAP st ore.
Specialized SQL Servers specialized SQL servers provides advanced query language and query pro cess ing support f or SQL queries over s tar and s nowf lake schemas in a read-o nly environment.
OLAP Operations As we kno w that the OLAP server is based o n the multidimensional view of data hence we will dis cuss the OLAP operatio ns in multidimensio nal data. Here is t he list of OLAP operatio ns.
Roll-up Drill-down Slice and dice Pivot (rotate)
Roll-up T his operation perf orms aggregatio n on a data cube in any of the f ollowing way: By climbing up a concept hierarchy f or a dimensio n By dimension reduction. Consider the f ollowing diagram sho wing the roll- up operation.
T he roll-up o peration is perf ormed by climbing up a concept hierarchy f or the dimension locatio n. Initially the concept hierarchy was "street < city < province < country". On rolling up the dat a is aggregated by ascending the location hierarchy f rom the level of city to level of country. T he data is grouped into cities rather than countries.
When roll-up o peration is perf ormed then one or more dimensions f rom the data cube are removed.
Drill-down Drill-do wn operatio n is reverse of the roll-up. T his operation is perf ormed by either of the f ollowing way: By stepping down a concept hierarchy f or a dimension. By introducing new dimensio n. Consider the f ollowing diagram sho wing the drill- down operatio n:
T he drill-do wn operation is perf ormed by stepping down a concept hierarchy f or the dimension t ime. Initially the co ncept hierarchy was "day < mont h < quarter < year." On drill-up the t ime dimension is descended f rom the level quarter to the level of mont h. When drill-do wn operation is perf ormed then one o r more dimensions f rom the data cube are added. It navigates t he data f rom less detailed data t o highly detailed data.
Slice T he slice operation perf orms s electio n of one dimension o n a given cube and give us a new sub cube. Consider the f ollowing diagram sho wing the s lice operatio n.
T he Slice operation is perf ormed f or t he dimension time using the criterion time ="Q1". It will f orm a new sub cube by select ing one or more dimensions.
Dice The Dice operation perf orms selection of two o r more dimensio n on a given cube and give us a new subcube. Cons ider the following diagram sho wing the dice operation: T he dice operation o n the cube based on the f ollowing selection criteria that involve three dimensions. (location = "Toro nto " or "Vancouver") (time = "Q1" or "Q2") (item =" Mobile" o r "Modem").
Pivot T he pivot o peration is also known as rot ation.It rot ates t he data axes in view in order to provide an alternative presentatio n of data.Consider the f ollowing diagram sho wing the pivot operatio n.
In this t he item and location axes in 2-D slice are rotat ed.
OLAP vs OLTP SN
Dat a Warehouse (OLAP)
Operat ional Dat abase(OLT P)
1
T his invo lves his t orical pro ces s ing o f inf o r mat io n.
T his invo lves day t o day pro ces s ing.
2
OLAP systems are used by knowledge workers such as executive, manager and analyst .
OLT P syst em are used by clerk, DBA, or database prof essionals.
3
T his is used to analysis the business.
T his is used to run the business.
4
It f ocuses on Inf ormation out.
It f ocuses on Data in.
5
This is based on Star Schema, Snowf lake Schema and Fact Constellation Schema.
T his is based on Entity Relations hip Model.
6
It f ocuses on Inf ormation out.
T his is application oriented.
7
T his contains historical data.
T his contains current data.
8
This provides summarized and consolidated data.
This provide primitive and highly detailed data.
9
This provide summarized and multidimensional view of data.
T his provides detailed and f lat relational view of data.
10
T he number or users are in Hundreds.
T he number of users are in thousands.
11
T he number of r eco rds acces sed are in millio ns .
T he number of reco rds acces sed are in tens.
12
T he database size is f rom 100GB to T B
T he database size is f rom 100 MB to GB.
13
T his are highly f lexible.
T his provide high perf ormance.
Data Warehousing - Relational OLAP Introduction T he Relational OLAP servers are placed between relational back-end server and client f ront - end to ols . To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS. ROLAP includes t he f ollo wing. implementatio n o f aggregation navigation logic. optimizatio n f or each DBMS back end. additio nal too ls and services. Note: T he ROLAP servers are highly scalable.
Points to remember T he ROLAP to ols need to analyze large volume of data acros s multiple dimensions. The ROLAP tools need to store and analyze highly volatile and changeable data.
Relational OLAP Architecture T he ROLAP includes t he f ollowing. Database Server ROLAP Server Front end too l
Advantages T he ROLAP servers are highly scalable. T hey can be easily used with t he existing RDBMS. Data Can be st ored ef f iciently since no zero f acts can be sto red. ROLAP to ols do no t us e pre-calculated data cubes. DSS server of microst rategy adopt s the ROLAP approach.
Disadvantages Poor query perf ormance. Some limitatio ns o f scalability depending on t he technology architecture t hat is utilized.
Data Warehousing - Multidimensional OLAP Introduction Multidimensional OLAP (MOLAP) uses the array- based multidimensio nal storage engines f or multidimensional views o f data. With multidimensional data st ores , the s to rage utilizat ion may be low if the data set is spars e. T heref ore many MOLAP Server uses t he two level of data st orage representation t o handle dense and sparse dat a set s.
Points to remember: MOLAP to ols need to process inf ormation with consist ent respo nse time regardless o f level of summarizing or calculations selected. T he MOLAP to ols need to avoid many of the complexities o f creating a relatio nal database to st ore data f or analysis. T he MOLAP to ols need f ast est po ss ible perf ormance. MOLAP Server adopt s t wo level of st orage representatio n to handle dense and sparse data set s. Denser subcubes are identif ied and sto red as array structure. Sparse subcubes employs compress ion technology.
MOLAP Architecture MOLAP includes t he f ollowing component s. Database server MOLAP server Front end too l
Advantages Here is the list of advantages o f Multidimensional OLAP MOLAP allows f ast est indexing to t he precomputed summarized data. Helps t he user who are connected to a network and need to analyze larger, less def ined data. Easier t o use t heref ore MOLAP is best suitable f or inexperienced user.
Disadvantages MOLAP are not capable of cont aining detailed data. T he sto rage utilization may be low if the data set is spars e.
MOLAP vs ROLAP SN
MOLAP
ROLAP
1
T he inf ormation retrieval is f ast.
Inf ormation retrieval is comparatively slow.
2
It us es the s pars e array t o s to re t he dat a s et s.
It us es relat io nal t able.
3
MOLAP is best suited f or inexperienced users since it is very easy to use.
ROLAP is best suited f or experienced users.
4
T he separate database f or data cube.
It may not require space other than available in Data warehouse.
5
DBMS f acility is weak.
DBMS f acility is strong.
Data Warehousing - Schemas Introduction T he schema is a logical description of the ent ire database. T he schema includes t he name and description of records o f all record t ypes including all asso ciated data- items and aggregates . Likewise the database the data warehouse also require the s chema. T he database uses t he relational model on the o ther hand the data warehouse uses the Stars, s nowf lake and f act cons tellation schema. In this chapter we will discuss t he schemas used in data warehouse.
In star schema each dimension is represented with only one dimension table. This dimension table contains the set of attributes. In the f ollowing diagram we have shown the sales data o f a company with respect t o t he f our dimensio ns namely, time, item, branch and locatio n.
There is a f act t able at t he centre. This f act table contains t he keys to each of f our dimensions. T he f act table also contain the att ributes namely, dollars so ld and units so ld. Note: Each dimension has only one dimension t able and each table holds a set of att ributes. For example the locatio n dimension t able cont ains t he attribute s et {lo cation_key,st reet,city,province_or_st ate,country}. This constraint may cause data redundancy. For example the "Vancouver" and "Victoria" both cities are both in Canadian province of British Columbia. The entries f or such cities may cause dat a redundancy along t he attributes province_or_state and country.
Snowflake Schema In Snowf lake schema so me dimensio n tables are normalized. T he normalization s plit up t he data into additio nal tables. Unlike Star s chema the dimensio ns t able in sno wf lake schema are normalized f or example the item dimension table in star schema is normalized and split into two dimension tables namely, item and supplier table.
T heref ore now t he item dimension t able cont ains the at tributes item_key, item_name, type, brand, and supplier-key. The supplier key is linked to supplier dimension table. The supplier dimension table contains the attributes supplier_key, and supplier_type. <>Note : Due t o normalization in Snowflake schema the re dundancy is reduced t here fore it becomes easy to maintain and save storage space. <>
Fact Constellation Schema In f act Cons tellation t here are multiple f act t ables. This s chema is also known as galaxy schema. In the f ollo wing diagram we have two f act t ables namely, sales and s hipping.
T he sale f act table is same as t hat in st ar schema.
T he shipping f act t able also contains t wo measures namely, dollars so ld and units s old. It is also pos sible f or dimension t able to s hare between f act t ables. For example time, item and location dimension t ables are s hared between sales and shipping f act t able.
Schema Definition T he Multidimensio nal schema is def ined using Data Mining Query Language( DMQL). the t wo primitives namely, cube def initio n and dimension def initio n can be used f or def ining the Data warehouses and data marts.
Synt ax f or cube de f inition define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax f or dimension def inition define dimension < dimension_name > as ( < att ribute_or_dimension_list > )
Star Schema Def inition T he st ar schema that we have discus sed can be def ined using the Dat a Mining Query Language (DMQL) as follows:
define cube sales star [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count (*) define dimension t ime as (t ime key, day, day of week, month, quart er, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (locat ion key, st reet, city, province or st ate, country)
Snowflake Schema Definition T he Snowf lake schema that we have discuss ed can be def ined using the Data Mining Query Language (DMQL) as f ollows:
define cube sales s nowflake [time, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count (*) define dimension t ime as (t ime key, day, day of week, month, quart er, year) define dimension it em as (item key, item name, brand, t ype, supplier (supplier key, s upplier type)) define dimension branch as (branch key, branch name, branch type) define dimension location as (locat ion key, st reet, city (cit y key, cit y, province or st ate, count ry))
Fact Constellation Schema Definition
T he Snowf lake schema that we have discuss ed can be def ined using the Data Mining Query Language (DMQL) as f ollows:
define cube sales [t ime, item, branch, location]: dollars sold = sum(sales in dollars), units sold = count (*) define dimension t ime as (t ime key, day, day of week, month, quart er, year) define dimension item as (item key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch type) define dimension location as (locat ion key, st reet, city, province or st ate,count ry) define cube shipping [time, item, shipper, from location, to location]: dollars cost = sum(cost in dollars), units shipped = count(*) define dimension time as t ime in cube s ales define dimension it em as it em in cube sales define dimension shipper as (shipper key, shipper name, locat ion as location in cube sales, shipper type) define dimension from location as location in cube sales define dimension to location as location in cube sales
Data Warehousing - Partitioning Strategy Introduction T he partit ioning is done t o enhance the perf ormance and make the management eas y. Partitio ning also helps in balancing the various requirements o f the s yst em. It will optimize the hardware perf ormance and simplify t he management o f data warehouse. In this we partition each f act t able into a multiple separate partitions . In this chapter we will discuss about t he partitioning strat egies.
Why to Partition Here is t he list o f reasons. For easy management To ass ist backup/recovery To enhance perf or mance
For easy management T he f act table in data warehouse can grow to many hundreds of gigabytes in size. T his to o large size of f act t able is very hard to manage as a s ingle entity. T herefo re it needs partitio n.
To assist backup/recove ry If we do no t have partitioned the f act table then we have to load t he complete f act table with all the data.Partit ioning allow us to load t hat dat a which is required on regular basis. This will reduce the time to load and also enhances t he perf ormance of the sys tem.
Note: To cut do wn on the backup size all partitions ot her than the current partitio ns can be marked read only. We can then put these part ition into a st ate where they can not be modif ied.T hen they can be backed up .T his means t hat o nly the current partitio n is t o be backed up.
To e nhance perf ormance By partitio ning the f act t able into set s o f data the query procedures can be enhanced. T he query perf ormance is enhanced because now t he query scans t he partitions that are relevant. It does not have to scan the large amount o f data.
Horizontal Partitioning T here are various way in which f act table can be partitio ned. In horiz ontal partitio ning we have to keep in mind the requirements f or manageability o f the data warehouse.
Partitioning by Time int o equal Segments In this partitio ning st rategy the f act table is partitioned on t he bases of time period. Here each time period represents a s ignif icant ret ention period within the business. For example if the user queries f or month to date data t hen it is appropriate to partition into mont hly segments. We can reuse the partitio ned tables by removing the data in them.
Partitioning by time int o dif f erent -sized segment s T his kind of partition is done where the aged data is accessed inf requently. This partit ion is implemented as a set o f small partitio ns f or relatively current data, larger partitio n f or inactive data. Following is t he list o f advantages. T he detailed inf ormation remains available online. T he number of physical tables is kept relat ively small, which reduces the operating cost. T his technique is suitable where the mix of data dipping recent histo ry, and data mining through entire hist ory is required. Following is t he list o f disadvantages. T his technique is not usef ul where the partitioning prof ile changes o n regular basis, because the repartitioning will increase the operation cos t o f data warehouse.
Partition on a different dimension T he f act table can also be partit ioned on basis of dimensions o ther than time such as product group,region,supplier, or any other dimensions. Let's have an example. Suppose a market f unction which is st ructured into distinct regional departments f or example state by state basis. If each region wants to query on inf ormation captured within its region, it wo uld proves t o be more ef f ective to partition t he f act table into regional partitions . T his will cause the queries to s peed up because it does not require to s can inf ormation that is no t relevant. Following is t he list o f advantages. Since the query does no t have to scan the irrelevant data which speed up the query process. Following is t he list o f disadvantages.
T his t echnique is no t appropriate where the dimensions are unlikely to change in f uture. So it is wort h determining that t he dimension does not change in fut ure. If the dimension changes t hen the entire f act table would have to be repartit ioned. Note: We recommend that do the partitio n only on t he basis o f time dimension unless yo u are certain that the s uggested dimension grouping will not change within the life o f data warehous e.
Partition by size of t able When there are no clear basis f or part itioning the f act t able on any dimension t hen we sho uld partition the fact t able on t he basis of t heir size. We can set the predetermined size as a critical point. when the table exceeds the predetermined size a new table partition is created. Following is t he list o f disadvantages. T his partitio ning is complex to manage. Note: T his partitioning required metadata to identif y what data s to red in each partition.
Part itioning Dimensions If the dimension co ntain the large number of entries t hen it is required to partition dimensions. Here we have to check the size o f dimension. Suppose a large design which changes o ver time. If we need to st ore all the variations in order t o apply compariso ns, t hat dimension may be very large. T his would definitely af f ect t he respons e time.
Round Robin Partitions In round robin technique when the new partition is needed the old one is archived. In this technique metadata is used to allow user access to ol t o ref er to the correct t able partition. Following is t he list o f advantages. T his t echnique make it easy t o auto mate t able management f acilities within the data warehous e.
Vertical Partition In Vertical Partitioning the data is split vertically.
T he Vertical Partitio ning can be perf ormed in the f ollowing two ways. Normalization
Row Splitting
Normalization Normalization method is t he standard relatio nal metho d of database organizatio n. In this method t he rows are collapsed into single row, hence reduce the space. Table bef ore normalization Product _id
Quant it y
Value
sales_dat e
St ore_id
St ore _name
Locat ion
Region
30
5
3.67
3- Aug- 13
16
sunny
Bangalore
S
35
4
5.33
3- Sep- 13
16
sunny
Bangalore
S
40
5
2.50
3- Sep- 13
64
san
Mumbai
W
45
7
5.66
3- Sep- 13
16
sunny
Bangalore
S
Table aft er normalization St ore _id
St ore _name
Locat ion
Re gion
16
sunny
Bangalore
W
64
san
Mumbai
S
Product _id
Q uant it y
Value
sale s_dat e
St ore _id
30
5
3.67
3- Aug- 13
16
35
4
5.33
3- Sep- 13
16
40
5
2.50
3- Sep- 13
64
45
7
5.66
3- Sep- 13
16
Row Splitting T he row splitt ing tend to leave a one-t o- one map between partitions . T he mot ive of row splitting is to speed the access to large table by reducing its s ize. Note: while using vertical partitioning make sure t hat t here is no requirement t o perf orm major jo in operations between t wo partitions.
Identif y Key to Partition It is very crucial to choos e the right partition key.Choos ing wrong part ition key will lead you to reorganize the f act t able. Let's have an example. Suppose we want t o partition the f ollowing table.
Account_Txn_Table transaction_id account_id transaction_type value transaction_date region branch_name We can choos e to partition o n any key. T he two pos sible keys could be region transaction_date Now suppos e the business is o rganised in 30 geographical regions and each region have dif f erent number of branches.T hat will give us 30 partitions , which is reaso nable. T his partitio ning is go od eno ugh because our requirements capture has sho wn that vast majorit y of queries are restricted to the user's o wn business region. Now If we partition by transaction_date instead of region. T hen it means that the latest t ransaction f rom every region will be in one partition. Now the user who wants to loo k at data within his own region has t o query acros s multiple partitio n. Hence it is worth de te rmining the right partit ioning key.
Data Warehousing - Metadata Concepts What is Metadata Metadata is s imply def ined as data about data. T he data that are used to represent o ther data is known as metadata. For example the index of a boo k serve as metadata f or the cont ents in the boo k. In ot her words we can say that metadata is t he summarized data that leads us to the detailed data. In terms of data warehous e we can def ine metadata as f ollowing. Metadata is a ro ad map to data warehous e.
Metadata in data warehous e def ine the warehouse o bjects . T he metadata act as a directory.T his directo ry helps t he decision s upport s ystem to locate the contents of data warehouse. Note: In data warehouse we create metadata f or t he data names and def initio ns o f a given data warehous e. Along with this metadata additional metadata are also created f or timestamping any extracted data, the source of extracted data.
Categories of Metadata T he metadata can be broadly categorized into three categories: Business Met adata - T his metadata has the data ownership inf ormation, business def initio n and changing policies. Technical Met adata - Technical metadata includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural inf ormation s uch as primary and f oreign key attributes and indices. Operational Metadata - T his metadata includes currency of data and data lineage.Currency of data means whether data is active, archived or purged. Lineage of data means histo ry of data migrated and transf ormation applied on it.
Role of Metadata Metadata has very import ant role in data warehous e. T he role of metadata in warehouse is dif f erent f rom the warehouse data yet it has very import ant ro le. T he various roles of metadata are explained below. T he metadata act as a directo ry. This directo ry helps t he decision s upport system to locate the contents o f data warehouse. Metadata helps in decision support syst em f or mapping of data when data are transf ormed fro m operatio nal environment t o data warehouse environment. Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data. Metadata are also used f or query to ols. Metadata are used in report ing too ls.
Metadata are used in extractio n and cleansing too ls. Metadata are used in transf ormation to ols . Metadata also plays import ant ro le in loading f unctions. Diagram to underst and role of Metadata.
Metadata Respiratory T he Metadata Respiratory is an integral part o f data warehouse s ystem. T he Metadata Respiratory has t he f ollowing metadata: Definition of data warehouse - This includes t he description o f st ructure of data warehouse. The descript ion is def ined by schema, view, hierarchies, derived data def initions, and data mart lo catio ns and contents . Business Met adata - T his metadata has the data ownership inf ormation, business def initio n and changing policies. Operational Metadata - T his metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means histo ry of data migrated and transf ormation applied on it. Data for mapping f rom operational environment to data warehouse - T his metadata includes so urce databases and their cont ents, data extraction,data partition cleaning, transf ormation rules, data ref resh and purging rules. The algorithms for summarization - T his includes dimension algorithms, data on granularity, aggregation, summarizing etc.
Challenges for Metadata Management T he import ance of metadata can not be overst ated. Metadata helps in driving the accuracy of report s, validates data transf ormation and ensures t he accuracy of calculations . T he metadata also enf orces t he consistent def inition of business t erms to business end users. With all these uses o f Metadata it also has challenges f or metadata management. The s ome of the challenges are discuss ed below.
T he Metadata in a big organization is s cattered across the o rganizat ion. T his metadata is s preaded in spreadsheets, databases , and applications . T he metadata could present in text f ile or multimedia f ile. To us e this data f or inf ormation management s olutio n, this data need to be correctly defined. T here are no industry wide accepted st andards. The data management s olutio n vendors have narrow focus. T here is no easy and accepted metho ds o f passing metadata.
Data Warehousing - Data Marting Why to create Datamart T he f ollowing are the reaso ns to create datamart: To partition data in order to impose access control strat egies. To speed up the queries by reducing the volume of data t o be scanned. To segment data into dif f erent hardware platf orms. To s tructure data in a fo rm suitable f or a user access to ol. Note: Donot data mart f or any ot her reaso n since the operatio n cost of data marting could be very high. Bef ore data marting, make sure that data marting st rategy is appropriate f or yo ur particular solutio n.
Steps to dete rmine that data mart appears to f it the bill Following steps need to be f ollowed to make cos t ef f ective data marting: Identif y the Functio nal Splits Identif y User Access Tool Requirements Identif y Access Co ntro l Iss ues
Identif y the Funct ional Splits In this s tep we determine that whether t he natural f unctional split is t here in the organizatio n. We loo k f or departmental splits, and we determine whether t he way in which department use inf ormation t ends t o be in iso lation f rom the rest of the o rganizat ion. Let's have an example... suppos e in a retail organizat ion where the each merchant is account able f or maximizing the sales o f a group of products . For t his the info rmation t hat is valuable is : sales t ransaction o n daily basis sales f orecast o n weekly basis stock position on daily basis st ock movements on daily basis
As t he merchant is no t interes ted in t he pro ducts t hey are no t dealing with, so the dat a marting is subset of the data dealing which the product gro up of interest. Following diagram sho ws data marting fo r dif f erent users.
Issues in det ermining t he f unct ional split : T he str ucture of the department may change. T he products might switch f rom one department t o o ther. T he merchant could query the sales trend of ot her products t o analyse what is happening to the sales. T hese are issues t hat need to be t aken into account while determining the f unctional split. Note: we need to determine the business benef its and technical f easibility of using data mart.
Identif y User Access Tool Requireme nt s For the user access tools that require the internal data st ructures we need data mart t o s upport s uch to ols. The data in such st ructures are outs ide the contro l of data warehouse but need to be populated and updated o n regular basis. T here are some too ls that po pulated directly f rom the source syst em but so me can not. T heref ore additional requirements o utside the scope of the to ol are needed to be identif ied f or f uture. Note: In order to ensure consist ency of data acros s all access t oo ls the data should not be directly populated f rom the data warehous e rather each to ol must have its own data mart.
Identif y Access Cont rol Issues
T here need to be privacy rules t o ensure the data is accessed by the aut horised users only. For example in data warehous e f or r etail baking institut ion ensure that all the accounts belong to the s ame legal entity. Privacy laws can fo rce you t o t ot ally prevent access to inf ormation that is not owned by the specif ic bank. Data mart allow us to build complete wall by physically separating data segments within the data warehous e. To avoid pos sible privacy problems t he detailed data can be removed f ro m the dat a warehous e.We can create data mart f or each legal entit y and load it via data warehous e, with det ailed account data.
Designing Data Marts T he data marts sho uld be designed as s maller version o f st arf lake schema with in the data warehouse and sho uld match to the database design of the data warehous e. T his helps in maintaining contro l on database instances.
The summaries are data marted in the same way as they would have been designed within the data warehous e. Summary tables helps t o ut ilize all dimensio n data in the st arf lake schema.
Cost Of Data Marting T he f ollowing are the cos t measures f or Data marting: Hardware and Sof tware Cost Netwo rk Access Time Window Cons traints
Hardware and Sof t ware Cost
Altho ugh t he dat a marts are created o n the same hardware even then t hey require some additio nal hardware and sof tware.To handle the user queries t here is need of additio nal process ing power and disk st orage. If the detailed data and the data mart exist within the data warehouse t hen we would f ace additio nal cos t t o st ore and manage replicated data. Note: T he data marting is more expensive than aggregations theref ore it sho uld be used as an additional st rategy not as an alternative st rategy.
Net work Access T he data mart could be on diff erent locations f rom the data warehouse so we should ensure that t he LAN or WAN has t he capacity t o handle the data vo lumes being transf erred within t he data mart load process.
Time Window Constraints T he extent to which the data mart loading process will eat into the available time window will depend on t he complexity o f the t ransf ormations and the data volumes being shipped. Feasiblity of number of data mart depend on. Netwo rk Capacity. Time Windo w Available Volume of data being transf erred Mechanisms being used t o insert data into data mart
Data Warehousing - System Managers Introduction T he syst em management is must f or the s uccess f ul implementatio n of data warehous e. In this chapter we will discuss t he mos t import ant s yst em managers such as f ollowing mentioned below. Syst em Conf iguratio n Manager System Scheduling Manager System Event Manager System Databas e Manager System Backup Recovery Manager
System Configuration Manager T he syst em conf iguratio n manager is respons ible f or t he management of the set up and conf iguration o f data warehouse. T he Structure of conf iguration manager varies f rom the operating syst em to operating syst em. In unix st ructure of conf iguratio n manager varies f rom vendor t o vendor.
The interface of conf iguration manager allow us to control of all aspects o f the system. Note: T he mos t important co nf iguration t oo l is the I/O manager.
System Scheduling Manager T he Syst em Scheduling Manager is also respons ible f or the s uccessf ul implementatio n of the data warehous e. T he purpose o f this scheduling manager is t o schedule the ad ho c queries. Every operating syst em has its own s cheduler with so me f orm of batch contro l mechanism. Features o f Syst em Scheduling Manager are f ollo wing. Work across clust er or MPP boundaries. Deal with international time dif f erences. Handle job f ailure. Handle multiple queries. Supports job priorities. Restart o r requeue the f ailed jobs . Not if y the user or a process when job is completed. Maintain the job s chedules acros s s yst em out ages. Requeue jobs to other queues. Support the stopping and starting of queues. Log Queued jobs. Deal with interqueue processing. Note: T he above are the evaluation parameters f or evaluatio n of a goo d scheduler. Some import ant jobs that the s cheduler must be able to handle are as f ollowed: Daily and ad hoc query s cheduling. execution o f regular report requirements. Data load Data Processing Index creation Backup Aggregat ion creat ion data transf ormation Note: If the data warehous e is running on a cluster o r MPP architecture, then the s ystem scheduling manager must be capable of running across the architecture.
System Event Manager
T he event manager is a kind of a so f tware. T he event manager manages t he events that are defined on t he data warehouse s yst em. We cannot manage the dat a warehous e manually because the s tructure o f data warehous e is very complex. T heref ore we need a to ol t hat auto matically handle all the events without interventio n of the user. Note: The Event manager monitor the events occurrences and deal with them. the event manager also track the myriad of things that can go wro ng on this co mplex data warehous e syst em.
Events T he quest ion arises is What is an event? event is no thing but the action t hat are generated by the user or the s yst em itself . It may be not ed that t he event is measurable, observable, occurrence of def ined action. T he f ollowing are the common events that are required to be tracked. hardware f ailure. Running out o f space on certain key disks. A pro cess dying. A pro cess ret urning an erro r. CPU usage exceeding an 805 threshold. Internal cont ention on database serializat ion points . Buf f er cache hit rat ios exceeding or f ailure below t hreshold. A table reaching to maximum of its siz e. Excess ive memory s wapping. A table f ailing to extend due t o lack of space. Disk exhibiting I/O bottlenecks. Usage of temporary or so rt area reaching a certain thresho lds. Any ot her dat abas e shared memory usage. T he mos t import ant t hing about is t hat t hey should be capable of executing on their own. there event packages that def ined the procedures f or t he predef ined events . T he code asso ciated with each event is known as event handler. This code is executed whenever an event occurs.
System and Database Manager Syst em and Database manager are the two separate piece of so f tware but they do t he same job. T he objective of these to ols is to automate the certain processes and to s implify t he execution of ot hers. The Criteria of choos ing the syst em and database manager are an abitlity t o: increase user's Quot a. ass ign and deassign role to the users. ass ign and deassign the prof iles to the users. perf orm databas e space management
monito r and report on s pace usage. tidy up f ragmented and unused space. add and expand the space. add and remove users. manage user passwo rd. manage summary or temporary tables. ass ign or deassign temporary space to and f rom the user. reclaim the space fo rm old or o uto f date temporary tables. manage erro r and t race logs. to browse log and trace files. redirect error o r trace inf ormation. switch on and of f error and trace logging. perf orm s ystem s pace management. monito r and report on s pace usage. clean up old and unused f ile directories. add or expand space.
System Backup Recovery Manager T he backup and recovery to ol make it eas y f or o perations and management st af f to backup the data. It is wort h not ed that the s ystem backup manager must be integrated with t he schedule manager s of tware being used. T he import ant f eatures t hat are required f or t he management o f backups are f ollowing. Scheduling Backup data t racking Database awareness . T he backup are taken only to prot ect the data against loss . Following are the import ant points to remember. T he backup so f tware will keep so me f rom of database of where and when the piece of data was backed up. T he backup recovery manager must have a goo d f ront end to t hat database. T he backup recovery so f tware sho uld be database aware. Being aware of database the s of tware then can be addressed in database terms, and will not perf orm backups t hat would not be viable.
Data Warehousing - Process Managers Data Warehouse Load Manager
T his Component perf orms the operatio ns required to extract and load process . T he size and complexity of load manager varies between specif ic solutio ns f rom data warehouse t o data warehouse.
Load Manager Archite ct ure T he load manager does the f ollowing f unctions. Extract t he data f rom source syst em. Fast Load the extracted data into temporary data st ore. Perf orm simple transf ormations into st ructure similar to the one in the data warehouse.
Extract Data f rom Source T he data is extracted f rom the operatio nal databases o r the external inf ormation providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection( ODBC), Java Database Connection (JDBC), are examples of gateway.
Fast Load In order t o minimize the t ot al load window the data need to be loaded into the warehouse in the f astest poss ible time. The transf ormations af f ects t he speed of data processing. It is more eff ective to load the data into relational database prior to applying transf ormations and checks. Gateway technology proves t o be not suitable, since they tend not be perf ormant when large data volumes are involved.
Simple Transf ormations
While loading it may be required to perf orm simple transf ormations . Aft er this has been completed we are in pos ition t o do t he complex checks. Suppose we are loading the EPOS sales transactio n we need to perf orm the f ollowing checks. Strip out all the columns t hat are no t required within t he warehouse. Convert all the values t o required data types.
Warehouse Manager Warehouse manager is respo nsible f or the warehouse management pro cess. T he warehouse manager consist of third party system sof tware, C programs and shell scripts. T he size and complexity o f warehous e manager varies bet ween specif ic solutio ns.
Warehouse Manager Architecture T he warehous e manager includes the f ollowing. The Controlling process Stored pro cedures o r C with SQL Backup/Recovery tool SQL Scripts
Operations Performed by Warehouse Manager Warehouse manager analyses t he data to perf orm consis tency and ref erential integrity checks. Creates t he indexes, business views, partition views against the base dat a. Generates the new aggregations and also updates t he exist ing aggregation Generates t he normalizations.
Warehouse manager Warehouse manager trans f orms and merge the so urce data into the t emporary st ore into the published data warehous e. Backup the data in the data warehouse. Warehouse Manager archives t he data that has reached the end of its captured lif e. Note: Warehouse Manager also analyses query prof iles to determine index and aggregations are appropriate.
Query Manager Query Manager is res ponsible f or directing the queries t o the s uitable tables. By directing the queries t o appropriate table the query request and respo nse pro cess is speed up. Query Manager is res ponsible f or s cheduling the execution o f the queries po sed by the user.
Query Manager Archit ect ure Query Manager includes the f ollo wing. The query redirection via C tool or RDBMS. Stored procedures. Query Management tool. Query Scheduling via C to ol or RDBMS. Query Scheduling via third party Sof tware.
Operations Performed by Query Manager Query Manager direct to the appropriate tables. Query Manager schedule the execution of the queries po sed by t he end user. Query Manager st ores query prof iles to allow t he warehous e manager to determine which indexes and aggregations are appropriate.
Data Warehousing - Security Introduction T he objective data warehouse is t o allow large amount of data t o be easily access ible by the users. Hence allowing user to extract the inf ormation about the business as a whole. But we know that t here could be so me security rest rictions applied on t he data which can prove an obs tacle f or accessing the inf ormation. If the analyst has t he restricted view of data then it is impos sible to capture a complete picture of the trends within the business . T he data f rom each analyst can be summarised and pass ed ont o management where the dif f erent summarise can be created. As the aggregatio ns o f summaries cannot be same as t hat o f aggregation as a whole so It is pos sible to miss so me inf ormation trends in the data unless s omeone is analysing the data as a whole.
Requirements Adding the securit y will af f ect the perf ormance of the dat a wareho use, theref ore it is worth det ermining the security requirements early as pos sible. Adding t he security af ter the dat a warehouse has gone live, is very difficult. During the design phase of data warehouse we sho uld keep in mind that what dat a so urces may be added later and what would be the impact of adding tho se data s ources. We sho uld cons ider t he f ollowing pos sibilities during the design phase. Whether t he new data so urces will require new security and/or audit rest rictions to be implemented? Whether the new users added who have restricted access to data that is already generally available? T his situat ion arises when the f uture users and the data s ources are not well known. In such a situation we need to us e the knowledge of business and the objective of data warehous e to know likely requirements.
Factor to Consider for Security requirements The f ollowing are the parts t hat are aff ected by the security hence it is worth consider these f acto rs. User Access Data Load Data Movement Query Generation
User Access We need to class if y the data f irst and t hen the users by what data t hey can access.In ot her word the users are class if ied according to t he data, they can access .
Data Classif ication T he f ollowing are the two approaches that can be used to class if y the data: T he data can be classif ied according to its sensit ivity. T he highly sensitive data is class if ied as highly rest ricted and less s ensitive data is classif ied as less res trictive. T he data can also be class if ied according to t he job f unction. T his rest riction allows o nly the specif ic users to view particular data. In this we rest rict the users to view only that t hat in which they are interested and are respons ible f or. T here are so me issues in the s econd approach. To underst and let's have an example, suppos e you are building the data warehous e f or a bank. suppos e f urther that data being sto red in the data warehous e is the trans action data f or all the accounts . T he quest ion here is who is allowed to see the trans action data. T he solutio n lies in class if ying the data according to t he functio n.
User classif ication T he f ollowing are the approaches that can be used to class if y the users . T he users can be class if ied as per the hierarchy of users in an organisat ion i.e. users can be class if ied by department, s ection, group, and so on.
T he user can also be class if ied according to their role, with people grouped acros s departments based on t heir role.
Classificat ion on basis of Department Let's have an example of a data warehous e where the us ers are f rom sales and marketing department. we can design the s ecurity by t opdo wn company view, with access centered around the dif f erent departments . But t hey could be so me restrictio ns o n users at dif f erent level. T his st ructure is shown in the fo llowing diagram.
But if each department access es the dif f erent data then we should design the security access f or each department separately. This can be achieved by the departmental data marts. Since these data marts are separated f rom the data warehous e hence we can enf orce the s eparate security rest rictions o n each data mart. This approach is shown in the f ollowing f igure.
Classif ication on basis of Role If the dat a is generally available to all the departments.The it is wort h to f ollow the role access hierarchy. In ot her words if the data is generally access ed by all the departments the apply the s ecurity rest rictions as per the role of the user. T he role access hierarchy is s hown in the f ollowing f igure.
Audit Requirement s T he auditing is a s ubset o f security. T he auditing is a cos tly activity t heref ore it is wort h underst anding the audit requirements and reason f or each audit requirement. The auditing can cause the heavy overheads on the s yst em. To complete auditing in time we require the more hardware theref ore it is recommended that where pos sible, auditing sho uld be switch of f . Audit requirements can be categorized into the f ollowing: Connections Disconnections Data access Data change Note: For each of the above mentioned categories it is necessary t o audit s uccess, f ailure or bo th. From the perspective of security reasons t he auditing of f ailures are very import ant. T he auditing of f ailure are import ant because they can highlight t he unautho rised or f raudulent access .
Network Requirements T he Network s ecurity is as import ant as o ther s ecurities. We can not ignore t he netwo rk security requirement. We need to consider the f ollowing issues. Is it necessary to encrypt data bef ore t ransf erring it to the data warehouse machine? Are there restrict ions on which net work ro utes the dat a can t ake? T hese rest rictions need to be considered caref ully. Following are the points to remember. T he process of encryption and decryption will increase the o verheads.It would require more process ing power and process ing time. T he cost of encryption can be high if the sys tem is already a loaded system because the encryption is borne by the source syst em.
Data Movement T here exist pot ential security implicatio ns while moving the data. Suppose we need to transf er so me rest ricted data as a f lat f ile to be loaded. When the data is loaded into t he data warehouse the f ollowing questions are raised?
Where is t he f lat f ile sto red? Who has access t o t hat disk space? If we talk about t he backup of these f lat f iles t he f ollowing questions are raised? Do yo u backup encrypted or decrypted versions ? Do t hese backup needs t o be made to special tapes t hat are st ored separately? Who has access to these tapes? Some ot her f orm of data movement like query result s ets also need to be considered. T he quest ion here are raised when creating the t emporary table are as f ollows. Where is t hat t emporary table to be held? How do you make such table visible? We should avoid the accidental f louting of security restrictions . If a user with access to the rest ricted data can generate accessible temporary tables, data can be made visible to nonauthorized users. We can overcome it by having separate temporary area f or us ers with access to rest ricted data.
Documentation T he audit and security requirements need to be properly documented. T his will be treated as part o f ust if icatio n. T his document can cont ain all the inf ormation gathered on t he f ollowing. Data class if icatio n User class if icatio n Network requirements Data movement and storage requirements All auditable actions
Impact of Security on Design T he security af f ects t he application code and the development t imescales. T he Security af f ects t he following. Applicatio n develo pment Database design Testing
Applicat ion Development T he security af f ect the overall applicatio n development and it also af f ect the design of the import ant component s o f the data warehouse such as load manager, warehouse manager and the query manager. T he load manager may require checking code t o f ilter record and place them in dif f erent locatio ns. The more transf ormation rule may also be required to hide certain data . Also there may be requirement o f extra metadata t o handle any extra o bjects .
To create and maintain the extra vies the warehouse manager may require extra code t o enf orce t he security. There may be the requirement o f the extra checks coded into the data warehouse to prevent it f rom being f oo led into moving data into location where it s hould not be available. T he query manager require t he changes t o handle any access rest rictions. T he query manager will need to be aware o f all extra views and aggregations .
Database de sign T he database layout is also af f ected because when the security is added there is increase in number of views and tables. Adding security adds t he size t o the database and hence increase t he complexity o f the databas e design and management. it will also add complexity t o the backup management and reco very plan.
Testing T he test ing of the data warehouse is very complex and a lengthy process . Adding security to the data warehous e also af f ect the test ing time complexity. It af f ects t he testing in the f ollowing two ways. It will increase the t ime required f or integration and sys tem test ing. T here is added f unctionality t o be t est ed which will cause increase in the size of the t est ing suite.
Data Warehousing - Backup Introduction T here exist large volume of data into the data warehouse and the data warehouse s ystem is very complex hence it becomes important to have backup of all the data which is available f or the recovery in f uture as per the requirement. In this chapter I will discuss the issues on designing backup strategy.
Backup Terminologies Bef ore pro ceeding furt her we sho uld know so me of the backup terminologies discuss ed below. Complete backup - In complete backup the entire database is backed up at the same time. This backup includes all the database f iles, cont rol f iles and journal f iles. Part ial backup - Partial backup is no t t he complete backup of databas e. Partial backup are very usef ul in large databases because t hey allow a strat egy whereby various parts o f the database are backed up in a round ro bin f ashion o n daybyday basis, so that the whole database is backed up ef f ectively once a week. Cold backup - Cold backup is taken while the database is completely shut down. In multiinstance environment all the instances s hould be shut down. Hot backup - The hot backup is take when the database engine is up and running. Hot backup requirements that need to be cons idered varies f ro m RDBMS to RDBMS. Hot backups are extremely useful. Online backup - It is same as t he hot backup.
Hardware Backup
It is import ant t o decide which hardware to use f or the backup.We have to make the upper bound on t he speed at which backup is can be process ed. the s peed of process ing backup and resto re depends not only on t he hardware being use rat her it also depends upon t he how hardware is connected, bandwidth of the network, backup sof tware and speed of server's I/O syst em. Here I will discuss about s ome of the hardware choices t hat are available and their pros and cons . T hese choices are as f ollows. Tape Technolo gy Disk Backups
Tape Technology T he tape choice can be categoriz ed into the f ollowing. Tape media Standalone tape drives Tape stackers Tape silos
Tape Media T here exist s several varieties o f tape media. T he so me tape media st andard are listed in the t able below: Tape Me dia
Capacit y
I/O rat e s
DLT
40 GB
3 MB/s
3490e
1.6 GB
3 MB/s
8 mm
14 GB
1 MB/s
Other f actors that need to be considered are fo llowing: Reliability of the t ape medium. Cost of tape medium per unit. scalability. Cost of upgrades to tape system. Cost of tape medium per unit. Shelf life o f tape medium.
St andalone t ape drives T he tape drives can be connected in the f ollowing ways. Direct t o t he server. As as networkavailable devices . Remot ely to ot her machine.
Iss ues of connecting the tape drives Suppose the server is the 48node MPP machine so which node do you connect the tape drive, how do you s pread them over the server nodes t o get t he optimal perf ormance with least disruptio n of the server and least internal I/O latency? Connecting the tape drive as a network available device require the net work t o be up to the jo b of the huge data t ransf er rates needed. make sure that suf f icient bandwidth is available during the time you require it. Connecting the tape drives remotely also require the high bandwidth.
Tape St ackers T he method o f loading the multiple tapes into a single tape drive is known as tape s tackers. T he st acker dismounts the current t ape when it has f inished with it and load the next t ape hence only one tape is available data a time to be accessed.The price and the capabilities may vary but the common ability is that they can perf orm unatt ended backups.
Tape Silos T he tape silos pro vide the large st ore capacities.Tape silos can sto re and manage the tho usands o f tapes. T he tape silos can integrate t he multiple tape drives. T hey have the s of tware and hardware to label and st ore t he tapes they sto re. It is very common f or t he silo to be connected remot ely over a network or a dedicated link.We should ensure that t he bandwidth o f that connection is up t o the job.
Other Technologies T he technologies o ther t han the tape are mentioned below. Disk Backups Optical jukeboxes
Disk Backups Metho ds o f disk backups are list ed below. Disk-to-disk backups Mirro r breaking These methods are used in OLTP system. These methods minimize the database downtime and maximize the availability.
Disk-t o-disk backups In this kind of backup the backup is t aken on t o disk rather t han to t ape. Reaso ns f or do ing Diskto disk backups are. Speed of initial backups Speed of restore Backing up the data f rom Disk t o disk is much f ast er than to the t ape. However it is the intermediate st ep of backup later t he data is backed up on t he tape. T he ot her advantage of Disk to disk backups is t hat it gives you t he online copy of the latest backup.
Mirror Breaking T he idea is t o have disks mirrored f or resilience during the working day. When back is required one of the mirror s ets can be broken out. This t echnique is variat o f Disktodisk backups. Note: T he database may need to be shut down to guarantee the cons istency of the backup.
Optical jukeboxes Optical jukeboxes allow the dat a to be st ored near line. This t echnique allow large number of opt ical disks to be managed in same way as a t ape stacker or t ape silo. T he drawback of this t echnique is t hat it is slow write speed t han disks. But t he opt ical media provide the long lif e and reliability make them goo d choice of medium of archiving.
Software Backups T here are so f tware to ols available which helps in backup process. These so f tware to ols co me as a package.T hese to ols not only take backup in fact they ef f ectively manage and cont rol t he backup st rategies. T here are many sof tware packages available in the market .Some of them are here list ed in the f ollowing table. Package Name
Vendor
Networker
Legato
ADSM
IBM
Epoch
Epoch Systems
Omniback II
HP
Alexandria
Sequent
Crite ria For Choosing Sof t ware Packages T he criteria of choos ing the best s of tware package is listed below: How scalable is t he product as tape drives are added? Does t he package have client server opt ion, or must it run on database server itself ? Will it wo rk in cluster and MPP environments? What degree of parallelism is required? What platf orms are suppo rted by the package? Does package support easy access to inf ormation about tape contents? Is the package database aware? What tape drive and tape media are supported by package?
Data Warehousing - Tuning Introduction
T he data warehouse evolves thro ughout t he period o f time and the it is unpredictable that what query the user is going to be produced in f uture. T heref ore it becomes more dif f icult to tune data warehouse sys tem. In this chapter we will discuss about how to tune the dif f erent aspects o f data warehouse s uch as perf ormance, data load, queries ect.
Diff iculties in Data Warehouse Tuning Here is t he list of dif f iculties that can occur while tuning the data warehous e. T he data warehouse never remain const ant thro ughout t he period of time. It is very dif f icult to predict t hat what query the user is going to produce in f uture. T he need of the business also changes with t ime. T he users and their prof ile never remains the s ame with time. T he user can switch f rom one group to anot her. the data lo ad on t he warehouse also changes with time. Note: It is very import ant t o have the complete knowledge of data warehous e.
Perf ormance Assessment Here is the list of objective measures of perf ormance. Average query res ponse time Scan rates. Time used per day query. Memory us age per process . I/O thro ughput rates Following are the points to be remembered. It is necessary to s pecif y the measures in service level agreement(SLA). It is o f no use t o t rying to tune respons e time if t hey are already better t han thos e required. It is ess ential to have realist ic expectations while perfo rmance assess ment. It is also essent ial that t he users have the f easible expectatio ns. To hide the complexity o f the sys tem f rom the user t he aggregations and views sho uld be used. It is also pos sible that t he user can write a query you had not tuned f or.
Data Load Tuning Data Load is very critical part o f overnight process ing. Not hing else can run until data lo ad is co mplete. This is the entry point into the system.
Note: If there is delay in transf erring the data o r in arrival of data then the entire syst em is ef f ected badly. Theref ore it is very important to tune the data load first . T here are various approaches o f tuning data load t hat are discussed below: T he very common approach is to insert data using the SQL Layer . In this approach t he normal checks and const raints need to be perfo rmed. When the data is insert ed into the t able the code will run to check is t here enough space available to insert t he data. if the s uf f icient s pace is no t available then more space may have to be allocated t o these t ables. These checks t ake time to perf orm and are cos tly to CPU. But pack the data t ightly by making maximal use of space. T he second approach is to bypass all these checks and const raints and place the data directly into pref ormatt ed blocks. T hese blocks are later written to the database. It is f ast er than the firs t approach but it can work o nly with the whole blocks o f data. T his can lead to so me space wast age. T he third approach is t hat while loading the data into the t able that already contains t he table, we can either maintain the indexes. T he f ourt h approach says that t o load the data in tables that already cont ains the data, drop the indexes & recreate t hem when the data load is complete. Out o f third and f ourt h, which approach is better depends on how much data is already loaded and how many indexes need to be rebuilt.
Integrity Checks T he integrity checking highly af f ects t he perf ormance of the load Following are the points to be remembered. The integrity checks need to be limited because processing required can be heavy. T he integrity checks s hould be applied on the so urce system to avoid perf ormance degrade of data load.
Tuning Queries We have two kinds o f queries in data warehous e: Fixed Queries Ad hoc Queries
Fixed Queries T he f ixed queries are well def ined. T he f ollowing are the examples of f ixed queries. regular report s Canned queries Common aggregations Tuning the f ixed queries in data warehouses is same as in relational database sys tems. the o nly dif f erence is that t he amount o f data to be queries may be dif f erent. It is go od to sto re the mos t s uccessf ul execution plan while tes ting the f ixed queries. Sto ring these executing plan will allow us to spo t changing data s ize and data skew as this will cause t he execution plan to change.
Note: We cannot do more o n f act t able but while dealing with t he dimension t able or the aggregations, t he usual collectio n of SQL tweaking, st orage mechanism and access methods can be used to tune t hese queries.
Ad hoc Queries To know the ad hoc queries it is import ant t o know t he ad hoc users of the data warehouse. Here is t he list of points that need to understand about t he users o f the data warehouse: T he number of users in the group. Whether t hey use ad hoc queries at regular interval of time. Whether t hey use ad hoc queries f requently. whether they use ad hoc queries o ccasionally at unknown intervals. T he maximum size o f query they tend t o run T he average size o f query they tend to run. Whether t hey require drill-down access to the base data. The elapsed login time per day T he peak time of daily usage T he number of queries t hey run per peak hour. Following are the points to be remembered. It is important to track the users pro f iles and identif y the queries t hat are run on regular basis. It is also important t o identif y tuning perf ormed does not af f ect t he perf ormance. Identif y the s imilar and ad hoc queries that are f requently run. If these queries are identif ied then the database will change and new indexes can be added f or tho se queries. If these queries are identif ied then new aggregations can be created specif ically f or t hos e queries that wo uld result in their ef f icient execution.
Data Warehousing - Testing Introduction Test ing is very import ant f or data warehous e syst ems t o make them work correctly and ef f iciently. T here are three basic level of tes ting that are listed below: Unit Test ing Integration Test ing Syst em tes ting
Unit Test ing
In the Unit Testing each component is separately test ed. In this kind of tes ting each module i.e. procedure, program, SQL Script, Unix shell is t est ed. T his tes ted is perf ormed by the developer.
Integration Testing In this kind of tes ting the various modules of the applicatio n are brought t oget her and then test ed against number of inputs . It is perf ormed to tes t whether the various components do well af ter integration.
Sustem Test ing In this kind of tes ting the whole data warehouse application is t est ed to gether. The purpose o f this t esting is to check whether the entire system work correctly together or not . T his test ing is perf ormed by the tes ting team. Since the size o f the whole data warehous e is very large so it is us ually pos sible to perf orm minimal syst em tes ting befo re the test plan proper can be enacted.
Test Schedule First of all the Test Schedule is created in process of development o f Test Plan. In this we predict the est imated time required f or t he test ing of entire data warehouse s yst em.
Dif f iculties in Scheduling the Test ing T here are dif f erent metho dologies available but none of them is perf ect because the data warehouse is very complex and large. Also the data warehouse system is evolving in nature. A simple problem may have large siz e of query which can t ake a day o r more to complete i.e. the query does no t complete in desired time scale. T here may be the hardware f ailure such as los ing a disk, or t he human error s uch as accidentally deleting t he table o r o verwriting a large table. Note: Due to the above mentioned dif f iculties it is recommended that always double the amount of time you would normally allow f or t est ing.
Testing the backup recovery T his is very import ant test ing that need to be perf ormed. Here is the list of scenarios f or which this test ing is needed. Media f ailure. Los s o r damage of table space or data f ile Los s o r damage of redo log f ile. Los s o r damage of control f ile
Inst ance f ailure. Los s o r damage of archive f ile. Los s o r damage of table. Failure during data f ailure.
Testing Operational Environment T here are number of aspects t hat need to be test ed. T hese aspects are list ed below. Security - A separate security document is required f or security test ing. T his document cont ain the list of disallowed operations and devising test f or each. Scheduler - Scheduling sof tware is required to contro l the daily operations o f data warehouse. This need to be tes ted during the sys tem test ing. T he scheduling so f tware require interf ace with the data warehouse, which will need the scheduler to control the overnight processing and the management of aggregations. Disk Configuration. - T he Disk conf iguratio n also need to be tested to identif y the I/O bot tlenecks. T he test s hould be perf ormed with multiple times with dif f erent sett ings. Management Tools. - It is needed to t est all the management t oo ls during system test ing. Here is the list of to ols that need to be test ed. Event manager sys tem Manager. Database Manager. Conf iguratio n Manager Backup recovery manager.
Testing the Database There are three set of test s t hat are listed below: Testing t he database manager and monitoring tools. - To tes t t he database manager and the monito ring too ls t hey should be used in the creation, running and management o f tes t dat abase. Testing database features. - Here is t he list o f f eatures t hat we have to t est: Querying in parallel Create index in parallel Data lo ad in parallel Testing database perf ormance. - Query execution plays a very import ant ro le in data warehouse perf ormance measures. There are set o f f ixed queries t hat need to be run regularly and they should be test ed. To tes t ad hoc queries o ne should go t hrough the user requirement do cument and underst and the business completely. Take the time to tes t the mos t awkward queries that the business is likely to ask against dif f erent index and aggregation s trat egies.
Testing The Application
All the managers should be int egrat ed co rrect ly and work in o rder to ensure t hat the end- to- end load, index, aggregate and queries work as per the expectations. Each f unction o f each manager s hould work in correct manner. It is also necess ary to t est t he application over a period o f time. T he week-end and mont h-end tas k should also be test ed.
Logistic of the Test T here is a questio n that What you are really tes ting? The answer to t his quest ion is that yo u are tes ting a suite o f data warehouse application code. The aim of system test is to test all of the f ollowing areas. Scheduling Sof tware Day- to Day operatio nal procedures. Backup recovery strategy. Management and scheduling tools. Overnight pro cessing Query Perf ormance Note: T he mos t import ant po int is t o t est the s calability. Failure to do s o will leave us a syst em design that does not work when the syst em grow.
Data Warehousing - Future Aspects Following are the f uture aspects o f Data Warehousing. As we have s een t hat the size o f the o pen dat abase has gro wn appro ximately do uble the magnitude in last f ew years. This change in magnitude is of greater signif icance. As t he size o f the dat abases gro w , the estimates of what constitutes a very large dat abase continues to grow. T he Hardware and sof tware that are available to day do no t allow to keep a large amount of data online. For example a Telco call record require 10TB of data t o be kept o nline which is just a size of one month record. If It require to keep record o f sales, marketing custo mer, employee etc. then t he size will be more than 100 TB. T he record no t o nly contain the t extual inf ormation but also contain so me multimedia data. Multimedia data cannot be easily manipulated as t ext dat a. Searching the multimedia data is not an easy tas k whereas t he textual inf ormation can be retrieved by the relational so f tware available to day. Apart f ro m size planning, building and running ever- larger data wareho use systems are very co mplex. As t he number o f users increas es the size of the dat a wareho use also increas es. T hese users will also require to access to the syst em. With growt h of internet there is requirement of users t o access data online. Hence the Future shape of data warehouse will be very diff ere nt f rom what is being creat ed today.
Data Warehousing - Interview Questions Dear readers, t hese Data Warehousing Interview Questions have been designed especially to get you acquainted with the nature of questio ns you may encounter during your interview f or t he subject o f Data Warehousing. As per my experience, good interviewers hardly planned to ask any part icular questio n during your interview, normally quest ions s tart with so me basic concept of the s ubject and later they cont inue based on f urther discussion and what you answer: Q: Define Data Warehouse? A: Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile collection of data t hat support management's decisio n making process . Q: What does t he subject orient ed data warehouse signifies? A: Subject o riented signif ies that t he data warehouse s to res t he inf ormation around a particular subject such as pro duct, custo mer, sales et c. Q: List any five applications of Dat a Warehouse ? A: Some applications include Financial services, Banking Services, Customer goods, Retail Sectors, Contro lled Manuf acturing. Q: What doe s OLAP and OLT P stand f or? A: OLAP is acro nym of Online Analytical Processing and OLAP is acro nym of Online Transactio nal Processing Q: What is the very basic diff ere nce bet ween dat a warehouse and Ope rational Databases? A: Data warehouse contains t he hist orical inf ormation that is made available f or analysis o f the business whereas t he Operational database contains the current inf ormation t hat is required to run the business . Q: List the Schema t hat Data Warehouse System implement s ? A: Data Warehouse can implement Star Schema, Snowf lake Schema or t he Fact Cons tellation Schema Q: What is Data Warehousing? A: Data Warehousing is t he process o f const ructing and using the data warehouse. Q: List the process that are involved in Data Warehousing? A: Data Warehousing involves data cleaning, data integratio n and data co nso lidations . Q: List t he f unctions of data warehouse t ools and utilities? A: T he f unctio ns perf ormed by Data warehouse t oo l and utilities are Data Extractio n, Data Cleaning, Data Transf ormation, Data Loading and Ref reshing Q: What do you mean by Data Ext raction? A: Data Extractio n means gathering the data f rom multiple heterogeneous so urces. Q: Define Metadata? A: Metadata is simply def ined as data abo ut dat a. In ot her words we can say that metadata is t he summarized data t hat lead us t o the detailed data.
Q: What does MetaData Respiratory contains? A: Metadata respirato ry contains Def initio n of data warehous e, Business Metadata, Operational Metadata, Data f or mapping fro m operatio nal environment to data warehous e and the Alorithms f or s ummarization Q: How does a Dat a Cube he lp? A: Data cube help us to represent t he data in multiple dimensio ns. The data cube is def ined by dimensio ns and facts. Q: Define Dimension? A: T he dimensions are the entities with respect to which an enterprise keep the records. Q: Explain Data mart? A: Data mart cont ains t he subset of organisatio n-wide data. T his subset o f data is valuable to specif ic group of an organisation. in ot her words we can say that data mart contains o nly that data which is s pecif ic to a particular group. Q: What is Virtual Wareh ouse? A: T he view over a o perational data warehous e is known as virtual warehouse. Q: List the phases involved in Data warehouse delivery Process? A: The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version, Hist ory Load, Ad hoc query,Requirement Evolutio n, Automation, Extending Scope. Q: Explain Load Manage r? A: T his Component perf orms t he operations required to extract and load process. The size and complexity of load manager varies between specif ic so lutions f rom data warehouse to data warehous e. Q: Def ine the funct ion of Load Manager? A: Extract t he data f rom source syst em.Fast Load the extracted data into temporary data st ore.Perf orm simple transf ormations into st ructure similar to the one in the data warehouse. Q: Explain Warehouse Manager? A: Warehouse manager is respons ible f or the warehouse management pro cess.The warehouse manager consist of third party system sof tware, C programs and shell scripts.The size and complexity of warehous e manager varies between specif ic so lutions. Q: Define f unctions of Warehouse Manager? A: T he Warehouse Manager perf orms co nsist ency and ref erential integrity checks, Creates t he indexes, business views, partitio n views against the base data, transf orms and merge the so urce data into the temporary st ore into the published data warehouse, Backup the data in the dat a warehous e and archives the data that has reached the end of its captured lif e. Q: What is Summary Inf ormation? A: Summary Inf ormation is the area in data warehous e where the predef ined aggregatio ns are kept. Q: What does t he Q uery Manager re sponsible f or? A: Query Manager is res ponsible f or directing the queries t o the s uitable tables.
Q: List the t ypes of OLAP server? A: There are f our t ypes of OLAP Server namely Relatio nal OLAP, Multidimensio nal OLAP, Hybrid OLAP, Specialized SQL Servers Q: Which one is more f aste r Multidimensional OLAP or Relational OLAP? A: Multidimensional OLAP is f ast er t han the Relatio nal OLAP Q: List t he f unctions performed by OLAP? A: T he f unctio ns s uch as ro ll-up, drill-do wn, slice, dice, and pivot are perf ormed by OLAP Q: How many dimensions are se lect ed in Slice ope ration? A: Only one dimension is s elected f or t he slice operation. Q: How many dimensions are se lect ed in dice ope rat ion? A: For dice operation two or more dimensions are selected f or a given cube. Q: How many fact t ables are t here in Star Schema? A: T here is o nly one f act t able in Star Schema. Q: What is Normalization? A: T he normalization s plit up t he data into additio nal tables. Q: Out of Star Schema and Snowflake Schema, the dimension table is normalised? A: The sno wf lake schema uses the concept of normalizatio n. Q: What is the benef it of Normalization? A: Normalization helps to reduce the data redundancy. Q: Which language is used for defining Schema Definition A: Data Mining Query Language (DMQL) id used f or Schema Def inition. Q: What language is the base of DMQL A: DMQL is based on Structured Query Language (SQL) Q: What are the reasons for partitioning? A: Partit ioning is do ne f or various reas ons such as easy management, to ass ist backup recovery, to enhance perf ormance. Q: What kind of costs are involved in Data Mart ing? A: Data Marting involves Hardware & Sof tware cost , Network access cost and Time cost .
What is Next? Further, you can go t hrough your past ass ignments you have done with the s ubject and make sure you are able to s peak conf idently on t hem. If you are f resher then interviewer does no t expect you will answer very complex quest ions, rat her you have to make your basics co ncepts very stro ng.