BigData Objective

This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “History of Hadoop”. 1. IBM and ________ have announced a major initiative to use Hadoop to support university courses in distributed computer programming. a) Google Latitude b) Android (operating system) c) Google Variations d) Google Answer:d Explanation:Google and IBM Announce University Initiative to Address Internet-Scale. 2. Point out the correct statement : a) Hadoop is an ideal environment for extracting and transforming small volumes of data b) Hadoop stores data in HDFS and supports data compression/decompression c) The Giraph framework is less useful than a MapReduce job to solve graph and machine learning d) None of the mentioned Answer:b Explanation:Data compression can be achieved using compression algorithms like bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based on their capabilities. 3. What license is Hadoop distributed under ? a) Apache License 2.0 b) Mozilla Public License c) Shareware d) Commercial Answer:a Explanation:Hadoop is Open Source, released under Apache 2 license. 4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional Hadoop cluster using a live CD. a) OpenOffice.org b) OpenSolaris c) GNU d) Linux

Answer:b Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image. 5. Which of the following genres does Hadoop produce ? a) Distributed file system b) JAX-RS c) Java Message Service d) Relational Database Management System Answer:a Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user. 6. What was Hadoop written in ? a) Java (software platform) b) Perl c) Java (programming language) d) Lua (programming language) Answer:c Explanation: The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts. 7. Which of the following platforms does Hadoop run on ? a) Bare metal b) Debian c) Cross-platform d) Unix-like Answer:c Explanation:Hadoop has support for cross platform operating system. 8. Hadoop achieves reliability by replicating the data across multiple hosts, and hence does not require ________ storage on hosts. a) RAID b) Standard RAID levels c) ZFS d) Operating system

Answer:a Explanation:With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. 9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs. a) MapReduce b) Google c) Functional programming d) Facebook Answer:a Explanation:MapReduce engine uses to distribute work around a cluster.

10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and matrix operations. a) Machine learning b) Pattern recognition c) Statistical classification d) Artificial intelligence Answer:a Explanation: The Apache Mahout project’s goal is to build a scalable machine learning tool. 1. As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including: a) Improved data storage and information retrieval b) Improved extract, transform and load features for data integration c) Improved data warehousing functionality d) Improved security, workload management and SQL support Answer:d Explanation:Adding security to Hadoop is challenging because all the interactions do not follow the classic client- server pattern. 2. Point out the correct statement : a) Hadoop do need specialized hardware to process the data b) Hadoop 2.0 allows live stream processing of real time data c) In Hadoop programming framework output files are divided in to lines or records

d) None of the mentioned Answer:b Explanation:Hadoop batch processes data distributed over a number of computers ranging in 100s and 1000s. 3. According to analysts, for what can traditional IT systems provide a foundation when they’re integrated with big data technologies like Hadoop ? a) Big data management and data mining b) Data warehousing and business intelligence c) Management of Hadoop clusters d) Collecting and storing unstructured data Answer:a Explanation:Data warehousing integrated with Hadoop would give better understanding of data. 4. Hadoop is a framework that works with a variety of related tools. Common cohorts include: a) MapReduce, Hive and HBase b) MapReduce, MySQL and Google Apps c) MapReduce, Hummer and Iguana d) MapReduce, Heron and Trumpet Answer:a Explanation:To use Hive with HBase you’ll typically want to launch two clusters, one to run HBase and the other to run Hive. 5. Point out the wrong statement : a) Hardtop’s processing capabilities are huge and its real advantage lies in the ability to process terabytes & petabytes of data b) Hadoop uses a programming model called “MapReduce”, all the programs should confirms to this model in order to work on Hadoop platform c) The programming model, MapReduce, used by Hadoop is difficult to write and test d) All of the mentioned Answer:c Explanation: The programming model, MapReduce, used by Hadoop is simple to write and test. 6. What was Hadoop named after? a) Creator Doug Cutting’s favorite circus act b) Cutting’s high school rock band

c) The toy elephant of Cutting’s son d) A sound Cutting’s laptop made during Hadoop’s development Answer:c Explanation:Doug Cutting, Hadoop’s creator, named the framework after his child’s stuffed toy elephant. 7. All of the following accurately describe Hadoop, EXCEPT: a) Open source b) Real-time c) Java-based d) Distributed computing approach Answer:b Explanation:Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. 8. __________ can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of data. a) MapReduce b) Mahout c) Oozie d) All of the mentioned Answer:a Explanation:MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm. 9. __________ has the world’s largest Hadoop cluster. a) Apple b) Datamatics c) Facebook d) None of the mentioned Answer:c Explanation:Facebook has many Hadoop clusters, the largest among them is the one that is used for Data warehousing. 10. Facebook Tackles Big Data With _______ based on Hadoop. a) ‘Project Prism’

b) ‘Prism’ c) ‘Project Big’ d) ‘Project Data’ Answer:a Explanation:Prism automatically replicates and moves data wherever it’s needed across a vast network of computing facilities. Hadoop Questions and Answers – Hadoop Ecosystem This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Ecosystem”. 1. ________ is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. a) Pig Latin b) Oozie c) Pig d) Hive Answer:c Explanation:Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. 2. Point out the correct statement : a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data b) Hive is a relational database with SQL support c) Pig is a relational database with SQL support d) All of the mentioned Answer:a Explanation:Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. 3. _________ hides the limitations of Java behind a powerful and concise Clojure API for Cascading. a) Scalding b) HCatalog c) Cascalog

d) All of the mentioned Answer:c Explanation:Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the name “Cascalog” is a contraction of Cascading and Datalog. 4. Hive also support custom extensions written in : a) C# b) Java c) C d) C++ Answer:b Explanation:Hive also support custom extensions written in Java, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats. 5. Point out the wrong statement : a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate d) All of the mentioned Answer:a Explanation:Rather than building Hadoop deployments manually on EC2 (Elastic Compute Cloud) clusters, users can spin up fully configured Hadoop installations using simple invocation commands, either through the AWS Web Console or through command-line tools. 6. ________ is the most popular high-level Java API in Hadoop Ecosystem a) Scalding b) HCatalog c) Cascalog d) Cascading Answer:d Explanation:Cascading hides many of the complexities of MapReduce programming behind more intuitive pipes and data flow abstractions. 7. ___________ is general-purpose computing model and runtime system for distributed data analytics. a) Mapreduce

b) Drill c) Oozie d) None of the mentioned Answer:a Explanation:Mapreduce provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms. 8. The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to : a) SQL b) JSON c) XML d) All of the mentioned Answer:a Explanation:Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce. 9. _______ jobs are optimized for scalability but not latency. a) Mapreduce b) Drill c) Oozie d) Hive Answer:d Explanation:Hive Queries are translated to MapReduce jobs to exploit the scalability of MapReduce.

10. ______ is a framework for performing remote procedure calls and data serialization. a) Drill b) BigTop c) Avro d) Chukwa Answer:c Explanation:In the context of Hadoop, Avro can be used to pass data from one program or language to another.

Hadoop Questions and Answers – Introduction to Mapreduce This set of Multiple Choice Questions & Answers (MCQs) focuses on “MapReduce”. 1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. a) MapReduce b) Mapper c) TaskTracker d) JobTracker Answer:c Explanation:TaskTracker receives the information necessary for execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker. 2. Point out the correct statement : a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned Answer:a Explanation:This feature of MapReduce is “Data Locality”. 3. ___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results. a) Maptask b) Mapper c) Task execution d) All of the mentioned Answer:a Explanation:Map Task in MapReduce is performed using the Map() function. 4. _________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. a) Reduce b) Map c) Reducer

d) All of the mentioned Answer:a Explanation:Reduce function collates the work and resolves the results. 5. Point out the wrong statement : a) A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned Answer:d Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks. 6. Although the Hadoop framework is implemented in Java , MapReduce applications need not be written in : a) Java b) C c) C# d) None of the mentioned Answer:a Explanation:Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNITM based). 7. ________ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned Answer:b Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution.

8. __________ maps input key/value pairs to a set of intermediate key/value pairs. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned Answer:a Explanation:Maps are the individual tasks that transform input records into intermediate records. 9. The number of maps is usually driven by the total size of : a) inputs b) outputs c) tasks d) None of the mentioned Answer:a Explanation:Total size of inputs means total number of blocks of the input files. 10. _________ is the default Partitioner for partitioning key space. a) HashPar b) Partitioner c) HashPartitioner d) None of the mentioned Answer:c Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition to partition. Hadoop Questions and Answers – Analyzing Data with Hadoop This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Analyzing Data with Hadoop”. 1. Mapper implementations are passed the JobConf for the job via the ________ method a) JobConfigure.configure b) JobConfigurable.configure c) JobConfigurable.configureable d) None of the mentioned

Answer:b Explanation:JobConfigurable.configure method is overridden to initialize themselves. 2. Point out the correct statement : a) Applications can use the Reporter to report progress b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format d) All of the mentioned Answer: d Explanation:Reporters can be used to set application-level status messages and update Counters. 3. Input to the _______ is the sorted output of the mappers. a) Reducer b) Mapper c) Shuffle d) All of the mentioned Answer:a Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. 4. The right number of reduces seems to be : a) 0.90 b) 0.80 c) 0.36 d) 0.95 Answer:d Explanation: The right number of reduces seems to be 0.95 or 1.75. 5. Point out the wrong statement : a) Reducer has 2 primary phases b) Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures c) It is legal to set the number of reduce-tasks to zero if no reduction is desired d) The framework groups Reducer inputs by keys (since different mappers may have output the

same key) in sort stage Answer:a Explanation:Reducer has 3 primary phases: shuffle, sort and reduce. 6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop. a) Mapper b) Cascader c) Scalding d) None of the mentioned Answer:d Explanation: The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted. 7. Which of the following phases occur simultaneously ? a) Shuffle and Sort b) Reduce and Sort c) Shuffle and Map d) All of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. 8. Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned 9. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned

Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners. 10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned Answer:b Explanation:JobConf represents a MapReduce job configuration. Hadoop Questions and Answers – Scaling out in Hadoop This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Scaling out in Hadoop”. 1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. a) NoSQL b) NewSQL c) SQL d) All of the mentioned Answer:a Explanation: NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation. 2. Point out the correct statement : a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload b) HDFS runs on a small cluster of commodity-class nodes c) NEWSQL is frequently the collection point for big data d) None of the mentioned Answer:a Explanation:Hadoop together with a relational data warehouse, they can form very effective data warehouse infrastructure.

3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on: a) HCatalog b) Hive c) Hbase d) All of the mentioned Answer:a Explanation:Other means of tagging the values also can be used. 4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments a) EMR b) Isilon solutions c) AWS d) None of the mentioned Answer:b Explanation:enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution. 5. Point out the wrong statement : a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform b) Isilon’s native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure c) NoSQL systems do provide high latency access and accommodate less concurrent users d) None of the mentioned Answer:c Explanation:NoSQL systems do provide low latency access and accommodate many concurrent users. 6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to : a) Scale out b) Scale up c) Both Scale out and up d) None of the mentioned

Answer:a Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up. 7. Which is the most popular NoSQL database for scalable big data store with Hadoop ? a) Hbase b) MongoDB c) Cassandra d) None of the mentioned Answer:a Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. 8. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks. a) DataCache b) DistributedData c) DistributedCache d) All of the mentioned Answer:c Explanation: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH. 9. HBase provides ___________ like capabilities on top of Hadoop and HDFS. a) TopTable b) BigTop c) Bigtable d) None of the mentioned Answer:c Explanation: Google Bigtable leverages the distributed data storage provided by the Google File System. 10. _______ refers to incremental costs with no major impact on solution design, performance and complexity. a) Scale-out

b) Scale-down c) Scale-up d) None of the mentioned Answer:c Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches. Hadoop Questions and Answers – Hadoop Streaming This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Streaming”. 1. Streaming supports streaming command options as well as _________ command options. a) generic b) tool c) library d) task Answer:a Explanation:Place the generic options before the streaming options, otherwise the command will fail. 2. Point out the correct statement : a) You can specify any executable as the mapper and/or the reducer b) You cannot supply a Java class as the mapper and/or the reducer c) The class you supply for the output format should return key/value pairs of Text class d) All of the mentioned Answer:a Explanation:If you do not specify an input format class, the TextInputFormat is used as the default. 3. Which of the following Hadoop streaming command option parameter is required ? a) output directoryname b) mapper executable c) input directoryname d) All of the mentioned

Answer:d Explanation:Required parameters is used for Input and Output location for mapper. 4. To set an environment variable in a streaming command use: a) -cmden EXAMPLE_DIR=/home/example/dictionaries/ b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/ c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/ Answer:c Explanation:Environment Variable is set using cmdenv command. 5. Point out the wrong statement : a) Hadoop has a library package called Aggregate b) Aggregate allows you to define a mapper plugin class that is expected to generate “aggregatable items” for each input key/value pair of the mappers c) To use Aggregate, simply specify “-mapper aggregate” d) None of the mentioned Answer:c Explanation:To use Aggregate, simply specify “-reducer aggregate”: 6. The ________ option allows you to copy jars locally to the current working directory of tasks and automatically unjar the files. a) archives b) files c) task d) None of the mentioned Answer:a Explanation:Archives options is also a generic option. 7. ______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys. a) KeyFieldPartitioner b) KeyFieldBasedPartitioner c) KeyFieldBased d) None of the mentioned

Answer:b Explanation: The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting. 8. Which of the following class provides a subset of features provided by the Unix/GNU Sort ? a) KeyFieldBased b) KeyFieldComparator c) KeyFieldBasedComparator d) All of the mentioned Answer:c Explanation:Hadoop has a library class, KeyFieldBasedComparator, that is useful for many applications. 9. Which of the following class is provided by Aggregate package ? a) Map b) Reducer c) Reduce d) None of the mentioned Answer:b Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a sequence of values. 10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility. a) Copy b) Cut c) Paste d) Move Answer:b Explanation: The map function defined in the class treats each input key/value pair as a list of fields. Hadoop Questions and Answers – Introduction to HDFS This set of Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Filesystem – HDFS”.

1. A ________ serves as the master and there is only one NameNode per cluster. a) Data Node b) NameNode c) Data block d) Replication Answer:b Explanation:All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode. 2. Point out the correct statement : a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned Answer:a Explanation:There can be any number of DataNodes in a Hadoop Cluster. 3. HDFS works in a __________ fashion. a) master-worker b) master-slave c) worker/slave. d) All of the mentioned Answer:a Explanation:NameNode servers as the master and each DataNode servers as a worker/slave 4. ________ NameNode is used when the Primary NameNode goes down. a) Rack b) Data c) Secondary d) None of the mentioned Answer:c Explanation:Secondary namenode is used for all time availability and reliability. 5. Point out the wrong statement : a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file

level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to Answer:d Explanation: NameNode is aware of the files to which the blocks stored on it belong to. 6. Which of the following scenario may not be a good fit for HDFS ? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentioned Answer:a Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance. 7. The need for data replication can arise in various scenarios like : a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned Answer:d Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance. 8. ________ is the slave/worker node and holds the user data in the form of Data Blocks. a) DataNode b) NameNode c) Data block d) Replication Answer:a Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

9. HDFS provides a command line interface called __________ used to interact with HDFS. a) “HDFS Shell” b) “FS Shell” c) “DFS Shell” d) None of the mentioned Answer:b Explanation: The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS). 10. HDFS is implemented in _____________ programming language. a) C++ b) Java c) Scala d) None of the mentioned Answer:b Explanation:HDFS is implemented in Java and any computer which can run Java can host a NameNode/DataNode on it. Hadoop Questions and Answers – Java Interface This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Java Interface”. 1. In order to read any file in HDFS, instance of __________ is required. a) filesystem b) datastream c) outstream d) inputstream Answer:a Explanation:InputDataStream is used to read data from file. 2. Point out the correct statement : a) The framework groups Reducer inputs by keys b) The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged c) Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values d) All of the mentioned

Answer:d Explanation:If equivalence rules for keys while grouping the intermediates are different from those for grouping keys before reduction, then one may specify a Comparator. 3. ______________ is method to copy byte from input stream to any other stream in Hadoop. a) IOUtils b) Utils c) IUtils d) All of the mentioned Answer:a Explanation:IOUtils class is static method in Java interface. 4. _____________ is used to read data from bytes buffers . a) write() b) read() c) readwrite() d) All of the mentioned Answer:a Explanation:readfully method can also be used instead of read method. 5. Point out the wrong statement : a) The framework calls reduce method for each pair in the grouped inputs b) The output of the Reducer is re-sorted c) reduce method reduces values for a given key d) None of the mentioned Answer:b Explanation: The output of the Reducer is not re-sorted. 6. Interface ____________ reduces a set of intermediate values which share a key to a smaller set of values. a) Mapper b) Reducer c) Writable d) Readable Answer:b Explanation:Reducer implementations can access the JobConf for the job.

7. Reducer is input the grouped output of a : a) Mapper b) Reducer c) Writable d) Readable Answer:a Explanation:In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP. 8. The output of the reduce task is typically written to the FileSystem via : a) OutputCollector b) InputCollector c) OutputCollect d) All of the mentioned Answer:a Explanation:In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each pair in the grouped inputs. 9. Applications can use the _________ provided to report progress or just indicate that they are alive. a) Collector b) Reporter c) Dashboard d) None of the mentioned Answer:b Explanation:In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. 10. Which of the following parameter is to collect keys and combined values ? a) key b) values c) reporter d) output

Answer:d Explanation:reporter parameter is for facility to report progress. Hadoop Questions and Answers – Data Flow This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Data Flow”. 1. ________ is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. a) Hive b) MapReduce c) Pig d) Lucene Answer:b Explanation:MapReduce is the heart of hadoop. 2. Point out the correct statement : a) Data locality means movement of algorithm to the data instead of data to algorithm b) When the processing is done on the data algorithm is moved across the Action Nodes rather than data to the algorithm c) Moving Computation is expensive than Moving Data d) None of the mentioned Answer:a Explanation:Data flow framework possesses the feature of data locality. 3. The daemons associated with the MapReduce phase are ________ and task-trackers. a) job-tracker b) map-tracker c) reduce-tracker d) All of the mentioned Answer:a Explanation:Map-Reduce jobs are submitted on job-tracker. 4. The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep the work as close to the data as possible a) DataNodes b) TaskTracker c) ActionNodes

d) All of the mentioned Answer:b Explanation:A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status whether the node is dead or alive. 5. Point out the wrong statement : a) The map function in Hadoop MapReduce have the following general form:map:(K1, V1) → list(K2, V2) b) The reduce function in Hadoop MapReduce have the following general form: reduce: (K2, list(V2)) → list(K3, V3) c) MapReduce has a complex model of data processing: inputs and outputs for the map and reduce functions are key-value pairs d) None of the mentioned Answer:c Explanation:MapReduce is relatively simple model to implement in Hadoop. 6. InputFormat class calls the ________ function and computes splits for each file and then sends them to the jobtracker. a) puts b) gets c) getSplits d) All of the mentioned Answer:c Explanation:InputFormat uses their storage locations to schedule map tasks to process them on the tasktrackers. 7. On a tasktracker, the map task passes the split to the createRecordReader() method on InputFormat to obtain a _________ for that split. a) InputReader b) RecordReader c) OutputReader d) None of the mentioned Answer:b Explanation: The RecordReader loads data from its source and converts into key-value pairs suitable for reading by mapper.

8. The default InputFormat is __________ which treats each value of input a new value and the associated key is byte offset. a) TextFormat b) TextInputFormat c) InputFormat d) All of the mentioned Answer:b Explanation:A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs. 9. __________ controls the partitioning of the keys of the intermediate map-outputs. a) Collector b) Partitioner c) InputFormat d) None of the mentioned Answer:b Explanation: The output of the mapper is sent to the partitioner. 10. Output of the mapper is first written on the local disk for sorting and _________ process. a) shuffling b) secondary sorting c) forking d) reducing Answer:a Explanation:All values corresponding to the same key will go the same reducer. Hadoop Questions and Answers – Hadoop Archives This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Archives”. 1. _________ is the name of the archive you would like to create. a) archive b) archiveName c) Name d) None of the mentioned

Answer:b Explanation: The name should have a *.har extension. 2. Point out the correct statement : a) A Hadoop archive maps to a file system directory b) Hadoop archives are special format archives c) A Hadoop archive always has a *.har extension d) All of the mentioned Answer:d Explanation:A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. 3. Using Hadoop Archives in __________ is as easy as specifying a different input filesystem than the default file system. a) Hive b) Pig c) MapReduce d) All of the mentioned Answer:c Explanation:Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input. 4. The __________ guarantees that excess resources taken from a queue will be restored to it within N minutes of its need for them. a) capacitor b) scheduler c) datanode d) None of the mentioned Answer:b Explanation:Free resources can be allocated to any queue beyond its guaranteed capacity. 5. Point out the wrong statement : a) The Hadoop archive exposes itself as a file system layer b) Hadoop archives are immutable c) Archive rename’s, deletes and creates return an error d) None of the mentioned

Answer:d Explanation:All the fs shell commands in the archives work but with a different URI. 6. _________ is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters. a) Flow Scheduler b) Data Scheduler c) Capacity Scheduler d) None of the mentioned Answer:c Explanation: The Capacity Scheduler supports for multiple queues, where a job is submitted to a queue. 7. Which of the following parameter describes destination directory which would contain the archive ? a) -archiveName b) c) d) None of the mentioned Answer:c Explanation: -archiveName is the name of the archive to be created. 8. _________ identifies filesystem pathnames which work as usual with regular expressions. a) -archiveName b) c) d) None of the mentioned Answer:d Explanation: identifies destination directory which would contain the archive. 9. __________ is the parent argument used to specify the relative path to which the files should be archived to a) -archiveName b) -p c) d)

Answer:b Explanation: The hadoop archive command creates a Hadoop archive, a file that contains other files. 10. Which of the following is a valid syntax for hadoop archive ? a) hadooparchive [ Generic Options ] archive -archiveName [-p ] b) hadooparch [ Generic Options ] archive -archiveName [-p ] c) hadoop [ Generic Options ] archive -archiveName [-p ] d) None of the mentioned Answer:c Explanation: The Hadoop archiving tool can be invoked using the following command format: hadoop archive -archiveName name -p *

Hadoop Questions and Answers – Hadoop I/O This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop I/O”. 1. Hadoop I/O Hadoop comes with a set of ________ for data I/O. a) methods b) commands c) classes d) None of the mentioned Answer:d Explanation:Hadoop I/O consist of primitives for serialization and deserialization. 2. Point out the correct statement : a) The sequence file also can contain a “secondary” key-value list that can be used as file Metadata b) SequenceFile formats share a header that contains some information which allows the reader to recognize is format c) There’re Key and Value Class Name’s that allow the reader to instantiate those classes, via reflection, for reading d) All of the mentioned Answer:d Explanation:In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. 3. Apache Hadoop’s ___________ provides a persistent data structure for binary key-value pairs. a) GetFile b) SequenceFile c) Putfile d) All of the mentioned Answer:b Explanation:SequenceFile is append-only. 4. How many formats of SequenceFile are present in Hadoop I/O ? a) 2 b) 3 c) 4 d) 5

Answer:b Explanation:SequenceFile has 3 available formats: An “Uncompressed” format, A “Record Compressed” format and a “Block-Compressed”. 5. Point out the wrong statement : a) The data file contains all the key, value records but key N + 1 must be greater then or equal to the key N b) Sequence file is a kind of hadoop file based data structure c) Map file type is splittable as it contains a sync point after several records d) None of the mentioned Answer:c Explanation:Map file is again a kind of hadoop file based data structure and it differs from a sequence file in a matter of the order. 6. Which of the following format is more compression-aggressive ? a) Partition Compressed b) Record Compressed c) Block-Compressed d) Uncompressed Answer:c Explanation:SequenceFile key-value list can be just a Text/Text pair, and is written to the file during the initialization that happens in the SequenceFile. 7. The __________ is a directory that contains two SequenceFile. a) ReduceFile b) MapperFile c) MapFile d) None of the mentioned Answer:c Explanation:Sequence files are data file (“/data”) and the index file (“/index”). 8. The ______ file is populated with the key and a LongWritable that contains the starting byte position of the record. a) Array b) Index c) Immutable

d) All of the mentioned Answer:b Explanation:Index does’t contains all the keys but just a fraction of the keys. 9. The _________ as just the value field append(value) and the key is a LongWritable that contains the record number, count + 1. a) SetFile b) ArrayFile c) BloomMapFile d) None of the mentioned Answer:b Explanation: The SetFile instead of append(key, value) as just the key field append(key) and the value is always the NullWritable instance. 10. ____________ data file takes is based on avro serializaton framework which was primarily created for hadoop. a) Oozie b) Avro c) cTakes d) Lucene Answer:b Explanation:Avro is a splittable data format with a metadata section at the beginning and then a sequence of avro serialized objects. Hadoop Questions and Answers – Compression This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Compression”. 1. The _________ codec from Google provides modest compression ratios. a) Snapcheck b) Snappy c) FileCompress d) None of the mentioned Answer:b Explanation:Snappy has fast compression and decompression speeds.

2. Point out the correct statement : a) Snappy is licensed under the GNU Public License (GPL) b) BgCIK needs to create an index when it compresses a file c) The Snappy codec is integrated into Hadoop Common, a set of common utilities that supports other Hadoop subprojects d) None of the mentioned Answer:c Explanation:You can use Snappy as an add-on for more recent versions of Hadoop that do not yet provide Snappy codec support. 3. Which of the following compression is similar to Snappy compression ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:a Explanation:LZO is only really desirable if you need to compress text files. 4. Which of the following supports splittable compression ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:a Explanation:LZO enables the parallel processing of compressed text file splits by your MapReduce jobs. 5. Point out the wrong statement : a) From a usability standpoint, LZO and Gzip are similar. b) Bzip2 generates a better compression ratio than does Gzip, but it’s much slower c) Gzip is a compression utility that was adopted by the GNU project d) None of the mentioned Answer:a Explanation:From a usability standpoint, Bzip2 and Gzip are similar.

6. Which of the following is the slowest compression technique ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:b Explanation:Of all the available compression codecs in Hadoop, Bzip2 is by far the slowest. 7. Gzip (short for GNU zip) generates compressed files that have a _________ extension. a) .gzip b) .gz c) .gzp d) .g Answer:b Explanation:You can use the gunzip command to decompress files that were created by a number of compression utilities, including Gzip. 8. Which of the following is based on the DEFLATE algorithm ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:c Explanation:gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding. 9. __________ typically compresses files to within 10% to 15% of the best available techniques. a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:b Explanation:bzip2 is a freely available, patent free (see below), high-quality data compressor. 10. The LZO compression format is composed of approximately __________ blocks of compressed data.

a) 128k b) 256k c) 24k d) 36k Answer:b Explanation:LZO was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. Hadoop Questions and Answers – Data Integrity This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Data Integrity”. 1. The HDFS client software implements __________ checking on the contents of HDFS files. a) metastore b) parity c) checksum d) None of the mentioned Answer:c Explanation:When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. 2. Point out the correct statement : a) The HDFS architecture is compatible with data rebalancing schemes b) Datablocks support storing a copy of data at a particular instant of time. c) HDFS currently support snapshots. d) None of the mentioned Answer:a Explanation:A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. 3. The ___________ machine is a single point of failure for an HDFS cluster. a) DataNode b) NameNode c) ActionNode d) All of the mentioned

Answer:b Explanation:If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported. 4. The ____________ and the EditLog are central data structures of HDFS. a) DsImage b) FsImage c) FsImages d) All of the mentioned Answer:b Explanation:A corruption of these files can cause the HDFS instance to be non-functional 5. Point out the wrong statement : a) HDFS is designed to support small files only b) Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously c) NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog d) None of the mentioned Answer:a Explanation:HDFS is designed to support very large files. 6. __________ support storing a copy of data at a particular instant of time. a) Data Image b) Datanots c) Snapshots d) All of the mentioned Answer:c Explanation:One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. 7. Automatic restart and ____________ of the NameNode software to another machine is not supported. a) failover b) end c) scalability

d) All of the mentioned Answer:a Explanation:If the NameNode machine fails, manual intervention is necessary. 8. HDFS, by default, replicates each data block _____ times on different nodes and on at least ____ racks. a) 3,2 b) 1,2 c) 2,3 d) All of the mentioned Answer:a Explanation:HDFS has a simple yet robust architecture that was explicitly designed for data reliability in the face of faults and failures in disks, nodes and networks. 9. _________ stores its metadata on multiple disks that typically include a non-local file server. a) DataNode b) NameNode c) ActionNode d) None of the mentioned Answer:b Explanation:HDFS tolerates failures of storage servers (called DataNodes) and its disks 10. The HDFS file system is temporarily unavailable whenever the HDFS ________ is down. a) DataNode b) NameNode c) ActionNode d) None of the mentioned Answer:b Explanation:When the HDFS NameNode is restarted it recovers its metadata. Hadoop Questions and Answers – Serialization This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Serialization”. 1. Apache _______ is a serialization framework that produces data in a compact binary format. a) Oozie b) Impala

c) kafka d) Avro Answer:d Explanation:Apache Avro doesn’t require proxy objects or code generation. 2. Point out the correct statement : a) Apache Avro is a framework that allows you to serialize data in a format that has a schema built in b) The serialized data is in a compact binary format that doesn’t require proxy objects or code generation c) Including schemas with the Avro messages allows any application to deserialize the data d) All of the mentioned Answer:d Explanation:Instead of using generated proxy libraries and strong typing, Avro relies heavily on the schemas that are sent along with the serialized data. 3. Avro schemas describe the format of the message and are defined using : a) JSON b) XML c) JS d) All of the mentioned Answer:a Explanation: The JSON schema content is put into a file. 4. The ____________ is an iterator which reads through the file and returns objects using the next() method. a) DatReader b) DatumReader c) DatumRead d) None of the mentioned Answer:b Explanation:DatumReader reads the content through the DataFileReader implementation. 5. Point out the wrong statement : a) Java code is used to deserialize the contents of the file into objects b) Avro allows you to use complex data structures within Hadoop MapReduce jobs

c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies d) None of the mentioned Answer:d Explanation:A unit test is useful because you can make assertions to verify that the values of the deserialized object are the same as the original values. 6. The ____________ class extends and implements several Hadoop-supplied interfaces. a) AvroReducer b) Mapper c) AvroMapper d) None of the mentioned Answer:c Explanation:AvroMapper is used to provide the ability to collect or map data. 7. ____________ class accepts the values that the ModelCountMapper object has collected. a) AvroReducer b) Mapper c) AvroMapper d) None of the mentioned Answer:a Explanation:AvroReducer summarizes them by looping through the values. 8. The ________ method in the ModelCountReducer class “reduces” the values the mapper collects into a derived value a) count b) add c) reduce d) All of the mentioned Answer:c Explanation:In some case, it can be simple sum of the values. 9. Which of the following works well with Avro ? a) Lucene b) kafka c) MapReduce

d) None of the mentioned Answer:c Explanation:You can use Avro and MapReduce together to process many items serialized with Avro’s small binary format. 10. __________ tools is used to generate proxy objects in Java to easily work with the objects. a) Lucene b) kafka c) MapReduce d) Avro Answer:d Explanation:Avro serialization includes the schema with it — in JSON format — which allows you to have different versions of objects. Hadoop Questions and Answers – Avro-1 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Avro-1”. 1. Avro schemas are defined with _____ a) JSON b) XML c) JAVA d) All of the mentioned Answer:a Explanation:JSON implementation facilitates implementation in languages that already have JSON libraries. 2. Point out the correct statement : a) Avro provides functionality similar to systems such as Thrift b) When Avro is used in RPC, the client and server exchange data in the connection handshake c) Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Java Foundation d) None of the mentioned Answer:a Explanation:Avro differs from these systems in the fundamental aspects like untagged data.

3. __________ facilitates construction of generic data-processing systems and languages. a) Untagged data b) Dynamic typing c) No manually-assigned field IDs d) All of the mentioned Answer:b Explanation:Avro does not require that code be generated. 4. With ______ we can store data and read it easily with various programming languages a) Thrift b) Protocol Buffers c) Avro d) None of the mentioned Answer:c Explanation:Avro is optimized to minimize the disk space needed by our data and it is flexible. 5. Point out the wrong statement : a) Apache Avro™ is a data serialization system b) Avro provides simple integration with dynamic languages c) Avro provides rich data structures d) All of the mentioned Answer:d Explanation: Code generation is not required to read or write data files nor to use or implement RPC protocols in Avro. 6. ________ are a way of encoding structured data in an efficient yet extensible format. a) Thrift b) Protocol Buffers c) Avro d) None of the mentioned Answer:b Explanation:Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. 7. Thrift resolves possible conflicts through _________ of the field. a) Name

b) Static number c) UID d) None of the mentioned Answer:b Explanation:Avro resolves possible conflicts through the name of the field. 8. Avro is said to be the future _______ layer of Hadoop. a) RMC b) RPC c) RDC d) All of the mentioned Answer:b Explanation:When Avro is used in RPC, the client and server exchange schemas in the connection handshake. 9. When using reflection to automatically build our schemas without code generation, we need to configure Avro using : a) AvroJob.Reflect(jConf); b) AvroJob.setReflect(jConf); c) Job.setReflect(jConf); d) None of the mentioned

Answer:c Explanation:For strongly typed languages like Java, it also provides a generation code layer, including RPC services code generation. 10. We can declare the schema of our data either in a ______ file. a) JSON b) XML c) SQL d) R Answer:c Explanation:Schema can be declared using an IDL or simply through Java beans by using reflection-based schema building.

Hadoop Questions and Answers – Avro-2 This set of Interview Questions and Answers focuses on “Avro”. 1. Which of the following is a primitive data type in Avro ? a) null b) boolean c) float d) All of the mentioned Answer:d Explanation:Primitive type names are also defined type names. 2. Point out the correct statement : a) Records use the type name “record” and support three attributes b) Enum are represented using JSON arrays c) Avro data is always serialized with its schema d) All of the mentioned Answer:a Explanation:A record is encoded by encoding the values of its fields in the order that they are declared. 3. Avro supports ______ kinds of complex types. a) 3 b) 4 c) 6 d) 7 Answer:d Explanation:Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed. 4.________ are encoded as a series of blocks. a) Arrays b) Enum c) Unions d) Maps

Answer:a Explanation:Each block of array consists of a long count value, followed by that many array items. A block with count zero indicates the end of the array. Each item is encoded per the array’s item schema. 5. Point out the wrong statement : a) Record, enums and fixed are named types b) Unions may immediately contain other unions c) A namespace is a dot-separated sequence of such names d) All of the mentioned Answer:b Explanation:Unions may not immediately contain other unions. 6. ________ instances are encoded using the number of bytes declared in the schema. a) Fixed b) Enum c) Unions d) Maps Answer:a Explanation:Except for unions, the JSON encoding is the same as is used to encode field default values. 7. ________ permits data written by one system to be efficiently sorted by another system. a) Complex Data type b) Order c) Sort Order d) All of the mentioned Answer:c Explanation:Avro binary-encoded data can be efficiently ordered without deserializing it to objects. 8. _____________ are used between blocks to permit efficient splitting of files for MapReduce processing. a) Codec b) Data Marker c) Syncronization markers

d) All of the mentioned Answer:c Explanation:Avro includes a simple object container file format. 9. The __________ codec uses Google’s Snappy compression library. a) null b) snappy c) deflate d) None of the mentioned Answer:b Explanation:Snappy is a compression library developed at Google, and, like many technologies that come from Google, Snappy was designed to be fast. 10. Avro messages are framed as a list of _________ a) buffers b) frames c) rows d) None of the mentioned Answer:b Explanation:Framing is a layer between messages and the transport. It exists to optimize certain operations. Hadoop Questions and Answers – Metrics in Hbase This set of Interview Questions & Answers focuses on “Hbase”. 1. _______ can change the maximum number of cells of a column family. a) set b) reset c) alter d) select Answer:c Explanation:Alter is the command used to make changes to an existing table. 2. Point out the correct statement : a) You can add a column family to a table using the method addColumn() b) Using alter, you can also create a column family

c) Using disable-all, you can truncate a column family d) None of the mentioned Answer:a Explanation:Columns can also be added through HbaseAdmin. 3. Which of the following is not a table scope operator ? a) MEMSTORE_FLUSH b) MEMSTORE_FLUSHSIZE c) MAX_FILESIZE d) All of the mentioned Answer:a Explanation:Using alter, you can set and remove table scope operators such as MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc. 4. You can delete a column family from a table using the method _________ of HBAseAdmin class. a) delColumn() b) removeColumn() c) deleteColumn() d) All of the mentioned Answer:c Explanation:Alter command also can be used to delete a column family. 5. Point out the wrong statement : a) To read data from an HBase table, use the get() method of the HTable class b) You can retrieve data from the HBase table using the get() method of the HTable class c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row ids, or scan an entire table or a subset of rows d) None of the mentioned Answer:d Explanation:You can retrieve an HBase table data using the add method variants in Get class. 6. __________ class adds HBase configuration files to its object. a) Configuration b) Collector c) Component

d) None of the mentioned Answer:a Explanation:You can create a configuration object using the create() method of the HbaseConfiguration class. 7. The ________ class provides the getValue() method to read the values from its instance. a) Get b) Result c) Put d) Value Answer:b Explanation:Get the result by passing your Get class instance to the get method of the HTable class. This method returns the Result class object, which holds the requested result. 8. ________ communicate with the client and handle data-related operations. a) Master Server b) Region Server c) Htable d) All of the mentioned Answer:b Explanation:Region Server handle read and write requests for all the regions under it. 9. _________ is the main configuration file of HBase. a) hbase.xml b) hbase-site.xml c) hbase-site-conf.xml d) None of the mentioned Answer:b Explanation:Set the data directory to an appropriate location by opening the HBase home folder in /usr/local/HBase. 10. HBase uses the _______ File System to store its data. a) Hive b) Imphala c) Hadoop

d) Scala Answer:c Explanation: The data storage will be in the form of regions (tables). These regions will be split up and stored in region servers. adoop Questions and Answers – Mapreduce Development-2 This set of Questions & Answers focuses on “Hadoop MapReduce”. 1. The Mapper implementation processes one line at a time via _________ method. a) map b) reduce c) mapper d) reducer Answer:a Explanation: The Mapper outputs are sorted and then partitioned per Reducer. 2. Point out the correct statement : a) Mapper maps input key/value pairs to a set of intermediate key/value pairs b) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods c) Mapper and Reducer interfaces form the core of the job d) None of the mentioned Answer:d Explanation: The transformed intermediate records do not need to be of the same type as the input records. 3. The Hadoop MapReduce framework spawns one map task for each __________ generated by the InputFormat for the job. a) OutputSplit b) InputSplit c) InputSplitStream d) All of the mentioned Answer:b Explanation:Mapper implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and override it to initialize themselves.

4. Users can control which keys (and hence records) go to which Reducer by implementing a custom : a) Partitioner b) OutputSplit c) Reporter d) All of the mentioned Answer:a Explanation:Users can control the grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class). 5. Point out the wrong statement : a) The Mapper outputs are sorted and then partitioned per Reducer b) The total number of partitions is the same as the number of reduce tasks for the job c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format d) None of the mentioned Answer:d Explanation:All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output. 6. Applications can use the ____________ to report progress and set application-level status messages a) Partitioner b) OutputSplit c) Reporter d) All of the mentioned Answer:c Explanation:Reporter is also used to update Counters, or just indicate that they are alive. 7. The right level of parallelism for maps seems to be around _________ maps per-node a) 1-10 b) 10-100 c) 100-150 d) 150-200

Answer:b Explanation:Task setup takes a while, so it is best if the maps take at least a minute to execute. 8. The number of reduces for the job is set by the user via : a) JobConf.setNumTasks(int) b) JobConf.setNumReduceTasks(int) c) JobConf.setNumMapTasks(int) d) All of the mentioned Answer:b Explanation:Reducer has 3 primary phases: shuffle, sort and reduce. 9. The framework groups Reducer inputs by key in _________ stage. a) sort b) shuffle c) reduce d) None of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. 10. The output of the reduce task is typically written to the FileSystem via _____________ . a) OutputCollector.collect b) OutputCollector.get c) OutputCollector.receive d) OutputCollector.put Answer:a Explanation: The output of the Reducer is not sorted. Hadoop Questions and Answers – MapReduce Features-1 This set of Hadoop Questions & Answers for freshers focuses on “MapReduce Features”. 1. Which of the following is the default Partitioner for Mapreduce ? a) MergePartitioner b) HashedPartitioner c) HashPartitioner d) None of the mentioned

Answer:c Explanation: The total number of partitions is the same as the number of reduce tasks for the job. 2. Point out the correct statement : a) The right number of reduces seems to be 0.95 or 1.75 b) Increasing the number of reduces increases the framework overhead c) With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish d) All of the mentioned Answer:c Explanation:With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. 3. Which of the following partitions the key space ? a) Partitioner b) Compactor c) Collector d) All of the mentioned Answer:a Explanation:Partitioner controls the partitioning of the keys of the intermediate map-outputs. 4. ____________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) OutputCompactor b) OutputCollector c) InputCollector d) All of the mentioned Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners. 5. Point out the wrong statement : a) It is legal to set the number of reduce-tasks to zero if no reduction is desired b) The outputs of the map-tasks go directly to the FileSystem c) The Mapreduce framework does not sort the map-outputs before writing them out to the FileSystem

d) None of the mentioned Answer:d Explanation:Outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). 6. __________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. a) JobConfig b) JobConf c) JobConfiguration d) All of the mentioned Answer:b Explanation:JobConf is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat and OutputCommitter implementations. 7. The ___________ executes the Mapper/ Reducer task as a child process in a separate jvm. a) JobTracker b) TaskTracker c) TaskScheduler d) None of the mentioned Answer:a Explanation: The child-task inherits the environment of the parent TaskTracker. 8. Maximum virtual memory of the launched child-task is specified using : a) mapv b) mapred c) mapvim d) All of the mentioned Answer:b Explanation:Admins can also specify the maximum virtual memory of the launched child-task, and any sub-process it launches recursively, using mapred. 9. Which of the following parameter is the threshold for the accounting and serialization buffers ? a) io.sort.spill.percent b) io.sort.record.percent

c) io.sort.mb d) None of the mentioned Answer:a Explanation:When percentage of either buffer has filled, their contents will be spilled to disk in the background. 10. ______________ is percentage of memory relative to the maximum heapsize in which map outputs may be retained during the reduce. a) mapred.job.shuffle.merge.percent b) mapred.job.reduce.input.buffer.percen c) mapred.inmem.merge.threshold d) io.sort.factor Answer:b Explanation:When the reduce begins, map outputs will be merged to disk until those that remain are under the resource limit this defines. Hadoop Questions and Answers – MapReduce Features-2 This set of Interview Questions & Answers focuses on “MapReduce”. 1. ____________ specifies the number of segments on disk to be merged at the same time. a) mapred.job.shuffle.merge.percent b) mapred.job.reduce.input.buffer.percen c) mapred.inmem.merge.threshold d) io.sort.factor Answer:d Explanation:io.sort.factor limits the number of open files and compression codecs during the merge. 2. Point out the correct statement : a) The number of sorted map outputs fetched into memory before being merged to disk b) The memory threshold for fetched map outputs before an in-memory merge is finished c) The percentage of memory relative to the maximum heapsize in which map outputs may not be retained during the reduce d) None of the mentioned

Answer:a Explanation:When the reduce begins, map outputs will be merged to disk until those that remain are under the resource limit this defines. 3. Map output larger than ___ percent of the memory allocated to copying map outputs. a) 10 b) 15 c) 25 d) 35 Answer:c Explanation:Map output will be written directly to disk without first staging through memory. 4. Jobs can enable task JVMs to be reused by specifying the job configuration : a) mapred.job.recycle.jvm.num.tasks b) mapissue.job.reuse.jvm.num.tasks c) mapred.job.reuse.jvm.num.tasks d) All of the mentioned Answer:b Explanation:Many of my tasks had performance improved over 50% using mapissue.job.reuse.jvm.num.tasks. 5. Point out the wrong statement : a) The task tracker has local directory to create localized cache and localized job b) The task tracker can define multiple local directories c) The Job tracker cannot define multiple local directories d) None of the mentioned Answer:d Explanation:When the job starts, task tracker creates a localized job directory relative to the local directory specified in the configuration. 6. During the execution of a streaming job, the names of the _______ parameters are transformed. a) vmap b) mapvim c) mapreduce d) mapred

Answer:d Explanation:To get the values in a streaming job’s mapper/reducer use the parameter names with the underscores. 7. The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and logged to : a) ${HADOOP_LOG_DIR}/user b) ${HADOOP_LOG_DIR}/userlogs c) ${HADOOP_LOG_DIR}/logs d) None of the mentioned Answer:b Explanation: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH. 8. ____________ is the primary interface by which user-job interacts with the JobTracker. a) JobConf b) JobClient c) JobServer d) All of the mentioned Answer:b Explanation:JobClient provides facilities to submit jobs, track their progress, access componenttasks’ reports and logs, get the MapReduce cluster’s status information and so on. 9. The _____________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks. a) DistributedLog b) DistributedCache c) DistributedJars d) None of the mentioned Answer:b Explanation:Cached libraries can be loaded via System.loadLibrary or System.load. 10. __________ is used to filter log files from the output directory listing. a) OutputLog b) OutputLogFilter c) DistributedLog d) DistributedJars

d) None of the mentioned Answer:b Explanation:User can view the history logs summary in specified directory using the following command $ bin/hadoop job -history output-dir. Hadoop Questions and Answers – Hadoop Configuration This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Configuration”. 1. Which of the following class provides access to configuration parameters ? a) Config b) Configuration c) OutputConfig d) None of the mentioned Answer:b Explanation:Configurations are specified by resources. 2. Point out the correct statement : a) Configuration parameters may be declared static b) Unless explicitly turned off, Hadoop by default specifies two resources c) Configuration class provides access to configuration parameters d) None of the mentioned Answer:a Explanation:Once a resource declares a value final, no subsequently-loaded resource can alter that value. 3. ___________ gives site-specific configuration for a given hadoop installation. a) core-default.xml b) core-site.xml c) coredefault.xml d) All of the mentioned Answer:b Explanation:core-default.xml is read-only defaults for hadoop. 4. Administrators typically define parameters as final in __________ for values that user applications may not alter.

a) core-default.xml b) core-site.xml c) coredefault.xml d) All of the mentioned Answer:b Explanation:Value strings are first processed for variable expansion. 5. Point out the wrong statement : a) addDeprecations adds a set of deprecated keys to the global deprecations b) Configuration parameters cannot be declared final c) addDeprecations method is lockless d) None of the mentioned Answer:b Explanation:Configuration parameters may be declared final. 6. _________ method clears all keys from the configuration. a) clear b) addResource c) getClass d) None of the mentioned Answer:a Explanation:getClass is used to get the value of the name property as a Class. 7. ________ method adds the deprecated key to the global deprecation map. a) addDeprecits b) addDeprecation c) keyDeprecation d) None of the mentioned Answer:b Explanation:addDeprecation does not override any existing entries in the deprecation map. 8. ________ checks whether the given key is deprecated. a) isDeprecated b) setDeprecated c) isDeprecatedif

d) All of the mentioned Answer:a Explanation:Method returns true if the key is deprecated and false otherwise. 9. _________ is useful for iterating the properties when all deprecated properties for currently set properties need to be present. a) addResource b) setDeprecatedProperties c) addDefaultResource d) None of the mentioned Answer:b Explanation:setDeprecatedProperties sets all deprecated properties that are not currently set but have a corresponding new property that is set. 10. Which of the following adds a configuration resource ? a) addResource b) setDeprecatedProperties c) addDefaultResource d) addResource Answer:d Explanation: The properties of this resource will override properties of previously added resources, unless they were marked final. Hadoop Questions and Answers – Security This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Security”. 1. For running hadoop service daemons in Hadoop in secure mode, ___________ principals are required. a) SSL b) Kerberos c) SSH d) None of the mentioned Answer:b Explanation:Each service reads auhenticate information saved in keytab file with appropriate permission.

2. Point out the correct statement : a) Hadoop does have the definition of group by itself b) MapReduce JobHistory server run as same user such as mapred c) SSO environment is managed using Kerberos with LDAP for Hadoop in secure mode d) None of the mentioned Answer:c Explanation:You can change a way of mapping by specifying the name of mapping provider as a value of hadoop.security.group.mapping. 3. The simplest way to do authentication is using _________ command of Kerberos. a) auth b) kinit c) authorize d) All of the mentioned Answer:b Explanation:HTTP web-consoles should be served by principal different from RPC’s one. 4. Data transfer between Web-console and clients are protected by using : a) SSL b) Kerberos c) SSH d) None of the mentioned Answer:a Explanation:AES offers the greatest cryptographic strength and the best performance 5. Point out the wrong statement : a) Data transfer protocol of DataNode does not use the RPC framework of Hadoop b) Apache Oozie which access the services of Hadoop on behalf of end users need to be able to impersonate end users c) DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address d) None of the mentioned Answer:d Explanation:Authentication is based on the assumption that the attacker won’t be able to get root privileges.

6. In order to turn on RPC authentication in hadoop, set the value of hadoop.security.authentication property to : a) zero b) kerberos c) false d) None of the mentioned Answer:b Explanation:Security settings need to be modified properly for robustness. 7. The __________ provides a proxy between the web applications exported by an application and an end user. a) ProxyServer b) WebAppProxy c) WebProxy d) None of the mentioned Answer:b Explanation:If security is enabled it will warn users before accessing a potentially unsafe web application. Authentication and authorization using the proxy is handled just like any other privileged web application. 8. ___________ used by YARN framework which define how any container launched and controlled. a) Container b) ContainerExecutor c) Executor d) All of the mentioned Answer:b Explanation: The container process has the same Unix user as the NodeManager. 9. The ____________ requires that paths including and leading up to the directories specified in yarn.nodemanager.local-dirs a) TaskController b) LinuxTaskController c) LinuxController d) None of the mentioned

Answer:b Explanation:LinuxTaskController keeps track of all paths and directories on datanode. 10. The configuration file must be owned by the user running : a) DataManager b) NodeManager c) ValidationManager d) None of the mentioned Answer:b Explanation:To re-cap,local file-sysytem permissions need to be modified Hadoop Questions and Answers – MapReduce Job-1 This set of Hadoop Interview Questions & Answers for freshers focuses on “MapReduce Job”. 1. __________ storage is a solution to decouple growing storage capacity from compute capacity. a) DataNode b) Archival c) Policy d) None of the mentioned Answer:b Explanation:Nodes with higher density and less expensive storage with low compute power are becoming available. 2. Point out the correct statement : a) When there is enough space, block replicas are stored according to the storage type list b) One_SSD is used for storing all replicas in SSD c) Hot policy is useful only for single replica blocks d) All of the mentioned Answer:a Explanation: The first phase of Heterogeneous Storage changed datanode storage model from a single storage. 3. ___________ is added for supporting writing single replica files in memory. a) ROM_DISK b) ARCHIVE c) RAM_DISK

d) All of the mentioned Answer:c Explanation:DISK is the default storage type. 4. Which of the following has high storage density ? a) ROM_DISK b) ARCHIVE c) RAM_DISK d) All of the mentioned Answer:b Explanation:Little compute power is added for supporting archival storage. 5. Point out the wrong statement : a) A Storage policy consists of the Policy ID b) The storage policy can be specified using the “dfsadmin -setStoragePolicy” command c) dfs.storage.policy.enabled is used for enabling/disabling the storage policy feature d) None of the mentioned Answer:d Explanation: The effective storage policy can be retrieved by the “dfsadmin -getStoragePolicy” command. 6. Which of the following storage policy is used for both storage and compute ? a) Hot b) Cold c) Warm d) All_SSD Answer:a Explanation:When a block is hot, all replicas are stored in DISK. 7. Which of the following is only for storage with limited compute ? a) Hot b) Cold c) Warm d) All_SSD

Answer:b Explanation:When a block is cold, all replicas are stored in ARCHIVE. 8. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in : a) ROM_DISK b) ARCHIVE c) RAM_DISK d) All of the mentioned Answer:b Explanation:Warm storage policy is partially hot and partially cold. 9. ____________ is used for storing one of the replicas in SSD. a) Hot b) Lazy_Persist c) One_SSD d) All_SSD Answer:c Explanation: The remaining replicas are stored in DISK. 10. ___________ is used for writing blocks with single replica in memory. a) Hot b) Lazy_Persist c) One_SSD d) All_SSD Answer:b Explanation: The replica is first written in RAM_DISK and then it is lazily persisted in DISK. Hadoop Questions and Answers – MapReduce Job-2 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “MapReduce Job2”. 1. _________ is a data migration tool added for archiving data. a) Mover b) Hiver c) Serde

d) None of the mentioned Answer:a Explanation:Mover periodically scans the files in HDFS to check if the block placement satisfies the storage policy. 2. Point out the correct statement : a) Mover is not similar to Balancer b) hdfs dfsadmin -setStoragePolicy puts a storage policy to a file or a directory. c) addCacheArchive add archives to be localized d) None of the mentioned Answer:c Explanation:addArchiveToClassPath(Path archive) adds an archive path to the current set of classpath entries. 3. Which of the following is used to list out the storage policies ? a) hdfs storagepolicies b) hdfs storage c) hd storagepolicies d) All of the mentioned Answer:a Explanation:Arguments are none for the hdfs storagepolicies command. 4. Which of the following statement can be used get the storage policy of a file or a directory ? a) hdfs dfsadmin -getStoragePolicy path b) hdfs dfsadmin -setStoragePolicy path policyName c) hdfs dfsadmin -listStoragePolicy path policyName d) All of the mentioned Answer:a Explanation: refers to the path referring to either a directory or a file. Hadoop Questions and Answers – Task Execution This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Task Execution”. 1. Which of the following node is responsible for executing a Task assigned to it by the JobTracker ?

a) MapReduce b) Mapper c) TaskTracker d) JobTracker Answer:c Explanation:TaskTracker receives the information necessary for execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker. 2. Point out the correct statement : a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function. c) Reduce Task in MapReduce is performed using the Map() function. d) All of the mentioned Answer:a Explanation:This feature of MapReduce is “Data Locality”. 3. ___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results. a) Maptask b) Mapper c) Task execution d) All of the mentioned Answer:a Explanation:Map Task in MapReduce is performed using the Map() function. 4. _________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. a) Reduce b) Map c) Reducer d) All of the mentioned Answer:a Explanation:Reduce function collates the work and resolves the results. 5. Point out the wrong statement : a) A MapReduce job usually splits the input data-set into independent chunks which are

processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned Answer:d Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks. 6. Although the Hadoop framework is implemented in Java , MapReduce applications need not be written in : a) Java b) C c) C# d) None of the mentioned Answer:a Explanation:Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNITM based). 7. ________ is a utility which allows users to create and run jobs with any executable as the mapper and/or the reducer. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned Answer:b Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution. 8. __________ maps input key/value pairs to a set of intermediate key/value pairs. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned

Answer:a Explanation:Maps are the individual tasks that transform input records into intermediate records. 9. The number of maps is usually driven by the total size of : a) inputs b) outputs c) tasks d) None of the mentioned Answer:a Explanation:Total size of inputs means total number of blocks of the input files. 10. Running a ___________ program involves running mapping tasks on many or all of the nodes in our cluster. a) MapReduce b) Map c) Reducer d) All of the mentioned Answer:a Explanation: In some applications, component tasks need to create and/or write to side-files, which differ from the actual job-output files. Hadoop Questions and Answers – YARN-1 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “YARN-1”. 1. ________ is the architectural center of Hadoop that allows multiple data processing engines. a) YARN b) Hive c) Incubator d) Chuckwa Answer:a Explanation:YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. 2. Point out the correct statement : a) YARN also extends the power of Hadoop to incumbent and new technologies found within the data center

b) YARN is the central point of investment for Hortonworks within the Apache community c) YARN enhances a Hadoop compute cluster in many ways d) All of the mentioned Answer:d Explanation:YARN provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. 3. YARN’s dynamic allocation of cluster resources improves utilization over more static _______ rules used in early versions of Hadoop. a) Hive b) MapReduce c) Imphala d) All of the mentioned Answer:b Explanation:Multi-tenant data processing improves an enterprise’s return on its Hadoop investments. 4. The __________ is a framework-specific entity that negotiates resources from the ResourceManager a) NodeManager b) ResourceManager c) ApplicationMaster d) All of the mentioned Answer:c Explanation:Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the schedule. 5. Point out the wrong statement : a) From the system perspective, the ApplicationMaster runs as a normal container. b) The ResourceManager is the per-machine slave, which is responsible for launching the applications’ containers c) The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage d) None of the mentioned

Answer:b Explanation:ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. 6. Apache Hadoop YARN stands for : a) Yet Another Reserve Negotiator b) Yet Another Resource Network c) Yet Another Resource Negotiator d) All of the mentioned Answer:c Explanation:YARN is a cluster management technology. 7. MapReduce has undergone a complete overhaul in hadoop : a) 0.21 b) 0.23 c) 0.24 d) 0.26 Answer:b Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker. 8. The ____________ is the ultimate authority that arbitrates resources among all the applications in the system. a) NodeManager b) ResourceManager c) ApplicationMaster d) All of the mentioned Answer:b Explanation: The ResourceManager and per-node slave, the NodeManager (NM), form the datacomputation framework. 9. The __________ is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. a) Manager b) Master c) Scheduler

d) None of the mentioned Answer:c Explanation: The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. 10. The CapacityScheduler supports _____________ queues to allow for more predictable sharing of cluster resources. a) Networked b) Hierarchial c) Partition d) None of the mentioned Answer:b Explanation: The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. Hadoop Questions and Answers – YARN-2 This set of Hadoop Question Bank focuses on “YARN”. 1. Yarn commands are invoked by the ________ script. a) hive b) bin c) hadoop d) home Answer:b Explanation:Running the yarn script without any arguments prints the description for all commands. 2. Point out the correct statement : a) Each queue has strict ACLs which controls which users can submit applications to individual queues b) Hierarchy of queues is supported to ensure resources are shared among the sub-queues of an organization c) Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal d) All of the mentioned

Answer:d Explanation:All applications submitted to a queue will have access to the capacity allocated to the queue. 3. The queue definitions and properties such as ________, ACLs can be changed, at runtime. a) tolerant b) capacity c) speed d) All of the mentioned Answer:b Explanation:Administrators can add additional queues at runtime, but queues cannot be deleted at runtime. 4. The CapacityScheduler has a pre-defined queue called : a) domain b) root c) rear d) All of the mentioned Answer:b Explanation:All queueus in the system are children of the root queue. 5. Point out the wrong statement : a) The multiple of the queue capacity which can be configured to allow a single user to acquire more resources b) Changing queue properties and adding new queues is very simple c) Queues cannot be deleted, only addition of new queues is supported d) None of the mentioned Answer:d Explanation:You need to edit conf/capacity-scheduler.xml and run yarn rmadmin -refreshQueues for changing queue properties. 6. The updated queue configuration should be a valid one i.e. queue-capacity at each level should be equal to : a) 50% b) 75% c) 100%

d) 0% Answer:c Explanation:Queues cannot be deleted, only addition of new queues is supported. 7. Users can bundle their Yarn code in a _________ file and execute it using jar command. a) java b) jar c) C code d) xml Answer:b Explanation:Usage: yarn jar [mainClass] args… 8. Which of the following command is used to dump the log container ? a) logs b) log c) dump d) All of the mentioned Answer:a Explanation:Usage: yarn logs -applicationId . 9. __________ will clear the RMStateStore and is useful if past applications are no longer needed. a) -format-state b) -form-state-store c) -format-state-store d) None of the mentioned Answer:c Explanation:-format-state-store formats the RMStateStore. 10. Which of the following command runs ResourceManager admin client ? a) proxyserver b) run c) admin d) rmadmin

Answer:d Explanation:proxyserver command starts the web proxy server. Hadoop Questions and Answers – Mapreduce Types This set of Hadoop Questions & Answers for experienced focuses on “MapReduce Types”. 1. ___________ generates keys of type LongWritable and values of type Text. a) TextOutputFormat b) TextInputFormat c) OutputInputFormat d) None of the mentioned Answer:b Explanation:If K2 and K3 are the same, you don’t need to call setMapOutputKeyClass(). 2. Point out the correct statement : a) The reduce input must have the same types as the map output, although the reduce output types may be different again b) The map input key and value types (K1 and V1) are different from the map output types c) The partition function operates on the intermediate key d) All of the mentioned Answer:d Explanation:In practice, the partition is determined solely by the key (the value is ignored). 3. In _____________, the default job is similar, but not identical, to the Java equivalent. a) Mapreduce b) Streaming c) Orchestration d) All of the mentioned Answer:b Explanation:MapReduce Types and Formats MapReduce has a simple model of data processing. 4. An input _________ is a chunk of the input that is processed by a single map. a) textformat b) split c) datanode d) All of the mentioned

Answer:b Explanation:Each split is divided into records, and the map processes each record—a key-value pair—in turn. 5. Point out the wrong statement : a) If V2 and V3 are the same, you only need to use setOutputValueClass() b) The overall effect of Streaming job is to perform a sort of the input c) A Streaming application can control the separator that is used when a key-value pair is turned into a series of bytes and sent to the map or reduce process over standard input d) None of the mentioned Answer:d Explanation:If a combine function is used then it is the same form as the reduce function, except its output types are the intermediate key and value types (K2 and V2), so they can feed the reduce function. 6. An ___________ is responsible for creating the input splits, and dividing them into records. a) TextOutputFormat b) TextInputFormat c) OutputInputFormat d) InputFormat Answer:d Explanation:As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat. 7. ______________ is another implementation of the MapRunnable interface that runs mappers concurrently in a configurable number of threads. a) MultithreadedRunner b) MultithreadedMap c) MultithreadedMapRunner d) SinglethreadedMapRunner Answer:c Explanation:A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function. 8. Which of the following is the only way of running mappers ? a) MapReducer b) MapRunner

c) MapRed d) All of the mentioned Answer:b Explanation:Having calculated the splits, the client sends them to the jobtracker. 9. _________ is the base class for all implementations of InputFormat that use files as their data source . a) FileTextFormat b) FileInputFormat c) FileOutputFormat d) None of the mentioned Answer:b Explanation:FileInputFormat provides implementation for generating splits for the input files. 10. Which of the following method add a path or paths to the list of inputs ? a) setInputPaths() b) addInputPath() c) setInput() d) None of the mentioned Answer:b Explanation:FileInputFormat offers four static convenience methods for setting a JobConf’s input paths. Hadoop Questions and Answers – Mapreduce Formats-1 This set of Hadoop Interview Questions & Answers for experienced focuses on “MapReduce Formats”. 1. The split size is normally the size of an ________ block, which is appropriate for most applications. a) generic b) task c) library d) HDFS

Answer:d Explanation:FileInputFormat splits only large files(Here “large” means larger than an HDFS block). 2. Point out the correct statement : a) The minimum split size is usually 1 byte, although some formats have a lower bound on the split size b) Applications may impose a minimum split size. c) The maximum split size defaults to the maximum value that can be represented by a Java long type d) All of the mentioned Answer:a Explanation: The maximum split size has an effect only when it is less than the block size, forcing splits to be smaller than a block. 3. Which of the following Hadoop streaming command option parameter is required ? a) output directoryname b) mapper executable c) input directoryname d) All of the mentioned Answer:d Explanation:Required parameters is used for Input and Output location for mapper. 4. To set an environment variable in a streaming command use: a) -cmden EXAMPLE_DIR=/home/example/dictionaries/ b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/ c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/ Answer:c Explanation:Environment Variable is set using cmdenv command. 5. Point out the wrong statement : a) Hadoop works better with a small number of large files than a large number of small files b) CombineFileInputFormat is designed to work well with small files c) CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job

d) None of the mentioned Answer:c Explanation:If the file is very small (“small” means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. 6. The ________ option allows you to copy jars locally to the current working directory of tasks and automatically unjar the files. a) archives b) files c) task d) None of the mentioned Answer:a Explanation:Archives options is also a generic option. 7. ______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys. a) KeyFieldPartitioner b) KeyFieldBasedPartitioner c) KeyFieldBased d) None of the mentioned Answer:b Explanation: The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting. 8. Which of the following class provides a subset of features provided by the Unix/GNU Sort ? a) KeyFieldBased b) KeyFieldComparator c) KeyFieldBasedComparator d) All of the mentioned Answer:c Explanation:Hadoop has a library class, KeyFieldBasedComparator, that is useful for many applications. 9. Which of the following class is provided by Aggregate package ? a) Map

b) Reducer c) Reduce d) None of the mentioned Answer:b Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of simple aggregators that perform aggregations such as “sum”, “max”, “min” and so on over a sequence of values. 10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix ______ utility. a) Copy b) Cut c) Paste d) Move Answer:b Explanation: The map function defined in the class treats each input key/value pair as a list of fields. Hadoop Questions and Answers – Mapreduce Formats-2 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Mapreduce Formats-2”. 1. ___________ takes node and rack locality into account when deciding which blocks to place in the same split a) CombineFileOutputFormat b) CombineFileInputFormat c) TextFileInputFormat d) None of the mentioned Answer:b Explanation:CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job. 2. Point out the correct statement : a) With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of lines of input b) StreamXmlRecordReader, the page elements can be interpreted as records for processing by a mapper

c) The number depends on the size of the split and the length of the lines. d) All of the mentioned Answer:d Explanation:Large XML documents that are composed of a series of “records” can be broken into these records using simple string or regular-expression matching to find start and end tags of records. 3. The key, a ____________, is the byte offset within the file of the beginning of the line. a) LongReadable b) LongWritable c) LongWritable d) All of the mentioned Answer:b Explanation: The value is the contents of the line, excluding any line terminators (newline, carriage return), and is packaged as a Text object. 4. _________ is the output produced by TextOutputFor mat, Hadoop’s default OutputFormat. a) KeyValueTextInputFormat b) KeyValueTextOutputFormat c) FileValueTextInputFormat d) All of the mentioned Answer:b Explanation:To interpret such files correctly, KeyValueTextInputFormat is appropriate. 5. Point out the wrong statement : a) Hadoop’s sequence file format stores sequences of binary key-value pairs b) SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects c) SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects. d) None of the mentioned Answer:c Explanation:SequenceFileAsBinaryInputFormat is used for reading keys, values from SequenceFiles in binary (raw) format.

6. __________ is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects a) SequenceFile b) SequenceFileAsTextInputFormat c) SequenceAsTextInputFormat d) All of the mentioned Answer:b Explanation:With multiple reducers, records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task. 7. __________ class allows you to specify the InputFormat and Mapper to use on a per-path basis. a) MultipleOutputs b) MultipleInputs c) SingleInputs d) None of the mentioned Answer:b Explanation:One might be tab-separated plain text, the other a binary sequence file. Even if they are in the same format, they may have different representations, and therefore need to be parsed differently. 8. ___________ is an input format for reading data from a relational database, using JDBC. a) DBInput b) DBInputFormat c) DBInpFormat d) All of the mentioned Answer:b Explanation:DBInputFormat is the most frequently used format for reading data. 9. Which of the following is the default output format ? a) TextFormat b) TextOutput c) TextOutputFormat d) None of the mentioned

Answer:c Explanation:TextOutputFormat keys and values may be of any type. 10. Which of the following writes MapFiles as output ? a) DBInpFormat b) MapFileOutputFormat c) SequenceFileAsBinaryOutputFormat d) None of the mentioned Answer:c Explanation:SequenceFileAsBinaryOutputFormat writes keys and values in raw binary format into a SequenceFile container. Hadoop Questions and Answers – Hadoop Cluster-1 This set of Questions and Answers focuses on “Hadoop Cluster” 1. Mapper implementations are passed the JobConf for the job via the ________ method a) JobConfigure.configure b) JobConfigurable.configure c) JobConfigurable.configureable d) None of the mentioned Answer:b Explanation:JobConfigurable.configure method is overrided to initialize themselves. 2. Point out the correct statement : a) Applications can use the Reporter to report progress b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format d) All of the mentioned Answer: d Explanation:Reporters can be used to set application-level status messages and update Counters. 3. Input to the _______ is the sorted output of the mappers. a) Reducer b) Mapper c) Shuffle

d) All of the mentioned Answer:a Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. 4. The right number of reduces seems to be : a) 0.90 b) 0.80 c) 0.36 d) 0.95 Answer:d Explanation: The right number of reduces seems to be 0.95 or 1.75. 5. Point out the wrong statement : a) Reducer has 2 primary phases b) Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures c) It is legal to set the number of reduce-tasks to zero if no reduction is desired d) The framework groups Reducer inputs by keys (since different mappers may have output the same key) in sort stage Answer:a Explanation:Reducer has 3 primary phases: shuffle, sort and reduce. 6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop. a) Mapper b) Cascader c) Scalding d) None of the mentioned Answer:d Explanation: The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted. 7. Which of the following phases occur simultaneously ? a) Shuffle and Sort b) Reduce and Sort c) Shuffle and Map

d) All of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. 8. Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned Answer:c Explanation:Reporter is a facility for MapReduce applications to report progress, set applicationlevel status messages and update Counters. 9. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners. 10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned Answer:b Explanation:JobConf represents a MapReduce job configuration. Hadoop Questions and Answers – Hadoop Cluster-2

This set of Hadoop assessment questions focuses on “Hadoop Cluster”. 1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. a) NoSQL b) NewSQL c) SQL d) All of the mentioned Answer:a Explanation: NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation. 2. Point out the correct statement : a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload b) HDFS runs on a small cluster of commodity-class nodes c) NEWSQL is frequently the collection point for big data d) None of the mentioned Answer:a Explanation:Hadoop together with a relational data warehouse, they can form very effective data warehouse infrastructure. 3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on: a) HCatalog b) Hive c) Hbase d) All of the mentioned Answer:a Explanation:Other means of tagging the values also can be used. 4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments a) EMR b) Isilon solutions c) AWS d) None of the mentioned

Answer:b Explanation:enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution. 5. Point out the wrong statement : a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform b) Isilon’s native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure c) NoSQL systems do provide high latency access and accommodate less concurrent users d) None of the mentioned Answer:c Explanation:NoSQL systems do provide low latency access and accommodate many concurrent users. 6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to : a) Scale out b) Scale up c) Both Scale out and up d) None of the mentioned Answer:a Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up. 7. Which is the most popular NoSQL database for scalable big data store with Hadoop ? a) Hbase b) MongoDB c) Cassandra d) None of the mentioned Answer:a Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. 8. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.

a) DataCache b) DistributedData c) DistributedCache d) All of the mentioned Answer:c Explanation: The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH. 9. HBase provides ___________ like capabilities on top of Hadoop and HDFS. a) TopTable b) BigTop c) Bigtable d) None of the mentioned Answer:c Explanation: Google Bigtable leverages the distributed data storage provided by the Google File System. 10. _______ refers to incremental costs with no major impact on solution design, performance and complexity. a) Scale-out b) Scale-down c) Scale-up d) None of the mentioned Answer:c Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches. Hadoop Questions and Answers – HDFS Maintenance This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “HDFS Maintenance”. 1. Which of the following is a common hadoop maintenance issue ? a) Lack of tools b) Lack of configuration management c) Lack of web interface d) None of the mentioned

Answer:b Explanation:Without a centralized configuration management framework, you end up with a number of issues that can cascade just as your usage picks up. 2. Point out the correct statement : a) RAID is turned off by default b) Hadoop is designed to be a highly redundant distributed system c) Hadoop has a networked configuration system d) None of the mentioned Answer:b Explanation:Hadoop deployment is sometimes difficult to implement. 3. ___________ mode allows you to suppress alerts for a host, service, role, or even the entire cluster. a) Safe b) Maintenance c) Secure d) All of the mentioned Answer:b Explanation:Maintenance mode can be useful when you need to take actions in your cluster and do not want to see the alerts that will be generated due to those actions. 4. Which of the following is a configuration management system ? a) Alex b) Puppet c) Acem d) None of the mentioned Answer:b Explanation:Administrators may use configuration management systems such as Puppet and Chef to manage processes. 5. Point out the wrong statement : a) If you set the HBase service into maintenance mode, then its roles (HBase Master and all Region Servers) are put into effective maintenance mode b) If you set a host into maintenance mode, then any roles running on that host are put into effective maintenance mode c) Putting a component into maintenance mode prevent events from being logged

d) None of the mentioned Answer:c Explanation:Maintenance mode only suppresses the alerts that those events would otherwise generate. 6. Which of the following is a common reason to restart hadoop process ? a) Upgrade Hadoop b) React to incidents c) Remove worker nodes d) All of the mentioned Answer:d Explanation: The most common reason administrators restart Hadoop processes is to enact configuration changes. 7. __________ Manager’s Service feature monitors dozens of service health and performance metrics about the services and role instances running on your cluster. a) Microsoft b) Cloudera c) Amazon d) None of the mentioned Answer:b Explanation:Manager’s Service feature presents health and performance data in a variety of formats. 8. Which of the tab shows all the role instances that have been instantiated for this service ? a) Service b) Status c) Instance d) All of the mentioned Answer:c Explanation: The Instances page displays the results of the configuration validation checks it performs for all the role instances for this service. 9. __________ is a standard Java API for monitoring and managing applications. a) JVX b) JVM

c) JMX d) None of the mentioned Answer:c Explanation:Hadoop includes several managed beans (MBeans), which expose Hadoop metrics to JMX-aware applications. 10. NameNode is monitored and upgraded in a __________ transition. a) safemode b) securemode c) servicemode d) None of the mentioned Answer:b Explanation: The HDFS service has some unique functions that may result in additional information on its Status and Instances pages. Hadoop Questions and Answers – Monitoring HDFS This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Monitoring HDFS”. 1. For YARN, the ___________ Manager UI provides host and port information. a) Data Node b) NameNode c) Resource d) Replication Answer:c Explanation:All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode. 2. Point out the correct statement : a) The Hadoop framework publishes the job flow status to an internally running web server on the master nodes of the Hadoop cluster b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned

Answer:a Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows information about the NameNode itself. 3. For ________, the HBase Master UI provides information about the HBase Master uptime. a) HBase b) Oozie c) Kafka d) All of the mentioned Answer:a Explanation:HBase Master UI provides information about the number of live, dead and transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks. 4. ________ NameNode is used when the Primary NameNode goes down. a) Rack b) Data c) Secondary d) None of the mentioned Answer:c Explanation:Secondary namenode is used for all time availability and reliability. 5. Point out the wrong statement : a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to Answer:d Explanation:NameNode is aware of the files to which the blocks stored on it belong to. 6. Which of the following scenario may not be a good fit for HDFS ? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentioned

Answer:a Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance. 7. The need for data replication can arise in various scenarios like : a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned Answer:d Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance. 8. ________ is the slave/worker node and holds the user data in the form of Data Blocks. a) DataNode b) NameNode c) Data block d) Replication Answer:a Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them. 9. HDFS provides a command line interface called __________ used to interact with HDFS. a) “HDFS Shell” b) “FS Shell” c) “DFS Shell” d) None of the mentioned Answer:b Explanation: The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System. 10. During start up, the ___________ loads the file system state from the fsimage and the edits log file. a) DataNode b) NameNode c) ActionNode

d) None of the mentioned Answer:b Explanation:HDFS is implemented on any computer which can run Java can host a NameNode/DataNode on it.

BigData Objective

Recommend Documents