Latest Hadoop Interview Questions Part – 2
1) Apache Hadoop framework is composed of which modules?
Hadoop common, Hadoop YARN, Hadoop mapreduce, HDFS (Hadoop Distributed file system)
2) What does the term “Replication factor” denotes?
Replication factor is the number of times a file needs to be replicated in HDFS
3) What is the default replication factor in HDFS?
Three
4) Typical block size of an HDFS block?
64 MB( Extendable to custom defined 128 MB)
5) Explain HDFS Functionality in brief?
HDFS is scalable distributed Storage file system which is used for storing large amount of data in a Replicated environment.
6) What is NameNode?
The NameNode is the bookkeeper of HDFS. It keeps track of the data files and how they get split into different file blocks, storage of various file blocks in respective nodes and overall health of the distributed file system. The administrative functions of the NameNode are highly memory and I/O intensive.
7) Explain the key functionalities of Secondary Name node?
Based on the intervals specified in the cluster configuration, the Secondary Namenode (SNN) communicates with the NameNode to take snapshots of the HDFS metadata. In a Hadoop cluster, single point of failure is mainly caused by nameNode and the SNN snapshots help minimize the downtime and loss of data.
8) What is meant by Rack Awareness in Hadoop?
The NN (Name Node) stores the Metadata information of the storage location of the files like the rack, node and block. In Hadoop terminology, it is known as Rack awareness.
9) Which component in Hadoop is responsible for Job scheduling and monitoring?
Job Tracker
10) Name the structure provided by the MR(Map Reduce)?
Dynamic schema
11) Explain the heartbeat mechanism in Hadoop?
At regular intervals, Namenode get acknowledgement from various data nodes regarding space allocations and free memory. Typically, datanode send the heart beat every three seconds.
12) Explain failover fence in Hadoop?
It is also known as decommissioning of datanodes. When we want to reduce datanode machine in a cluster due to the datanode malfunction, load optimization issues, we decommission certain datanodes.
13) List all the daemons required to run the Hadoop cluster ?
NameNode
DataNode
JobTracker
TaskTracker
14) What is HDFS federation?
The process of maintaining multiple Namenodes in the Hadoop cluster environment to provide backup, recovery and failure control over the cluster.
15) Assume that the Hadoop spawned 50 tasks for a job and one of the task failed. What will Hadoop do?
If a task fails, Hadoop will restart the task on some other task tracker. In case, the restarted task fails more than four times, Hadoop will kill the job. Number of max restarts required before killing a task can be specified in the settings file.
16) In what format, MR process the data?
MR process data in Key-Value pairs.
17) How many input splits, the Hadoop framework will create for the scenario given below?
A MR system with HDFS block size 128 MB, having three files of size 64K, 300MB and 127MB with FileInputFormat as Input format
Hadoop will create five splits of following sizes
1 split for 64K files
3 splits for 300Mb files
1 splits for 127Mb file
18) Explain Speculative Execution?
If multiple mappers are working on the same task and if the one mapper goes down due to some unspecified reason, the JT assigns the shutdown mapper task to another mapper, parallelly to avoid data loss. This phenomenon is known as Speculative Execution.
19) What is Hadoop Streaming?
Hadoop Streaming API allows programmers to use programs written in various programming languages as Hadoop mapper and reducer implementations.
20) What is Distributed Cache in Hadoop?
The Map Reduce framework provides Distributed Cache functionality to cache the files (text, jars, arcHives, etc.) required by the applications during job execution. Before starting any tasks of a job in a node, the framework copies the required files to the slave node.