1) What
is Hadoop?
Hadoop is a distributed computing platform. It
is written in Java. It consist of the features like Google File System and
MapReduce.
2) What
platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for
Hadoop, preferably from Sun. Linux and Windows are the supported operating
system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.
3) What
kind of Hardware is best for Hadoop?
Hadoop can run on a dual processor/ dual core
machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.
4) What
are the most common input formats defined in Hadoop?
These are the most common input formats defined
in Hadoop:
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is a by default input format.
5) What
is InputSplit in Hadoop? Explain.
When a hadoop job runs, it splits input files
into chunks and assign each split to a mapper for processing. It is called
InputSplit.
6) How
many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits as following:
- One
split for 64K files
- Two
splits for 65MB files, and
- Two
splits for 127MB files
7) What
is the use of RecordReader in Hadoop?
InputSplit is assigned with a work but doesn't
know how to access it. The record holder class is totally responsible for
loading the data from its source and convert it into keys pair suitable for
reading by the Mapper. The RecordReader's instance can be defined by the Input
Format.
8) What
is JobTracer in Hadoop?
is a service within Hadoop which runs MapReduce
jobs on the cluster.
9) What
are the functionalities of JobTracer?
These are the main tasks of JobTracer:
- To
accept jobs from client.
- To
communicate with the NameNode to determine the location of the data.
- To
locate TaskTracker Nodes with available slots.
- To
submit the work to the chosen TaskTracker node and monitors progress of
each tasks.
10)
Define TaskTracker.
TaskTracker is a node in the cluster that
accepts tasks like MapReduce and Shuffle operations from a JobTracker.
11)
What is Map/Reduce job in Hadoop?
Map/Reduce is programming paradigm which is used
to allow massive scalability across the thousands of server.
Actually MapReduce refers two different and
distinct tasks that Hadoop performs. In the first step maps jobs which takes
the set of data and converts it into another set of data and in the second
step, Reduce job. It takes the output from the map as input and compress those
data tuples into smaller set of tuples.
12)
What is Hadoop Streaming?
Hadoop streaming is a utility which allows you
to create and run map/reduce job. It is a generic API that allows programs
written in any languages to be used as Hadoop mapper.
13)
What is a combiner in Hadoop?
A Combiner is a mini-reduce process which
operates only on data generated by a Mapper. When Mapper emits the data,
combiner receives it as input and sends the output to reducer.
14) Is
it necessary to know java to learn Hadoop?
If you have a background in any programming
language like C, C++, PHP, Python, Java etc. It may be really helpful, but if
you are nil in java, it is necessary to learn Java and also get the basic
knowledge of SQL.
15) How
to debug Hadoop code?
There are many ways to debug Hadoop codes but
the most popular methods are:
- By
using Counters.
- By
web interface provided by Hadoop framework.
16) Is
it possible to provide multiple inputs to Hadoop? If yes, explain.
Yes, It is possible. The input format class
provides methods to insert multiple directories as input to a Hadoop job.
17)
What is the relation between job and task in Hadoop?
In Hadoop, A job is divided into multiple small
parts known as task.
18)
What is distributed cache in Hadoop?
Distributed cache is a facility provided by
MapReduce Framework. It is provided to cache files (text, archives etc.) at the
time of execution of the job. The Framework copies the necessary files to the
slave node before the execution of any task at that node.
19)
What commands are used to see all jobs running in the Hadoop cluster and kill a
job in LINUX?
Hadoop job - list
Hadoop job - kill jobID
20)
What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker
run on Hadoop cluster?
JobTracker is a giant service which is used to
submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on
any Hadoop cluster. JobTracker runs it within its own JVM process.
Functionalities of JobTracker in Hadoop:
- When
client application submits jobs to the JobTracker, the JobTracker talks to
the NameNode to find the location of the data.
- It
locates TaskTracker nodes with available slots for data.
- It
assigns the work to the chosen TaskTracker nodes.
- The
TaskTracker nodes are responsible to notify the JobTracker when a task
fails and then JobTracker decides what to do then. It may resubmit the
task on another node or it may mark that task to avoid.
21) How
JobTracker assign tasks to the TaskTracker?
The TaskTracker periodically sends heartbeat
messages to the JobTracker to assure that it is alive. This messages also
inform the JobTracker about the number of available slots. This return message
updates JobTracker to know about where to schedule task.
22) Is
it necessary to write jobs for Hadoop in Java language?
No, There are many ways to deal with non-java
codes. HadoopStreaming allows any shell command to be used as a map or reduce
function.

Online BA Analyst Course
ReplyDeleteLearn to bridge the gap between business needs and IT solutions with the online BA Analyst course. Get mentored by experts and work on live projects for professional exposure.