When you learn about Big Data you will sooner or later come
across this odd sounding word: Hadoop – but what exactly is it?
Put simply, Hadoop can be thought of as a set of open source
programs and procedures (meaning essentially they are free for anyone to use or
modify, with a few exceptions) which anyone can use as the “backbone” of their
big data operations.
I’ll try to keep things simple as I know a lot of people reading
this aren’t software engineers, so I hope I don’t over-simplify anything –
think of this as a brief guide for someone who wants to know a bit more about
the nuts and bolts that make big data analysis possible.
The 4 Modules of Hadoop
Hadoop is made up of “modules”, each of which carries out a
particular task essential for a computer system designed for big data
analytics.
1. Distributed File-System
The most important two are the Distributed File System, which
allows data to be stored in an easily accessible format, across a large number
of linked storage devices, and the MapReduce – which provides the basic tools
for poking around in the data.
(A “file system” is the method used by a computer to store data,
so it can be found and used. Normally this is determined by the computer’s
operating system, however a Hadoop system uses its own file system which sits
“above” the file system of the host computer – meaning it can be accessed using
any computer running any supported OS).
2. MapReduce
MapReduce is named after the two basic operations this module
carries out – reading data from the database, putting it into a format suitable
for analysis (map), and performing mathematical operations i.e counting the
number of males aged 30+ in a customer database (reduce).
3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in
Java) needed for the user’s computer systems (Windows, Unix or whatever) to
read data stored under the Hadoop file system.
4. YARN
The final module is YARN, which manages resources of the systems
storing the data and running the analysis.
Various other procedures, libraries or features have come to be
considered part of the Hadoop “framework” over recent years, but Hadoop
Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are
the principle four.
How Hadoop Came About
Development of Hadoop began when forward-thinking software
engineers realised that it was quickly becoming useful for anybody to be able
to store and analyze datasets far larger than can practically be stored and
accessed on one physical storage device (such as a hard disk).
This is partly because as physical storage devices become bigger
it takes longer for the component that reads the data from the disk (which in a
hard disk, would be the “head”) to move to a specified segment. Instead, many
smaller devices working in parallel are more efficient than one large one.
It was released in 2005 by the Apache Software Foundation, a
non-profit organization which produces open source software which powers much
of the Internet behind the scenes. And if you’re wondering where the odd name
came from, it was the name given to a toy elephant belonging to the son of one
of the original creators!
The Usage of Hadoop
The flexible nature of a Hadoop system means companies can add
to or modify their data system as their needs change, using cheap and
readily-available parts from any IT vendor.
Today, it is the most widely used system for providing data
storage and processing across “commodity” hardware – relatively inexpensive, off-the-shelf
systems linked together, as opposed to expensive, bespoke systems custom-made
for the job in hand. In fact it is claimed that more than half of the companies
in the Fortune 500 make use of it.
Just about all of the big online names use it, and as anyone is
free to alter it for their own purposes, modifications made to the software by
expert engineers at, for example, Amazon and Google, are fed back to the
development community, where they are often used to improve the “official”
product. This form of collaborative development between volunteer and
commercial users is a key feature of open source software.
In its “raw” state – using the basic modules supplied here by
Apache, it can be very complex, even for IT professionals – which is why
various commercial versions have been developed such as Cloudera which simplify
the task of installing and running a Hadoop system, as well as offering
training and support services.
So that, in a (fairly large) nutshell, is Hadoop. Thanks
to the flexible nature of the system, companies can expand and adjust their
data analysis operations as their business expands. And the support and
enthusiasm of the open source community behind it has led to great strides
towards making big data analysis more accessible for everyone.

