The language of data
is SQL, so naturally lots of tools have been developed to bring SQL to Hadoop.
They range from simple wrappers on top of Map Reduce to full
data warehouse implementations built on top of HDFS and everywhere in between.
There are more tools
than you might think, so this is my attempt at listing them all and hopefully
providing some insight into what each of them actually does.
I’ve tried to order
them based on ‘installation friction’, so the more complex products are towards
the bottom.
I’ll cover the
following technologies:
Hive is the original SQL-on-Hadoop
solution.
Hive is an open-source
Java project which converts SQL to a series of Map-Reduce jobs which run on
standard Hadoop tasktrackers. It tries to look like MySQL by using a metastore
(itself a database) to store table schemas, partitions, and locations. It
largely supports MySQL syntax and organizes datasets using familiardatabase/table/view conventions. Hive provides:
·
A SQL-like query
interface called Hive-QL, loosely modelled after MySQL
·
A command line client
·
Metadata sharing via a
central service
·
JDBC drivers
·
A Java API for
creating custom functions and transformations
SHOULD YOU USE IT?
Hive is considered one
of the de-facto tools installed on almost all Hadoop installations. It’s simple
to set up and doesn’t require much infrastructure to get started with. Given
the small cost of use, there’s pretty much no reason to not try it.
That said, queries
performed with Hive are usually very slow because of the overhead associated
with using Map-Reduce.
THE FUTURE OF HIVE
Hortonworks has been pushing the development of Apache Tez as a new back-end for Hive to provide fast response times
currently unachievable using Map Reduce.

0 comments:
Post a Comment