October 2015 ~ Hadoop Tutorials & Materials

Friday, 30 October 2015

Hadoop Stack Developer and Administrator

by jason58245 on 02:56 4 comments

Profile: Hadoop Stack Developer and Administrator

“Transforming large, unruly data sets into competitive advantages”

Purveyor of competitive intelligence and holistic, timely analyses of Big Data made possible by the successful installation, configuration and administration of Hadoop ecosystem components and architecture.

· Two years’ experience installing, configuring, testing Hadoop ecosystem components.

· Capable of processing large sets of structured, semi-structured and unstructured data and supporting systems application architecture.

· Able to assess business rules, collaborate with stakeholders and perform source-to-target data mapping, design and review.

· Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.

· Hortonworks Certified Hadoop Developer, Cloudera Certified Hadoop Developer and Certified Hadoop Administrator.

Areas of Expertise:

· Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, Cassandra, Oozie, Flume, Chukwa, Pentaho Kettle and Talend

· Programming Languages: Java, C/C++, eVB, Assembly Language (8085/8086)

· Scripting Languages: JSP & Servlets, PHP, JavaScript, XML, HTML, Python and Bash

· Databases: NoSQL, Oracle

· UNIX Tools: Apache, Yum, RPM

· Tools: Eclipse, JDeveloper, JProbe, CVS, Ant, MS Visual Studio

· Platforms: Windows(2000/XP), Linux, Solaris, AIX, HPUX

· Application Servers: Apache Tomcat 5.x 6.0, Jboss 4.0

· Testing Tools: NetBeans, Eclipse, WSAD, RAD

· Methodologies: Agile, UML, Design Patterns

Professional Experience:

Hadoop Developer
Investor Online Network, Englewood Cliff, New Jersey2013 to present
Facilitated insightful daily analyses of 60 to 80GB of website data collected by external sources. Spawning recommendations and tips that increased traffic 38% and advertising revenue 16% for this online provider of financial market intelligence.

· Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.

· Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.

· Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.

· Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems.

· Managed and reviewed Hadoop log files.

· Tested raw data and executed performance scripts.

· Shared responsibility for administration of Hadoop, Hive and Pig.

Hadoop Developer/Administrator
Bank of the East, Yonkers, New York2012 to 2013
Helped this regional bank streamline business processes by developing, installing and configuring Hadoop ecosystem components that moved data from individual servers to HDFS.

· Installed and configured MapReduce, HIVE and the HDFS; implemented CDH3 Hadoop cluster on CentOS. Assisted with performance tuning and monitoring.

· Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.

· Supported code/design analysis, strategy development and project planning.

· Created reports for the BI team using Sqoop to export data into HDFS and Hive.

· Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

· Assisted with data capacity planning and node forecasting.

· Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

· Administrator for Pig, Hive and Hbase installing updates, patches and upgrades.

Java Developer
New York Bank, New York, New York2010 to 2012
Improved user satisfaction and adoption rates by designing, coding, debugging, documenting, maintaining and modifying a number of apps and programs for ATM and online banking. Participated in Hadoop training and development as part of a cross-training program.

· Led the migration of monthly statements from UNIX platform to MVC Web-based Windows application using Java, JSP, Struts technology.

· Prepared use cases, designed and developed object models and class diagrams.

· Developed SQL statements to improve back-end communications.

· Incorporated custom logging mechanism for tracing errors, resolving all issues and bugs before deploying the application in the WebSphere Server.

· Received praise from users, shareholders and analysts for developing a highly interactive and intuitive UI using JSP, AJAX, JSF and JQuery techniques.

· View samples at www.myportfolio.com/aburke

Education, Training and Professional Development
New Jersey Institute of Technology, BS Computer Science

Hadoop Training
Accelebrate: “Hadoop Administration Training”
Cloudera University Courses: “Hadoop Essentials” and “Hadoop Fundamentals I & II”
MapReduce Courses: “Introduction to Apache MapReduce and HDFS,” “Writing MapReduce Applications” and “Intro to Cluster Administration”
Nitesh Jain: “Become a Certified Hadoop Developer”

Member, Hadoop Users Group of New Jersey

Continue

Hadoop resume

by jason58245 on 02:55 3 comments

Jackson Lindsey

123 Main Street, San Francisco, CA 94122
Home : 000-000-0000 Cell: 000-000-0000
email@example.com

Professional Summary

Experienced Hadoop Developer has a strong background with file distribution systems in a big-data arena.Understands the complex processing needs of big data and has experience developing codes and modules to address those needs.Brings a Master's Degree in Computer Science along with Certification as a Developer using Apache Hadoop.

Core Qualifications

· Good experience with Python, Pig, Sqoop, Oozie, Hadoop Streaming and Hive

· Solid understanding of the Hadoop file distributing system

· Extensive knowledge of ETL, including Ab Initio and Informatica

· Excellent oral and written communication skills

· Collaborates well across technology groups

· Vast experience with Java, Puppet, Chef, Linux, Perl and Python

· In-depth understanding of MapReduce and the Hadoop Infrastructure

· Focuses on the big picture with problem-solving

Experience

June 2009 to July 2014 Simpson Technological Solutions—New Cityland, CA Hadoop Developer

· Communicated across traditional technology group boundaries to allow for collaborative delivery.

· Conducted code reviews to ensure systems operations.

· Utilized high-level information architecture to design modules for complex programs.

· Prepared code modules for staging.

August 2005 to May 2009 RMS Innovation—New Cityland, CA Hadoop Developer

· Authored extremely detailed specifications manuals.

· Completed testing of integration and tracked and solved defects.

· Performed problem-solving in a big data arena.

· Programmed code within distributed file networks.

· Always followed standards and procedures for documentation.

Education

2007 California Pacific Universtiy, New Cityland, CA Master's of Science Degree in Computer Science2005 California Pacific University, New Cityland, CA Bachelor's of Science Degree in Computer Programming2005 Certified Developer for Apache Hadoop through Cloudera

Continue

Hadoop Developer

by jason58245 on 02:54 No comments

Hadoop Developer - HADOOP BIG DATA MANAGEMENT

• 1.9 year of professional experience in Izon Technosoft working as a Hadoop Developer.
• Having Experience in writing queries in Pig and Hive.
• Executed jobs in Hadoop local mode, Pseudo Distributed mode, Hadoop Cluster mode for production.
• Having exposure in writing SQL quires
• Having Experience in developing Map reduce applications for Hadoop.
• Optimized several MapReduce algorithms in Java according to the client requirement for big data analytics.
• Followed the given standard approaches while debugging and error handling.
• Exposure on Query Programming Model of Hadoop such as Hive, Pig, SQL.
• Written Unit Test cases in JUnit and submitted Unit test results as per the quality process.
• Knowledge on Hadoop architecture and Hadoop Distributed File System, Hadoop ecosystem.
• Good understanding of Hbase, Zookeeper, Oozie.

The project is meant to provide one spot solution to all the major actions required while working
with big data and hadoop ecosystem.

➢ Development of enterprise level data lakes and making data available.
• Creation of various data zones to make data available for different phases like landing, staging, processing and raw data zones.
• Data Ingestion from landing zone to staging.
• Making data available to raw data zone and integrating the data to the Hive tables to make them available for further analysis.
➢ Did POC's in Sandbox
• Required a dedicated environment for data analysis.
• Creation of definite hdfs folder.
• Creation of hive database over it.
• Creation of tables/views over it.
• Maintenance of permissions and ownership.
Roles and Responsibilities:
Worked as Hadoop developer using Hadoop framework, Map Reduce, Pig, Hive, HBase, and Core java.
• Handled bad records while processing big datasets.
• Installed Hadoop and Hive.
• Exposure on Query Programming Model of Hadoop such as Hive.
• Exposure on Query Programming Model of Hadoop such as Pig.
• Exposure on Query Programming Model of SQL

Work Experience

Hadoop Developer

HADOOP BIG DATA MANAGEMENT

January 2014 to Present

Platform: Hadoop.
Technology: Hadoop ecosystem including HDFS, Hive, Pig.

Hadoop Developer

Izon Technosoft

Education

B-Tech in Electrical and Electronics

Vidya College Of Engineering -

Meerut, Uttar Pradesh

CBSE

Translam Academy International -

Meerut, Uttar Pradesh

CBSE

Rainbow Public School -

Saharanpur, Uttar Pradesh

Additional Information

TECHNICAL SKILLS

Platform: Hadoop
Operating System: Windows, Linux.
Languages: Java 1.7
Course: CCNA (Routing and Switches)

Continue

What is Hadoop?

by jason58245 on 02:11 1 comment

1) What is Hadoop?

Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Google File System and MapReduce.

2) What platform and Java version is required to run Hadoop?

Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

3) What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

4) What are the most common input formats defined in Hadoop?

These are the most common input formats defined in Hadoop:

TextInputFormat
KeyValueInputFormat
SequenceFileInputFormat

TextInputFormat is a by default input format.

5) What is InputSplit in Hadoop? Explain.

When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.

6) How many InputSplits is made by a Hadoop Framework?

Hadoop will make 5 splits as following:

One split for 64K files
Two splits for 65MB files, and
Two splits for 127MB files

7) What is the use of RecordReader in Hadoop?

InputSplit is assigned with a work but doesn't know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader's instance can be defined by the Input Format.

8) What is JobTracer in Hadoop?

is a service within Hadoop which runs MapReduce jobs on the cluster.

9) What are the functionalities of JobTracer?

These are the main tasks of JobTracer:

To accept jobs from client.
To communicate with the NameNode to determine the location of the data.
To locate TaskTracker Nodes with available slots.
To submit the work to the chosen TaskTracker node and monitors progress of each tasks.

10) Define TaskTracker.

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.

11) What is Map/Reduce job in Hadoop?

Map/Reduce is programming paradigm which is used to allow massive scalability across the thousands of server.

Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples.

12) What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

13) What is a combiner in Hadoop?

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.

14) Is it necessary to know java to learn Hadoop?

If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

15) How to debug Hadoop code?

There are many ways to debug Hadoop codes but the most popular methods are:

By using Counters.
By web interface provided by Hadoop framework.

16) Is it possible to provide multiple inputs to Hadoop? If yes, explain.

Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

17) What is the relation between job and task in Hadoop?

In Hadoop, A job is divided into multiple small parts known as task.

18) What is distributed cache in Hadoop?

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

19) What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?

Hadoop job - list

Hadoop job - kill jobID

20) What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?

JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

Functionalities of JobTracker in Hadoop:

When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
It locates TaskTracker nodes with available slots for data.
It assigns the work to the chosen TaskTracker nodes.
The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.

21) How JobTracker assign tasks to the TaskTracker?

The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.

22) Is it necessary to write jobs for Hadoop in Java language?

No, There are many ways to deal with non-java codes. HadoopStreaming allows any shell command to be used as a map or reduce function.

Continue

Hadoop Tutorials & Materials

Study Materials, Interview Questions, Sample Resumes, Helpful notes etc

Friday, 30 October 2015

Hadoop Stack Developer and Administrator

Hadoop resume

Hadoop Developer

Hadoop Developer - HADOOP BIG DATA MANAGEMENT

Work Experience

Education

Additional Information

What is Hadoop?

Student Registration form

Popular Posts

Recent Posts

Unordered List

Sample Text

Categories

Blog Archive