- Hadoop is an open source tool used for big data.
- Dougcuttingand Michael Cafarella worked in yahoo to develop Hadoop in 2004 – 2008
- In 2006 yahoo donated the Hadoop project to Apache.
- Mainly used for data mapping and reduce data.
- It is a linux based and framework tool is Hadoop.
- Hadoop is infrastructure software and cluster based system.
- Hadoop is owned and maintained by apache.
- Areas Hadoop used
- Amazon
- Yahoo
- IBM
- Ebay
- Users of Hadoop are
- Users– (who design, import/export, work with application)
- Administrator– (who install, monitor, manage, tune )
- Hadoop break the data into equal pieces for computation.
- It has three components.
- MapReduce
- File System [HDFS]
- Projects[It is a set of tools]
- Projects tools are
- Hive
- HBase
- Mahout
- Pig
- Oozie
- Scoop
- Flume
- Hadoop has default fault tolerance on hardware failure is handled by HDFS. It will copy the data to some other node.
- Iftask tracker is failed, the job seeker asks for other task tracker to perform the operation given by application.
- Hadoop projects provide additional functionality
A. Flume– It is for collecting, aggregating & moving large data. It is simple and flexible architecture used for (OLAP – online analytics application).
B. Hive– Data warehouse infrastructure built on top of the Hadoop for providing data, summarization, query analysis.
C. HBase– Open source, non-relational distributed database runs on top of HDFS.
D. Mahout– Distributed and scalable machine learning algorithm on Hadoop platform (eg:- user recommended tasks).
E. Oozie– Java based web application. It used for workflow schedule system to manage theHadoop jobs.
F. Pig– High level for creating MapReduce program using language called Pig Latin. It is similarto SQL query language in RDBMS.
G. Scoop– Tool designed for transferring bulk data between Apache Hadoop and structureddata store.
ConversionConversion EmoticonEmoticon