Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
50
votes
10 answers

Spark iterate HDFS directory

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?
Jon
  • 3,985
  • 7
  • 48
  • 80
49
votes
2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…
ToBeSparkShark
  • 641
  • 2
  • 6
  • 10
49
votes
6 answers

Hadoop/Hive : Loading data from .csv on a local machine

As this is coming from a newbie... I had Hadoop and Hive set up for me, so I can run Hive queries on my computer accessing data on AWS cluster. Can I run Hive queries with .csv data stored on my computer, like I did with MS SQL Server? How do I…
mel
  • 1,566
  • 5
  • 17
  • 29
49
votes
7 answers

Why do I need to source bash_profile every time

I have installed Hadoop and every time I want to run it, first I have to do this: source ~/.bash_profile or it won't recognize the command hadoop Why is that? I am on OSX 10.8
user1899082
48
votes
10 answers

What is RDD in spark

Definition says: RDD is immutable distributed collection of objects I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java,…
kittu
  • 6,662
  • 21
  • 91
  • 185
48
votes
9 answers

What is a container in YARN?

What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different?
rahul
  • 1,423
  • 3
  • 18
  • 28
47
votes
3 answers

Is it better to use the mapred or the mapreduce package to create a Hadoop Job?

To create MapReduce jobs you can either use the old org.apache.hadoop.mapred package or the newer org.apache.hadoop.mapreduce package for Mappers and Reducers, Jobs ... The first one had been marked as deprecated but this got reverted meanwhile. Now…
momo13
  • 473
  • 1
  • 4
  • 6
47
votes
12 answers

http://localhost:50070 does not work HADOOP

I already installed Hadoop on my machine "Ubuntu 13.05" and now I have an error when browsing localhost:50070 the browser says that the page does not exist.
deltascience
  • 3,321
  • 5
  • 42
  • 71
47
votes
11 answers

Hive query output to file

I run hive query by java code. Example: "SELECT * FROM table WHERE id > 100" How to export result to hdfs file.
cldo
  • 1,735
  • 6
  • 21
  • 26
47
votes
29 answers

Datanode process not running in Hadoop

I set up and configured a multi-node Hadoop cluster using this tutorial. When I type in the start-all.sh command, it shows all the processes initializing properly as follows: starting namenode, logging to…
Jawwad Zakaria
  • 1,139
  • 3
  • 13
  • 27
46
votes
3 answers

SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
Gaurav Khare
  • 2,203
  • 4
  • 25
  • 23
46
votes
10 answers

Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000

When I setup the hadoop cluster, I read the namenode runs on 50070 and I set up accordingly and it's running fine. But in some books I have come across name node address : hdfs://localhost:9000/ or hdfs://localhost:8020 What exactly is the proper…
Kumar
  • 949
  • 1
  • 13
  • 23
46
votes
3 answers

What is best way to start and stop hadoop ecosystem, with command line?

I see there are several ways we can start hadoop ecosystem, start-all.sh & stop-all.sh Which say it's deprecated use start-dfs.sh & start-yarn.sh. start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh hadoop-daemon.sh namenode/datanode and…
twid
  • 6,368
  • 4
  • 32
  • 50
46
votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

Why are there two separate packages map-reduce package in Apache's hadoop package tree: org.apache.hadoop.mapred…
bartonm
  • 1,600
  • 3
  • 18
  • 30
45
votes
5 answers

Hive installation issues: Hive metastore database is not initialized

I tried to install hive on a raspberry pi 2. I installed Hive by uncompress zipped Hive package and configure $HADOOP_HOME and $HIVE_HOME manually under hduser user-group I created. When running hive, I got the following error message: hive ERROR…
As high as honor
  • 451
  • 1
  • 5
  • 3