Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

1 answer

HADOOP / YARN - Are the ResourceManager and the hdfs NameNode always installed on the same host?

Are the “resource manager” and the “hdfs namenode” always installed on the same host? 1) When I want to send an http request (YARN REST API) to get new application id I am using this web uri: http://

apache rest hadoop hadoop-yarn webhdfs

asked Mar 30 '15 at 12:19

Xquery

votes

2 answers

Difference between hive thrift server from hive and spark distributions

What's the difference between running hive server using either of the following two commands :- hive --service hiveserver2 Running hive thrift server from spark/sbin$ ./start-thriftserver.sh Do they listen on separate ports? Which one should I…

java hadoop jdbc hive thrift

asked Mar 17 '15 at 12:49

BludShot

votes

1 answer

YARN UNHEALTHY nodes

In our YARN cluster which is 80% full, we are seeing some of the yarn nodemanager's are marked as UNHEALTHY. after digging into logs I found its because disk space is 90% full for data dir. With following error 2015-02-21 08:33:51,590 INFO…

hadoop distributed-computing cloudera hadoop-yarn cloudera-cdh

asked Mar 12 '15 at 12:41

roy

6,344
24
92
174

votes

2 answers

Why YARN java heap space memory error?

I want to try about setting memory in YARN, so I'll try to configure some parameter on yarn-site.xml and mapred-site.xml. By the way I use hadoop 2.6.0. But, I get an error when I do a mapreduce job. It says like this : 15/03/12 10:57:23 INFO…

java hadoop mapreduce heap-memory hadoop-yarn

asked Mar 12 '15 at 04:07

Kenny Basuki

votes

7 answers

HDFS write resulting in " CreateSymbolicLink error (1314): A required privilege is not held by the client."

Tried to execute sample map reduce program from Apache Hadoop. Got exception below when map reduce job was running. Tried hdfs dfs -chmod 777 / but that didn't fix the issue. 15/03/10 13:13:10 WARN mapreduce.JobSubmitter: Hadoop command-line option…

java hadoop mapreduce hdfs

asked Mar 10 '15 at 08:24

Sylvester Daniel

votes

1 answer

could to find or load main class org.apache.nutch.crawl.InjectorJob

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command gave me the error: src/bin/nutch inject crawl/crawldb…

hadoop solr nutch

asked Mar 09 '15 at 09:27

orilion

votes

2 answers

Hive Runtime Error while processing row in Hive

I've got issue while Querying on ORC file format table I was trying below query: INSERT INTO TABLE . SELECT FROM . WHERE CONDITIONS; which results in: TaskAttempt 2 failed, info=[Error: Failure while…

hadoop hive hadoop-yarn hadoop2

asked Feb 23 '15 at 13:19

Vijay_Shinde

1,332
2
17
38

votes

1 answer

How to kill a mapred job started by hive?

I'm working by CDH 5.1 now. It starts normal Hadoop job by YARN but hive still works with mapred. Sometimes a big query will hang for a long time and I want to kill it. I can find this big job by JobTracker web console while it didn't provide a…

hadoop mapreduce hive hadoop-yarn cloudera-cdh

asked Feb 12 '15 at 06:28

2shou

votes

3 answers

hadoop namenode port in use

This is actually a standby HA namenode. It was configured with the same settings as the primary and hdfs namenode -bootstrapStandby was successfully run. It begins coming up on the standard HTTP port 50070 as defined in the config…

hadoop high-availability cloudera-cdh standby

asked Feb 02 '15 at 15:29

Bill Warner

votes

1 answer

Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast

I have a comma seperated .txt file, I want to DUMP the AVG age of all Males. records = LOAD 'file:/home/gautamshaw/Documents/PigDemo_CommaSep.txt' USING PigStorage(',') AS…

hadoop mapreduce apache-pig bigdata

asked Jan 30 '15 at 01:27

user182944

7,897
33
108
174

votes

2 answers

Run a hadoop cluster on docker containers

I want to run a multi-node hadoop cluster, with each node inside a docker container on a different host. This image - https://github.com/sequenceiq/hadoop-docker works well to start hadoop in a pseudo distributed mode, what is the easiest way to…

hadoop networking docker amazon-ec2 cluster-computing

asked Nov 19 '14 at 03:35

user1016313

1,274
2
15
26

votes

1 answer

package org.apache.hadoop.fs does not exist

First, I KNOW THIS HAS BEEN ASKED BEFORE, but none of the solutions work for me and I would like to know why. I am trying to compile the standard 'WordCount.java' .jar for hadoop on my linux single-node cluster, but keep getting the package…

java hadoop

asked Nov 03 '14 at 21:27

drjrm3

4,474
10
53
91

votes

4 answers

Unable to Create Table in HIVE reading a CSV from HDFS

I have issues while creating a table in Hive by reading the .csv file from HDFS. The Query is below: CREATE EXTERNAL TABLE testmail (memberId String , email String, sentdate String,actiontype String, actiondate String, campaignid String,campaignname…

hadoop hive hdfs

asked Sep 23 '14 at 07:19

Blue Whale

votes

1 answer

hadoop command and SLF4J error message cdh in ubuntu

The SLF4J error has been bugging me for a while now. It appears every time I type any hadoop shell command before showing the output of the command. $ hadoop fs -ls SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting…

ubuntu hadoop slf4j

asked Aug 01 '14 at 22:05

Antoni

2,542
20
21

votes

1 answer

Using Hadoop through a SOCKS proxy?

So our Hadoop cluster runs on some nodes and can only be accessed from these nodes. You SSH into them and do your work. Since that is quite annoying, but (understandably) nobody will even go near trying to configure access control so that it may be…

hadoop proxy

asked Aug 01 '14 at 00:57

pascal

2,623
2
20
30

Prev 1 2 3

…

100