Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
1 answer

HADOOP / YARN - Are the ResourceManager and the hdfs NameNode always installed on the same host?

Are the “resource manager” and the “hdfs namenode” always installed on the same host? 1) When I want to send an http request (YARN REST API) to get new application id I am using this web uri: http://
Xquery
  • 187
  • 1
  • 3
  • 9
8
votes
2 answers

Difference between hive thrift server from hive and spark distributions

What's the difference between running hive server using either of the following two commands :- hive --service hiveserver2 Running hive thrift server from spark/sbin$ ./start-thriftserver.sh Do they listen on separate ports? Which one should I…
BludShot
  • 301
  • 1
  • 3
  • 16
8
votes
1 answer

YARN UNHEALTHY nodes

In our YARN cluster which is 80% full, we are seeing some of the yarn nodemanager's are marked as UNHEALTHY. after digging into logs I found its because disk space is 90% full for data dir. With following error 2015-02-21 08:33:51,590 INFO…
roy
  • 6,344
  • 24
  • 92
  • 174
8
votes
2 answers

Why YARN java heap space memory error?

I want to try about setting memory in YARN, so I'll try to configure some parameter on yarn-site.xml and mapred-site.xml. By the way I use hadoop 2.6.0. But, I get an error when I do a mapreduce job. It says like this : 15/03/12 10:57:23 INFO…
Kenny Basuki
  • 625
  • 4
  • 11
  • 27
8
votes
7 answers

HDFS write resulting in " CreateSymbolicLink error (1314): A required privilege is not held by the client."

Tried to execute sample map reduce program from Apache Hadoop. Got exception below when map reduce job was running. Tried hdfs dfs -chmod 777 / but that didn't fix the issue. 15/03/10 13:13:10 WARN mapreduce.JobSubmitter: Hadoop command-line option…
Sylvester Daniel
  • 98
  • 1
  • 1
  • 5
8
votes
1 answer

could to find or load main class org.apache.nutch.crawl.InjectorJob

I'm using Linux with Hadoop, Cloudera and HBase. Could you tell me how to correct this error? Error: could to find or load main class org.apache.nutch.crawl.InjectorJob The following command gave me the error: src/bin/nutch inject crawl/crawldb…
orilion
  • 81
  • 4
8
votes
2 answers

Hive Runtime Error while processing row in Hive

I've got issue while Querying on ORC file format table I was trying below query: INSERT INTO TABLE . SELECT FROM . WHERE CONDITIONS; which results in: TaskAttempt 2 failed, info=[Error: Failure while…
Vijay_Shinde
  • 1,332
  • 2
  • 17
  • 38
8
votes
1 answer

How to kill a mapred job started by hive?

I'm working by CDH 5.1 now. It starts normal Hadoop job by YARN but hive still works with mapred. Sometimes a big query will hang for a long time and I want to kill it. I can find this big job by JobTracker web console while it didn't provide a…
2shou
  • 145
  • 1
  • 1
  • 7
8
votes
3 answers

hadoop namenode port in use

This is actually a standby HA namenode. It was configured with the same settings as the primary and hdfs namenode -bootstrapStandby was successfully run. It begins coming up on the standard HTTP port 50070 as defined in the config…
Bill Warner
  • 639
  • 6
  • 18
8
votes
1 answer

Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast

I have a comma seperated .txt file, I want to DUMP the AVG age of all Males. records = LOAD 'file:/home/gautamshaw/Documents/PigDemo_CommaSep.txt' USING PigStorage(',') AS…
user182944
  • 7,897
  • 33
  • 108
  • 174
8
votes
2 answers

Run a hadoop cluster on docker containers

I want to run a multi-node hadoop cluster, with each node inside a docker container on a different host. This image - https://github.com/sequenceiq/hadoop-docker works well to start hadoop in a pseudo distributed mode, what is the easiest way to…
user1016313
  • 1,274
  • 2
  • 15
  • 26
8
votes
1 answer

package org.apache.hadoop.fs does not exist

First, I KNOW THIS HAS BEEN ASKED BEFORE, but none of the solutions work for me and I would like to know why. I am trying to compile the standard 'WordCount.java' .jar for hadoop on my linux single-node cluster, but keep getting the package…
drjrm3
  • 4,474
  • 10
  • 53
  • 91
8
votes
4 answers

Unable to Create Table in HIVE reading a CSV from HDFS

I have issues while creating a table in Hive by reading the .csv file from HDFS. The Query is below: CREATE EXTERNAL TABLE testmail (memberId String , email String, sentdate String,actiontype String, actiondate String, campaignid String,campaignname…
Blue Whale
  • 113
  • 1
  • 3
  • 7
8
votes
1 answer

hadoop command and SLF4J error message cdh in ubuntu

The SLF4J error has been bugging me for a while now. It appears every time I type any hadoop shell command before showing the output of the command. $ hadoop fs -ls SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting…
Antoni
  • 2,542
  • 20
  • 21
8
votes
1 answer

Using Hadoop through a SOCKS proxy?

So our Hadoop cluster runs on some nodes and can only be accessed from these nodes. You SSH into them and do your work. Since that is quite annoying, but (understandably) nobody will even go near trying to configure access control so that it may be…
pascal
  • 2,623
  • 2
  • 20
  • 30
1 2 3
99
100