Questions tagged [hadoop2]

Hadoop 2 represents the second generation of the popular open source distributed platform Apache Hadoop.

Apache Hadoop 2.x consists of significant improvements over the previous stable release of Hadoop aka Hadoop 1.x. Several major enhancements have been made to both the building blocks of Hadoop viz, HDFS and MapReduce. They are :

  1. HDFS Federation : In order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces.

  2. MapReduce NextGen aka YARN aka MRv2 : The new architecture divides the two major functions of the JobTracker, resource management and job life-cycle management, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚ scheduling and coordination. An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.

For more info on Hadoop 2 the official Hadoop 2 homepage can be visited.

2047 questions
11
votes
5 answers

Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

I am running this command -- sudo -u hdfs hadoop fs -du -h /user | sort -nr and the output is not sorted in terms of gigs, Terabytes,gb I found this command - hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT";…
Mayur Narang
  • 111
  • 1
  • 1
  • 5
11
votes
2 answers

Importing CSV file into Hadoop

I am new with Hadoop, I have a file to import into hadoop via command line (I access the machine through SSH) How can I import the file in hadoop? How can I check afterward (command)?
akaliza
  • 3,641
  • 6
  • 24
  • 31
11
votes
3 answers

Could not find or load main class com.sun.tools.javac.Main hadoop mapreduce

I am trying to learn MapReduce but I am a little lost right now. http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Usage Particularly this set of instructions: Compile WordCount.java and…
Liondancer
  • 15,721
  • 51
  • 149
  • 255
11
votes
1 answer

namespace image and edit log

From the book "Hadoop The Definitive Guide", under the topic Namenodes and Datanodes it is mentioned that: The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the…
user4221591
  • 2,084
  • 7
  • 34
  • 68
10
votes
1 answer

Should Hadoop FileSystem be closed?

I'm building a spring-boot powered service that writes data to Hadoop using filesystem API. Some data is written to parquet file and large blocks are cached in memory so when the service is shut down, potentially several hundred Mb of data have to…
epsylon
  • 357
  • 3
  • 13
10
votes
2 answers

Working with input splits(HADOOP)

I have a .txt file as follows: This is xyz This is my home This is my PC This is my room This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx (ignoring the blank line after each record) I have set the…
User9523
  • 415
  • 1
  • 4
  • 18
10
votes
5 answers

Where is the classpath set for hadoop

Where is the classpath for hadoop set? When I run the below command it gives me the classpath. Where is the classpath set? bin/hadoop classpath I'm using hadoop 2.6.0
Bourne
  • 1,905
  • 13
  • 35
  • 53
10
votes
5 answers

Difference between a ring buffer and a queue

What is the difference between the ring (circular) buffer and a queue? Both support FIFO so in what scenarios I should use ring buffer over a queue and why? Relevance to Hadoop The map phase uses ring buffer to store intermediate key value pairs.…
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
9
votes
2 answers

Spark/Yarn: File does not exist on HDFS

I have a Hadoop/Yarn cluster setup on AWS, I have one master and 3 slaves. I have verified I have 3 live nodes running on port 50070 and 8088. I tested a spark job in client deploy-mode, everything works fine. When I try to spark-submit a job using…
user1187968
  • 7,154
  • 16
  • 81
  • 152
9
votes
5 answers

Can Apache YARN be used without HDFS?

I want to use Apache YARN as a cluster and resource manager for running a framework where resources would be shared across different task of the same framework. I want to use my own distributed off-heap file system. Is it possible to use any other…
Amar Gajbhiye
  • 484
  • 5
  • 17
9
votes
2 answers

Namenode high availability client request

Can anyone please tell me that If I am using java application to request some file upload/download operations to HDFS with Namenode HA setup, Where this request go first? I mean how would the client know that which namenode is active? It would be…
user2846382
  • 385
  • 1
  • 3
  • 16
9
votes
1 answer

could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation

I don't know how to fix this error: Vertex failed, vertexName=initialmap, vertexId=vertex_1449805139484_0001_1_00, diagnostics=[Task failed, taskId=task_1449805139484_0001_1_00_000003, diagnostics=[AttemptID:attempt_1449805139484_0001_1_00_000003_0…
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
9
votes
2 answers

How to optimize shuffling/sorting phase in a hadoop job

I'm doing some data preparation using a single node hadoop job. The mapper/combiner in my job outputs many keys (more than 5M or 6M) and obviously the job proceeds slowly or even fails. The mapping phase runs up to 120 mapper and there is just one…
HHH
  • 6,085
  • 20
  • 92
  • 164
9
votes
4 answers

Is there the equivalent for a `find` command in `hadoop`?

I know that from the terminal, one can do a find command to find files such as : find . -type d -name "*something*" -maxdepth 4 But, when I am in the hadoop file system, I have not found a way to do this. hadoop fs -find .... throws an error. How…
makansij
  • 9,303
  • 37
  • 105
  • 183
9
votes
2 answers

Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability

After reading Apache Hadoop documentation , there is a small confusion in understanding responsibilities of secondary node & check point node I am clear on Namenode role and responsibilities: The NameNode stores modifications to the file system…
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211