Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

8 answers

Hadoop DistributedCache is deprecated - what is the preferred API?

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new…

java hadoop mapreduce

asked Jan 20 '14 at 16:53

DNA

42,007
12
107
146

votes

7 answers

How to find the size of a HDFS file

How to find the size of a HDFS file? What command should be used to find the size of any file in HDFS.

hadoop hdfs

asked Jul 20 '12 at 07:02

priya

24,861
26
62
81

votes

5 answers

putting a remote file into hadoop without copying it to local disk

I am writing a shell script to put data into hadoop as soon as they are generated. I can ssh to my master node, copy the files to a folder over there and then put them into hadoop. I am looking for a shell command to get rid of copying the file to…

unix ssh hadoop copying piping

asked Jun 30 '12 at 00:33

reza

1,188
3
17
32

votes

5 answers

HDFS_NAMENODE_USER, HDFS_DATANODE_USER & HDFS_SECONDARYNAMENODE_USER not defined

I am new to hadoop. I'm trying to install hadoop in my laptop in Pseudo-Distributed mode. I am running it with root user, but I'm getting the error below. root@debdutta-Lenovo-G50-80:~# $HADOOP_PREFIX/sbin/start-dfs.sh WARNING: HADOOP_PREFIX has…

hadoop hadoop3

asked Jan 06 '18 at 15:50

Sujata Roy

votes

2 answers

hdfs dfs -mkdir, No such file or directory

Hi I am new to hadoop and trying to create directory in hdfs called twitter_data. I have set up my vm on softlayer, installed & started hadoop successfully. This is the commend I am trying to run: hdfs dfs -mkdir…

hadoop hdfs

asked Oct 20 '16 at 00:18

2D_

votes

11 answers

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database

I am trying to run SparkSQL : val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) But the error i m getting is below: ... 125 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the…

hadoop apache-spark derby

asked Dec 25 '15 at 18:56

Amaresh

3,231
7
37
60

votes

9 answers

apache spark - check if file exists

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing…

hadoop apache-spark hdfs

asked May 22 '15 at 20:55

Chandra

1,577
3
21
28

votes

6 answers

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once…

hadoop mapreduce

asked Feb 25 '10 at 11:34

KARASZI István

30,900
8
101
128

votes

7 answers

How to list only the file names in HDFS

I would like to know is there any command/expression to get only the file name in hadoop. I need to fetch only the name of file, when I do hadoop fs -ls it prints the whole path. I tried below but just wondering if some better way to do it. hadoop…

shell hadoop

asked Feb 05 '14 at 05:16

Navneet Kumar

3,732
2
18
25

votes

4 answers

Why does Hadoop need classes like Text or IntWritable instead of String or Integer?

Why does Hadoop need to introduce these new classes? They just seem to complicate the interface

hadoop

asked Oct 18 '13 at 03:22

Casebash

114,675
90
247
350

votes

12 answers

Just enough Java for Hadoop

I have been a C++ developer for about 10 years. I need to pick up Java just for Hadoop. I doubt I will be doing any thing else in Java. So, I would like a list of things I would need to pick up. Of course, I would need to learn the core language,…

java hadoop

asked Apr 20 '11 at 14:45

Nikhil

2,230
6
33
51

votes

1 answer

Is there a hdfs command to list files in HDFS directory as per timestamp

Is there a hdfs command to list files in HDFS directory as per timestamp, ascending or descending? By default, hdfs dfs -ls command gives unsorted list of files. When I searched for answers what I got was a workaround i.e. hdfs dfs -ls /tmp | sort…

hadoop hdfs

asked May 04 '16 at 08:48

PradeepKumbhar

3,361
1
18
31

votes

12 answers

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this…

hadoop configuration hdfs

asked Mar 15 '16 at 15:42

DJ180

18,724
21
66
117

votes

4 answers

Primary keys with Apache Spark

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?

database postgresql hadoop apache-spark

asked Oct 13 '15 at 12:28

Nhor

3,860
6
28
41

votes

2 answers

How can I force Spark to execute code?

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation? I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results…

java scala hadoop apache-spark

asked Jul 13 '15 at 12:50

MetallicPriest

29,191
52
200
356

Prev 1 2 3

…

99 100 Next