Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
37
votes
8 answers

Hadoop DistributedCache is deprecated - what is the preferred API?

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new…
DNA
  • 42,007
  • 12
  • 107
  • 146
37
votes
7 answers

How to find the size of a HDFS file

How to find the size of a HDFS file? What command should be used to find the size of any file in HDFS.
priya
  • 24,861
  • 26
  • 62
  • 81
37
votes
5 answers

putting a remote file into hadoop without copying it to local disk

I am writing a shell script to put data into hadoop as soon as they are generated. I can ssh to my master node, copy the files to a folder over there and then put them into hadoop. I am looking for a shell command to get rid of copying the file to…
reza
  • 1,188
  • 3
  • 17
  • 32
36
votes
5 answers

HDFS_NAMENODE_USER, HDFS_DATANODE_USER & HDFS_SECONDARYNAMENODE_USER not defined

I am new to hadoop. I'm trying to install hadoop in my laptop in Pseudo-Distributed mode. I am running it with root user, but I'm getting the error below. root@debdutta-Lenovo-G50-80:~# $HADOOP_PREFIX/sbin/start-dfs.sh WARNING: HADOOP_PREFIX has…
Sujata Roy
  • 427
  • 1
  • 6
  • 8
36
votes
2 answers

hdfs dfs -mkdir, No such file or directory

Hi I am new to hadoop and trying to create directory in hdfs called twitter_data. I have set up my vm on softlayer, installed & started hadoop successfully. This is the commend I am trying to run: hdfs dfs -mkdir…
2D_
  • 571
  • 1
  • 9
  • 17
36
votes
11 answers

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database

I am trying to run SparkSQL : val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) But the error i m getting is below: ... 125 more Caused by: java.sql.SQLException: Another instance of Derby may have already booted the…
Amaresh
  • 3,231
  • 7
  • 37
  • 60
36
votes
9 answers

apache spark - check if file exists

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing…
Chandra
  • 1,577
  • 3
  • 21
  • 28
36
votes
6 answers

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once…
KARASZI István
  • 30,900
  • 8
  • 101
  • 128
36
votes
7 answers

How to list only the file names in HDFS

I would like to know is there any command/expression to get only the file name in hadoop. I need to fetch only the name of file, when I do hadoop fs -ls it prints the whole path. I tried below but just wondering if some better way to do it. hadoop…
Navneet Kumar
  • 3,732
  • 2
  • 18
  • 25
36
votes
4 answers

Why does Hadoop need classes like Text or IntWritable instead of String or Integer?

Why does Hadoop need to introduce these new classes? They just seem to complicate the interface
Casebash
  • 114,675
  • 90
  • 247
  • 350
35
votes
12 answers

Just enough Java for Hadoop

I have been a C++ developer for about 10 years. I need to pick up Java just for Hadoop. I doubt I will be doing any thing else in Java. So, I would like a list of things I would need to pick up. Of course, I would need to learn the core language,…
Nikhil
  • 2,230
  • 6
  • 33
  • 51
35
votes
1 answer

Is there a hdfs command to list files in HDFS directory as per timestamp

Is there a hdfs command to list files in HDFS directory as per timestamp, ascending or descending? By default, hdfs dfs -ls command gives unsorted list of files. When I searched for answers what I got was a workaround i.e. hdfs dfs -ls /tmp | sort…
PradeepKumbhar
  • 3,361
  • 1
  • 18
  • 31
35
votes
12 answers

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this…
DJ180
  • 18,724
  • 21
  • 66
  • 117
35
votes
4 answers

Primary keys with Apache Spark

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Nhor
  • 3,860
  • 6
  • 28
  • 41
35
votes
2 answers

How can I force Spark to execute code?

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation? I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356