Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
42
votes
5 answers

Cascading examples failed to compile?

In shell I typed gradle cleanJar in the Impatient/part1 directory. The output is below. The error is "class file for org.apache.hadoop.mapred.JobConf not found". Why did it fail to compile? :clean UP-TO-DATE :compileJava Download…
Treper
  • 3,539
  • 2
  • 26
  • 48
41
votes
15 answers

Setting the number of map tasks and reduce tasks

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not…
asembereng
  • 675
  • 2
  • 8
  • 18
41
votes
1 answer

Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?

What is difference between yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb? I see both of these in yarn-site.xml and I see the explanations here. yarn.scheduler.maximum-allocation-mb is given the following definition:…
makansij
  • 9,303
  • 37
  • 105
  • 183
41
votes
2 answers

How to get started with Big Data Analysis

I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis. I'd like to…
harshsinghal
  • 3,720
  • 8
  • 35
  • 32
41
votes
1 answer

Easiest way to install Python dependencies on Spark executor nodes?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)? Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or…
41
votes
10 answers

How does Hadoop perform input splits?

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form where k is the offset of the line from the beginning and…
Deepak
  • 2,003
  • 6
  • 30
  • 32
41
votes
4 answers

Free Large datasets to experiment with Hadoop

Do you know any large dataset to experiment with Hadoop which is free/low cost? Any pointers/links related are appreciated. Preference: At least one GB of data. Production log data of webserver. Few of them which I found so far: Wikipedia…
Sundar
  • 1,204
  • 1
  • 14
  • 17
40
votes
9 answers

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/ I tried it with: val conf = new Configuration() val fs = FileSystem.get(new…
AlexL
  • 761
  • 1
  • 6
  • 20
40
votes
12 answers

Why spark-shell fails with NullPointerException?

I try to execute spark-shell on Windows 10, but I keep getting this error every time I run it. I used both latest and spark-1.5.0-bin-hadoop2.4 versions. 15/09/22 18:46:24 WARN Connection: BoneCP specified but not present in CLASSPATH (or one…
Nick
  • 2,818
  • 5
  • 42
  • 60
39
votes
1 answer

HBase REST Filter ( SingleColumnValueFilter )

I cannot figure out how to use filters in the HBase REST interface (HBase 0.90.4-cdh3u3). The documentation just gives me a schema definition for a "string", but doesn't show how to use it. So, I'm able to do this: curl -v -H 'Content-Type:…
Mario
  • 1,801
  • 3
  • 20
  • 32
39
votes
1 answer

How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.?

Is there a paper/blog-post on when to use Cassandra or Membase or Hadoop or plain old relational databases ? Is there a paper discussing the strengths/weaknesses of each, and on what scenarios either of these technologies should be chosen ? I am…
Sankar
  • 6,192
  • 12
  • 65
  • 89
39
votes
3 answers

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship. Secondly, Spark…
Shahbaz
  • 10,395
  • 21
  • 54
  • 83
39
votes
8 answers

Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?

I am trying to setup a single-node Hadoop 2.6.0 cluster on my PC. On visiting http://localhost:8088/cluster, I find that my node is listed as an "unhealthy node". In the health report, it provides the error: 1/1 local-dirs are bad:…
Ra41P
  • 744
  • 1
  • 9
  • 18
39
votes
5 answers

List the namenode and datanodes of a cluster from any node?

From any node in a Hadoop cluster, what is the command to identify the running namenode? identify all running datanodes? I have looked through the commands manual and have not found this.
T. Webster
  • 9,605
  • 6
  • 67
  • 94
39
votes
7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…
nickponline
  • 25,354
  • 32
  • 99
  • 167