Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
55
votes
22 answers

Namenode not getting started

I was using Hadoop in a pseudo-distributed mode and everything was working fine. But then I had to restart my computer because of some reason. And now when I am trying to start Namenode and Datanode I can find only Datanode running. Could anyone…
user886908
55
votes
5 answers

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
jason
  • 3,471
  • 6
  • 30
  • 43
55
votes
4 answers

Where are logs in Spark on YARN?

I'm new to spark. Now I can run spark 0.9.1 on yarn (2.0.0-cdh4.2.1). But there is no log after execution. The following command is used to run a spark example. But logs are not found in the history server as in a normal MapReduce…
DeepNightTwo
  • 4,809
  • 8
  • 46
  • 60
55
votes
8 answers

How to overwrite the existing files using hadoop fs -copyToLocal command

Is there any way we can overwrite existing files, while coping from HDFS using: hadoop fs -copyToLocal
hjamali52
  • 1,135
  • 5
  • 12
  • 19
54
votes
12 answers

Hadoop: «ERROR : JAVA_HOME is not set»

I'm trying to install Hadoop on Ubuntu 11.10. I set the JAVA_HOME variable in the file conf/hadoop-env.sh to: # export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk and then I execute these commands (Standalone Operation): $ mkdir input $ cp…
koukou
  • 613
  • 1
  • 7
  • 13
54
votes
18 answers

How to Access Hive via Python?

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated. When I add this to /etc/profile: export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py I can then do the imports as listed in the link, with the…
Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231
52
votes
4 answers

hdfs dfs -put with overwrite?

I am using hdfs dfs -put myfile mypath and for some files I get put: 'myfile': File Exists does that mean there is a file with the same name or does that mean the same exact file (size, content) is already there? how can I specify an -overwrite…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
52
votes
8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…
Super_John
  • 1,767
  • 2
  • 14
  • 27
52
votes
7 answers

How to export data from Spark SQL to CSV

This command works with HiveQL: insert overwrite directory '/data/home.csv' select * from testtable; But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace: java.lang.RuntimeException: Unsupported language…
shashankS
  • 1,043
  • 1
  • 11
  • 21
52
votes
2 answers

What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

I would like to know the relation between the mapreduce.map.memory.mb and mapred.map.child.java.opts parameters. Is mapreduce.map.memory.mb > mapred.map.child.java.opts?
yedapoda
  • 850
  • 2
  • 9
  • 11
52
votes
6 answers

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so: CREATE TABLE mytable ( num1 INT, text1 STRING, num2 INT, text2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; LOAD DATA LOCAL INPATH '/data.csv' OVERWRITE INTO TABLE mytable; …
Martijn Lenderink
  • 535
  • 1
  • 5
  • 5
51
votes
7 answers

java.net.URISyntaxException when starting HIVE

I am new in HIVE. I have already set up hadoop and it works well, and I want to set up Hive. When I start hive , it shows an error as Caused by: java.net.URISyntaxException: Relative path in absolute URI:…
Exia
  • 2,381
  • 4
  • 17
  • 24
51
votes
8 answers

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB. What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB? If yes, what is the advantage of doing that?-> easy…
dykw
  • 1,199
  • 3
  • 13
  • 17
51
votes
3 answers

Java vs Python on Hadoop

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go…
jnoss
  • 2,049
  • 1
  • 17
  • 20
50
votes
4 answers

Thrift, Avro, Protocolbuffers - Are they all dead?

Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor…
dominik
  • 613
  • 2
  • 6
  • 10