Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

22 answers

Namenode not getting started

I was using Hadoop in a pseudo-distributed mode and everything was working fine. But then I had to restart my computer because of some reason. And now when I am trying to start Namenode and Datanode I can find only Datanode running. Could anyone…

hadoop hdfs

asked Nov 10 '11 at 08:02

user886908

votes

5 answers

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.

hadoop mapreduce

asked Jul 08 '10 at 19:34

jason

3,471
6
30
43

votes

4 answers

Where are logs in Spark on YARN?

I'm new to spark. Now I can run spark 0.9.1 on yarn (2.0.0-cdh4.2.1). But there is no log after execution. The following command is used to run a spark example. But logs are not found in the history server as in a normal MapReduce…

hadoop logging apache-spark cloudera hadoop-yarn

asked Apr 14 '14 at 11:15

DeepNightTwo

4,809
8
46
60

votes

8 answers

How to overwrite the existing files using hadoop fs -copyToLocal command

Is there any way we can overwrite existing files, while coping from HDFS using: hadoop fs -copyToLocal

hadoop

asked May 08 '13 at 09:50

hjamali52

1,135
5
12
19

votes

12 answers

Hadoop: «ERROR : JAVA_HOME is not set»

I'm trying to install Hadoop on Ubuntu 11.10. I set the JAVA_HOME variable in the file conf/hadoop-env.sh to: # export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk and then I execute these commands (Standalone Operation): $ mkdir input $ cp…

linux hadoop ubuntu-11.04

asked Jan 11 '12 at 21:44

koukou

votes

18 answers

How to Access Hive via Python?

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated. When I add this to /etc/profile: export PYTHONPATH=$PYTHONPATH:/usr/lib/hive/lib/py I can then do the imports as listed in the link, with the…

python hadoop hive

asked Jan 26 '14 at 23:01

Matthew Moisen

16,701
27
128
231

votes

4 answers

hdfs dfs -put with overwrite?

I am using hdfs dfs -put myfile mypath and for some files I get put: 'myfile': File Exists does that mean there is a file with the same name or does that mean the same exact file (size, content) is already there? how can I specify an -overwrite…

hadoop hdfs

asked Apr 23 '16 at 21:16

ℕʘʘḆḽḘ

18,566
34
128
235

votes

8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…

hadoop apache-pig hdfs parquet

asked Nov 24 '15 at 00:57

Super_John

1,767
2
14
27

votes

7 answers

How to export data from Spark SQL to CSV

This command works with HiveQL: insert overwrite directory '/data/home.csv' select * from testtable; But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace: java.lang.RuntimeException: Unsupported language…

hadoop apache-spark export-to-csv hiveql apache-spark-sql

asked Aug 11 '15 at 09:24

shashankS

1,043
1
11
21

votes

2 answers

What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

I would like to know the relation between the mapreduce.map.memory.mb and mapred.map.child.java.opts parameters. Is mapreduce.map.memory.mb > mapred.map.child.java.opts?

apache hadoop configuration hadoop-yarn heap-size

asked Jun 05 '14 at 21:34

yedapoda

votes

6 answers

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so: CREATE TABLE mytable ( num1 INT, text1 STRING, num2 INT, text2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; LOAD DATA LOCAL INPATH '/data.csv' OVERWRITE INTO TABLE mytable; …

hadoop hbase hive hdfs delimiter

asked Nov 29 '12 at 15:05

Martijn Lenderink

votes

7 answers

java.net.URISyntaxException when starting HIVE

I am new in HIVE. I have already set up hadoop and it works well, and I want to set up Hive. When I start hive , it shows an error as Caused by: java.net.URISyntaxException: Relative path in absolute URI:…

hadoop hive

asked Nov 24 '14 at 07:37

Exia

2,381
4
17
24

votes

8 answers

data block size in HDFS, why 64MB?

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB. What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB? If yes, what is the advantage of doing that?-> easy…

database hadoop mapreduce block hdfs

asked Oct 20 '13 at 03:56

dykw

1,199
3
13
17

votes

3 answers

Java vs Python on Hadoop

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go…

java python hadoop

asked Sep 26 '09 at 21:55

jnoss

2,049
1
17
20

votes

4 answers

Thrift, Avro, Protocolbuffers - Are they all dead?

Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor…

hadoop serialization protocol-buffers thrift avro

asked Dec 05 '16 at 06:26

dominik

Prev 1 2 3

…

99 100 Next