Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Ambari, A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for Hadoop
HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
health such as heatmaps and ability to view MapReduce, Pig and Hive
applications visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Avro, a data serialization system based on JSON schemas.
Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
Chukwa: A data collection system for managing large distributed systems.
Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
Flink, a fast and reliable large-scale data processing engine.
Giraph is an iterative graph processing framework, built on top of Apache Hadoop
HBase, A scalable, distributed database that supports structured data storage for large tables.
Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout, a library of machine learning algorithms compatible with M/R paradigm.
Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
Pig, a platform/programming language for authoring parallelizable jobs
Spark, a fast and general engine for large-scale data processing.
Storm, a system for real-time and stream processing
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions

votes

2 answers

Different ways to import files into HDFS

I want to know what are the different ways through which I can bring data into HDFS. I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log…

hadoop import hdfs

asked Sep 26 '15 at 06:29

Gaurav

votes

4 answers

Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)

I am running a hive query throwh oozie using hue.. I am creating a table through hue-oozie work flow... My job is failing but when I check in hive the table is created. Log shows below error: 16157 [main] INFO …

hadoop hive oozie hue

asked Sep 24 '15 at 11:14

Amaresh

3,231
7
37
60

votes

2 answers

how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar

I'm practicing a video tutorial from plural sight about Amazon EMR. I am stuck as i cannot proceed as i am getting this error Not a valid JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar Please note that tutorial is old and it is using a…

java python hadoop amazon-web-services emr

asked Sep 12 '15 at 21:07

harshil bhatt

votes

1 answer

Could not find uri with key dfs.encryption.key.provider.uri to create a keyProvider in HDFS encryption for CDH 5.4

CDH Version: CDH5.4.5 Issue: When HDFS Encryption is enabled using KMS available in Hadoop CDH 5.4 , getting error while putting file into encryption zone. Steps: Steps for Encryption of Hadoop as follows: Creating a key [SUCCESS] [tester@master…

hadoop encryption copy hdfs cloudera-cdh

asked Sep 09 '15 at 10:07

Jack Sparrow

votes

0 answers

Hadoop Counters vs Spark Accumulators (or what's a best way to gather statistics from hadoop mr and spark applications)

I'd like to understands what are the best practices to gather statistics of job execution in standard hadoop map-reduce and spark. Given 1. A number of files in hdfs (each director, i.e. dataset1, dataset2, etc. is the name of the dataset from the…

hadoop mapreduce apache-spark

asked Sep 08 '15 at 17:26

szhem

4,672
2
18
30

votes

1 answer

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster? The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.

hadoop mapreduce hadoop2 emr

asked Sep 03 '15 at 20:57

Marsellus Wallace

17,991
25
90
154

votes

3 answers

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and…

hadoop apache-spark filesystems hdfs pyspark

asked Sep 01 '15 at 14:53

A7med

votes

1 answer

Unable to connect to Spark UI on EMR

I have set up my SSH tunnel as per the instructions on the EMR console using ssh -i ~/SparkTest.pem -ND 8157 hadoop@ec2-52-1-245-67.compute-1.amazonaws.com. I have also set up FoxyProxy as per the instructions. I can access the Hadoop…

hadoop ssh amazon-ec2 apache-spark amazon-emr

asked Aug 26 '15 at 12:10

Rory Byrne

votes

2 answers

How to use the ResourceManager web interface as an user

Every time i try to use the Hadoop Resource Manager web interface (http://resource-manger.host:8088/cluster/) i show up logged in as dr.who. My question, how can I login as another user? In this case i want to login as myself and have a higher lever…

hadoop resourcemanager

asked Aug 20 '15 at 14:46

SQL.injection

2,607
5
20
37

votes

2 answers

Pyspark: shuffle RDD

I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers. However, pyspark falls over with only 100000000 integers. I'm using the code…

python hadoop apache-spark bigdata pyspark

asked Aug 19 '15 at 22:41

Marcin

48,559
18
128
201

votes

3 answers

"Wrong FS... expected: file:///" when trying to read file from HDFS in Java

I am unable to read a file from HDFS using Java: String hdfsUrl = "hdfs://:"; Configuration configuration = new Configuration(); configuration.set("fs.defaultFS", hdfsUrl); FileSystem fs = FileSystem.get(configuration); Path filePath = new…

java hadoop hdfs

asked Aug 18 '15 at 17:00

jds

7,910
11
63
101

votes

2 answers

Hive collect_list() does not collect NULL values

I am trying to collect a column with NULLs along with some values in that column...But collect_list ignores the NULLs and collects only the ones with values in it. Is there a way to retrieve the NULLs along with other values ? SELECT col1, col2,…

hadoop hive hive-udf

asked Aug 12 '15 at 04:56

lalith kkvn

votes

1 answer

Hive: Is there a better way to percentile rank a column?

Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to…

performance hadoop hive rank percentile

asked Aug 07 '15 at 17:29

Charlie Haley

4,152
4
22
36

votes

1 answer

How to set up Hadoop in Docker Swarm?

I would like to be able to start a Hadoop cluster in Docker, distributing the Hadoop nodes to the different physical nodes, using swarm. I have found the sequenceiq image that lets me run hadoop in a docker container, but this doesn't allow me to…

hadoop docker docker-swarm

asked Aug 03 '15 at 09:20

SGer

votes

2 answers

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried, -- creating external table hive> create…

hadoop mapreduce hive apache-hive

asked Aug 03 '15 at 07:46

Green

Prev 1 2 3

…

99 100 Next