Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
2 answers

Oozie shell action not running as submitting user

I've written an Oozie workflow that runs a BASH shell script to do some hive queries and perform some actions on the results. The script runs but throws a permission error when accessing some of the HDFS data. The user that submitted the Oozie…
Blake
  • 83
  • 6
8
votes
2 answers

Using Hive ntile results in where clause

I want to get summary data of the first quartile for a table in Hive. Below is a query to get the maximum number of views in each quartile: SELECT NTILE(4) OVER (ORDER BY total_views) AS quartile, MAX(total_views) FROM view_data GROUP BY…
Nadine
  • 1,620
  • 2
  • 15
  • 27
8
votes
2 answers

hadoop map reduce taking forever to complete

I am new to the world of map reduce, I have run a job and it seems to be taking forever to complete given that it is a relatively small task, I am guessing something has not gone according to plan. I am using hadoop version 2.6, here is some info…
godzilla
  • 3,005
  • 7
  • 44
  • 60
8
votes
1 answer

How to interpret MapReduce Performance Counters

To be more specific: In task counters, the CPU spent is from proc/stat's utime + stime, so it means things like IOWait will not be counted. Is that right? Elapsed time for the whole task are a lot longer than CPU time spent counter, does it mean…
user1192878
  • 704
  • 1
  • 10
  • 20
8
votes
3 answers

merge multiple small files in to few larger files in Spark

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table…
dheee
  • 1,588
  • 3
  • 15
  • 25
8
votes
6 answers

Hadoop on Windows. YARN fails to start with java.lang.UnsatisfiedLinkError

I have installed/configured Hadoop on windows hadoop-2.7.0 I could successfully start "sbin\start-dfs" run command. DataNode and NameNode started. I could create directory, add file into hadoop system. But now when I try "sbin/start-yarn" on…
Kaushik Lele
  • 6,439
  • 13
  • 50
  • 76
8
votes
2 answers

Save flume output to hive table with Hive Sink

I am trying to configure flume with Hive to save flume output to hive table with Hive Sink type. I have single node cluster. I use mapr hadoop distribution. Here is my flume.conf agent1.sources = source1 agent1.channels = channel1 agent1.sinks =…
Andrey Braslavskiy
  • 211
  • 1
  • 3
  • 10
8
votes
2 answers

Hive clustered by on more than one column

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets…
Manikandan Kannan
  • 8,684
  • 15
  • 44
  • 65
8
votes
1 answer

How to flatMap a function on GroupedDataSet in Apache Flink

I want to apply a function via flatMap to each group produced by DataSet.groupBy. Trying to call flatMap I get the compiler error: error: value flatMap is not a member of org.apache.flink.api.scala.GroupedDataSet My code: var mapped =…
Willi Müller
  • 631
  • 1
  • 5
  • 13
8
votes
2 answers

Change Hive Database location

Is there a way to alter the location that a database points to? I tried the following ways: alter database set DBPROPERTIES('hive.warehouse.dir'=''); alter database set DBPROPERTIES('location'=''); alter…
Harman
  • 751
  • 1
  • 9
  • 31
8
votes
4 answers

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3

I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command bin/hadoop jar .jar It throws the following errors (not at the…
Deepak
  • 2,003
  • 6
  • 30
  • 32
8
votes
5 answers

Hadoop - java.net.ConnectException: Connection refused

I want connect to hdfs (in localhost) and i have a error: Call From despubuntu-ThinkPad-E420/127.0.1.1 to localhost:54310 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: …
Alex
  • 207
  • 2
  • 3
  • 11
8
votes
4 answers

Pig keeps trying to connect to job history server (and fails)

I'm running a Pig job that fails to connect to the Hadoop job history server. The task (usually any task with GROUP BY) runs for a while and then it starts with a message like: 2015-04-21 19:05:22,825 [main] INFO …
badroit
  • 1,316
  • 15
  • 28
8
votes
3 answers

Change Block size of existing files in Hadoop

Consider a hadoop cluster where the default block size is 64MB in hdfs-site.xml. However, later on the team decides to change this to 128MB. Here are my questions for the above scenario? Will this change require restart of the cluster or it will be…
divinedragon
  • 5,105
  • 13
  • 50
  • 97
8
votes
4 answers

Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

When running Spark 1.3.0 Pi example on YARN (Hadoop 2.6.0.2.2.0.0-2041) with the following script: # Run on a YARN cluster export HADOOP_CONF_DIR=/etc/hadoop/conf /var/home2/test/spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi…
zork
  • 2,085
  • 6
  • 32
  • 48
1 2 3
99
100