Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

"Hadoop" typically refers to the software in the project that implements the map-reduce data analysis framework, plus the distributed file system (HDFS) that underlies it.

There is a Name Node, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS.

Description of Hadoop cluster

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. See: Filesystem Compatibility with Apache Hadoop.

Since version 0.23, Hadoop disposes of a standalone resource manager : yarn.

This resource manager makes it easier to use other modules alongside with the MapReduce engine, such as :

  • Accumulo, a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
  • Ambari, A web-based tool for provisioning, managing, and
    monitoring Apache Hadoop clusters which includes support for Hadoop
    HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster
    health such as heatmaps and ability to view MapReduce, Pig and Hive
    applications visually along with features to diagnose their
    performance characteristics in a user-friendly manner.
  • Avro, a data serialization system based on JSON schemas.
  • Cassandra, a replicated, fault-tolerant, decentralized and scalable database system.
  • Chukwa: A data collection system for managing large distributed systems.
  • Cascading: Cascading is a software abstraction layer for Apache Hadoop and it mainly targets Java developers. The framework has been developed to reduce the effort of writing boilerplate code by MapReduce programmers with Java skills.
  • Flink, a fast and reliable large-scale data processing engine.
  • Giraph is an iterative graph processing framework, built on top of Apache Hadoop
  • HBase, A scalable, distributed database that supports structured data storage for large tables.
  • Hive, A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout, a library of machine learning algorithms compatible with M/R paradigm.
  • Oozie, a workflow scheduler system to manage Apache Hadoop jobs.
  • Pig, a platform/programming language for authoring parallelizable jobs
  • Spark, a fast and general engine for large-scale data processing.
  • Storm, a system for real-time and stream processing
  • Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN.
  • ZooKeeper, a system for coordinating distributed nodes, similar to Google's Chubby

References

Online Tutorials

Related Tags

Hadoop

Related Technology

Commercial support is available from a variety of companies.

44316 questions
8
votes
1 answer

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver. I have 3 questions - Let's day you have a 20 TB flat file file stored in HDFS and from a driver…
uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64
8
votes
1 answer

Running yarn with spark not working with Java 8

I have cluster with 1 master and 6 slaves which uses pre-built version of hadoop 2.6.0 and spark 1.6.2. I was running hadoop MR and spark jobs without any problem with openjdk 7 installed on all the nodes. However when I upgraded openjdk 7 to…
jmoa
  • 91
  • 1
  • 6
8
votes
1 answer

How to count number of files under specific directory in hadoop?

I'm new to map-reduce framework. I want to find out the number of files under a specific directory by providing the name of that directory. e.g. Suppose we have 3 directories A, B, C and each one is having 20, 30, 40 part-r files respectively. So…
Prasanna
  • 1,752
  • 1
  • 15
  • 27
8
votes
1 answer

Dynamically load partitions in Hive with predicate pushdown

I have a very large table in Hive, from which we need to load a subset of partitions. It looks something like this: CREATE EXTERNAL TABLE table1 ( col1 STRING ) PARTITIONED BY (p_key STRING); I can load specific partitions like this: SELECT *…
KennethJ
  • 924
  • 1
  • 7
  • 16
8
votes
3 answers

Hadoop NoSuchMethodError apache.commons.cli

I'm using hadoop-2.7.2 and I did a MapReduceJob with IntelliJ. In my job, I'm using apache.commons.cli-1.3.1 and I put the lib in the jar. When I use the MapReduceJob on my Hadoop cluster I have a NoSuchMethodError: Exception in thread "main"…
Antonin
  • 81
  • 1
  • 2
8
votes
1 answer

java.io.EOFException: Premature EOF: no length prefix available in Spark on Hadoop

I'm getting this weird exception. I'm using Spark 1.6.0 on Hadoop 2.6.4 and submitting Spark job on YARN cluster. 16/07/23 20:05:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block…
ChikuMiku
  • 509
  • 2
  • 11
  • 22
8
votes
6 answers

Use SparkContext hadoop configuration within RDD methods/closures, like foreachPartition

I am using Spark to read a bunch of files, elaborating on them and then saving all of them as a Sequence file. What I wanted, was to have 1 sequence file per partition, so I did this: SparkConf sparkConf = new SparkConf().setAppName("writingHDFS") …
Vale
  • 1,104
  • 1
  • 10
  • 29
8
votes
1 answer

Kerberos Authentication Error - When loading Hadoop Config Files from SharedPath

I am developing an Java Application and this application is saving a result data to HDFS. The java Application should run in my windows machine. We using Kerberos Authentication and we placed a keytab file in NAS drive. And we saved Hadoop config…
8
votes
1 answer

Forward fill missing values in Spark/Python

I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this…
8
votes
2 answers

Insert timestamp into Hive

Hi i'm new to Hive and I want to insert the current timestamp into my table along with a row of data. Here is an example of my team table : team_id int fname string lname string time timestamp I have looked at some other examples, How to…
Frostie_the_snowman
  • 629
  • 3
  • 9
  • 17
8
votes
4 answers

How to select all columns of a dataframe in join - Spark-scala

I am doing join of 2 data frames and select all columns of left frame for example: val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer") in above I want to do select first_df.* .How can I select all columns of…
user2895589
  • 1,010
  • 4
  • 20
  • 33
8
votes
2 answers

Hadoop MR source: HDFS vs HBase. Benefits of each?

If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability,…
Andre
  • 1,601
  • 3
  • 15
  • 15
8
votes
1 answer

How to copy files from HDFS to S3 effectively programatically

My hadoop job generate large number of files on HDFS and I want to write a separate thread which will copy these files from HDFS to S3. Could any one point me to any java API that handles it. Thanks
RandomQuestion
  • 6,778
  • 17
  • 61
  • 97
8
votes
5 answers

How to access files in Hadoop HDFS?

I have a .jar file (containing a Java project that I want to modify) in my Hadoop HDFS that I want to open in Eclipse. When I type hdfs dfs -ls /user/... I can see that the .jar file is there - however, when I open up Eclipse and try to import it I…
wj1091
  • 159
  • 1
  • 2
  • 5
8
votes
2 answers

Hive Utf-8 Encoding number of characters supported?

Hi actually the problem is as follows the data i want to insert in hive table has latin words and its in utf-8 encoded format. But still hive does not display it properly. Actual Data:- Data Inserted in hive I changed the encoding of the table to…
Chetan Pulate
  • 503
  • 1
  • 7
  • 21