Questions tagged [hadoop]

Hadoop is an open-source solution for providing a distributed/replicated file system, a produciton grade map-reduce system, and has a series of complementary additions like Hive, Pig, and HBase to get more out of a Hadoop-powered cluster.

Hadoop is an Apache foundation sponsored project, with commercial support provided by multiple vendors, including Cloudera, Hortonworks, and MapR. Apache has a more complete set of commercial solutions documented.

Available complementary additions to Hadoop include:

  • Hadoop distributed filesystem ( standard )
  • The map-reduce architecture ( standard )
  • Hive, which provides a SQL like interface to the M/R arch
  • Hbase, a distributed key-value service

Recommended reference sources:

261 questions
0
votes
1 answer

What is the Ideal instance type for hadoop namenode

For a relatively small, one terabyte cluster ( 2TB actual after replication ) I was trying to nail down what the namenode's ideal memory/cpu size would be, having worked with hadoop off and on as an end-user I can't imagine it being too crazy... but…
David
  • 409
  • 1
  • 4
  • 10
0
votes
1 answer

How do I grant a user permission to use Hadoop via Kerberos?

I've setup Hadoop to use Kerberos (following the Cloudera security guide), but it is unclear how I connect to hadoop with regular users (e.g. username=myuser). Currently I have myself authenticated with Kerberos with my Keberos admin user (via…
Dolan Antenucci
  • 329
  • 1
  • 4
  • 16
0
votes
1 answer

Hadoop + NAT scenario

I have a situation where I'd like to run Hadoop spread across 2 clusters. The first cluster (ClusterA) is normal and all nodes are publicly accessible. The second cluster (ClusterB) is behind a NAT. Nodes in ClusterA will be running both Mapred…
BigChief
  • 398
  • 1
  • 2
  • 12
0
votes
1 answer

Hadoop streaming job on EC2 stays in "pending" state

Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu. Have valid data in hdfs:// ready for processing. Wrote little streaming mapper in python. When I launch a mapper only job using: hadoop jar…
liamf
  • 392
  • 4
  • 10
0
votes
2 answers

Adding smaller nodes to pseudo-distributed nutch/hadoop cluster

I have nutch/hadoop pseudo distributed running fine. I want to add processing capacity by adding new nodes which are smaller than master (HD 3 times smaller) and cheaper of course. Since the default HDFS replication is at 3, after balancing the data…
millebii
  • 161
  • 8
0
votes
1 answer

Datanode not showing in WEB interface

Newbie on hadoop clusters. I have setup my two nodes conf as described by M. G. Noll here. The datanode has datanode & tasktracker running (jps command show them). However in the WEB UI I only see one node for the DFS Live Node : 1 Dead Node :…
millebii
  • 161
  • 8
0
votes
1 answer

Need help to build a strategy

I am a Junior System Administrator with one of the Engineering Schools. One of the Professors got a donation of 45 servers (Dell Poweredge 1690) from Yahoo. Following are his requirements: hadoop (mapreduce) on Linux (which flavor of Linux and…
Anup
  • 1
0
votes
2 answers

How to use combined CPU/Memory power of a Windows cluster

I have 5 Windows machines (dual-core, 3GB) in a LAN all joined to a domain. I have a program which needs 8-cores and 10 GB to run in a given SLA time. What platform/tool can i use to harness the combined CPU/memory and other resources of these…
Munish Goyal
  • 111
  • 1
  • 3
0
votes
1 answer

Compiling hdfs-fuse bundled with Hadoop

I am trying to compile the hdfs-fuse extension from Hadoop 0.20.2 on a machine running Fedora 14. Below are the packages I have installed: fuse-2.8.5-2.fc14.x86_64 fuse-libs-2.8.5-2.fc14.x86_64 fuse-devel-2.8.5-2.fc14.x86_64 Then, I have…
Laurent
  • 321
  • 3
  • 14
0
votes
1 answer

Can overriding of -Xmx be prevented for hadoop jobs?

I have a shared cluster running Hadoop-0.20.2. Occasionally users don't realize that the default memory settings chosen are based on the amount of available memory. Can I enforce a maximum value for Xmx?
Dan R
  • 2,335
  • 2
  • 19
  • 28
0
votes
0 answers

ZooKeeper error ; unrecognized host name for local configuration

I am using Kylin 4+ and want to use Windows and run it locally (without Hadoop). I follow this tut in their website which states that zookeeper config must be set to local like so: kylin.env.zookeeper-is-local=true Which supposes that Kylin won't…
0
votes
1 answer

Does VM machine can replace physical machine,

We have 254 Physical servers when all machines are DELL servers R740. servers are part of Hadoop cluster. most of them are holding HDFS filesystem and data node & node manager services, part of them are Kafka machines. The OS that installed on the…
King David
  • 549
  • 6
  • 20
0
votes
0 answers

How to force Hadoop Daemon or JVM to use given hostname instead of nodes actual hostname

0 I have 5 nodes hadoop cluster with different fqdns with domain xyz.com like node1.xyz.com, node2.xyz.com ... node5.xyz.com, its hostnames are configured with this domains, so if we write hostname command inside linux terminal it returns…
0
votes
1 answer

Clear RAM Memory Cache and buffer on production Hadoop cluster with HDFS filesystem

we have Hadoop cluster with 265 Linux RHEL machines. from total 265 machines, we have 230 data nodes machines with HDFS filesystem. total memory on each data-node is 128G and we run many spark applications on these machines. last month we added…
King David
  • 549
  • 6
  • 20
0
votes
1 answer

Hadoop datanodes Using "{Hostname}/{IP address}:9000" to try to connect to nameNode

I have a cluster of Pis that I'm using to experiment with Hadoop. masternode is set to .190, p1 to 191 ... p4 to 194. All nodes are up and running. start-dfs.sh, stop-all.sh, etc from the master successfully start and stop the datanodes. However, on…
Snap E Tom
  • 101
  • 1