1

I am running my hadoop jobs on a cluster consisting of multiple machines whose sizes are not known (main memory, number of cores, size etc.. per machine). Without using any OS specific library (*.so files I mean), is there any class or tools for hadoop in itself or some additional libraries where I could collect information like while the Hadoop MR jobs are being executed:

  1. Total Number of cores / number of cores employed by the job
  2. Total available main memory / allocated available main memory
  3. Total Storage space on each machine/allocated storage space
  4. 4.

I don't have the hardware information or the specs of the cluster which is why I want to collect this kind of information programmatically in my hadoop code.

How can I achieve this? I want to know this kind of information because of different reasons. One reason is given by the following error: I want to know which machine ran out of space.

12/07/17 14:28:25 INFO mapred.JobClient: Task Id : attempt_201205221754_0208_m_001087_0, Status : FAILED

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill2.out

        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)

        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)

        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)

        at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:121)

        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1247)

        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1155)

        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)

        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.
Bob
  • 991
  • 8
  • 23
  • 40

3 Answers3

1

The master node would have ssh access to all the slaves and the list of all the nodes should be there in the slaves files. So, write a script which iterates through the list of nodes in the slaves file and copies the file to the master using scp.

Something like this script should work

for i in `cat /home/praveensripati/Installations/hadoop-0.21.0/conf/slaves`;
do
scp praveensripati@$i:/proc/cpuinfo cpuinfo_$i
scp praveensripati@$i:/proc/meminfo meminfo_$i
done

The hos name/ip ($i) would be appended to the cpuinfo and the meminfo files. MR job would be an overkill for this task.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
0

Assuming you are on a cluster that is deployed on Linux nodes, you can extract the CPU and memory information from the /proc/cpuinfo and /proc/meminfo files. You'll need to write a custom input format that ensures you touch each node in the cluster (or just process a Text file with a split size that ensures enough map tasks are generated to ensure that each task tracker node gets at least one task to execute.

You can output the information as pairs from the mapper (hostname, info), and dedup in the reducer

Note that cpuinfo will report the number of hyperthreaded cores (if you have a compatible CPU) rather than the number of cores, so a 4 core hyperthreaded CPU will probably show 8 'processors' in /proc/cpuinfo

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Your idea is good, but how can I access the hostname? Should I process it as a classical map reduce job? can't I just open the file in the main class and write it out to the stdout? – Bob Jul 17 '12 at 16:26
  • Acquiring the hostname - http://stackoverflow.com/questions/5596788/get-hostname-of-local-machine – Chris White Jul 17 '12 at 17:57
  • If you want to know the specs of every node in your cluster then you need to run it as a MR job (unless of course you have ssh access to each node, in which case a shell script will be much easier). Running in the main(String args[]) method will only get you the information for the machine you are currently on – Chris White Jul 17 '12 at 17:58
  • I have access to the hostname. Now the problem is hadoop is throwing too many open files exception when I read the /proc/meminfo or cpuinfo from my mapper class. I think it is related to concurrent threads trying to read the file while map is being executed. I made my readfile method with the synchronized tag. How can I avoid this problem?Hadoop job fails at the moment because of this. – Bob Jul 18 '12 at 10:39
  • sync tag is not going to help you as the mappers run in separate JVMs. How many map tasks does your job currently run? – Chris White Jul 18 '12 at 10:51
  • like 100, therefore 100 * default 64MB per block is the size of the input file. How can I make sure that the files are read properly? – Bob Jul 18 '12 at 11:06
  • Just to be clear - you're not doing this in the map method are you? it should be done in the setup method, and you can probably override the run method so that the actual processing of each line in the input file is skipped – Chris White Jul 18 '12 at 22:35
0

The ohai library (part of Opscode Chef) is superb; it will output a JSON dump of all sorts of stats from the machine.

There used to be a flag -- mapred.max.maps.per.node -- to limit the number of tasks any one job could run concurrently on one node, but it was removed. Boo. You would have to run a modified Scheduler to provide that functionality.

mrflip
  • 822
  • 6
  • 7