3

I am quite new to the use of big data tools like Hadoop. I want to execute a publicly available cluster trace (https://github.com/google/cluster-data) on Yarn/or Yarn Simulator.

One way to do is to feed input into Yarn via Gridmix.

The format in which Gridmix (https://hadoop.apache.org/docs/r2.8.3/hadoop-gridmix/GridMix.html) takes input is basically the output from Rumen. And Rumen (https://hadoop.apache.org/docs/r2.8.3/hadoop-rumen/Rumen.html) takes JobHistory log generated from a map-reduce cluster as input.

The google trace is not a map-reduce trace. However, I was wondering if I can transform it to the format same as what Grdimix takes as input, then I can use the Grdmix.

Can anyone here point me input format of Gridmix (Or output of Rumen)?

Or suggest me another way to do what I want to do?

Thanks.

PHcoDer
  • 1,166
  • 10
  • 23

1 Answers1

0

The output of Rumen contains two files: 1. job-trace file, 2. cluster-topology file;

those two files are all json format, job-trace file as following format:

{
  "jobID" : "job_1546949851050_53464",
  "user" : "mammut",
  "computonsPerMapInputByte" : -1,
  "computonsPerMapOutputByte" : -1,
  "computonsPerReduceInputByte" : -1,
  "computonsPerReduceOutputByte" : -1,
  "submitTime" : 1551801585141,
  "launchTime" : 1551801594958,
  "finishTime" : 1551801630228,
  "heapMegabytes" : 200,
  "totalMaps" : 2,
  "totalReduces" : 1,
  "outcome" : "SUCCESS",
  "jobtype" : "JAVA",
  "priority" : "NORMAL",
  "directDependantJobs" : [ ],
  "mapTasks" : [ {
    "inputBytes" : 25599927,
    ...}]
  ...
}

And, the cluster-topology like:

{
  "name" : "<root>",
  "children" : [ {
    "name" : "rack-01",
    "children" : [ {
      "name" : "",
      "children" : null
    }, {
      "name" : "",
      "children" : null
    }, {
      "name" : "",
      "children" : null
    } ]
  }, {
    "name" : "default-rack",
    "children" : [ {
      "name" : "x",
      "children" : null
    } ]
  } ]
}
lin0Xu
  • 21
  • 5