8

I'd like to understands what are the best practices to gather statistics of job execution in standard hadoop map-reduce and spark.

Given

1. A number of files in hdfs (each director, i.e. dataset1, dataset2, etc. is the name of the dataset from the point 3)

/user/tester/dataset1/part-0000*.avro
/user/tester/dataset2/part-0000*.avro
/user/tester/dataset3/part-0000*.avro
/user/tester/dataset4/part-0000*.avro

2. Each file contains an avro records with a ~1000 attributes

| id | attr_1  | attr_2  | attr_3  | ... | attr_N  |
----------------------------------------------------
| 1  | val_1_1 | val_1_2 | val_1_3 | ... | val_1_N |
| 2  | val_2_1 | val_2_2 | val_2_3 | ... | val_2_N |
| 3  | val_M_1 | val_M_2 | val_M_3 | ....| val_M_N |

3. There is a configuration file with the information what attributes/columns and from what dataset it is necessary to get and how the given dataset should be filtered out like that one

<datasets>
    <dataset>
        <id>dataset1</id>
        <attributes>
            <attribute>attr_1</attribute>
            <attribute>attr_3</attribute>
        </attributes>
        <filter>attr_1 gt 50 and attr_3 eq 100</filter>
    </dataset>
    <dataset>
        <id>dataset2</id>
        <attributes>
            <attribute>attr_2</attribute>
            <attribute>attr_5</attribute>
            <attribute>attr_8</attribute>
        </attributes>
        <filter>attr_2 gteq 71</filter>
    </dataset>
    ...
</datasets>

Problem

  1. Filter all the datasets and get only the necessary attributes according to the configuration from point 2, then group the datasets by the attribute id and save the resulting dataset into the file (implementation is pretty clear here).
  2. By the end of the job determine total number of records read from each dataset
  3. By the end of the job determine filtered number of records in each dataset
  4. By the end of the job determine how many times each non-empty attribute occured in each dataset according to the configuration from the point 2.
  5. By the end of the job determine how many times each non-empty attribute occured in the final dataset.

So

What is the best way to calculate such statistics if

  1. Hadoop guarantees that by the job completion an update of the counter will be performed once even if

    1. the corresponding task restarts
    2. the speculative execution enabled
  2. Hadoop counters are not intended to be used for statistics.

  3. Regarding Spark,

    1. Accumulator updates performed inside actions only and Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value.
    2. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed?
szhem
  • 4,672
  • 2
  • 18
  • 30

0 Answers0