2

I am a bit stuck with a simple task of aggregating two example sets using rapidminer and need an advise on a process and possible operators.

I have two CSVs like:

key     A   B   C   D   E
first   2   2   2   2   2
second  3   3   3   3   3

and

key     D   E   F   G   H   I   J   K
first   4   4   4   4   4   99  99  99
third   5   5   5   5   5   77  77  77

What I want to have is an aggregated dataset which is basically a Union of these two, but also I would like to make the 'key' attribute a uniqie one in the resulting dataset. Since the 'first' value appears in both datasets, I need to sum the corresponding values, so the resulting dataset would be the following:

key     A   B   C   D   E   F   G   H   I   J   K
first   2   2   2   6   6   4   4   4   99  99  99
second  3   3   3   3   3   ?   ?   ?   ?   ?   ?
 third  ?   ?   ?   5   5   5   5   5   77  77  77

Note that 'First D' and 'First E' cells have become 6 (2+4).

Question marks are missing values as produced by the Union operator and will be replaced later.

Thanks in advance.

kypexin
  • 40
  • 4

1 Answers1

0

Start with the Union operator.

With this input

key        B        C
first      4        5
third      4        5

and this

key        A        B
first      3        4
second     3        4

Union will do this.

key        A        B       C
first      3        4       ?
second     3        4       ?
first      ?        4       5
third      ?        4       5

Then use Aggregate to get this.

key        sum(A)   sum(B)  sum(C)
first      3        8       5
second     3        4       0
third      0        4       5

The parameters needed for this are use default aggregation set to true, default aggregation function set to sum and group by attributes set to key.

A final step is to use Rename by Replacing to change the attribute names to be more user friendly. The parameters needed are replace what set to sum\((.*)\) and replace by set to $1.

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41