Why duplicate columns are created after applying a grouping on multiple columns in pydatatable?

Question

I have a pydatatable as,

DT = dt.Frame(
     A=[1, 3, 2, 1, 4, 2, 1], 
     B=['A','B','C','A','D','B','A'],
     C=['myamulla','skumar','cary','myamulla','api','skumar','myamulla'])

Out[7]: 
   |  A  B   C       
-- + --  --  --------
 0 |  1  A   myamulla
 1 |  3  B   skumar  
 2 |  2  C   cary    
 3 |  1  A   myamulla
 4 |  4  D   api     
 5 |  2  B   skumar  
 6 |  1  A   myamulla

[7 rows x 3 columns]

I'm trying to filter out the duplicate rows as

DT[:, first(f[1:]), by([f[0],f[1],f[2]])]

Its giving an output as-

Out[10]: 
   |  A  B   C         B.0  C.0     
-- + --  --  --------  ---  --------
 0 |  1  A   myamulla  A    myamulla
 1 |  2  B   skumar    B    skumar  
 2 |  2  C   cary      C    cary    
 3 |  3  B   skumar    B    skumar  
 4 |  4  D   api       D    api     

[5 rows x 5 columns]

Here it has removed the duplicate observation and why it is creating the duplicate columns on B and C as B.0 C.0 ?

I'd say it is a bug; you should raise it on the github page. Hopefully there will be a dedicated function for duplicate rows — sammywemmy, Feb 11 '21 at 08:28

score 0 · Answer 1 · answered Feb 11 '21 at 20:10

The by() function adds its argument columns to the output frame, so that one can clearly see the group column(s) and the corresponding result of the j computation. In your example, you are grouping by 3 columns f[0], f[1] and f[2], and then within each group compute first(f[1]) and first(f[2]). So this is exactly what the output shows: first the 3 "by" columns, and then the 2 "first()" columns. Also, since the latter two columns would normally be auto-named as "B" and "C", but the columns "A", "B", "C" already exist in the output, then the last 2 columns get renamed as "B.0" and "C.0".

Now, if you don't want the columns "A", "B", "C" to be auto-added to the output, you can use the by()'s parameter add_columns=False:

>>> DT[:, first(f[1:]), by([f[0],f[1],f[2]], add_columns=False)]
   | B      C       
   | str32  str32   
-- + -----  --------
 0 | A      myamulla
 1 | B      skumar  
 2 | C      cary    
 3 | B      skumar  
 4 | D      api     
[5 rows x 2 columns]

That's true, but in the result dataset i can't see the column A, it only returns column B and C. — myamulla_ciencia, Feb 12 '21 at 03:12
Of course, because the expression asks for the first() of all columns starting from the second (remember that arrays are 0-based in python). So changing the j part into `first(f[:])` will produce all 3 columns as in the original. — Pasha, Feb 12 '21 at 20:10

Why duplicate columns are created after applying a grouping on multiple columns in pydatatable?

1 Answers1