0

I have a pydatatable as,

DT = dt.Frame(
     A=[1, 3, 2, 1, 4, 2, 1], 
     B=['A','B','C','A','D','B','A'],
     C=['myamulla','skumar','cary','myamulla','api','skumar','myamulla'])
Out[7]: 
   |  A  B   C       
-- + --  --  --------
 0 |  1  A   myamulla
 1 |  3  B   skumar  
 2 |  2  C   cary    
 3 |  1  A   myamulla
 4 |  4  D   api     
 5 |  2  B   skumar  
 6 |  1  A   myamulla

[7 rows x 3 columns]

I'm trying to filter out the duplicate rows as

DT[:, first(f[1:]), by([f[0],f[1],f[2]])]

Its giving an output as-

Out[10]: 
   |  A  B   C         B.0  C.0     
-- + --  --  --------  ---  --------
 0 |  1  A   myamulla  A    myamulla
 1 |  2  B   skumar    B    skumar  
 2 |  2  C   cary      C    cary    
 3 |  3  B   skumar    B    skumar  
 4 |  4  D   api       D    api     

[5 rows x 5 columns]

Here it has removed the duplicate observation and why it is creating the duplicate columns on B and C as B.0 C.0 ?

myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30
  • I'd say it is a bug; you should raise it on the github page. Hopefully there will be a dedicated function for duplicate rows – sammywemmy Feb 11 '21 at 08:28

1 Answers1

0

The by() function adds its argument columns to the output frame, so that one can clearly see the group column(s) and the corresponding result of the j computation. In your example, you are grouping by 3 columns f[0], f[1] and f[2], and then within each group compute first(f[1]) and first(f[2]). So this is exactly what the output shows: first the 3 "by" columns, and then the 2 "first()" columns. Also, since the latter two columns would normally be auto-named as "B" and "C", but the columns "A", "B", "C" already exist in the output, then the last 2 columns get renamed as "B.0" and "C.0".

Now, if you don't want the columns "A", "B", "C" to be auto-added to the output, you can use the by()'s parameter add_columns=False:

>>> DT[:, first(f[1:]), by([f[0],f[1],f[2]], add_columns=False)]
   | B      C       
   | str32  str32   
-- + -----  --------
 0 | A      myamulla
 1 | B      skumar  
 2 | C      cary    
 3 | B      skumar  
 4 | D      api     
[5 rows x 2 columns]
Pasha
  • 6,298
  • 2
  • 22
  • 34
  • That's true, but in the result dataset i can't see the column A, it only returns column B and C. – myamulla_ciencia Feb 12 '21 at 03:12
  • Of course, because the expression asks for the first() of all columns starting from the second (remember that arrays are 0-based in python). So changing the j part into `first(f[:])` will produce all 3 columns as in the original. – Pasha Feb 12 '21 at 20:10
  • Yeah, that's correct, and thanks for that – myamulla_ciencia Feb 13 '21 at 01:32