0

I have to get the number of times that a complete line appears repeated in my data frame, to then only show those lines that appear repetitions and show in the last column how many times those lines appear repetitions.

Input values for creating output correct table:

dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr
2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad
2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good
2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad
1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good
1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good

I just have to keep those rows that are totally equal.

This is the table result:

      dur       wage1   wage2   wage3   cola    hours  pension  stby_pay    shift_diff  num_reps
 6    3.0        2.0    3.0     NaN      tcf      NaN   empl_contr  NaN      NaN           4
 8    1.0        2.8    NaN     NaN      none    38.0   empl_contr  2.0      3.0           2
 9    1.0        5.7    NaN     NaN      none    40.0   empl_contr  NaN      4.0           3
 43   2.0        2.5    3.0     NaN      NaN     40.0    none       NaN      NaN           2

As you can see on this table, we keep for example line with index 6 because on line 6 and 17 from the input table to read, both lines are the same.

With my current code:

def detect_duplicates(data):
    x = DataFrame(columns=data.columns.tolist() + ["num_reps"])

    x = data[data.duplicated(keep=False)].drop_duplicates()

    return x

I get the result correctly, however I do not know how to count the repeated rows and then add it in the column 'nums_rep' at the end of the table.

This is my result, without the last column that counts the number of repeated rows:

    dur  wage1  wage2  wage3  cola  hours     pension  stby_pay  shift_diff
6   3.0    2.0    3.0    NaN   tcf    NaN  empl_contr       NaN         NaN
8   1.0    2.8    NaN    NaN  none   38.0  empl_contr       2.0         3.0
9   1.0    5.7    NaN    NaN  none   40.0  empl_contr       NaN         4.0
43  2.0    2.5    3.0    NaN   NaN   40.0        none       NaN         NaN

How can I perform a correct count, based on the equality of all the data in the column, then add it and show it in the column 'num_reps'?

fiticida
  • 664
  • 1
  • 10
  • 24
  • Possible duplicate of [How to count duplicate rows in pandas dataframe?](https://stackoverflow.com/questions/35584085/how-to-count-duplicate-rows-in-pandas-dataframe). Or maybe [Pandas DataFrame count duplicate rows and fill in column](https://stackoverflow.com/q/43015345/2823755) – wwii Nov 16 '17 at 20:34
  • Can you provide a sample of the input table showing what rows are being dropped and why? – Paula Livingstone Nov 16 '17 at 21:13
  • Ofc, I update the post, but as I said it's a csv, where I have a lot of differentes rows, I just have to keep those rows that are totally equalin – fiticida Nov 16 '17 at 21:16

0 Answers0