14

I have updated my question to provide a clearer example.

Is it possible to use the drop_duplicates method in Pandas to remove duplicate rows based on a column id where the values contain a list. Consider column 'three' which consists of two items in a list. Is there a way to drop the duplicate rows rather than doing it iteratively (which is my current workaround).

I have outlined my problem by providing the following example:

import pandas as pd

data = [
{'one': 50, 'two': '5:00', 'three': 'february'}, 
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 25, 'two': '6:00', 'three': ['february', 'january']},
{'one': 90, 'two': '9:00', 'three': 'january'}
]

df = pd.DataFrame(data)

print(df)

   one                three   two
0   50             february  5:00
1   25  [february, january]  6:00
2   25  [february, january]  6:00
3   25  [february, january]  6:00
4   90              january  9:00

df.drop_duplicates(['three'])

Results in the following error:

TypeError: type object argument after * must be a sequence, not map
archienorman
  • 1,434
  • 3
  • 20
  • 36
  • 1
    you want `df_two = df_one.drop_duplicates('ID')` or specifically `df_two = df_one.drop_duplicates(subset=['ID'])` – EdChum Jun 13 '16 at 14:58
  • afraid that has not resolved the issue. still seeing the same error – archienorman Jun 13 '16 at 15:07
  • so does `df_two = df_one.drop_duplicates()` work? – EdChum Jun 13 '16 at 15:08
  • unfortunately not, get the same error – archienorman Jun 13 '16 at 15:11
  • 2
    You'll have to post raw data and code that reproduces this error then as it seems that this is not the issue – EdChum Jun 13 '16 at 15:12
  • You had the relevant part of the error posted earlier - the rest of the error that you added to your post doesn't help us address your question. I think that @EdChum meant it would be more helpful if you posted the contents (i.e., raw data) of `df` and `df_one` and any code you used to create `df` as well. – Vladislav Martin Jun 13 '16 at 15:18
  • It's not really clear: is `ID` is a column in your `DataFrame`s or is it a separate entity? If you'd like some general help with understanding how to use the `drop_duplicates` function on a `DataFrame`, maybe this [StackOverflow question](http://stackoverflow.com/q/13035764/5209610) would help you... – Vladislav Martin Jun 13 '16 at 15:38
  • 'ID' is the column in which to remove duplicate rows on. I have updated the post to make this clearer. – archienorman Jun 13 '16 at 15:40

1 Answers1

24

I think it's because the list type isn't hashable and that's messing up the duplicated logic. As a workaround you could cast to tuple like so:

df['four'] = df['three'].apply(lambda x : tuple(x) if type(x) is list else x)
df.drop_duplicates('four')

   one                three   two                 four
0   50             february  5:00             february
1   25  [february, january]  6:00  (february, january)
4   90              january  9:00              january
Matthew
  • 10,361
  • 5
  • 42
  • 54