Questions tagged [drop-duplicates]

questions related to removing (or dropping) unwanted duplicate values

A duplicate is any re-occurrence of an item in a collection. This can be as simple as two identical strings in a list of strings, or multiple complex objects which are treated as the same object when compared to each other.

This tag may pertain to questions about removing unwanted duplicates.

See also:

144 questions
2
votes
2 answers

How to drop duplicates in one column based on values in 2 other columns in DataFrame in Python Pandas?

I have DataFrame in Python Pandas like below: data types: ID - int TYPE - object TG_A - int TG_B - int ID TYPE TG_A TG_B 111 A 1 0 111 B 1 0 222 B 1 0 222 A 1 0 333 B 0 1 333 A 0 1 And I need to drop duplicates in above…
dingaro
  • 2,156
  • 9
  • 29
2
votes
2 answers

Pandas multiindex duplicated only for particular indices

Say I have a Pandas dataframe with multiple indices: arrays = [["UK", "UK", "US", "FR"], ["Firm1", "Firm1", "Firm2", "Firm1"], ["Andy", "Peter", "Peter", "Andy"]] idx = pd.MultiIndex.from_arrays(arrays, names = ("Country", "Firm", "Responsible")) df…
W. Walter
  • 337
  • 1
  • 10
2
votes
2 answers

Summing values of (dropped) duplicate rows Pandas DataFrame

For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset. z =…
2
votes
2 answers

Remove duplicates from all columns with condition value>0 in columns in data frame

I need to remove duplicates from all of the columns. My data: id country publisher weak A B C 123 US X 1 6.77 0 0 123 US X 1 0 1.23 88.7 456 BZ Y …
Lili
  • 371
  • 3
  • 13
2
votes
2 answers

Drop almost duplicates rows based on timestamp

I'm trying to remove some data almost duplicates. I'm looking for a way to detect the closest (edited_at) trip made by the user without losing informations. So I want to solve this problem by calculating the difference between succesive timestamps…
Adil Blanco
  • 616
  • 2
  • 6
  • 23
2
votes
2 answers

Python Dataframe: Dropping duplicates base on certain conditions

Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice: I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area. Area Shop Name Shop Distance Shop ID 0 AAA Ly …
est
  • 51
  • 5
2
votes
2 answers

How to drop duplicate rows based on values of two columns?

I have a data frame like this: Category Date_1 Score_1 Date_2 Score_2 A 13/11/2019 5 13/11/2019 10 A 13/11/2019 5 14/11/2019 55 A 13/11/2019 5 15/11/2019 45 …
Sara
  • 97
  • 1
  • 9
2
votes
0 answers

Pandas drop_duplicates -> Fatal Python error: deallocating None

I have a code that checks an Excel sheet, and if it finds some changes, then takes a snapshot (Pandas Dataframe) of the whole sheet and saves it to a csv with a timestamp. It has been running all day doing it's job correctly, but usually once or…
user10466538
2
votes
1 answer

Drop all group rows when met a condition?

I have pandas data frame have two-level group based on 'col10' and 'col1'.All I want to do is, drop all group rows if a specified value in another column repeated or this value did not existed in the group (keep the group which the specified value…
Sidhom
  • 935
  • 1
  • 8
  • 15
2
votes
1 answer

pandas inclusive unique values from two columns

I can't find any elegant way to select unique rows from column A and column B but not jointly and not in a sequence. This is in order to keep "inclusive" intersection of unique values from these two columns. My aim is to keep as many unique values…
Alex
  • 43
  • 4
2
votes
1 answer

Custom logic for dropping duplicates

I have the following dataset that I'm hoping to apply some custom logic to: data = pd.DataFrame({'ID': ['A','B','B','C','C','D','D'], 'Date':…
Dfeld
  • 187
  • 9
1
vote
0 answers

In PySpark, how do I avoid an error when using exceptAll after a dropDuplicates (with subset)?

I am working on a sequence of transformations in PySpark (version 3.3.1). At certain point I have a dropDuplicates(subset=[X]) followed by a exceptAll, and I get an error. Here is a reproducible pipeline: from pyspark.sql import SparkSession spark…
PMHM
  • 173
  • 1
  • 3
  • 12
1
vote
1 answer

Python Pandas groupby agg or transform on max value_counts to drop duplicate rows

I have this df, and want to drop duplicates based on the max value counts of 'rating' (its binary field). None of the drop_duplicates with combination of grouby, max, count isn't fecthing the desired output. Any suggestion highly appreciated. df =…
1
vote
3 answers

Dropping duplicate rows in a Pandas DataFrame based on multiple column values

In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame A B C D 0 1 1 3 9 1 1 2 4 8 2 1 3 5 7 3 1 3 4 6 4 1 4 5 5 5 1 4 6 4 rows…
leofer
  • 120
  • 2
  • 6
1
vote
3 answers

Remove duplicate values across columns in pandas dataframe, without removing entire row

I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row. Dataframe: A B C 0 foo g A 1 foo g G 2 yes y B 3 bar y B Desired result: A B C 0 foo g …
btroppo
  • 23
  • 4
1
2
3
9 10