Questions tagged [drop-duplicates]

questions related to removing (or dropping) unwanted duplicate values

A duplicate is any re-occurrence of an item in a collection. This can be as simple as two identical strings in a list of strings, or multiple complex objects which are treated as the same object when compared to each other.

This tag may pertain to questions about removing unwanted duplicates.

See also:

144 questions
1
vote
2 answers

Python Pandas Drop Consecutive Data Frames but Period (.) at the End is the Differentiator

Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor. The only differentiator is a period at the end. Header A First First. I just want to drop the row that has a duplicate that does not have a…
1
vote
0 answers

remove duplicates from only one column while keeping the rest of the row intact dataframe pandas

I have 2 DataFrames, the first is for real estate (A) and the second for business types (B). They both have longitude and latitude. First I measured the distance between the coordinates to filter them by the closest ones (a radius of 2.5km). Now, I…
Albint3r
  • 11
  • 1
1
vote
1 answer

pandas select rows with condition in priority order

I'm new to pandas. So I have dataframe that looks like that: id car date color 1 2 bmw 2021-05-21 black 2 3 bmw 2021-05-21 yellow 3 4 mercedes 2021-06-21 red 4 5 toyota 2021-11-01 pink 5 6 toyota 2021-09-06 …
Sara
  • 41
  • 4
1
vote
2 answers

Is there a way to modify this code to reduce run time?

so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I…
1
vote
1 answer

Dask Dataframe: Remove duplicates by columns A, keeping the row with the highest value in column B

Basically this is answered for pandas in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B. In pandas I adopted the solution df.sort_values('B', ascending=False).drop_duplicates('A').sort_index() but…
Michael E.
  • 128
  • 1
  • 7
1
vote
1 answer

Pandas drop_duplicates with multiple conditions

I have some measurement datas that need to be filtered, I read them as dataframe data, like these: df RequestTime RequestID ResponseTime ResponseID 0 150 14 103 101 1 150 15 …
Sun Jar
  • 91
  • 9
1
vote
1 answer

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used…
Kshtj
  • 91
  • 3
  • 11
1
vote
1 answer

remove duplicates while adding a column in csv file using python

I have a CSV file that looks like this: |innings | bowler | |--------|---------------| |1 | P Kumar | |1 | P Kumar | |1 | P Kumar | |1 | P Kumar | |1 | Z Khan …
1
vote
2 answers

Keep last when using dropduplicates?

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following: from pyspark.sql import Row df = sc.parallelize([ \ Row(name='Alice', age=5, height=80), \ Row(name='Alice',…
1
vote
1 answer

Drop duplicate rows in dataframe based on multplie columns with list values

I have DataFrame with multiple columns and few columns contains list values. By considering only columns with list values in it, duplicate rows have to be deleted. Current Dataframe: ID col1 col2 col3 col4 1 …
KPH
  • 64
  • 8
1
vote
1 answer

How can I drop duplicates in pandas without dropping NaN values

I have a dataframe which I query and I want to get only unique values out of a certain column. I tried to do that executing this code: database = pd.read_csv(db_file, sep='\t') query =…
Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34
1
vote
1 answer

How to get dropped rows when using drop_duplicates (Pandas DataFrame)?

I use pandas.DataFrame.drop_duplicates() to drop duplicates of rows where all column values are identical, however for data quality analysis, I need to produce a DataFrame with the dropped duplicate rows. How can I identify which are the rows to be…
Code Ninja 2C4U
  • 114
  • 1
  • 11
1
vote
2 answers

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to…
1
vote
1 answer

How to reset the indices of remaining dataframe values after removing duplicate rows?

I have a dataframe and have removed duplicate rows using .drop_duplicates() method. But the initial index of the rows and still the same. data = data.drop_duplicates(keep=False, inplace=True) id Name Designation DOB 2 7934 …
1
vote
2 answers

why does python not drop all duplicates?

This is my original data frame I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'. This is my code: df=df.astype(str) df.drop_duplicates(subset={'head_x','head_y'}, keep=False,…
KristelK
  • 13
  • 3