Questions tagged [drop-duplicates]

questions related to removing (or dropping) unwanted duplicate values

A duplicate is any re-occurrence of an item in a collection. This can be as simple as two identical strings in a list of strings, or multiple complex objects which are treated as the same object when compared to each other.

This tag may pertain to questions about removing unwanted duplicates.

Python Pandas Drop Consecutive Data Frames but Period (.) at the End is the Differentiator

Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor. The only differentiator is a period at the end. Header A First First. I just want to drop the row that has a duplicate that does not have a…

asked Aug 26 '21 at 06:25

Ken Russel Sy

vote

0 answers

remove duplicates from only one column while keeping the rest of the row intact dataframe pandas

I have 2 DataFrames, the first is for real estate (A) and the second for business types (B). They both have longitude and latitude. First I measured the distance between the coordinates to filter them by the closest ones (a radius of 2.5km). Now, I…

python dataframe geolocation drop-duplicates

asked Aug 07 '21 at 01:01

Albint3r

vote

1 answer

pandas select rows with condition in priority order

I'm new to pandas. So I have dataframe that looks like that: id car date color 1 2 bmw 2021-05-21 black 2 3 bmw 2021-05-21 yellow 3 4 mercedes 2021-06-21 red 4 5 toyota 2021-11-01 pink 5 6 toyota 2021-09-06 …

python pandas dataframe sorting drop-duplicates

asked Aug 02 '21 at 19:11

Sara

vote

2 answers

Is there a way to modify this code to reduce run time?

so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I…

python data-cleaning fuzzywuzzy drop-duplicates rapidfuzz

asked Jul 22 '21 at 10:56

Shrumo

vote

1 answer

Dask Dataframe: Remove duplicates by columns A, keeping the row with the highest value in column B

Basically this is answered for pandas in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B. In pandas I adopted the solution df.sort_values('B', ascending=False).drop_duplicates('A').sort_index() but…

dask dask-dataframe drop-duplicates

asked Jun 17 '21 at 13:05

Michael E.

vote

1 answer

Pandas drop_duplicates with multiple conditions

I have some measurement datas that need to be filtered, I read them as dataframe data, like these: df RequestTime RequestID ResponseTime ResponseID 0 150 14 103 101 1 150 15 …

pandas drop-duplicates

asked Jun 01 '21 at 14:26

Sun Jar

vote

1 answer

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used…

python dataframe count pandas-groupby drop-duplicates

asked May 09 '21 at 21:34

Kshtj

vote

1 answer

remove duplicates while adding a column in csv file using python

I have a CSV file that looks like this: |innings | bowler | |--------|---------------| |1 | P Kumar | |1 | P Kumar | |1 | P Kumar | |1 | P Kumar | |1 | Z Khan …

python pandas csv drop-duplicates

asked Apr 21 '21 at 07:40

Vanshika Saini

vote

2 answers

Keep last when using dropduplicates?

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following: from pyspark.sql import Row df = sc.parallelize([ \ Row(name='Alice', age=5, height=80), \ Row(name='Alice',…

apache-spark pyspark apache-spark-sql drop-duplicates

asked Feb 04 '21 at 16:07

OSUKevin

vote

1 answer

Drop duplicate rows in dataframe based on multplie columns with list values

I have DataFrame with multiple columns and few columns contains list values. By considering only columns with list values in it, duplicate rows have to be deleted. Current Dataframe: ID col1 col2 col3 col4 1 …

python pandas list dataframe drop-duplicates

asked Jan 25 '21 at 04:33

KPH

vote

1 answer

How can I drop duplicates in pandas without dropping NaN values

I have a dataframe which I query and I want to get only unique values out of a certain column. I tried to do that executing this code: database = pd.read_csv(db_file, sep='\t') query =…

python pandas drop-duplicates

asked Aug 13 '20 at 11:19

Eliran Turgeman

1,526
2
16
34

vote

1 answer

How to get dropped rows when using drop_duplicates (Pandas DataFrame)?

I use pandas.DataFrame.drop_duplicates() to drop duplicates of rows where all column values are identical, however for data quality analysis, I need to produce a DataFrame with the dropped duplicate rows. How can I identify which are the rows to be…

python pandas duplicates drop-duplicates

asked Jul 06 '20 at 17:24

Code Ninja 2C4U

vote

2 answers

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to…

apache-spark spark-streaming spark-structured-streaming drop-duplicates

asked Jul 05 '20 at 08:57

conetfun

1,605
4
17
38

vote

1 answer

How to reset the indices of remaining dataframe values after removing duplicate rows?

I have a dataframe and have removed duplicate rows using .drop_duplicates() method. But the initial index of the rows and still the same. data = data.drop_duplicates(keep=False, inplace=True) id Name Designation DOB 2 7934 …

pandas dataframe indexing drop-duplicates

asked Jun 02 '20 at 15:31

Saurav Ingle

vote

2 answers

why does python not drop all duplicates?

This is my original data frame I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'. This is my code: df=df.astype(str) df.drop_duplicates(subset={'head_x','head_y'}, keep=False,…

pandas python-3.6 drop-duplicates

asked May 06 '20 at 13:22

KristelK

Prev 1 2 3

…

9 10 Next