Questions tagged [drop-duplicates]

questions related to removing (or dropping) unwanted duplicate values

A duplicate is any re-occurrence of an item in a collection. This can be as simple as two identical strings in a list of strings, or multiple complex objects which are treated as the same object when compared to each other.

This tag may pertain to questions about removing unwanted duplicates.

See also:

144 questions
0
votes
1 answer

Applying PySpark dropDuplicates method messes up the sorting of the data frame

I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison. The following table is the output of sorted_df.show(), in which the sorting…
0
votes
1 answer

python pandas: duplicated rows using sort_values and drop_duplicates

I have this dataframe in column stage I have 4 values : I Have duplicates rows in this dataframe, and I wanted to drop them, for example: I want to keep row #8015 and I don't have 2 rows with the same stage and the same tweet_id, for example: I…
Asma
  • 137
  • 3
  • 15
0
votes
1 answer

AttributeError: module 'pandas' has no attribute 'drop_duplicates'

The error: "AttributeError: module 'pandas' has no attribute 'drop_duplicates'" This is a new error on a section of code that has been working fine, the code in question: def The_function(): file = os.getcwd() + "the_file.xls" xls =…
11l
  • 73
  • 1
  • 9
0
votes
0 answers

Concatenate two dataframes and then drop all duplicates- not working properly?

I want to create a dataframe which has entries from df dataframe which don't exist in any of the other dataframes (dfA, dfB, dfC, dfD). Basically, entries from dfA, dfB, dfC, dfD are also contained in df, i.e. df is the superset of them and n(df) =…
0
votes
1 answer

pandas DataFrame select specific data

I want to build a for loop to only select row 5, row 10 and row 14 in pandas. enter image description here The actual file include thousands of rows in similar format. Please teach me a function that can go over the entire file. Many Thanks…
Yumeng Xu
  • 179
  • 1
  • 2
  • 11
0
votes
2 answers

Pandas one-to-one row merge, maintaining the structure on the left hand side?

a similar question to an unresolved SO question (Can one perform a left join in pandas that selects only the first match on the right?), but slightly more complex and with no obvious workaround. I am hoping that there may be some fresh functionality…
0
votes
1 answer

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas

There are two columns in the data frame and am trying to remove the consecutive element from column "a" and its corresponding element from column "b" while keeping only the last element. import pandas as…
Kshtj
  • 91
  • 3
  • 11
0
votes
0 answers

Python-Deleting duplicate rows Pandas (Specifically)

Here, is the data set which I'm working on Which looks like this. Basically, I want to delete duplicate rows specifically I know the drop_duplicate command but I need some help. Let me show you by sorting the data so that It'll give you a clear…
kirti purohit
  • 401
  • 1
  • 4
  • 18
0
votes
1 answer

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because…
TBergy
  • 3
  • 1
0
votes
0 answers

KeyError: Float64Index when running drop_duplicates

I have a dataframe with the duplicates in the t column, however when I run the drop_duplicates function the following error is returned. Could someone please explain how to fix this? print(df.columns) Index(['t', 'ax', 'ay', 'az', 'gx', 'gy', 'gz',…
gee_whiz
  • 9
  • 1
0
votes
1 answer

Drop duplicates and complete nan with oldest values and optimise runing time

I'm working on a data base with some columns, and I drop duplicates after sorting values by date (format Y-m-d). My df is like the following : id date name firstname 01 2020-04-01 max smith 04 2020-08-04 georges …
mathilde
  • 194
  • 1
  • 11
0
votes
1 answer

Does drop_duplicate guarantee to keep the first row and drop rest of the rows after sorting the dataframe in spark?

I have a dataframe, read from Avro file in Hadoop, with three columns (a,b,c), Where one is a key column and among two other columns one is of integer type and the other is of date type. I am ordering the frame by the integer column and date column…
0
votes
2 answers

Drop duplicates row in Spark SQL based on custom function on a column in Java

I am trying to remove duplicates from my Dataset in Spark SQL in Java. My dataset has three columns. Let's say the name of the column are name, timestamp, and score. The name is the String representation of employee name and timestamp is in long…
Ajay Kr Choudhary
  • 1,304
  • 1
  • 14
  • 23
0
votes
0 answers

Pandas drop_duplicates function not working properly

I'm working on building a bridge table between two tables with a many-to-many relationship. Table A contains employee IDs and job assignments (one employee can have more than one assignment). Table B has the same structure. The aim is to build a…
0
votes
1 answer

Making df columns unique to one column permanently outside of df loop function

Note: Correction - the code returns AttributeError: 'str' object has no attribute 'drop_duplicates' I am trying to loop through a number of dfs and reduce my 'user_id' column to only unique values using the df.drop_duplicates(subset…