I have a diskframe object with many duplicate rows. How could I remove them? (The original dataframe is 10 Gb size)
Asked
Active
Viewed 35 times
0
-
1Please provide enough code so others can better understand or reproduce the problem. – Community Jul 04 '22 at 14:23
1 Answers
0
You can do it in base R:
#this removes duplicate rows across the entire data frame:
df[!duplicated(df), ]
#Or if you want to remove duplicate rows at specific column(s):
df[!duplicated(df[c('ColumnX')]), ]
If you want to do it using dplyr
, then similarly either across the entire data frame or at a specific column:
df %>% distinct(.keep_all = TRUE)
#Or:
df %>% distinct(ColumnX, .keep_all = TRUE)

Archeologist
- 169
- 1
- 11