0

I have a diskframe object with many duplicate rows. How could I remove them? (The original dataframe is 10 Gb size)

Irene M
  • 11

1 Answers1

0

You can do it in base R:

#this removes duplicate rows across the entire data frame:
df[!duplicated(df), ]

#Or if you want to remove duplicate rows at specific column(s): 
df[!duplicated(df[c('ColumnX')]), ]

If you want to do it using dplyr, then similarly either across the entire data frame or at a specific column:

df %>% distinct(.keep_all = TRUE)

#Or: 
df %>% distinct(ColumnX, .keep_all = TRUE)
Archeologist
  • 169
  • 1
  • 11