0

Note: Correction - the code returns AttributeError: 'str' object has no attribute 'drop_duplicates'

I am trying to loop through a number of dfs and reduce my 'user_id' column to only unique values using the df.drop_duplicates(subset =['user_id'] function.

I need this to be a global change and am trying to incorporate it into my function that imports .csv files and assigns them to their file name. This works perfectly but when I try and add the drop_duplicates function, it doesn't seem to do anything:

def assign_vars(files = os.listdir()):
    # Make list of variable names using file name
    variables = [make_var(file) for file in files]
    # Start list to place dfs into
    dfs = []
    for var,file in zip(variables,files):
        # Use globals to assign dfs to file names
        globals()[var] = pd.read_csv(file)
        #<<1>>
        # Add each newly made df var to a list
        dfs.append(var.drop_duplicates(subset =['user_id'])) # rmv duplicates
    return print('Your variables are: ',sorted(dfs))

This returns an attribute error.It seems that the var is being treated as a str instead of a df

When I len() a df, they remain the same as before. Even though when I individually df.drop_duplicates they shorten in len() by about 70%.

Alternatively, I have tried to make a mid object at <<1>> and then .drop_duplicates. This doesn't work and I believe its because the change is staying local.

rpatt97
  • 21
  • 2
  • dfs = dfs.append(var.drop_duplicates(subset =['user_id'])). You need to assign the modified df back. Can you make this change and check once. – Anshul May 29 '20 at 11:32
  • Thanks so much for your fast response. However, this fix returns: AttributeError: 'str' object has no attribute 'drop_duplicates'. I realised that this error also arrises with my previous code, I just hadn't called the function correctly. – rpatt97 May 29 '20 at 12:21

1 Answers1

0

Fix

Adding .drop_duplicates(subset =['user_id']) to the end of the csv import, as they become globalised seemed to do the trick.

def assign_vars(files = os.listdir()):
    # Make list of variable names using file name
    variables = [make_var(file) for file in files]
    # Start list to place dfs into
    dfs = []
    for var,file in zip(variables,files):
        # Use globals to assign dfs to file names
        globals()[var] = pd.read_csv(file).drop_duplicates(subset =['user_id'])
        # Add each newly made df var to a list
        dfs.append(var) # rmv duplicates
    return print('Your variables are: ',sorted(dfs))
rpatt97
  • 21
  • 2