1

Here I have this dataframe and I am trying to remove the duplicate elements from each array in column 2 as follows and resultant array in Column 3.

Column1    Column 2                                                      Column3
0        [ABC|QWER|12345, ABC|QWER|12345]                         [ABC|QWER|12345] 
1        [TBC|WERT|567890,TBC|WERT|567890]                        [TBC|WERT|567890]
2        [ERT|TYIO|9845366, ERT|TYIO|9845366,ERT|TYIO|5]   [ERT|TYIO|9845366, ERT|TYIO|5]
3        NaN                                               NaN
4        [SAR|QWPO|34564557,SAR|QWPO|3456455]             [SAR|QWPO|34564557,SAR|QWPO|3456455]
5        NaN                                              NaN
6        [SE|WERT|12233412]                                [SE|WERT|12233412]
7        NaN                                               NaN

I m using following codes but its showing the error of malformed node or string.Please help to solve this.

import ast
    def ddpe(a):
    return list(dict.fromkeys(ast.literal_eval(a)))

  df['column3'] = df['column2'].apply(ddpe)
Cuckoo
  • 97
  • 9

1 Answers1

4

I'm assuming the values of 'column2' are strings since you are trying to use ast.literal_eval. In that case, try this instead

import pandas as pd
import numpy as np

def ddpe(str_val):
    if pd.isna(str_val):  # return NaN if value is NaN
        return np.nan  
    # Remove the square brackets, split on ',' and strip possible
    # whitespaces between elements   
    vals = [v.strip() for v in str_val.strip('[]').split(',')]
    # remove duplicates keeping the original order
    return list(dict.fromkeys(vals))

df['column3'] = df['column2'].apply(ddpe)

If the column values are lists already, you just need

def ddpe(lst_val):
    # return NaN is value is not a list. 
    # Assuming those are only the two options.
    if not isinstance(lst_val, list):   
        return np.nan  
    return list(dict.fromkeys(lst_val))

df['column3'] = df['column2'].apply(ddpe)
Rodalm
  • 5,169
  • 5
  • 21
  • Hey,its giving the error ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). – Cuckoo Nov 17 '21 at 19:46
  • @Cuckoo are the values lists or string representations of lists? I'm assuming the latter, as your solution attempt suggests by using `ast.literal_eval`. However, the error suggests the values are lists. This type of information should be very clear in the post description to avoid misunderstandings. – Rodalm Nov 17 '21 at 19:54
  • So,its like I concatenated few columns of dataset to get column2 and its dtype is object.Sorry for misuderstanding – Cuckoo Nov 17 '21 at 20:10
  • @Cuckoo I updated the answer with the two cases. Does it work now? – Rodalm Nov 17 '21 at 20:11
  • @Cuckoo I meant the data type of the elements themselves. Are they strings or lists? For instance, what is the output of `df['column2'].map(type).unique()`? – Rodalm Nov 17 '21 at 20:13
  • 1
    array([, ], dtype=object) hey,its output is this and thanks that's working. – Cuckoo Nov 17 '21 at 20:16
  • Also would be great if you could add few links for the better understanding as I m novice to this field.Thanks – Cuckoo Nov 17 '21 at 20:17
  • @Cuckoo That means the element are lists as I suspected (the float comes from the fact the NaNs are floats). You're welcome, I'm glad it worked! Do you want resources to learn `pandas`? I highly recommend the [official guides](https://pandas.pydata.org/docs/) of the documentation. – Rodalm Nov 17 '21 at 20:26