0

I have a dataframe that contains answers to many questions.

Each row represents an answer-er and the columns are the answers to the questions given Because people often spam those questionnaires sometimes there are answer-ers that give the same answer many times like ''yes good'', ''yes good''....

I would like to remove those rows where same answers are repeated more than once or twice (because a single repetition could be coincidence)

My dataframe looks like this: Questions differ from file to file but always column 0 is ID and all rest columns are questions and their number vary.

ID , Question 1 , Question 2 , Question 3 , Question 4 , ...

Id1 , Ans. str1 ,Ans. string2 ,Ans. string3 , Ans. string4 , ...

Id2 , Ans. str1 ,Ans. string2 ,Ans. string3 , Ans. string4 , ...

Id3 , Ans. str1 ,Ans. string2 ,Ans. string3 , Ans. string4 , ...

Id4 , Ans. str1 ,Ans. string2 ,Ans. string3 , Ans. string4 , ...

What I need is to drop rows that contain same answers to more than one questions Idealy i would like to be able to adjust the number of identical answers found that for a row to be dropped. Because when you have big questionnaires 2 answers can be same without being a spammer. If such case is not easy lets try to drop when any 2 are same.

  • Hi Poulos! Could you please add sample input and output to your question. It would greatly help people in giving a solution – Mohsin hasan Aug 28 '20 at 15:44

1 Answers1

0
# importing pandas package 
import pandas as pd 

data = {'ID':  ['Id1', 'Id2','Id3', 'Id4'],
        'Question 1':  ['Ans. str1', 'Ans. string1','Ans. string1', 'Ans. string1'],
        'Question 2':  ['Ans. str2', 'Ans. string2','Ans. string2', 'Ans. string2'],
        'Question 3':  ['Ans. str3', 'Ans. string3','Ans. string3', 'Ans. string3'],
        'Question 4':  ['Ans. str4', 'Ans. string4','Ans. string4', 'Ans. string4']
       }
        
        
df = pd.DataFrame (data)        
    

output

    ID  Question 1  Question 2  Question 3  Question 4
0   Id1     Ans. str1   Ans. str2   Ans. str3   Ans. str4
1   Id2     Ans. string1    Ans. string2    Ans. string3    Ans. string4
2   Id3     Ans. string1    Ans. string2    Ans. string3    Ans. string4
3   Id4     Ans. string1    Ans. string2    Ans. string3    Ans. string4

Drop the duplicate rows

df = df.drop_duplicates()
print(df)

    ID Question 1 Question 2 Question 3 Question 4
0  Id1  Ans. str1  Ans. str2  Ans. str3  Ans. str4
john taylor
  • 1,080
  • 15
  • 31
  • Thanks john but i dont want to drop duplicates, What I need is to drop rows that contain same Answers to more than one questions – Poulos Spyros Aug 28 '20 at 15:52
  • I changed the answer do you mean like this? – john taylor Aug 28 '20 at 16:03
  • Thanks for trying, but i dont wont the comparison for duplicates to be among elemets of a column but a row. I also edited the question a bit to be more expalanatory. Does it help? For any row that Ans. stringX exists more than once in this row I want this row dropped. – Poulos Spyros Aug 28 '20 at 16:15