2

Hello thanks in advance for all answers, I really appreciate community help

Here is my dataframe - from a csv containing scraped data from cars classified ads

 Unnamed: 0                      NameYear  \
0           0             BMW 7 серия, 2007   
1           1                  BMW X3, 2021   
2           2  BMW 2 серия Gran Coupe, 2021   
3           3                  BMW X5, 2021   
4           4                  BMW X1, 2021   

                                    Price  \
0                               520 000 ₽   
1  от 4 810 000 ₽\n4 960 000 ₽ без скидки   
2                             2 560 000 ₽   
3  от 9 259 800 ₽\n9 974 800 ₽ без скидки   
4  от 3 130 000 ₽\n3 220 000 ₽ без скидки   

                                          CarParams  \
0  187 000 км, AT (445 л.с.), седан, задний, бензин   
1    2.0 AT (190 л.с.), внедорожник, полный, дизель   
2       1.5 AMT (140 л.с.), седан, передний, бензин   
3    3.0 AT (400 л.с.), внедорожник, полный, дизель   
4    2.0 AT (192 л.с.), внедорожник, полный, бензин   

                                                 url  
0  https://www.avito.ru/moskva/avtomobili/bmw_7_s...  
1  https://www.avito.ru/moskva/avtomobili/bmw_x3_...  
2  https://www.avito.ru/moskva/avtomobili/bmw_2_s...  
3  https://www.avito.ru/moskva/avtomobili/bmw_x5_...  
4  https://www.avito.ru/moskva/avtomobili/bmw_x1_...
  • THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column

screenshot to visually inspect the reslt of duplicated:

enter image description here

  • THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • Anurag Dabas thanks for the post edits and I remembered your comment from last time how to print dataframe and paste in here) – data_runner Mar 24 '21 at 08:36
  • I don't really understand what you are trying to achieve... can you clarify your case? – jan-seins Mar 24 '21 at 09:23
  • Thank you for response jan-seins. Basically I want to understand if my dataframe column 'url' has duplicate values. I tried to achieve this by using df.duplicated which outputs the screenshot in my post. It is supposed to give duplicated values but i see that the urls are not duplicate and highlighted it in the screenshot . Again excuse my rookie language – data_runner Mar 24 '21 at 09:44

2 Answers2

1

Try:

df.duplicated(subset=["url"], keep=False)
Laurent
  • 12,287
  • 7
  • 21
  • 37
1

df.duplicted() gives you a pd.Series with bool-values.

Here is a example that your could probably use

from random import randint
import pandas as pd
urls=['http://www.google.com',
     'http://www.stackoverfow.com',
     'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
    for j in range(0,randint(1,i+1)):
        d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)

resulting in the following df:

  customer                          url   dups
0       89        http://www.google.com  False
1       43  http://www.stackoverfow.com  False
2       36                http://bla.xy   True
3       86                http://bla.xy   True
4       32               http://bla.com  False

the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy

The important thing is that you check what the parameter keep does

keep{‘first’, ‘last’, False}, default ‘first’
  Determines which duplicates (if any) to mark.
  first : Mark duplicates as True except for the first occurrence.
  last : Mark duplicates as True except for the last occurrence.
  False : Mark all duplicates as True.

In my case used False to get all duplicated values

jan-seins
  • 1,253
  • 1
  • 18
  • 31