1

In Django, is it possible to find duplicates with queryset and regex?

Django select only rows with duplicate field values shows without using a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model:

class company(models.Model):
   Website = models.URLField(blank=True, null=True )

I want to find duplicates with regex

For example.

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')

Both of these are the same website. I'd like to use a regex so that it will return return these companies as duplicates.

I know there is filters like that use regex. I'm not sure how to update this to use a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I'd like to do something like:

Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Update There was some confusion so I'll give an example.

Here is what my db looks like

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:

  1. http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website.

I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.

I want to modify:

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

so that it returns the duplicates:

  1. http://example.org (from name=a) and http://example.org (from name=b)

  2. example.com www.example.com

Community
  • 1
  • 1
brian
  • 773
  • 4
  • 9
  • 24

2 Answers2

1

Use __icontains:

Company.objects.filter(Website__icontains='example.com')

Which will produce:

`ILIKE %'example.com'%. 

It will thus return the following records if exists in the Company table:

 http://example.com, http://www.example.com
dan-klasson
  • 13,734
  • 14
  • 63
  • 101
  • I'm looking for a way to find the duplicates. In my example I want a way to find the duplicates where Django will realize "example.com" and "www.example.com" is the same website. – brian Jan 22 '15 at 16:41
  • 1
    and with the above queryset it will – dan-klasson Jan 22 '15 at 16:47
  • 1
    I want to run a queryset that finds the duplicate websites. Your code assumes I know the duplicate urls. Also it creates an object which I assume is a typo. I want to modify this expression so that it will realize www and non-www is the same website: values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1) – brian Jan 22 '15 at 19:48
  • Updated my answer. And I was a bit fast copying there and did not see your first queryset was a create. – dan-klasson Jan 23 '15 at 01:49
  • There is some confusing about what I want. I updated the post. – brian Jan 23 '15 at 13:50
0

I have similar DB structure on some project - i store urls of some entities. To find duplicates i also store the 'domain' of url.

So regarding your examples the DB structure would be:

id |           url          | domain
-----------------------------------------
1  | http://www.example.com | example.com
2  | http://example.com     | example.com

It's easy then to find duplicates or find urls/entities related to particular domain.

You may think that using such an approach is an overkill to detect duplicates.

But your approach has 2 big drawbacks:

1) it's impossible to write correct regexp that will match domains variations

Matching "www.example.com" and "example.com" is easy. How about "example.co.uk" and "www.example.co.uk" or maybe "www.старт.рф" and "старт.рф" ? Those are all valid domain names.

2) you are shooting your leg in the long run - writing complex regexps on ever growing DB tables will kill your performance.

p.s. - i use "tldextract" lib to get domain of urls.

Termos
  • 664
  • 1
  • 7
  • 31