Django find duplicates with queryset and regex

Question

In Django, is it possible to find duplicates with queryset and regex?

Django select only rows with duplicate field values shows without using a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model:

class company(models.Model):
   Website = models.URLField(blank=True, null=True )

I want to find duplicates with regex

For example.

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')

Both of these are the same website. I'd like to use a regex so that it will return return these companies as duplicates.

I know there is filters like that use regex. I'm not sure how to update this to use a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I'd like to do something like:

Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Update There was some confusion so I'll give an example.

Here is what my db looks like

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:

http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website.

I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.

I want to modify:

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

so that it returns the duplicates:

http://example.org (from name=a) and http://example.org (from name=b)
example.com www.example.com

Does this `.annotate(count=Count('id')).order_by().filter(count__gt=1)` work at all??? — Csaba Toth, May 07 '16 at 02:56

dan-klasson · Answer 1 · 2015-01-23T01:48:35.960

1

Use __icontains:

Company.objects.filter(Website__icontains='example.com')

Which will produce:

`ILIKE %'example.com'%.

It will thus return the following records if exists in the Company table:

 http://example.com, http://www.example.com

edited Jan 23 '15 at 01:48

answered Jan 22 '15 at 05:13

dan-klasson

13,734
14
63
101

I'm looking for a way to find the duplicates. In my example I want a way to find the duplicates where Django will realize "example.com" and "www.example.com" is the same website. – brian Jan 22 '15 at 16:41
1

and with the above queryset it will – dan-klasson Jan 22 '15 at 16:47
1

I want to run a queryset that finds the duplicate websites. Your code assumes I know the duplicate urls. Also it creates an object which I assume is a typo. I want to modify this expression so that it will realize www and non-www is the same website: values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1) – brian Jan 22 '15 at 19:48
Updated my answer. And I was a bit fast copying there and did not see your first queryset was a create. – dan-klasson Jan 23 '15 at 01:49
There is some confusing about what I want. I updated the post. – brian Jan 23 '15 at 13:50

score 0 · Answer 2 · answered Jan 26 '15 at 08:31

I have similar DB structure on some project - i store urls of some entities. To find duplicates i also store the 'domain' of url.

So regarding your examples the DB structure would be:

id |           url          | domain
-----------------------------------------
1  | http://www.example.com | example.com
2  | http://example.com     | example.com

It's easy then to find duplicates or find urls/entities related to particular domain.

You may think that using such an approach is an overkill to detect duplicates.

But your approach has 2 big drawbacks:

1) it's impossible to write correct regexp that will match domains variations

Matching "www.example.com" and "example.com" is easy. How about "example.co.uk" and "www.example.co.uk" or maybe "www.старт.рф" and "старт.рф" ? Those are all valid domain names.

2) you are shooting your leg in the long run - writing complex regexps on ever growing DB tables will kill your performance.

p.s. - i use "tldextract" lib to get domain of urls.

Django find duplicates with queryset and regex

2 Answers2