Eliminate duplicate cities from database

Question

Background

Over 5300 duplicate rows:

"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"

Additional details:

PostgreSQL 8.4 Database
Linux

Problem

Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):

"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"

The goal is to eliminate all duplicates.

Questions

Given a table of values such as the above CSV data:

How would you eliminate duplicates?
What geo-centric PostgreSQL functions would you use?
What other criteria would you use to wheedle down the duplicates?

Update

Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):

select
  c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.country = 'BE' and
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  (c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
  country, name

Ideas

Two phase approach:

Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.

Thank you!

Considering "Ab Barik", how do you know the values for latitude and longitude are more reliable than the values for the regions? — Mike Sherrill 'Cat Recall', Apr 28 '11 at 11:27
@Catcall: I do not. However, by looking at cities of the same name in close proximity to each other, it does not matter which gets deleted (for the purpose I have in mind). One problem is determining when they are sufficiently far from each other to be considered a different city. That should be taken care of by using one of the geographical functions PostgreSQL offers to compare distances. — Dave Jarvis, Apr 28 '11 at 20:05
Be careful, Kansas City, Kansas, and Kansas City, Missouri, have the same name, are in the same country, and are very, very close to one another. — unpythonic, Apr 28 '11 at 23:35
@Mark: Thanks, Mark. I put Kansas City MO back. There are probably a few others like this that have been wiped out. For my purposes, though, it is not a huge issue. — Dave Jarvis, Apr 29 '11 at 00:41

score 1 · Answer 1 · answered Apr 28 '11 at 10:46

1

Finding duplicates is simple:

select
  max(id) as this_should_stay,
  latitude,
  longitude,
  country,
  region,
  city
FROM
  your_table
group by
  latitude,
  longitude,
  country,
  region,
  city
having count(*) > 1;

Adding code to remove duplicates based on this is simple:

delete from your_table where id not in (
    select
      max(id) as this_should_stay
    FROM
      your_table
    group by
      latitude,
      longitude,
      country,
      region,
      city
)

note lack of having in the delete query.

answered Apr 28 '11 at 10:46

Thank you. Won't that only find exact duplicates? From what I can see of the data, the Abay lines have different lat & long values. – Dave Jarvis Apr 28 '11 at 19:28
Well sure, but I think that removing 2 fields from the queries is pretty simple operation :) – Apr 28 '11 at 19:36
1

I don't think I expressed the problem well enough. Some of the duplicates are duplicates because their lat & long are in close proximity, not exact. To determine their proximity (e.g., Abbasabad), I will need to look for duplicates using some geographical distance functions. If they are not sufficiently close (using a set threshold), then they could be different cities. I'm not really looking for SQL code (as you point out, the code is simple), but problems I might encounter trying to remove the duplicates. – Dave Jarvis Apr 28 '11 at 19:57

score 1 · Accepted Answer · answered Apr 28 '11 at 23:03

This deletes the second city within close proximity to a city of the same name in the same country:

delete from climate.maxmind_city mc where id in (
select
  max(c1.id)
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
group by
  c1.country, c1.name
order by
  c1.country, c1.name
)

score 0 · Answer 3 · answered Apr 28 '11 at 10:34

if your data have been imported thru CSV files and with the code (PHP) then you can prevent duplicates entry with the putting condition in PHP code. if the city you inserted is already exist then make loop continue to next record and skip current record.

try this if you are follow this way to import data in database..

Thanks.