1

I appreciate that questions similar to this one have been asked on here before but I have thus far been unable to implement the answers provided into my code both because of wanting to distinguish duplicates in one column only whilst the other stays the same and the INNER JOIN in my code. The INNER JOIN is problematic because most of the provided answers use the PARTITION function and, being a novice with SQL, I do not know how to integrate this with it. Advice just on using INNER JOIN with PARTITION would be useful.

Whilst I could do this post-export in Python (where I will be using the desired output), this code currently outputs ~2 million rows, making it time-consuming to work with and check. Here is the code:

SELECT client_ip_address, language_enum_code
FROM vw_user_session_log AS usl
INNER JOIN vw_user_topic_ownership AS uto
ON usl.user_id = uto.user_id

Using SELECT DISTINCT instead of SELECT gets me closer to the desired output but rather than leaving one duplicate row behind it removes all of them. Advice on using this function whilst preserving one of the duplicate rows would be preferred. I am on a read-only connection to the database so the DELETE FROM approach seen here would only be viable if I could make a temporary query-able table from the query output which I don't think is possible and seems clumsy.

Raw data sample:

user_id:    client_ip_address:   language_enum_code:          (other stuff...)
    4          194:4:62:18              107
    2          101:9:23:34              14
    3          180:4:87:99              15
    3          194:4:62:18              15
    4          166:1:19:27              107
    2          166:1:19:27              14

Desired result:

user_id:    client_ip_address:   language_enum_code:          (other stuff...)
    4          194:4:62:18              107
    2          101:9:23:34              14
    3          180:4:87:99              15

As you can see, any id-enum combination should be filtered to occur only once. The reason this is not any ip-enum combination is that multiple users can connect through the same IP address.

Community
  • 1
  • 1
Sadie LaBounty
  • 379
  • 1
  • 5
  • 23
  • provide sample of raw data and expected result please – Alex Mar 08 '17 at 13:33
  • Done and still looking for ideal answer. – Sadie LaBounty Mar 08 '17 at 15:29
  • If I understand correctly, you want a unique user_id, language_enum_code with an arbitrary client_ip_address for that combination? Because in your 'desired result' you have thrown away data, for example "166:1:19:27". How did you choose which IP addresses to keep and which to throw away? – Matt Mar 08 '17 at 15:39
  • Once this is exported I will be running the IP's through a database that converts them into countries. It is unlikely that the same user will use the app from two different countries so I am neglecting the IP(s) that ends up being lost. – Sadie LaBounty Mar 08 '17 at 15:45
  • your raw data is not complete I guess. since you have 2 tables joined in your query, but only 1 table data presented. – Alex Mar 08 '17 at 15:48
  • And another question: why `4-194:4:62:18-107` is in desired result but `4-166:1:19:27-107` is not? what is criteria? – Alex Mar 08 '17 at 15:54
  • As explained in my previous comment, the IP is not relevant to the filter (so long as it is not null). – Sadie LaBounty Mar 08 '17 at 16:09
  • if it is not relevant - why it is in desired result? – Alex Mar 08 '17 at 16:11
  • It is not important which IP I get but it is important that I get AN IP. – Sadie LaBounty Mar 08 '17 at 16:14

2 Answers2

0

Do you simply want aggregation?

SELECT client_ip_address, GROUP_CONCAT(DISTINCT language_enum_code)
FROM vw_user_session_log usl INNER JOIN
     vw_user_topic_ownership uto
     ON usl.user_id = uto.user_id
GROUP BY client_ip_address;

This will return one row per client_ip_address with each language code in a comma delimited list.

You can also use MIN() or MAX() to get an arbitrary value for language_enum_code for each client_ip_address.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

If you don't care which IP address you keep for each user_id / enum combo, then something like this should do:

SELECT user_id, min(client_ip_address), language_enum_code
FROM vw_user_session_log AS usl
INNER JOIN vw_user_topic_ownership AS uto
ON usl.user_id = uto.user_id
where client_ip_address is not null
group by user_id, language_enum_code
Matt
  • 3,303
  • 5
  • 31
  • 53
  • That works, but produces a new problem. There are null IP values that must be filtered out before this filter. I don't know where to put the 'WHERE IP != NULL' statement - if you can help me here this will be the answer. – Sadie LaBounty Mar 08 '17 at 16:03
  • Done, try that now – Matt Mar 08 '17 at 16:28
  • Great! Do you happen to know why != NULL doesn't work? – Sadie LaBounty Mar 08 '17 at 16:36
  • It's a syntax thing, some databases will accept != NULL, but because NULL is not a value, rather it's the absence of any value, it is treated differently by different databases. I just learned from experience that if the database will accept the "is not null" syntax, it's generally the safest bet. Similarly for the opposite "is null". – Matt Mar 08 '17 at 16:57