1

My query returns the following in a list:

"Alex";"275467125";"2015-02-03 02:55:36-05";"1"
"Alex";"275467125";"2015-01-13 02:09:39-05";"1"
"Alex";"275467125";"2015-01-05 04:13:35-05";"1"
"Alex";"275467125";"2014-12-27 04:55:47-05";"1"
"Alex";"275467125";"2014-12-27 04:54:52-05";"1"
"Alex";"275467125";"2014-12-07 03:13:24-05";"1"
"Alex";"275467125";"2014-12-04 03:34:56-05";"1"
"Alex";"275467125";"2014-12-02 04:16:33-05";"1"
"Ali";"275464747";"2016-02-17 10:52:12-05";"2"
"Alladin";"275467455";"2016-03-13 06:51:52-04";"2"
"Alladin";"275467455";"2016-03-13 06:51:47-04";"2"
"Anna";"275467401";"2016-03-26 03:56:41-04";"1"
"Anna";"275467401";"2016-03-26 03:55:21-04";"1"
"Anna";"275467401";"2016-03-21 23:04:28-04";"1"
"Anna";"275467401";"2016-02-12 13:24:44-05";"1"
"Anna";"275467401";"2015-12-03 08:20:35-05";"1"
"Anna";"275467401";"2015-11-09 04:18:27-05";"1"
"Anna";"275467401";"2015-11-09 04:11:59-05";"1"
"Anna";"275467401";"2015-09-13 21:27:12-04";"1"

I want to create a dictionary of person's name with the oldest record they have. I've figured out:

oldestlist = {d[0]:d[2] for d in records}

This returns a correct answer but my worry is that if I am presented a list that is not formatted in a descending order of date/time it will not provide the correct answer. What is the best way to create a dictionary with a name and the oldest date?

4 Answers4

2

You don't need to sort any data, just use a defaultdict and check the current date vs any new date and update accordingly:

s = """"Alex";"275467125";"2015-02-03 02:55:36-05";"1"
"Alex";"275467125";"2015-01-13 02:09:39-05";"1"
"Alex";"275467125";"2015-01-05 04:13:35-05";"1"
"Alex";"275467125";"2014-12-27 04:55:47-05";"1"
"Alex";"275467125";"2014-12-27 04:54:52-05";"1"
"Alex";"275467125";"2014-12-07 03:13:24-05";"1"
"Alex";"275467125";"2014-12-04 03:34:56-05";"1"
"Alex";"275467125";"2014-12-02 04:16:33-05";"1"
"Ali";"275464747";"2016-02-17 10:52:12-05";"2"
"Alladin";"275467455";"2016-03-13 06:51:52-04";"2"
"Alladin";"275467455";"2016-03-13 06:51:47-04";"2"
"Anna";"275467401";"2016-03-26 03:56:41-04";"1"
"Anna";"275467401";"2016-03-26 03:55:21-04";"1"
"Anna";"275467401";"2016-03-21 23:04:28-04";"1"
"Anna";"275467401";"2016-02-12 13:24:44-05";"1"
"Anna";"275467401";"2015-12-03 08:20:35-05";"1"
"Anna";"275467401";"2015-11-09 04:18:27-05";"1"
"Anna";"275467401";"2015-11-09 04:11:59-05";"1"
"Anna";"275467401";"2015-09-13 21:27:12-04";"1"
"""

import  csv
from collections import defaultdict

d = defaultdict(str)
for name,_, date, _ in csv.reader(s.splitlines(), delimiter=";"):
    if not d[name] or d[name] > date:
        d[name] = date


from pprint import pprint as pp

pp(dict(d))

Output:

 {'Alex': '2014-12-02 04:16:33-05',
 'Ali': '2016-02-17 10:52:12-05',
'Alladin': '2016-03-13 06:51:47-04',
'Anna': '2015-09-13 21:27:12-04'}

because the dates are in the y-m-d time format it is safe to do a lexicographical comparison.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • But surely this only works if the input is ordered... and that was the concern in the original post? – roganjosh Sep 27 '16 at 21:30
  • @roganjosh, why would it need to be ordered? we only update the users value if the date was newer. – Padraic Cunningham Sep 27 '16 at 21:30
  • @PadraicCunningham, I'm sorry, but don't you think that timezones (-04, -05, at least they look like timezones) will interfere? – Pavel Gurkov Sep 27 '16 at 21:33
  • I'm reluctant to challenge you but you appear to display the opposite of what the OP expects (they want the _oldest_ record, while you display the newest) and they are concerned that their results will not be returned in chronological order, so they want some level of sorting - "This returns a correct answer but my worry is that if I am presented a list that is not formatted in a descending order of date/time it will not provide the correct answer.". So it works fine with their current data, but only because that happens to be sorted. – roganjosh Sep 27 '16 at 21:35
  • @roganjosh, I had it backwards initially, I edited before you comment to get the oldest, sorting makes no difference at all, Where are you getting the notion that the order is relevant, if I did `max([1,3,1,2,5,1])` would I not get the max value? – Padraic Cunningham Sep 27 '16 at 21:39
  • @PadraicCunningham I need to go away and test `if not d[name] or d[name] > date: d[name] = date` to check my understanding :) – roganjosh Sep 27 '16 at 21:42
  • @roganjosh, it is basically the same as setting `mx = None` then `for ele in some_iterable:if ele is None or ele > mx: mx = ele`, we don't care about order as we are guaranteed to find the max/greatest value – Padraic Cunningham Sep 27 '16 at 21:43
  • 1
    Upvote, you taught me something and it's cleaner than my suggestion by far :) – roganjosh Sep 27 '16 at 21:48
  • @PavelGurkov, there may be some edge cases but I don't think so, we are comparing by year-m-day time first, we would have to have matching values for everything up to the second before we would even hit the offset and if we did we would be in the same offset. Worst case we could consider the offset and still do it in O(n) time – Padraic Cunningham Sep 27 '16 at 21:49
  • 1
    @roganjosh, no worries. – Padraic Cunningham Sep 27 '16 at 21:50
  • @PadraicCunningham, what I meant is that in the sample dataset there are records with different offsets for the same user. So while lexicographical ordering works for this dataset, it may not be correct for any, for example for dates `2016-10-10 10:00:00-04` and `2016-10-10 09:00:00-06` (2PM vs 3PM UTC) – Pavel Gurkov Sep 27 '16 at 22:01
  • @PavelGurkov, the user would have to be in two different places with different offsets, we are not comparing different users against each other, we are comparing each users own dates. As I mentioned we can take the offset into account anyway if needed but until the OP replies with more info then we are only guessing. – Padraic Cunningham Sep 27 '16 at 22:12
  • @PadraicCunningham I really like this answer and you are right I'm only looking for the posters first post date which then gets written to a field in my database table. I have other processes and functions that do other data time analysis on my data. – Echelon_One Sep 29 '16 at 15:14
1

It was a bit frustrating to get your given "list" into an actual list format. If you can't deal with this task in the query itself, you could try:

from itertools import groupby
from operator import itemgetter

lst = '''"Alex";"275467125";"2015-02-03 02:55:36-05";"1",
"Alex";"275467125";"2015-01-13 02:09:39-05";"1",
"Alex";"275467125";"2015-01-05 04:13:35-05";"1",
"Alex";"275467125";"2014-12-27 04:55:47-05";"1",
"Alex";"275467125";"2014-12-27 04:54:52-05";"1",
"Alex";"275467125";"2014-12-07 03:13:24-05";"1",
"Alex";"275467125";"2014-12-04 03:34:56-05";"1",
"Alex";"275467125";"2014-12-02 04:16:33-05";"1",
"Ali";"275464747";"2016-02-17 10:52:12-05";"2",
"Alladin";"275467455";"2016-03-13 06:51:52-04";"2",
"Alladin";"275467455";"2016-03-13 06:51:47-04";"2",
"Anna";"275467401";"2016-03-26 03:56:41-04";"1",
"Anna";"275467401";"2016-03-26 03:55:21-04";"1",
"Anna";"275467401";"2016-03-21 23:04:28-04";"1",
"Anna";"275467401";"2016-02-12 13:24:44-05";"1",
"Anna";"275467401";"2015-12-03 08:20:35-05";"1",
"Anna";"275467401";"2015-11-09 04:18:27-05";"1",
"Anna";"275467401";"2015-11-09 04:11:59-05";"1",
"Anna";"275467401";"2015-09-13 21:27:12-04";"1"'''

broken_list = lst.split(',')
stripped = [item.replace('\n', '') for item in broken_list]

rebuilt = []
for line in stripped:
    line = line.split(';')
    rebuilt.append([item.strip('"') for item in line])

# Now actually sorting this        
grouped = []
for key, group in groupby(rebuilt, key=itemgetter(0)):
    grouped.append(list(group))

sort_grouped = [sorted(item, key=itemgetter(2)) for item in grouped]
#sort_grouped = 

oldestlist = {d[0][0]:d[0][2] for d in sort_grouped}
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
roganjosh
  • 12,594
  • 4
  • 29
  • 46
0

Since you need last record for each name instead of doing it explicitly with the dict make your queryset to perform GROUP BY on name. In Django you can do that using .annotate as is mentioned here: Django Orm get latest for each group

Hence, your querset should be like:

YourModel.objects.values('name_column').annotate(latest_date=Max('date'))

Additional piece of information, you should be using order_by(-your_date_column) with your queryset in order to ensure data is always returned in desc order where - ensure desc order when list is needed.

Community
  • 1
  • 1
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
0

I was pretty close. The answer I found that works best was a tweak of my original code but using the sorted() function.

For the newest I'd do:

newestlist = {d[0]:d[2] for d in sorted(records)}

For the oldest I'd do:

oldestlist = {d[0]:d[2] for d in sorted(records, reverse=True)}

Thanks to everyone who answered. I'll keep in mind the django references for if I use a queryset.

  • But have you checked what this sorts by? Muddle your records up, what makes it sort by date and not name? – roganjosh Sep 27 '16 at 21:58
  • After I started testing further I realized if my records were muddled to begin with I'd run into this exact issue. – Echelon_One Sep 29 '16 at 15:09