17

I am looking to group similar items in a list based on the first three characters in the string. For example:

test = ['abc_1_2', 'abc_2_2', 'hij_1_1', 'xyz_1_2', 'xyz_2_2']

How can I group the above list items into groups based on the first grouping of letters (e.g. 'abc')? The following is the intended output:

output = {1: ('abc_1_2', 'abc_2_2'), 2: ('hij_1_1',), 3: ('xyz_1_2', 'xyz_2_2')}

or

output = [['abc_1_2', 'abc_2_2'], ['hij_1_1'], ['xyz_1_2', 'xyz_2_2']]

I have tried using itertools.groupby to accomplish this without success:

>>> import os, itertools
>>> test = ['abc_1_2', 'abc_2_2', 'hij_1_1', 'xyz_1_2', 'xyz_2_2']
>>> [list(g) for k.split("_")[0], g in itertools.groupby(test)]
[['abc_1_2'], ['abc_2_2'], ['hij_1_1'], ['xyz_1_2'], ['xyz_2_2']]

I have looked at the following posts without success:

How to merge similar items in a list. The example groups similar items (e.g. 'house' and 'Hose') using an approach that is overly complicated for my example.

How can I group equivalent items together in a Python list?. This is where I found the idea for the list comprehension.

ekad
  • 14,436
  • 26
  • 44
  • 46
Borealis
  • 8,044
  • 17
  • 64
  • 112

1 Answers1

13

The .split("_")[0] part should be inside a single-argument function that you pass as the second argument to itertools.groupby.

>>> import os, itertools
>>> test = ['abc_1_2', 'abc_2_2', 'hij_1_1', 'xyz_1_2', 'xyz_2_2']
>>> [list(g) for _, g in itertools.groupby(test, lambda x: x.split('_')[0])]
[['abc_1_2', 'abc_2_2'], ['hij_1_1'], ['xyz_1_2', 'xyz_2_2']]
>>>

Having it in the for ... part does nothing since the result is immediately discarded.


Also, it would be slightly more efficient to use str.partition when you only want a single split:

[list(g) for _, g in itertools.groupby(test, lambda x: x.partition('_')[0])]

Demo:

>>> from timeit import timeit
>>> timeit("'hij_1_1'.split('_')")
1.3149855638076913
>>> timeit("'hij_1_1'.partition('_')")
0.7576401470019234
>>>

This isn't a major concern as both methods are pretty fast on small strings, but I figured I'd mention it.

  • 1
    Thanks, this works great. I recently found that it is a good practice to make sure the input list is sorted e.g. `test = sorted(['abc_1_2', 'abc_2_2', 'hij_1_1', 'xyz_1_2', 'xyz_2_2'])`. Otherwise, if the input list is not sorted, `itertools.groupby` will not work as expected. – Borealis Dec 31 '14 at 20:29
  • 1
    Yes, sorting the list first is a good practice when using `itertools.groupby`. This is because `groupby` only captures runs of similar values. Meaning, it can miss some if the list is not sorted. I didn't bother mentioning this in my post though because the main focus was on how to use `groupby` and also your list was already sorted. –  Dec 31 '14 at 20:32