counting unique values and ignoring numbers

Question

I have lists of elements that look like something below:

lst=['A1','A2','B','B','B','B']

I want to write a python code that the output is like a chemical formula: AB2

So if there are numbers, it ignores them and see A1 and A2 as A. Then removes repetition in a proportionate way.

I tried this but is not correct for sure:

output=''.join(np.unique(lst))

Can you please explain rule for `['A1','A2','B','B','B','B']` -> `'AB2'` transformation? — Guru Stron, Mar 07 '23 at 22:04
unrelated but import note: Do Not name a variable `list`. `list` is the name of the function that creates lists; naming a list `list` will cause confusion and bugs, even for small testing and examples. Name it `my_list` if you don't have a better, more descriptive name for it. — scotscotmcc, Mar 07 '23 at 22:05

Pranav Hosangadi · Accepted Answer · 2023-03-07T22:25:28.830

First, you want to strip out all numbers from the list of elements. You can do this using a regular expression. The regular expression I use here captures all characters between A-Z and a-z, i.e. all letters. Try it online

import re

elements = []
for item in lst:
    elem = re.match(r"([A-Za-z]+)", item).group(0)
    elements.append(elem)

(or you can write the loop as a list comprehension)

elements = [re.match(r"([A-Za-z]+)", item).group(0) for item in lst]

which gives:

elements = ['A', 'A', 'B', 'B', 'B', 'B']

Next, you want to count how many of each element are in the list. You can do this using a collections.Counter

import collections

element_counts = collections.Counter(elements)

which gives:

element_counts = Counter({'A': 2, 'B': 4})

Note: you can combine this step with the previous step. This way, you avoid creating the elements list and you only need one iteration over all items in your original list instead of two (one to create elements, one to count them all):

element_counts = collections.Counter(re.match(r"([A-Za-z]+)", item).group(0) for item in lst)

Now, you need to figure out the greatest common factor of all the values in the counter. What a happy surprise, it's a part of the standard library! Also, since GCD is associative, we can find the GCD of more than two numbers using functools.reduce. (In python 3.9+, math.gcd already takes care of this)

import functools

gcd = functools.reduce(lambda x,y: math.gcd(x, y), element_counts.values())

# Or for Py3.9+
gcd = math.gcd(*element_counts.values())

For our element_counts, we get gcd = 2

Finally, divide the count of each element by the GCD, and join it into a single string:

compound_string = "".join([f"{elem}{count//gcd}" for elem, count in element_counts.items()])

which gives compound_string = 'A1B2. Oops! Elements with a single atom don't need a number. Let's handle that by writing a function that will handle the formatting instead of a list comprehension with an f-string:

def elem_to_str(elem, count):
    if count == 1: return elem
    else: return f"{elem}{count}"

compound_string = "".join(elem_to_str(elem, count//gcd) for elem, count in element_counts.items())

Finally, we have our desired output: compound_string = 'AB2'

score 0 · Answer 2 · answered Mar 07 '23 at 22:17

Break it down into steps:

1. Extract the keys such as A, B and C

For this you can just use x[0] to take the first character of the string x

2. Count up the number of repeats

The easiest way for beginners is to create an empty dictionary, and then loop through the characters of the string. For each character, if it is not present in the dictionary, add it. Then add 1 to its count.

There are faster ways with the Python Counter class, but don't worry about that for now.

3. Plan your "proportionate reduction"

Calculate the greatest common divisor of all the character counts

4. Divide the character counts by the greatest common divisor

import math

lst=['A1','A2','B','B','B','B']

keys = [x[0] for x in lst]

dct = {}
for key in keys:
    if key not in dct:
        dct[key]=0
    dct[key]+=1

counts = list(dct.values())

divisor = counts[0]
for count in counts:
   divisor = math.gcd(divisor, count)

for key,count in dct.items():
    print(key, count // divisor)

counting unique values and ignoring numbers

2 Answers2

1. Extract the keys such as A, B and C

2. Count up the number of repeats

3. Plan your "proportionate reduction"

4. Divide the character counts by the greatest common divisor