0

I have a list of reference IDs which are alphanumeric. They have 3 digits which are zero-padded to the left, followed by a letter, followed by 3 more digits, again, zero padded to the left.

eg.

original_ref_list = ["005a004",
                     "018b003",
                     "007a029",
                     "105a015"]

As you can see, both sets of digits are padded with zeros. I want to get the same references without the zero padding on either side of the letter, but not to remove all zeros.

eg.

fixed_ref_list = ["5a4",
                  "18b3",
                  "7a29",
                  "105a15"]

I can do this by by searching for three regex patterns, combining the results and appending this to a list:

fixed_ref_list = list()
for i in original_ref_list:
    first_refpat = re.compile(r'[1-9]\d*[a-z]\d+')
    first_refpatiter = first_refpat.finditer(gloss[2])
    for first_ref_find in first_refpatiter:
        first_ref = first_ref_find.group()
        second_refpat = re.compile(r'[a-z]\d+')
        second_refpatiter = second_refpat.finditer(first_ref)
        for second_ref_find in second_refpatiter:
            second_ref = second_ref_find.group()[1:]
            third_refpat = re.compile(r'[1-9]\d*')
            third_refpatiter = third_refpat.finditer(second_ref)
            for third_ref_find in third_refpatiter:
                third_ref = third_ref_find.group()
    fixed_ref_list.append(first_ref[:-len(second_ref)] + third_ref)

But this seems like an awkward solution. Is there a built in way to return only part of a regex pattern, or to remove the padding before returning the result? Alternatively, is there any way to do what I want that's less messy?

AdeDoyle
  • 361
  • 1
  • 14
  • Are there always two digits in your string? e.g `005a` and `004`, or `018b` and `003`? – Andrej Kesely Jul 06 '20 at 16:00
  • Not always, I've boiled the problem down for the sake of asking the question. I can except the type of references that aren't in this format and treat them differently. – AdeDoyle Jul 06 '20 at 16:20
  • Yes, add them to question please, along with the expected output. – Andrej Kesely Jul 06 '20 at 16:22
  • They're not necessary. I only want to know how to clean the part of the code that deals with this type. The two answers below work very nicely. – AdeDoyle Jul 06 '20 at 16:50

3 Answers3

1

You can just group your matches using parenthesis like this:

re.match('([0-9a-f]{3})([0-9a-f])([0-9a-f]{3})', '005a004').groups()
> ('005', 'a', '004')

Now you have a tuple to work with. To remove the zeros in the beginning, you can match all the 0s using the ^ operator, which marks the beginning of a string and replace them with an empty string '':

re.sub('^0+', '', '004')
> '4'

That should give you all you need to make this more compact and readable.

devsnd
  • 7,382
  • 3
  • 42
  • 50
1

Using list comprehension

fixed_ref_list  = [str(int(x[:3])) + x[3] + str(int(x[4:])) for x in original_ref_list]

Result

print(fixed_ref_list)

Output

["5a4",
 "18b3",
 "7a29",
 "105a15"]

Explanation

Assuming zero padding is on digits 0-9, use int(...) to remove the zero padding in the field

DarrylG
  • 16,732
  • 2
  • 17
  • 23
1

Just use the following pattern "0+ and substitute it with ". See demo.

Be careful, as you have not said what do you want to happen to the last case here.

In the case you want to substitute the full hex number "00000" into "0", you have

"0*([0-9a-fA-F]+)"

as shown here.

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31