Regex for capturing digits in a string (Python)

Question

I'm new to python & I'm processing a text file with regular expressions to extract ids & append a list. I wrote some python below intending to construct a list that looks like this

["10073710","10074302","10079203","10082213"...and so on]

Instead I'm seeing a list structure that has a bunch of verbose tags included. I'm assuming this is normal behavior & the finditer function appends these tags when it finds matches. But the response is a bit messy & I'm not sure how to turn off/delete these added tags. See screenshot below.

Can anyone please help me modify the code below so I can achieve the intended structure for the list?

import re

#create a list of strings
company_id = []

#open file contents into a variable
company_data = open(r'C:\Users\etherealessence\Desktop\company_data_test.json', 'r', encoding="utf-8")

#read the line structure into a variable
line_list = company_data.readlines()

#stringify the contents so regex operations can be performed
line_list = str(line_list)

#close the file
company_data.close()

#assign the regex pattern to a variable
pattern = re.compile(r'"id":([^,]+)')

#find all instances of the pattern and append the list
#https://stackoverflow.com/questions/12870178/looping-through-python-regex-matches
for id in re.finditer(pattern, line_list): 
  #print(id)
  company_id.append(id)

#test view the list of company id strings
#print(line_list)
print(company_id)

Regex is a terrible solution here. You probably want something like JSON Path to query the structure directly. — jpmc26, May 27 '19 at 23:37

dmitryro · Answer 1 · 2019-05-12T00:47:51.103

1

To get the value, use id.string:

for id in re.finditer(pattern, line_list): 
  company_id.append(id.string)

as when you're reading just id, you're not fetching the actual value.

edited May 12 '19 at 00:47

answered May 12 '19 at 00:40

dmitryro

3,463
2
20
28

Emma · Answer 2 · 2019-05-12T00:47:50.467

If your data is in JSON, you might just want to simply parse it.

If you wish to use regular expression, you can simplify your expression and use three capturing groups to the desired ID much easier. You can set two capturing groups in the left and right sides of your IDs, then the middle capturing group can help you to get the IDs, maybe something similar to this expression:

("id":")([0-9]+)(")

RegEx Descriptive Graph

This link helps you to visualizes your expressions:

Python Testing

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\x22id\x22:\x22)([0-9]+)(\x22)"

test_str = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Python Test

# -*- coding: UTF-8 -*-
import re

string = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"
expression = r'(\x22id\x22:\x22)([0-9]+)(\x22)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

Output:

YAAAY! "10480132" is a match

score 1 · Accepted Answer · answered May 12 '19 at 00:42

re.finditer returns an iterator of re.Match objects.

If you want to extract the actual match (and more specifically, the captured group, to get rid of the leading "id":), you can do something like this:

for match in re.finditer(pattern, line_list):
    company_id.append(match.group(1))

Regex for capturing digits in a string (Python)

3 Answers3

RegEx Descriptive Graph

Python Testing

Python Test

Output: