-1

I'm new to python & I'm processing a text file with regular expressions to extract ids & append a list. I wrote some python below intending to construct a list that looks like this

["10073710","10074302","10079203","10082213"...and so on]

Instead I'm seeing a list structure that has a bunch of verbose tags included. I'm assuming this is normal behavior & the finditer function appends these tags when it finds matches. But the response is a bit messy & I'm not sure how to turn off/delete these added tags. See screenshot below.

enter image description here

Can anyone please help me modify the code below so I can achieve the intended structure for the list?

import re

#create a list of strings
company_id = []

#open file contents into a variable
company_data = open(r'C:\Users\etherealessence\Desktop\company_data_test.json', 'r', encoding="utf-8")

#read the line structure into a variable
line_list = company_data.readlines()

#stringify the contents so regex operations can be performed
line_list = str(line_list)

#close the file
company_data.close()

#assign the regex pattern to a variable
pattern = re.compile(r'"id":([^,]+)')

#find all instances of the pattern and append the list
#https://stackoverflow.com/questions/12870178/looping-through-python-regex-matches
for id in re.finditer(pattern, line_list): 
  #print(id)
  company_id.append(id)

#test view the list of company id strings
#print(line_list)
print(company_id)
Emma
  • 27,428
  • 11
  • 44
  • 69
emalcolmb
  • 1,585
  • 4
  • 18
  • 43
  • 1
    Regex is a terrible solution here. You probably want something like JSON Path to query the structure directly. – jpmc26 May 27 '19 at 23:37

3 Answers3

1

To get the value, use id.string:

for id in re.finditer(pattern, line_list): 
  company_id.append(id.string)

as when you're reading just id, you're not fetching the actual value.

dmitryro
  • 3,463
  • 2
  • 20
  • 28
1

If your data is in JSON, you might just want to simply parse it.


If you wish to use regular expression, you can simplify your expression and use three capturing groups to the desired ID much easier. You can set two capturing groups in the left and right sides of your IDs, then the middle capturing group can help you to get the IDs, maybe something similar to this expression:

("id":")([0-9]+)(") 

enter image description here

RegEx Descriptive Graph

This link helps you to visualizes your expressions:

enter image description here

Python Testing

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\x22id\x22:\x22)([0-9]+)(\x22)"

test_str = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Python Test

# -*- coding: UTF-8 -*-
import re

string = "some other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON datasome other JSON data'\"id\":\"10480132\"'>some other JSON data"
expression = r'(\x22id\x22:\x22)([0-9]+)(\x22)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

Output:

YAAAY! "10480132" is a match 
Emma
  • 27,428
  • 11
  • 44
  • 69
1

re.finditer returns an iterator of re.Match objects.

If you want to extract the actual match (and more specifically, the captured group, to get rid of the leading "id":), you can do something like this:

for match in re.finditer(pattern, line_list):
    company_id.append(match.group(1))
Matias Cicero
  • 25,439
  • 13
  • 82
  • 154