-1

I'm trying to find a reliable regex pattern for parsing a PCI address from a listing in sysfs.

For example:

s = "
# total 0
# drwxr-xr-x   7 root root    0 Mar 22 21:30 .
# drwxr-xr-x 121 root root    0 Mar 22 21:27 ..
# drwxr-xr-x   2 root root    0 Mar 22 21:27 0000:13:45.6:pcie001
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.5
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.6
# -r--r--r--   1 root root 4096 Mar 22 21:29 aer_dev_correctable
"
pattern = r'SOME MAGIC'
list_of_addrs = re.findall(pattern, s, re.MULTILINE)

where I expect list_of_addrs = ['0000:13:45.6:pcie001', '0000:12:34.5', '0000:12:34.6']

The pattern I'm approximately trying to encode as a regular expression is:

# Starts with a set of 4 hex characters, [0-9a-fA-F]
# :
# Set of 2 hex characters
# :
# Set of 2 hex characters
# Set of 1 hex characters
# Until next whitespace
tarabyte
  • 17,837
  • 15
  • 76
  • 117
  • What data checking do you want your regex to do? From your example it appears you want to match all strings at the end of lines that begin and end with a digit and contain no whitespace. If that is sufficient you could simply match [r" \d\S+\d$](https://regex101.com/r/acR2s4/1/). That's the problem with asking question in terms of a single example. You need to specify the *rules* to appy to extract the information of interest. – Cary Swoveland Mar 23 '20 at 07:32

2 Answers2

0

Try pattern r'\b(0{0,4}:\d{2}:\d{2}.\d:?\w*)'

Ex:

import re

s = """
# total 0
# drwxr-xr-x   7 root root    0 Mar 22 21:30 .
# drwxr-xr-x 121 root root    0 Mar 22 21:27 ..
# drwxr-xr-x   2 root root    0 Mar 22 21:27 0000:13:45.6:pcie001
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.5
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.6
# -r--r--r--   1 root root 4096 Mar 22 21:29 aer_dev_correctable
"""
pattern = r'\b(0{0,4}:\d{2}:\d{2}.\d:?\w*)'
list_of_addrs = re.findall(pattern, s, re.MULTILINE)
print(list_of_addrs)

Output:

['0000:13:45.6:pcie001', '0000:12:34.5', '0000:12:34.6']
Rakesh
  • 81,458
  • 17
  • 76
  • 113
-1

Input:

import re
s = """
# total 0
# drwxr-xr-x   7 root root    0 Mar 22 21:30 .
# drwxr-xr-x 121 root root    0 Mar 22 21:27 ..
# drwxr-xr-x   2 root root    0 Mar 22 21:27 0000:13:45.6:pcie001
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.5
# drwxr-xr-x   2 root root    0 Mar 22 21:30 0000:12:34.6
# -r--r--r--   1 root root 4096 Mar 22 21:29 aer_dev_correctable
"""

# Begins with 4 hex characters
# :
# 2 hex characters
# : 
# 2 hex characters
# .
# 1 decimal character
# 1 or more occurrences of anything other than whitespace
pattern = r'\b([0-9a-fA-F]{4}:[0-9a-fA-F]{2}:[0-9a-fA-F]{2}.\d{1}\S*)'
re.findall(pattern, s)

Output:

['0000:13:45.6:pcie001', '0000:12:34.5', '0000:12:34.6']

See also: https://www.w3schools.com/python/python_regex.asp

tarabyte
  • 17,837
  • 15
  • 76
  • 117