Reading data using regular expression

Question

For my project, I need to read file and match it with my constants and once matches, need to store them in a dictionary. I am going to show a sample of my data and what I have so far below.

My data:

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2

This data has 5 important parts I want to gather:

Date right after timestamp: 1579051725
Num (first part of the numbers until 128, 129, 130,etc): .1.3.4.1.2.1.1.1.1.1.1
Num2 (second part): 128 or 129 or 130 or others in my larger data set
Syntax: In this case it is named: STRING
Counter: In this case they are strings; AA1 or Eth1 or Eth2

I also have (need to have) constant Num as dictionary within the program that holds the value above and constant syntax

I want to read through the data file,

If Num matches the constant I have within the program,
grab Num2,
check if Syntax matches the constant syntax within the program
grab Counter

When I say grab, I mean put that data under corresponding dictionary.

In short, I want to read through the data file, split 5 variables within it, match 2 variables with constant dictionary values, and grab and store 3 variables (including time) under dictionary.

I have trouble with splitting the data as of right now. I can split everything except Num and Num2. Also I am not sure how to create the constant dictionaries and how I should put under the constant dictionaries.

I would love to use regular expression instead of using if statement, but could not figure out what symbols to use since data includes many dots within the words.

I have the following so far:

constant_dic1 = {[".1.3.4.1.2.1.1.1.1.1.1"]["STRING" ]}
data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
with open (fileN, 'r') as f:

    for lines in f:
        if lines.startswith('.'):
            if ': ' in lines:
                lines=lines.split("=")
                first_part = lines[0].split()
                second_part = lines[1].split()
                for i in first_part:
                    f_f = i.split("{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.{}.")
                print (f_f[0])

Once I run the program, I receive the error that that "TypeError: list indices must be integers or slices, not str".

When I comment out the dictionary part, output is Num as well as Num2. It does not get split and does not print just the Num part.

Any help is appreciated! If there's any other source, please let me know below. Please let me know if I need any updates on the question without down voting. Thanks!

UPDATED CODE

import pandas as pd
import io
import matplotlib
matplotlib.use('TkAgg') # backend option for matplotlib #TkAgg #Qt4Agg #Qt5Agg
import matplotlib.pyplot as plt
import re # regular expression
import argparse # for optional arguments
parser = argparse.ArgumentParser()
parser.add_argument('File_Name', help="Enter the file name | At least one file is required to graph")
args=parser.parse_args()

data_cols = {'InterfaceNum':[],"IndexNum":[],"SyntaxName":[],"Counter":[],"TimeStamp":[]}
fileN = args.File_Name
input_data = fileN
expr = r"""
    TIMESTAMP:\s(\d+)           # date    - TimeStamp
    |                           # ** OR **
    ((?:\.\d+)+)                # num     - InterfaceNum
        \.(\d+)\s=\s            # num2    - IndexNum
            (\w+):\s            # syntax  - SyntaxName
                (\w+)           # counter - Counter
    """
expr = re.compile(expr, re.VERBOSE)
data = {}
keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']


with io.StringIO(input_data) as data_file:
    for line in data_file:
        try:
            find_data = expr.findall(line)[0]
            vals = [date, num, num2, syntax, counter] = list(find_data)
            if date:
                cur_date = date
                data[cur_date] = {k: [] for k in keys}
            elif num:
                vals[0] = cur_date
                for k, v in zip(keys, vals):
                    data[cur_date][k].append(v)
        except IndexError:
            # expr.findall(...)[0] indexes an empty list when there's no
            # match.
            pass

data_frames = [pd.DataFrame.from_dict(v) for v in data.values()]

print(data_frames[0])

ERROR I GET

Traceback (most recent call last):
  File "v1.py", line 47, in <module>
    print(data_frames[0])
IndexError: list index out of range

NEW DATA

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2
.1.2.3.4.5.6.7.8.9.10.11.131 = INT32: A

UPDATED CODE (v2)

import pandas as pd
import io
import matplotlib
import re # regular expression

file = r"/home/rusif.eyvazli/Python_Projects/network-switch-packet-loss/s_data.txt"



def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None, 
                 syntax=None, counter=None):

    timestamp = timestamp or r'\d+'
    iface_num = iface_num or r'(?:\.\d+)+'
    idx_num   = idx_num   or r'\d+'
    syntax    = syntax    or r'\w+'
    counter   = counter   or r'\w+'

#     expr = r"""
#         TIMESTAMP:\s({timestamp})   # date    - TimeStamp
#         |                           # ** OR **
#         ({iface_num})               # num     - InterfaceNum
#             \.({idx_num})\s=\s      # num2    - IndexNum
#                 ({syntax}):\s       # syntax  - SyntaxName
#                     ({counter})     # counter - Counter
#         """

    expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"

#   expr = re.compile(expr, re.VERBOSE)

    expr = re.compile(expr)

    rows = []
    keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
    cols = {k: [] for k in keys}

    with open(file_path, 'r') as data_file:
        for line in data_file:
            try:

                find_data = expr.findall(line)[0]
                vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
                if tstamp:
                    cur_tstamp = tstamp
                elif num:
                    vals[0] = cur_tstamp
                    rows.append(vals)
                    for k, v in zip(keys, vals):
                         cols[k].append(v)
            except IndexError:
                # expr.findall(line)[0] indexes an empty list when no match.
                pass

    return rows, cols

const_num    = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'

result_5 = get_dev_data(file)

# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]

df = pd.DataFrame.from_dict(master_dict)

df = df.loc[(df['InterfaceNum'] == '.1.2.3.4.5.6.7.8.9.10.11') & (df['SyntaxName'] == 'INT32' )] 

print(f"\n{df}")

OUTPUT

    TimeStamp              InterfaceNum IndexNum SyntaxName Counter
3  1579051725  .1.2.3.4.5.6.7.8.9.10.11      131      INT32       A

@rahlf23 , i updated. When I comment out the dictionary part, output is Num as well as Num2. It does not get split and does not print just the Num part. — r_e, Mar 12 '20 at 20:13
hi @r_e, see if there's anything you can use from the answer I posted. — Todd, Mar 13 '20 at 00:28
hi @Todd , I am going to check it out in a few. Due to time zone difference, I could not reply any earlier, my bad — r_e, Mar 13 '20 at 12:12
@r_e, how come you selected the answer that only says, 'please use regular expressions'?? That answer doesn't even go into any detail. That's hardly fair. You're using my code above, so i assume you got it to work for you. — Todd, Mar 20 '20 at 19:52
Oh, @Todd, that's my mistake. I meant to choose your answer as the correct answer. Due to my laptop's screen size, I chose it wrong. I re-applied the "correct answer" and chose your answer as the correct answer. My mistake.. — r_e, Mar 21 '20 at 12:16
Ahh.. Thank you for fixing that. I know it's funny that it concerns me, but I enjoy the validation that points give me on this site =) thanks @r_e I'm glad you got your code working. Feel free to remove the parts of it you don't need. looks like you can remove the commented out verbose expression and just keep one of the input parameters. — Todd, Mar 21 '20 at 15:59
@Todd , of course. Please, upvote the question if you liked to answer it, as well as some of my comments. (it helps with reputation as well as showing that they worth to be in here :) ) Thanks! — r_e, Mar 26 '20 at 13:51

score 1 · Answer 1 · answered Mar 12 '20 at 20:08

1

Please use the python "re" package for using regular expression in python. This package makes it so easy to use regular expression in python you can use various functions that are inside this package to achieve what you need. https://docs.python.org/3/library/re.html#module-contents use this link to read the documents.

There is a function called re.Pattern.match() which can be used to match patterns as you need to try this out.

answered Mar 12 '20 at 20:08

Thanks for the link! I will try again to use RE. Meanwhile, if you could, would you be able to let me know how can I split .1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1 to 4 pieces using RE as asked in the question? I have tried but symbols are so confusing since there are repeated. – r_e Mar 12 '20 at 20:20
2

How do you want to split .1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1. Give me an example so that I can help you. – Mar 12 '20 at 20:44
Sure. i want to split .1.3.4.1.2.1.1.1.1.1.1 then 128 then STRING then AA1. After splitting, I will match my constant dictionary (After fixing the dictionary) with .1.3.4.1.2.1.1.1.1.1.1, if matches, will store 128 within that dictionary. Then will match STRING with the constant dictionary. And if STRING will match, then will store AA1 within that dictionary. But again, i want to split .1.3.4.1.2.1.1.1.1.1.1 then 128 then STRING then AA1. THANKS A LOT! – r_e Mar 12 '20 at 20:48
1

a = ".1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1" charcaters_splitted = a.split() >>> charcaters_splitted ['.1.3.4.1.2.1.1.1.1.1.1.128', '=', 'STRING:', 'AA1'] >>> number_splitted = charcaters_splitted[0].split('.') >>> number_splitted ['', '1', '3', '4', '1', '2', '1', '1', '1', '1', '1', '1', '128'] >>> number_len_one = [i for i in number_splitted if len(i) < 2] >>> number_len_one ['', '1', '3', '4', '1', '2', '1', '1', '1', '1', '1', '1'] joined = ".".join(number_len_one) >>> joined '.1.3.4.1.2.1.1.1.1.1.1' This is how I split the number. – Mar 12 '20 at 21:57

Todd · Accepted Answer · 2020-03-17T04:22:20.863

Parsing raw file input using Regular Expressions

The function below is an example of how to parse raw file input with regular expressions.

The regular expression capture groups are looped over to build records. This is a reusable pattern that can be applied in many cases. There's more info on how it works in the 'Groupings in compound regular expressions' section.

The function will filter records that match the parameter values. Leaving them to their defaults, the function returns all the rows of data.

def get_dev_data(file_path, timestamp=None, iface_num=None, idx_num=None, 
                 syntax=None, counter=None):
    timestamp = timestamp or r'\d+'
    iface_num = iface_num or r'(?:\.\d+)+'
    idx_num   = idx_num   or r'\d+'
    syntax    = syntax    or r'\w+'
    counter   = counter   or r'\w+'
    expr = rf"""
        TIMESTAMP:\s({timestamp})   # date    - TimeStamp
        |                           # ** OR **
        ({iface_num})               # num     - InterfaceNum
            \.({idx_num})\s=\s      # num2    - IndexNum
                ({syntax}):\s       # syntax  - SyntaxName
                    ({counter})     # counter - Counter
        """
    expr = re.compile(expr, re.VERBOSE)
    rows = []
    keys = ['TimeStamp', 'InterfaceNum', 'IndexNum', 'SyntaxName', 'Counter']
    cols = {k: [] for k in keys}

    with open(file_path, 'r') as data_file:
        for line in data_file:
            try:
                find_data = expr.findall(line)[0]
                vals = [tstamp, num, num2, sntx, ctr] = list(find_data)
                if tstamp:
                    cur_tstamp = tstamp
                elif num:
                    vals[0] = cur_tstamp
                    rows.append(vals)
                    for k, v in zip(keys, vals):
                        cols[k].append(v)
            except IndexError:
                # expr.findall(line)[0] indexes an empty list when no match.
                pass
    return rows, cols

A tuple is returned. The first item, rows, is a list of rows of data in simple format; the second item, cols, is a dictionary keyed by column name with a list of row data per key. Both contain the same data and are each digestible by Pandas with pd.DataFrame.from_records() or pd.DataFrame.from_dict() respectively.

filtering example

This shows how records can be filtered using the function parameters. I think the last one, result_4, fits the description in the question. Assume that iface_num is set to your const_num, and syntax to your const_syntax values. Only records that match will be returned.

if __name__ == '__main__':

    file = r"/test/inputdata.txt"

    result_1 = get_dev_data(file)[0]
    result_2 = get_dev_data(file, counter='Eth2')[0]
    result_3 = get_dev_data(file, counter='Eth2|AA1')[0]
    result_4 = get_dev_data(file,
                           iface_num='.1.3.4.1.2.1.1.1.1.1.1', syntax='STRING')[0]

    for var_name, var_val in zip(['result_1', 'result_2', 'result_3', 'result_4'],
                                 [ result_1,   result_2,   result_3,   result_4]):

        print(f"{var_name} = {var_val}")

Output

result_1 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_2 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_3 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]
result_4 = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '129', 'STRING', 'Eth1'], ['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '130', 'STRING', 'Eth2']]

Using the first returned tuple item, column data can be accessed from the returned records using their offsets. For instance TimeStamp would be accessed like first_item[0][0] - first row, first column. Or, the rows can be converted into a dataframe and accessed that way.

Input file /test/inputdata.txt

TIMESTAMP: 1579051725 20100114-202845
.1.2.3.4.5.6.7.8.9 = 234567890
ifTb: name-nam-na
.1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1
.1.3.4.1.2.1.1.1.1.1.1.129 = STRING: Eth1
.1.3.4.1.2.1.1.1.1.1.1.130 = STRING: Eth2

Convert row data into a Pandas dataframe

The first tuple item in the output of the function will be rows of data corresponding to columns we've defined. This format can be converted into a Pandas dataframe using pd.DataFrame.from_records():

>>> row_data = [['1579051725', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1']]]
>>>
>>> column_names = ['TimeStamp', 'InterfaceNum', 'IndexNum', 
...                 'SyntaxName', 'Counter']
>>>
>>> pd.DataFrame.from_records(row_data, columns=column_names)
    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
>>>

Convert column data into a Pandas dataframe

The function also produces a dictionary as the second item of the returned tuple containing the same data, which could also produce the same dataframe using pd.DataFrame.from_dict().

>>> col_data = {'TimeStamp': ['1579051725'], 
...             'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1'], 
...             'IndexNum': ['128'], 'SyntaxName': ['STRING'], 
...             'Counter': ['AA1']}
>>> 
>>> pd.DataFrame.from_dict(col_data)
    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
>>>

Dictionary example

Here are a few examples of filtering file data, initializing a persistent dictionary. Then filtering for more data and adding it to the persistent dictionary. I think this is also close to what's described in the question.

const_num    = '.1.3.4.1.2.1.1.1.1.1.1'
const_syntax = 'STRING'

result_5 = get_dev_data(file, iface_num=const_num, syntax=const_syntax)

# Use the results of the first dict retrieved to initialize the master
# dictionary.
master_dict = result_5[1]

print(f"master_dict = {master_dict}")

result_6 = get_dev_data(file, counter='Eth2|AA1')

# Add more records to the master dictionary.
for k, v in result_6[1].items():
    master_dict[k].extend(v)

print(f"master_dict = {master_dict}")

df = pandas.DataFrame.from_dict(master_dict)

print(f"\n{df}")

Output

master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2']}
master_dict = {'TimeStamp': ['1579051725', '1579051725', '1579051725', '1579051725', '1579051725'], 'InterfaceNum': ['.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1', '.1.3.4.1.2.1.1.1.1.1.1'], 'IndexNum': ['128', '129', '130', '128', '130'], 'SyntaxName': ['STRING', 'STRING', 'STRING', 'STRING', 'STRING'], 'Counter': ['AA1', 'Eth1', 'Eth2', 'AA1', 'Eth2']}

    TimeStamp            InterfaceNum IndexNum SyntaxName Counter
0  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
1  1579051725  .1.3.4.1.2.1.1.1.1.1.1      129     STRING    Eth1
2  1579051725  .1.3.4.1.2.1.1.1.1.1.1      130     STRING    Eth2
3  1579051725  .1.3.4.1.2.1.1.1.1.1.1      128     STRING     AA1
4  1579051725  .1.3.4.1.2.1.1.1.1.1.1      130     STRING    Eth2

If all columns of the dictionary data aren't needed, keys in it can be dispensed with using <dict>.pop(<key>). Or you could drop columns from any dataframe created off the data.

Groupings in compound regular expressions

This expression shows the expression that's evaluated in the function when all its parameters are left to their default values.

expr = r"""
    TIMESTAMP:\s(\d+)           # date    - TimeStamp
    |                           # ** OR **
    ((?:\.\d+)+)                # num     - InterfaceNum
        \.(\d+)\s=\s            # num2    - IndexNum
            (\w+):\s            # syntax  - SyntaxName
                (\w+)           # counter - Counter
    """

In the regular expression above, there are two alternative statements separated by the OR, | operator. These alternatives match either a line of timestamp data, or device data. And within these subexpressions are groupings to capture specific pieces of the string data. Match groups are created by putting parenthesis, (...), around a subexpression. The syntax for non-grouping parenthesis is (?:...).

No matter which alternative subexpression matches, there will still be the same number of match groups returned per successful call to re.findall(). Maybe a bit counterintuitive, but this is just how it works.

However, this feature does make it easy to write code to extract which fields of the match you've captured since you know the positions the groups should be at regardless of the subexpression matched:

     [<tstamp>, <num>, <num2>, <syntax>, <counter>]
     # ^expr1^  ^.............expr2..............^

And since we have a predictable number of match groups regardless of which subexpression matches, it enables a pattern of looping that can be applied in many scenarios. By testing whether single match groups are empty or not, we know which branch within the loop to take to process the data for whichever subexpression got the hit.

        if tstamp:
            # First expression hit.
        elif num:
            # Second alt expression hit.

When the expression matches against the line of text that has the timestamp, the first subexpression hits, and its groups will be populated.

>>> re.findall(expr, "TIMESTAMP: 1579051725 20100114-202845", re.VERBOSE)
[('1579051725', '', '', '', '')]

Here, the first grouping from the expression is filled in and the other groups are blank. The other groupings belong to the other subexpression.

Now when the expression matches against the first line of device data, the second subexpression gets a hit, and its groups are populated. The timestamp groups are blank.

>>> re.findall(expr, ".1.3.4.1.2.1.1.1.1.1.1.128 = STRING: AA1", re.VERBOSE)
[('', '.1.3.4.1.2.1.1.1.1.1.1', '128', 'STRING', 'AA1')]

And finally, when neither subexpression matches, then the entire expression doesn't get a hit. In this case we get an empty list.

>>> re.findall(expr, "ifTb: name-nam-na", re.VERBOSE)
[]
>>>

For contrast, here's the expression without verbose syntax and documentation:

expr = r"TIMESTAMP:\s(\d+)|((?:\.\d+)+)\.(\d+)\s=\s(\w+):\s(\w+)"

Hi @Todd, thanks for the great explanation! I appreciate it! To implement it to read the data from the file, I assigned input_data to fileN. But it gives me IndexError "list index out of range" I have updated my question with updated code. Could you please check on it? How may i fix such issue? — r_e, Mar 13 '20 at 15:01
@r_e you're going to have to replace the statement `io.StringIO(input_data)` with the file you want to read in: `with open(file_path, 'r') as data_file:` `io.StringIO` is only there to demo the algorithm using the data you showed in your question as a string. — Todd, Mar 13 '20 at 19:06
thanks for the information. One last thing, could you please explain the if statement? I am not sure if I understand the `data[cur_date] = {k: [] for k in keys}`part as well as the else if part. Appreciate it! — r_e, Mar 14 '20 at 13:10
Also, if I want to print out a specific set of `num`s or `syntaxname`s, would I just need to add if statement as `if num = ".1.3.4.2.1.1.1.1.1.1" print (data_frames[0])` ? — r_e, Mar 14 '20 at 13:46
I tried this, but it does not work: `if data_frames[0]['InterfaceNum'] = ".1.3.4.2.1.1.1.1.1.1"): print (data_frames[0])` — r_e, Mar 14 '20 at 14:54
I simplified the code somewhat @r_e. Hopefully, it's easier to follow. By the way, the code you say doesn't work, shouldn't work. What it appears to be attempting is assigning a value to a dataframe column. Anyway, the new code might be clearer and you can determine how to access the data a little easier. — Todd, Mar 15 '20 at 05:13
Hi @Todd, I am going to try it in a few. Thank you very much! I appreciate your answer and support! Will update you soon if it works — r_e, Mar 15 '20 at 13:26
Hi @Todd, for result_5: `result_5 = get_dev_data(file, iface_num=const_num, syntax=const_syntax)` , should it not filter the data by those const values? I added a new data with different const_num, and that one still is in the output. Could you please help with this? I will update the code above with what I have. I will also create github soon for better way to communicate, if you do not mind — r_e, Mar 20 '20 at 14:08
here's the link to the github page indeed. Appreciate the support and help! https://github.com/RusifE/Network_Analysis/tree/master — r_e, Mar 20 '20 at 14:58
I got it. I added the following line: `df = df.loc[(df['InterfaceNum'] == '.1.2.3.4.5.6.7.8.9.10.11') & (df['SyntaxName'] == 'INT32' )] ` Now, I need to figure out how to ask user to input those lines instead of hardcoding which lines to print. — r_e, Mar 20 '20 at 15:54

Reading data using regular expression

2 Answers2