1

I'm trying to perform a sed/awk style regex substitution with python3's re module.

You can see it works fine here with a hardcoded test string:

#!/usr/bin/env python3

import re

regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "

line = ("21:21:54.165651  stat64   this/                       0.000012         THiNG1.12471\n"
        "21:21:54.165652  stat64   /that                       0.000012         2thIng.12472\n"
        "21:21:54.165653  stat64   /and/the  other  thing.xml  0.000012  With  S paces.12473\n"
        "21:21:54.165654  stat64   /and/the_other_thing.xml    0.000012    without_em_.12474\n"
        "21:59:57.774616  fstat64           F=90               0.000002            tmux.4129\n")

result = re.sub(regex, subst, line, 0, re.MULTILINE)

if result:
        print(result)

But I'm having some trouble getting it to work the same way with the stdin:

#!/usr/bin/env python3

import sys, re

regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "

for line in str(sys.stdin):
        #sys.stdout.write(line)
        result = re.sub(regex, subst, line, 0, re.MULTILINE)

if result:
        print(result,end='')

I'd like to be able to pipe input straight into it from another utility, like is common with grep and similar CLI utilities.

Any idea what the issue is here?


Addendum

I tried to keep the question simple and generalized in the hope that answers might be more useful in similar but different situations, and useful to more people. However, the details might shed some more light on the problem, so here I will include are the exact details of my current scenario:

The desired input to my script is actually the output stream from a utility called fs_usage, it's similar to utilities like ps, but provides a constant stream of system calls and filesystem operations. It tells you which files are being read from, written to, etc. in real time.

From the manual:

NAME
fs_usage -- report system calls and page faults related to filesystem activity in real-time

DESCRIPTION
The fs_usage utility presents an ongoing display of system call usage information pertaining to filesystem activity. It requires root privileges due to the kernel tracing facility it uses to operate.

By default, the activity monitored includes all system processes except for:
fs_usage, Terminal.app, telnetd, telnet, sshd, rlogind, tcsh, csh, sh, zsh. These defaults can be overridden such that output is limited to include or exclude (-e) a list of processes specified by the user.

The output presented by fs_usage is formatted according to the size of your window.
A narrow window will display fewer columns. Use a wide window for maximum data display.
You may override the formatting restrictions by forcing a wide display with the -w option.
In this case, the data displayed will wrap when the window is not wide enough.

I hack together a crude little bash script to rip the process names from the stream, and dump them to a temporary log file. You can think of it as a filter or an extractor. Here it is as a function that will dump straight to stdout (remove the comment on the last line to dump to file).

proc_enum ()
  {
  while true; do
  sudo fs_usage -w -e 'grep' 'awk' | 
    grep -E -o '(?:\d\.\d{6})\s{3}\S+\.\d+' | 
    awk '{print $2}' | 
    awk -F '.' '{print $1}' \
      #>/tmp/proc_names.logx
  done
  }

Useful Links

Community
  • 1
  • 1
voices
  • 495
  • 6
  • 20

1 Answers1

0

The problem str(sys.stdin) what Python will do in for loop is this:

 i = iter(str(sys.stdin))
 # then in every iteration
 next(i)

Here you are converting the method to str, result in my computer is:

str(sys.stdin) == "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='cp1256'>"

you are not looping on the lines received by stdin, you are looping on the string representation of the function.

And another problem in the first example you are applying the re.sub on the entire text but here you are applying for each line, so you should concatenate the result of each line or concatenate the lines in single text before applying re.sub.

import sys, re

regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "

result = ''
for line in sys.stdin:
    # here you should convert the input but I think is optional
    line = str(line)
    result += re.sub(regex, subst, line, 0, re.MULTILINE)


if result:
    print(result, end='')
Charif DZ
  • 14,415
  • 3
  • 21
  • 40
  • Actually, I didn't mean to call `str()` there in my example. That's just a thing I tried when troubleshooting. I noticed it seems a bit like reading from a file object with `open()` (maybe analogous to calling `open(/dev/stdin)` or similar?). I also tried a few other things e.g., `sys.stdin(readline())` (I think that's right, I don't have my notes in front of me). That's good information though, thanks for the confirmation. – voices Sep 29 '19 at 10:46
  • This will work if the `sys.stdin` is recieving Text but the first command is not doing that, you may convert it first, `line.decode('utf8')`, what is the command exactly? – Charif DZ Sep 29 '19 at 10:50
  • So you're saying that the reason it doesn't behave like the hardcoded string, is because each line is treated separately instead of one single big chunk? The input is actually a continuous stream. I'll add a little more info to the question – I tried to keep the question simple, but maybe I left out an important detail. Just a sec. – voices Sep 29 '19 at 10:57
  • As I said If this is the problem, `you should concatenate the result of each line or concatenate the lines in single text before applying re.sub` – Charif DZ Sep 29 '19 at 11:05
  • Difficult when the input is a constant stream though. – voices Oct 01 '19 at 21:26