1

I received some multi-line data via HTTP and have it in one string. I need to filter only lines containing specific keywords and write it to a file.

How do I process these individual lines without consuming excessive memory? I.e. without splitting the input string at newline and then processing the list?

Jython-specific solutions are welcome, too.

user323094
  • 3,643
  • 4
  • 21
  • 30

4 Answers4

1

Since there is no iterator version of str.split, your best bet is to emulate it using the re module:

for line in re.finditer('.*?\n', data):
   # do stuff

However, note that this will leave the trailing newlines in place, unlike the regular split method.

lvc
  • 34,233
  • 10
  • 73
  • 98
0

You can try to use compiled regular expressions python re

Denis
  • 7,127
  • 8
  • 37
  • 58
0

Use the StringIO module to access your string as a file-like object. Then you can iterate over lines as you would do for a file.

Avaris
  • 35,883
  • 7
  • 81
  • 72
  • `StringIO` isn't necessarily better than `.split()` memory-wise since, being mutable, it isn't necessarily backed by the *same string* you pass it as an initial value. – lvc Mar 30 '12 at 11:01
  • @lvc: Not really. If you don't write on it, it keeps the same string. Even if you write, if you don't read it won't consume more memory. Check the source code if you'd like to be sure. Just creating the `StringIO` object and reading doesn't require extra memory, and my tests confirm that. – Avaris Mar 30 '12 at 12:09
  • Ok, having looked at Jython's source code, it seems that the `StringIO` module does indeed behave as you suggest, but that the newer `io.StringIO` class does *not*, and nor does Jython's implementation of `cStringIO` (and I haven't checked CPython's `cStringIO`). – lvc Mar 30 '12 at 12:45
0

I now actually tested the memory requirements of using data.split('\n'), re.finditer('.*?\n', data) and StringIO.readline() in Jython. I was surprised to find out that split() didn't increase used memory (PS Old Gen), StringIO came second and re third.

Jython 2.5.1+:
  split()  +0 x data
  StringIO +2 x data
  re       +4 x data

Jython 2.2.1:
  split()  +0 x data
  re       +2 x data
  StringIO +7 x data

StringIO didn't use additional memory after the .write() call, i.e. it seems to be backed by the same string in Jython.

I didn't test speed.

user323094
  • 3,643
  • 4
  • 21
  • 30