3

Recently, I was handling a large text file (~10GB) and trying to replace some characters in Python.

I tried these two versions:

f = open('myFile.txt', 'r')
filedata = f.read()
filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')
f = open('myFile.txt', 'r')
filedata = f.read()
filedata = filedata.replace(',', ' ')
filedata = filedata.replace('-', ' ')
filedata = filedata.replace('_', ' ')

When I tried the first one, the process was killed during the replace method. However, the process was not killed when I used the second one. (Screenshot.)

>>> f = open('myFile.txt', 'r')
... filedata = f.read()
... filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')
Killed

>>> f = open('myFile.txt', 'r')
... filedata = f.read()
... filedata = filedata.replace(',', ' ')
... filedata = filedata.replace('-', ' ')
... filedata = filedata.replace('_', ' ')
... print("Success!")
Success!

I don't think there is a significant difference in time and space complexity. Does anyone know what is going on under the hood?

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Ricky
  • 61
  • 8
  • 2
    When you say the process was killed, was there an error or anything? Also if you're just replacing characters with spaces, you could use a regular expression instead of `.replace` multiple times i.e. `filedata = re.sub('[,-_]', ' ', filedata)` – Rolv Apneseth Jun 30 '21 at 10:49
  • There is no error. Just "Killed" is shown in the terminal. As for the regular expression, I didn't know this. Is this faster than usual replace method? – Ricky Jun 30 '21 at 19:42
  • I just attached the screenshot to the original post. I hope this helps! – Ricky Jun 30 '21 at 19:56
  • 2
    As an aside, if it is all single-character replacements, you should use `str.translate` it will be much faster – juanpa.arrivillaga Jun 30 '21 at 20:02
  • Please include the error as text rather than an image if possible. I believe the `Killed` message usually means it's running out of RAM. My guess is since they're chained, the `filedata` does not get reallocated like in the second example, but everything is held in memory for all chained calls of `.replace()`, leading to you running out of RAM – Rolv Apneseth Jun 30 '21 at 20:06
  • But yes I think in terms of time they should be equivalent – Rolv Apneseth Jun 30 '21 at 20:07

2 Answers2

6

This isn't a problem with chained calls generally, but in this case it's because you maintain:

 filedata = f.read()

That original reference around.

So:

filedata = filedata.replace(',', ' ').replace('-', ' ').replace('_', ' ')

The original str read from the file has to stay in memory along with each subsequent .replace result until the assignment happens at the end, where its reference count finally reaches 0. A single replace, when the operation doesn't change the resulting size of the string, will require twice as much memory, because the method utilizes a reference to the original string and the new string at the same time. So at the point where you are on your second replace, you would have to have the original string, the the once-replaced string, and the new, twice-replaced string in memory.

On the other hand,

filedata = filedata.replace(',', ' ')
filedata = filedata.replace('-', ' ')
filedata = filedata.replace('_', ' ')

Here, each step requires at most 2 times the amount of memory of the original string, since the assignment causes the reference count of the original to be garbage collected before going on to a subsequent .replace, and importantly, the original doesn't stay in memory.

If what I say is true, then the following should work:

filedata = f.read().replace(',', ' ').replace('-', ' ').replace('_', ' ')

But the pythonic way to do this is to avoid .replace altogether in this instance, because you are doing multiple, single replacements.

For that, you should use str.translate.

filedata = f.read()
table = {ord(','): ' ', ord('-'): ' ', ord('_'): ' '}
filedata = fildata.translate(table)

Here is some empirical evidence:

import tracemalloc

tracemalloc.start()

result = "abcdefghij"*1_000_000
result = (
    result.replace('a', '*')
          .replace('b', '*')
          .replace('c', '*')
)

size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")

del result
tracemalloc.reset_peak()

result = "abcdefghij"*1_000_000
result = result.replace('a', '*')
result = result.replace('b', '*')
result = result.replace('c', '*')

size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")

del result
tracemalloc.reset_peak()

result = ("abcdefghij"*1_000_000).replace('a', '*').replace('b', '*').replace('c', '*')

size, peak = tracemalloc.get_traced_memory()
print(f"{size=}, {peak=}")

The above outputs what I would expect:

size=10000625, peak=30000723
size=10000681, peak=20000730
size=10000681, peak=20000730
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
2

Let:

>>> import dis
>>> def chain(s):
...     return s.replace(',', ' ').replace('-', ' ').replace('_', ' ')
>>> def line(s):
...     s = s.replace(',', ' ')
...     s = s.replace('-', ' ')
...     s = s.replace('_', ' ')
...     return s

Here are the bytecodes:

>>> dis.dis(chain)
  2           0 LOAD_FAST                0 (s)
              2 LOAD_METHOD              0 (replace)
              4 LOAD_CONST               1 (',')
              6 LOAD_CONST               2 (' ')
              8 CALL_METHOD              2
             10 LOAD_METHOD              0 (replace)
             12 LOAD_CONST               3 ('-')
             14 LOAD_CONST               2 (' ')
             16 CALL_METHOD              2
             18 LOAD_METHOD              0 (replace)
             20 LOAD_CONST               4 ('_')
             22 LOAD_CONST               2 (' ')
             24 CALL_METHOD              2
             26 RETURN_VALUE
>>> dis.dis(line)
  2           0 LOAD_FAST                0 (s)
              2 LOAD_METHOD              0 (replace)
              4 LOAD_CONST               1 (',')
              6 LOAD_CONST               2 (' ')
              8 CALL_METHOD              2
             10 STORE_FAST               0 (s)

  3          12 LOAD_FAST                0 (s)
             14 LOAD_METHOD              0 (replace)
             16 LOAD_CONST               3 ('-')
             18 LOAD_CONST               2 (' ')
             20 CALL_METHOD              2
             22 STORE_FAST               0 (s)

  4          24 LOAD_FAST                0 (s)
             26 LOAD_METHOD              0 (replace)
             28 LOAD_CONST               4 ('_')
             30 LOAD_CONST               2 (' ')
             32 CALL_METHOD              2
             34 STORE_FAST               0 (s)

  5          36 LOAD_FAST                0 (s)
             38 RETURN_VALUE

The only difference between the two is the interleaving of the two STORE_FAST and LOAD_FAST pairs:

             10 STORE_FAST               0 (s)

  3          12 LOAD_FAST                0 (s)

So, as @juanpa.arrivillaga describes in his answer, the only remaining difference must be related to memory usage. If the program is currently holding explicit references to a variable, that memory cannot be freed, even if it will not be later used. This is what occurs for chain, providing that the caller maintains an explicit reference to s.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135