9

By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.

Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.

san
  • 4,144
  • 6
  • 32
  • 50
  • 1
    What does the CPython string look like? Is it `NULL` terminated? – JohanL May 08 '17 at 06:29
  • @Kasramvd Read it as offsets. Editing out the redundancy – san May 08 '17 at 06:32
  • GvR has rejected view slices for strings because of the problem of space recovery. Numerical python does use view slices because there is usually just a few master arrays which needs to be kept around awhile anyway. If you are in a similar position with a few master strings and many slices, you can write a substring view class. There may even be one on PyPI. – Terry Jan Reedy May 13 '17 at 23:23

3 Answers3

6

Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.

The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.

I want to know about the top 10 most common user agents, so I keep this around in a list.

Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?

Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str) keeping a substring of a field prevents GC for object.

Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.

Community
  • 1
  • 1
Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • Yeah this probably the reason (I anticipated in my question). – san May 08 '17 at 06:35
  • Interesting, never thought of that. Seems like a problem that should/could be fixed at the GC level by just splitting the string into new logical memory blocks though. – monoceres May 08 '17 at 06:37
  • @monoceres: Perhaps you could write a GC implementation for us that would do that... it sounds like no small task. – Dietrich Epp May 08 '17 at 06:37
  • @monoceres For reference, when Oracle removed string sharing from Java, they had GC experts working on improving the GC for almost 20 years and had what is in my mind the best GC on the planet, bar none. – Dietrich Epp May 08 '17 at 06:40
  • @DietrichEpp I wasn't trying to say I know better than the oracle/python developers on their GC, but rather point out that this is not a problem with string sharing but with the memory allocator / GC. (Also, depending on the implementation, it could be either trivial or impossible to do the split). – monoceres May 08 '17 at 06:44
  • When you are using `fp.read()` you already have the `fp` which is a file object and an iterator-like object which is not only an iterable but also very optimized, in terms of memory use. So I think this is not a good example. I think one of the most important reasons is the level of abstractions in not only a high-level lang like python but also the low-level ones like C. This is what makes it impossible even for the GC to solve this problem. – Mazdak May 08 '17 at 06:55
  • 1
    @Kasramvd: That kind of misses the point of the example... it illustrates how a problem could arise. But there are a couple of other reasons why the example is reasonable. You may want to do multiple passes of analysis, so using `fp` as an iterator would make you do twice as much IO. Additionally, even though file iteration is fast, doing a regex or string search on a contiguous buffer is faster than doing it line-by-line. – Dietrich Epp May 08 '17 at 07:34
  • @monoceres: I wasn't meaning to accuse. But this is really a question of API design and string slicing, since "just fix it with smarter GC" is not really an option as far as I know. – Dietrich Epp May 08 '17 at 07:51
3

In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.

A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.

Community
  • 1
  • 1
Kevin
  • 28,963
  • 9
  • 62
  • 81
  • That was very informative, thank you. I assumed live substrings could just bump up the reference count of the parent string. So the problem is reference counting is not a very efficient way of doing garbage collection: trades of throughput for latency. – san May 08 '17 at 07:35
  • 1
    Typical implementation is to have one type for string slices and one type for the backing array, so you wouldn't ever worry about cycles and you wouldn't need anything more than reference-counting. Branching may be expensive but reference-counting requires branching already, and `free()` is much more expensive than a branch to begin with. – Dietrich Epp May 08 '17 at 07:57
  • @Dietrich: You got me, there's actually a free list. Adding an extra branch really would make a difference. – Kevin May 08 '17 at 16:08
0

CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.

You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)

Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.

I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.

Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.

It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().

Andrew Dalke
  • 14,889
  • 4
  • 39
  • 54