1

Primer: This question is quite long, because I want to give an overview of my current understanding of the inner mechanisms of MRI and how I came to my conclusions. I want to understand the code better, so please correct me if any assumption I'm making is wrong.

I'm trying to find out where MRI Ruby stores the data part (aka the contents) of a String, because I'd like to create String objects which reuse memory allocated by another binary (same allocator of course).

Here's what I know so far:

RString: internal representation of a String.

struct RString {
    struct RBasic basic;
    union {
        struct {
            long len;
            char *ptr;
            union {
                long capa;
                VALUE shared;
            } aux;
        } heap;
        char ary[RSTRING_EMBED_LEN_MAX + 1];
    } as;
};

reference

From the above snippet I conclude that there are 2 ways the data can be stored:

  1. on the heap via the heap struct (ptr points to data)
  2. in the ary char array directly (probably some optimization)

I'm only interested in the heap case.

str_new0() seems to be the most common way to create a String from a pointer to some string data and a length.

static VALUE
str_new0(VALUE klass, const char *ptr, long len, int termlen)
{
    VALUE str;

    if (len < 0) {
        rb_raise(rb_eArgError, "negative string size (or size too big)");
    }

    RUBY_DTRACE_CREATE_HOOK(STRING, len);

    str = str_alloc(klass);
    if (len > RSTRING_EMBED_LEN_MAX) {
        RSTRING(str)->as.heap.aux.capa = len;
        RSTRING(str)->as.heap.ptr = ALLOC_N(char, len + termlen);
        STR_SET_NOEMBED(str);
    }
    else if (len == 0) {
        ENC_CODERANGE_SET(str, ENC_CODERANGE_7BIT);
    }
    if (ptr) {
        memcpy(RSTRING_PTR(str), ptr, len);
    }
    STR_SET_LEN(str, len);
    TERM_FILL(RSTRING_PTR(str) + len, termlen);
    return str;
}

reference

Memory is allocated with the macro ALLOC_N which is an alias for RB_ALLOC_N which expands to ruby_xmalloc2() which calls objspace_xmalloc2() which calls objspace_xmalloc0().

Phew

static void *
objspace_xmalloc0(rb_objspace_t *objspace, size_t size)
{
    void *mem;

    size = objspace_malloc_prepare(objspace, size);
    TRY_WITH_GC(mem = malloc(size));
    size = objspace_malloc_size(objspace, mem, size);
    objspace_malloc_increase(objspace, mem, size, 0, MEMOP_TYPE_MALLOC);
    return objspace_malloc_fixup(objspace, mem, size);
}

reference

So here we are. TRY_WITH_GC seems to check if the allocation mem = malloc(size) succeeds and if not it tries again after a GC run I think.

#define TRY_WITH_GC(alloc) do { \
        objspace_malloc_gc_stress(objspace); \
        if (!(alloc) && \
        (!garbage_collect_with_gvl(objspace, TRUE, TRUE, TRUE, GPR_FLAG_MALLOC) || /* full/immediate mark && immediate sweep */ \
         !(alloc))) { \
            ruby_memerror(); \
        } \
    } while (0)

reference

Here's the first thing I'm unsure about: It seems to malloc just some memory (important: not in objspace). Is this the case? I don't know if they overwrote malloc somewhere to allocate GC friendly or whatever.

OK after that they mutate objspace with objspace_malloc_increase() and friends. I don't understand what these functions do. They do not seem to store the pointer mem in objspace, but maybe I overlooked it. I need clarification here.

As noted in the beginning I want to write code that creates a Ruby String, which uses memory allocated by some other binary, eg. C via FFI, of course with the system allocator. Do I have to register my "foreign" memory via the objspace_* functions? If yes, how does that exactly work? And are there subtleties when it comes to freeing the memory again? (I guess the GC does that, but what conditions must be true for this to work?)

I hope my question is not too vague, I can ask more precisely if necessary!

Thanks in advance!

le_me
  • 3,089
  • 3
  • 26
  • 28
  • "I'd like to create String objects which reuse memory allocated by another binary." This sounds like a recipe for disaster. I'd strongly encourage you to make a string work-alike that uses your custom allocator. Ruby will want to take ownership of any memory it's given and if this is allocated outside of the Ruby VM you may encounter seriously unpredictable behaviour. Duck typing is your friend here. Your string alternative might be transparently compatible if you do it right. – tadman May 04 '16 at 19:02
  • the problem with duck typing is that it is a lot of work to rewrite the entire String API, it's huge! Can you elaborate why it is problematic to allocate (and fill) the memory outside of the ruby VM? – le_me May 04 '16 at 19:09
  • What if your string gets modified using `<<` which requires a buffer allocation? What about other forms of concatenation? What if a future version of Ruby rewrites how all of this is done for performance reasons or otherwise? There's a ton of things that can go wrong here. Writing your own work-alike that's 100% feature complete might not be easy, but covering the functions you're actually going to use wouldn't take long, and the rest can be handled by calling `custom_string.to_s` which emits a regular Ruby string. – tadman May 04 '16 at 19:14
  • Your class could behave pretty closely to a frozen String object as you probably don't want people editing it. Depends on what restrictions that memory has if it's shared. – tadman May 04 '16 at 19:14
  • I see. The memory is not shared, as nothing but the ruby process would touch it after the String object is created. I think any implementation of String will have a pointer to the actual data, future ruby versions won't change that I guess. I want to use all the goodies a (frozen) Ruby String has to offer, like Regex matching etc, but strictly without copying the memory (unless a user modifies it of course). Is this not possible? – le_me May 04 '16 at 19:29
  • I'm just expressing caution here. The only way to find out for sure is to try. – tadman May 04 '16 at 20:13
  • maybe you're right...I initially thought it would be good to combine Ruby with a far more lowlevel language like Rust for example, but the way Ruby handles data is fundamentally inefficient rearding ffi, especially strings. I'm not sure it's worth the trouble...but I still want to know how this aspect of MRI works, out of curiosity! – le_me May 04 '16 at 21:03

0 Answers0