How to implement a variable-length ‘string’-y in C

Question

I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.

I have a struct describing the string type, and then a create function that allocates such a ‘string’:

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.

Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?

this was a genuine question :D don't turn things into wiki's unless the subject dictates it. Rep is an incentive and typing out a good answer is time consuming :D — Hassan Syed, Feb 11 '10 at 09:23
I don’t understand what you mean. I enable the ‘community wiki’ feature on almost every post I make; I’m community-minded in that way, I suppose… — ELLIOTTCABLE, Feb 11 '10 at 09:34
@elliotcable noone gets any reputation on a community wiki question. — Yacoby, Feb 11 '10 at 10:06
Oh, I didn’t know that! o_o Explains why my reputation stays the same. That’s a … really bad mechanic. An incentive to make Stack Overflow much less collaborative, and therefore, less useful… eh, but what do I know. Thanks, Yacoby and Hassan… — ELLIOTTCABLE, Feb 11 '10 at 10:12
Stack Overflow is already collaborative. People can edit your (non community wiki) questions if they really need to. The incentive in SO is the points. Without the points, there is no "incentive" other than good nature. It seems like you've gotten the completely wrong end of the stick when it comes to SO and the idea of a collaborative system. — Pod, Feb 11 '10 at 10:18
Yeah, it does sound like that. So people can still fix typos and such (i.e. improve the quality of this page for latecomers and Googlers) even if I don’t set ‘community wiki?’ What, then, is the purpose of ‘community wiki?’ Also, does the same apply to responses? Should I also not mark them as ‘community wiki,’ and will doing so preclude my reception of reputation? — ELLIOTTCABLE, Feb 11 '10 at 10:20
Without community wiki less people can fix your typos, as the reputation limit need for this is higher - see http://stackoverflow.com/faq. In practice this works well enough, though. Therefore I agree: by default you should leave your questions and answers as default. — Suma, Feb 11 '10 at 10:57

MSalters · Accepted Answer · 2010-02-12T13:00:29.807

12

Many C++ std::string implementations now use a "Small String Optimization". In pseudo-code:

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

The idea is that string up to 12 characters are stored in the shortString array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length). That helps to support scenario's in which you grow a string in small increments.

edited Feb 12 '10 at 13:00

answered Feb 11 '10 at 10:01

MSalters

173,980
10
155
350

1

I don’t quite understand how that’s legal; how do you access the struct portion of the union, if it has no name? – ELLIOTTCABLE Feb 11 '10 at 10:16
Also, this solution is really neat. I’m *definitely* going to use this—+1, and thanks for responding ^_^ – ELLIOTTCABLE Feb 11 '10 at 10:19
Hey, you're designing the language - the language itself doesn't need to access things by name. Furthermore, your language could support the "anonymous struct" rule: names of an anonymous struct are visible in the enclosing scope instead. Same thing as the anonymous union I used, really. – MSalters Feb 11 '10 at 10:24
Oh, you misunderstood me. I’m not writing a C-like language; I’m writing a language interpreter *in* C. I was asking how, in C, I access the anonymous struct/unions you used there. **Edit:** Ah, scratch that. I just did a bit of googling, found out it’s a C++ feature. Thanks! – ELLIOTTCABLE Feb 11 '10 at 10:30
3

@elliottcable: This answer talks about how this is typically done in C++, which also allows anonymous unions. In C, just add an inner name to the union member of the struct and you'll be fine. – unwind Feb 11 '10 at 10:33

outis · Answer 2 · 2010-02-11T09:25:21.443

The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.

typedef struct string {
    size_t length;
    unsigned char *native;
} string_t;

string_t String__create(char native[]) {
    string_t this;
    this.length = strlen(native);
    this.native = malloc(this.length+1);
    if (this.native) {
        strncpy(this.native, native, this.length+1);
    } else {
        this.length = 0;
    }
    return this;
}

If you want to dynamically allocate the string_t:

string_t* String__create(char native[]) {
    string_t* this;
    if (this = malloc(sizeof(*this))) {
        this->length = strlen(native);
        this->native = malloc(this->length+1);
        if (this->native) {
            strncpy(this->native, native, this->length+1);
        } else {
            free(this);
            this=NULL;
        }
    }
    return this;
}
void String__delete(string_t** str) {
    free((*str)->native);
    free((*str));
    *str=NULL;
}

Baltasarq · Answer 3 · 2010-02-11T11:52:59.397

In addition to what others have told you, I'd also include the concept of "capacity": It is not possible to know the size of the allocated vector in memory, you must store it. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.

typedef struct _mystring {
        char * native;
        size_t size;
        size_t capacity;
} String;

This way, capacity always bears the actual size of the chunk in which your string is. Say that your string goes shorter: you don't have to realloc in order to get an exact match between the capacity and the size of your string. Also, you can alloc from the beginning the characters you expect the string to have, and not the characters the initial string has. Finally, you can mimic the C++ string char dynamic vector and double capacity each time the string grows beyond the capacity limit. All of these will keep memory operations to a minimum, which will translate in better performance (you will also waste some memory, but never as much as 1Kb).

String String__create(char native[], size_t capacity) {
  String this;

  this.size = strlen( native );
  if ( capacity < ( this.size + 1 ) )
        this.capacity = this.size + 1;
  else  this.capacity = capacity;

  this.native = (char *) malloc( capacity * sizeof( char ) );
  strcpy( this.native, native );

  return this;
}

String * String__set(String *this, char native[]) {
    this->size = strlen( native );

    if ( this->size >= this->capacity ) {
        do {
            this->capacity <<= 1;
        } while( this->size > this->capacity );

        this->native = realloc( this->native, this->capacity );
    }

    strcpy( this->native, native );

    return this;
}

void String__delete(String *this)
{
    free( this->native );
}

Why is this necessary/useful? Cannot I simply ‘sizeof’ the string to get the current capacity? — ELLIOTTCABLE, Feb 11 '10 at 10:21
Actually not. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof. — Baltasarq, Feb 11 '10 at 11:07
Ah, forgive me. I didn’t know that. I can’t upvote your answer again unless you edit it; if you do so, I’ll gladly add a +1 (-: — ELLIOTTCABLE, Feb 11 '10 at 11:23

score 2 · Answer 4 · answered Feb 11 '10 at 09:18

realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:

struct mystring
{
    char * string;
    size_t size;
};

The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.

You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one immutable string to another (using the semantics of realoc).

Yes, but how can I declare this in the struct? Could you include some code, based on the example I provided? I’d be much appreciative. — ELLIOTTCABLE, Feb 11 '10 at 09:21
Ah, I spoke too soon. Thanks for the edit, and useful information. (-: — ELLIOTTCABLE, Feb 11 '10 at 09:22

score 0 · Answer 5 · answered Jun 03 '11 at 03:43

Some people prefer to use the "rope" data structure to store a string of characters of unbounded length, rather than a contiguous string (C string).

A simplified rope can be defined something like:

#include <stdio.h>

struct non_leaf_rope_node{
    char zero;
    union rope* left_substring;
    union rope* right_substring;
    // real rope implementations have a few more items here
};
#define rope_leaf_max ( sizeof( struct non_leaf_rope_node ) )

typedef union rope {
    char rope_data[ rope_leaf_max ];
    struct non_leaf_rope_node pointers;
} rope;

void print( union rope *node ){
    if( node->rope_data[0] != '\0' ){
        // short literal data
        fputs( node->rope_data, stdout );
    }else{
        // non-root node
        print( node->pointers.left_substring );
        print( node->pointers.right_substring );
    };
};
// other details involving malloc() and free() go here

int main(void){
    rope left = { "Hello," };
    rope right = { " World!" };
    rope root = {0,0,0};
    root.pointers.left_substring = &left;
    root.pointers.right_substring = &right;
    print( &root );

    return 0;
};

A rope with less than rope_leaf_max characters is stored the same as a null-terminated C string. A rope containing more than rope_leaf_max characters is stored as a root non_leaf_rope_node pointing to the left and right sub-strings, (which may in turn point to left and right sub-sub-strings), eventually pointing to leaf nodes, and the leaf nodes each contain at least one character of the full string.

A rope always stores at least one character, so we can always tell: If the first byte of a rope node is non-zero, that node is a leaf node storing literal characters. If the first byte of a rope node is zero, that node stores pointers to left and right sub-strings. (Real rope implementations often have a third kind of rope node).

Often using ropes requires less total RAM space than using C strings. (A node containing a phrase such as "New York City" can be re-used multiple times in one rope, or in some implementations shared between two ropes). Sometimes using ropes is faster than using C strings.

How to implement a variable-length ‘string’-y in C

5 Answers5

Linked