Best way to represent formatted text in memory? C++

Question

I'm writing a basic text editor, well it's really an edit control box where I want to write code, numerical values and expressions for my main program.

The way I'm currently doing it is that I feed character strings into the edit control. In the edit control I have a class that breaks the string up into “glyphs” like words, numbers, line breaks, tabs, format tokens, etc. The word glyphs for example contain a string representing a literal word and a short integer that represents the number of trailing white spaces. The glyphs also contain info needed when drawing the text and calculating line wrapping.

For example the text line “My name is Karl” would equal a linked list of glyphs like this: NewLineGlyph → WordGlyph (“My”, 1 whitespace) → WordGlyph (“name”, 1 whitespace) → WordGlyph(“is”, 1 whitespace ) → WordGlyph (“Karl”, 0 whitespace) → NULL.

So instead of storing the string in memory as a continuous block of chars (or WCHARs), it is stored in small chunks with potentially lots of small allocations and deallocations.

My question is; should I be concerned with heap fragmentation when doing it this way? Do you have any tips on makinging this more efficient? Or a completely different way of doing it? :)

PS. I'm working in C++ on Win7.

I'm curious: Why do you need to store the number of trailing whitespace characters? — Rudy Velthuis, Sep 02 '11 at 15:50
Just convenience really, I didn't think they deserved their own glyph. That way if there are many white spaces I can represent that with just one number which is the same size as a wchar. — Karl Hansson, Sep 02 '11 at 16:11
@Karl remember that you are already making a simplification. Many languages support many different characters. For example in C# white space is (other than space): Any character with Unicode class Zs, Horizontal tab character (U+0009), Vertical tab character (U+000B), Form feed character (U+000C) — xanatos, Sep 02 '11 at 20:06
Hi xanatos. That is a good point I didn't realize that. When I said white space I meant the blank "space" character in particular. The horizontal tab and other white spaces would typically have their own glyphs. — Karl Hansson, Sep 04 '11 at 08:46

score 2 · Accepted Answer · answered Sep 02 '11 at 19:30

Should you be concerned about fragmentation? The answer likely depends on how large your documents are (e.g., number of words), and how much editing will occur and the nature of those edits. The approach you have outlined might be reasonable for a static (read-only) document where you can "parse" the document once, but I imagine there will be a fair amount of work that needs to happen behind the scenes to keep your data structures in the correct state as a user is making arbitrary edits. Also, you'll have to decide on what a "word" is, which isn't necessarily obvious/consistent in every case. For example, is "hard-working" one word or two? If it's one, does that mean you will never word wrap at the hyphen? Or, consider the case where a "word" will not fit on a single line. In that case, will you simply truncate, or would you want to force break the word across lines?

My recommendation would be store the text as a block, and store the line breaks separately (as offsets into the text block), then recalculate line breaks as needed each time there is a change. If you're concerned about fragmentation and minimizing the number of allocations/deallocations, you could allocate fixed-size blocks and then manage memory inside of those blocks yourself. Here's what I've done in the past:

Text is stored as a block of characters, but rather than having a single contiguous block for the entire document, I maintain a linked list of blocks that are always allocated 4KB (i.e., either 4K single-byte chars, or 2K WCHARs). In other words, the text is stored as a linked list of arrays, where each array is allocated to a constant size.
Each block keeps track of how much space (i.e., characters) are used/free within that block.
When inserting one or more characters, if there is space in the current block, I can simply shift memory within that block (no allocation/deallocation required). If no space is available in the current block, but space is available in the adjacent block, then again I can just shift memory between existing blocks (no allocation/deallocation required). If both blocks are full, only then do I allocate a new 4KB block and add at the appropriate position in the linked list.
When deleting one or more characters, I simply need to shift memory (at most 4KB) rather than entire document text. I also may have to deallocate and remove any block(s) that become completely empty.
I also do some "garbage collection" to coalesce free space at appropriate times. This is fairly straightforward and involves moving characters from one block to another so that some blocks become empty and can be removed.

From the OS and/or runtime library's perspective, all of the allocations/dellocations are the same size (4KB), so there is no fragmentation. And since I manage the contents of that memory, I can avoid fragmentation within my allocated space by shifting memory contents to eliminate wasted space. The other advantage is that it minimizes the number of alloc/dealloc calls, which can be a performance concern depending on what allocator you are using. So, it's an optimization for both speed and size -- how often does that happen? :-)

Hi cbranch. Many thanks for your reply, you've got some really good points there. I like your way of managing a dedicated area of the memory for the purpose of text. I'm already toying with ideas in this direction and I'll be looking for information on this. :) — Karl Hansson, Sep 04 '11 at 08:13
@ cbranch. Continue: The main purpose of my text boxes are to store and display expressions and code style text so at the moment I'm not thinking of creating a fully fledged rich text editor. Though I would like to have features like syntax highlighting and to have different fonts and colours through out the text. Since it code I want to display first and foremost; word wrapping would only occur when a word doesn't fit a single line. But then again, since I'm writing this text box I might as well do it properly and plan ahead so I can add more advanced rich text features to it later. — Karl Hansson, Sep 04 '11 at 08:13

score 1 · Answer 2 · answered Sep 02 '11 at 19:49

I wouldn't worry about heap fragmentation; modern heap manager is pretty good at dealing with that.

I might worry about poor data locality, though. With each glyph as a separate allocation in a linked list (especially a non-invasive list like std::list), any sort of pass through the document is going to jump all over memory in a potentially non-cache-friendly way.

Text editors are harder than they seem at first blush. There are a lot of specialized data structures out there for representing blocks of text and structured documents. They each optimize for different types of operations. I recommend searching for explanations of them and then considering the types of operations you'll have to do most.

This paper is old, but it has a lot of good information: http://www.cs.unm.edu/~crowley/papers/sds.pdf

Hi Adrian. Thanks for your reply. I'm a bit concerned about poor data locality as well. I'm looking at ways to store the text in more continuous blocks. My text editor will be more of a code editor, so things like syntax highlighting, bracket matching as well as easy parsing of the code is my main concerns. Performance is also a big concern. I'll try to look for example data structures that is to do with this. Also thanks for the paper on data structures for text, I've already started reading it. :) — Karl Hansson, Sep 04 '11 at 08:29

Best way to represent formatted text in memory? C++

2 Answers2