Abstracting by data format, with examples from problems in 4coder

Allen Webster
Hey there everyone! This post is only a 4coder progress update in the sense that it features some code samples from recently written parts of the 4coder engine update "4.1.X".

This is more of a technical write up for discussion purposes. I have a neat thought about "abstractions" which we discuss a lot in the Handmade community, and to explain it I'd like to talk about the abstractions I have found useful in 4coder, and the abstractions that hurt more than they helped.

Of course I realize I might just be wrong, or my ideas here might just be a poorly developed version of something people have already thought through. So if you've got any suggested reading for me, I'd be happy to take a look. On the other hand if this is somehow useful to anyone else's thought process that is great!

Types of Abstractions
I want to make a distinction between two types of abstractions. One, which I will call code semantic abstraction is made by creating reusable bits of code with language features. This includes things like a virtual function table, passing callbacks, switching on an ID, wrapping statements in a macro, or pulling code out into a new function. I call this "code semantic" abstraction because you are building a boundary between two parts of your solution, and the communication between those solutions happens through some semantics in your language.

Another type of abstraction, which I will call data format abstraction, is made by defining a meaning for a data format. This includes things as simple as defining a struct, or union, an enum, or more advanced data like a file format, or a command buffer format. I think of these as abstractions because many data formats are intermediates that neither come directly from input nor have intrinsic meaning to output hardware. I see those intermediate formats as an abstraction because they, like code semantic abstraction, allow two pieces of code to communicate in such a way that the two separate parts can mostly be altered without altering the other side of the abstraction so long as the boundary didn't change. The only difference is that the communication happens by "reads" and "writes" instead of by passing parameters and returns.

A Toy Problem First
Okay that was all really abstract (this pun was unavoidable). Let's look at something concrete. I wish I could get straight into real code but I want a concrete example of something that I think is a bad abstraction choice, and 4coder's bad abstractions don't make very good examples (there are plenty of bad abstractions in 4coder but paradoxically the ones that make good examples are also the ones that are small and contained enough to fix right away which I would do as soon as I found them, thus never completing the blog post).

So let's turn our attention to a toy example instead. Imagine you are working on the physics for a 2D game. The game designer initially wants two collision shapes: a circle and an axis aligned bounding box.

This is fairly unpopular in the handmade community but let's imagine what an "OOP" solution would look like:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class Collision_Circle;
class Collision_AABB;

class Collision_Shape{
public:
    virtual bool shapes_collide(Collision_Shape *other);
    virtual bool shape_circle_collide(Collision_Circle *other);
    virtual bool shape_aabb_collide(Collision_AABB * other);
    // NOTE(Allen): This isn't very "C++ OOP" but I don't feel like writing out getters and setters.
    Vec2 pos;
};

class Collision_Circle : public Collision_Shape{
public:
    bool shapes_collide(Collision_Shape *other){
        return other->shape_circle_collide(this);
    }
    bool shape_circle_collide(Collision_Circle *other){
        // logic for circle/circle collision
    }
    bool shape_aabb_collide(Collision_AABB *other){
        // logic for circle/AABB collision
    }
    f32 radius;
};

class Collision_AABB : public Collision_Shape{
public:
    bool shapes_collide(Collision_Shape *other){
        return other->shape_aabb_collide(this);
    }
    bool shape_circle_collide(Collision_Circle *other){
        // logic for AABB/circle collision
    }
    bool shape_aabb_collide(Collision_AABB *other){
        // logic for AABB/AABB collision
    }
    f32 width;
    f32 height;
};


I can hear the groans from handmade developers reading already. Obviously a switch on a discriminated union will be way better right! Let's see it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
enum Shape_Type{
    Shape_Circle,
    Shape_AABB,
};

struct Shape{
    Shape_Type type;
    Vec2 pos;
    union{
        struct{
            f32 radius;
        };
        struct{
            f32 width;
            f32 height;
        };
    };
};

bool shapes_collide(Shape *a, Shape *b){
    bool result = false;
    switch (a->type){
        case Shape_Circle:
        switch (b->type){
            case Shape_Circle:
            // logic for circle/circle collision
            break;

            case Shape_AABB:
            // logic for circle/AABB collision
            break;
        }
        break;

        case Shape_AABB:
        switch (b->type){
            case Shape_Circle:
            // logic for AABB/circle collision
            break;

            case Shape_AABB:
            // logic for AABB/AABB collision
            break;
        }
        break;
    }
    return(result);
}


Sure these are different in a lot of ways:

The organization and length of the boilerplate code is different.
The speeds are probably different.
Depending on whether you are in a 32-bit or 64-bit program, one or the other might offer some compression vs the other option.
Although compressed serialization will take about as much work for both, the discriminated union offers a convenient serialization method with plain old memcpy's.

Requirements Change
Okay okay, I'm sure that could go on for a while. But now I want to highlight the very important way in which they are exactly the same. Let's imagine now that our game designer has an epiphany. Suppose after playing around with the level design options the designer wants two new collision shapes: horizontal lines with a radius and vertical lines with a radius.

What changes do we have to make now?

Well both version have this sort of idea of "forward declaring" the types. So those both change in similar ways.
1
2
3
4
class Collision_Circle;
class Collision_AABB;
class Collision_Vertical_Line;
class Collision_Horizontal_Line;

1
2
3
4
5
6
enum Shape_Type{
    Shape_Circle,
    Shape_AABB,
    Shape_Vertical_Line,
    Shape_Horizontal_Line,
};


The Shape struct grows a little to have new "radius & height" and "radius & width" versions.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
struct Shape{
    Shape_Type type;
    Vec2 pos;
    union{
        struct{
            f32 radius;
        } circle;
        struct{
            f32 width;
            f32 height;
        } aabb;
        struct{
            f32 radius;
            f32 height;
        } v_line;
        struct{
            f32 radius;
            f32 width;
        } h_line;
    };
};


The expansion of the OOP boilerplate is too painful for me to write out, but you need two new virtual functions for the second dispatch. And there needs to be two new classes inheriting from Collision_Shape with the "radius & height" members and the "radius & width" members.

And finally, the big similarity, the OOP and discriminated union solutions BOTH need a 4x4 double dispatch, which is horrible to write out whether it goes into virtual functions or into a doubly nested switch.

The Data Format Abstraction Way
Both of the above solutions are examples of "code semantic abstraction". The specific problem is that given any two shapes we want to determine whether they are colliding. We have tried to solve this problem by making a single procedure (or something that looks like a single procedure) that can answer that question for any two types of shape. In other words our "collide" function abstracts the fact that there actually are multiple types of shapes in our data that might want to collide. The programmer on the caller side may think about physics or gameplay logic and if shape types are irrelevant the programmer doesn't have to think about them. On the implementation side, the programmer, in addition to writing collision logic, is sorting out what to do with the given shape types.

If you look at my (suspiciously simplistic) set of shapes again you might see there are really only three fields that the union compresses: "radius" "width" and "height". What would happen if we didn't use a union and just let those members all mix together in a struct?

1
2
3
4
5
6
7
struct Shape{
    Shape_Type type;
    Vec2 pos;
    f32 radius;
    f32 width;
    f32 heigh;
};


So we've reduced boilerplate, but we still have "type" so there will still be a 4x4 dispatch grid. Can we eliminate type? We could eliminate type and just use zero or negative values in other fields to indicate type, but then we're still going to have a huge 4x4 grid and just an if tree instead of switches. So we need to eliminate type in a way that actually reduces this down to one type of shape.

What if an AABB could have a width and height of zero, and the radius was the radius around the border of AABB? Then a circle would just be a shape where the AABB portion has a width and height of zero. The AABBs would have a radius of zero. Vertical lines just have a width of zero, and horizontal lines just have a height of zero.

Instead of implementing ten different collision routines, now we just implement one "collide" routine that takes in two rounded rectangles and behaves well so long as the shape has width >= 0, height >= 0, and radius >= 0. This idea also has the bonus that if the designer ever wants a rounded rectangle, or unrounded lines, that will already be done too!

1
2
3
4
5
6
struct Shape{
    Vec2 pos;
    f32 radius;
    f32 width;
    f32 heigh;
};


This also offers a lot of flexibility. If the designer is uneasy about the idea that circles are going to have width and height controls, then you can still use the discriminated union in the level editor and convert sometime later in the pipeline. If you have network programmers who liked the compression they were getting on shapes because circles are very common, then you can convert back to the compressed format. Speaking of which, if you know that circle-circle collision is a very common case because a majority of the shapes are still circles, then you can still special case your logic to accelerate that one case. In a sense, before, you were special casing for every single case even though you might only need to accelerate one case, this gives you that option:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
bool shapes_collide(Shape *a, Shape *b){
    bool result = false;
    if (a->width == 0 && a->height == 0 && b->width == 0 && b->height == 0){
        // NOTE(allen): Pull out circle/circle because sometimes
        // we might have pre knowledge about the shapes that allows us to skip the if.
        result = circle_circle_collide(a->pos, a->radius, b->pos, b->radius);
    }
    else{
        // general rounded AABB collision logic.
    }
    return(result);
}


Both types of abstraction, "code semantic" and "data format", are useful and even required. We started with some of both and we ended with some of both. But we moved from heavily relying on code semantics to heavily relying on data formats. This is an example of how polymorphism can be simplified with this mindset, but other types of code semantics can be addressed this way too, which you will see in the 4coder example now!

Real problems in 4coder
If you go back several posts on this blog, I describe in detail what the primary problem is that has to be solved by the buffer engine. In particular the layout and coordinates for a buffer. I am not talking about the problem of choosing the underlying data structure to store the data in a buffer, such as a gap buffer, an array of lines, or any other fancy idea. I'm talking about the problem of converting between different "coordinate systems" such as byte index, character index, line/character number, and x/y pixel position. The coordinates problem basically requires the same "meta" data structures no matter what the underlying data structure is.

If you haven't read that post I encourage you to read it here at some point, because it describes in a lot of detail the problem that I am talking about now. I will try to summarize enough to be clear here though.

What I didn't include in that post was the structure of the code that I wrote for solving the problem. The code was too big and sprawling to really pull out meaningful parts. Basically several different subsystems all had to be programmed to agree with one another about the layout. The rendering version output a series of characters and the screen positions where those characters needed to be rendered. The cursor positioning system also had to figure out the positions of characters, but it worked in all coordinates, not just in the few that the rendering system cared about, and it only emitted the full set of coordinates for one character. Then there was the metadata measurement system, which would look at edits to the text, and figure out the layout of characters in order to generate a few arrays of metadata that were used to accelerate the other two systems, this too relies on accurately discovering the layout of the buffer.

It had evolved naturally from a time when the duplication of the layout logic was so minor it didn't look like duplication and I didn't have the experience to recognize that it would be a problem... I'm still not sure I would be able to see it early on. By the time it became clear that this was becoming a duplication problem, there were many assumptions about input/output patterns of these things throughout the code and so many subtle differences between their internal logic, that the cost of maintaining the duplication seemed lower than the cost of rewriting everything.

It wasn't until I piled on code wrapping and UTF-8 that it became clear that the rewrite was unavoidable. I did manage to get all the current features working by just piling in more layers of "code semantic" abstraction for each feature until it worked. I don't feel like digging into the repository to pull out an example of what that looked like, and I probably couldn't fit a reasonable portion of it on the screen and explain it anyway. Instead I will sketch out how the whole thing worked with some pseudo code, and point out where I am using "code semantic" abstraction.

Because the buffer stored UTF-8 and every feature needed to understand that, the outermost layer of the process was to do a translation:
1
2
3
4
5
6
7
void foo(u8 *data, u32 length){
    i32 pos = 0;
    u32 apparent_character = 0;
    for (; extract_character(data, length, &pos, &apparent_character);){
        // TODO(allen): Process the apparent character
    }
}


The idea here is that we are looping over input bytes and getting output characters. It is actually not this simple though, because the buffer is not a contiguous series of bytes. So it actually gets wrapped in the streaming API I talked about in the "4.0.18" post. Which looks something like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
void foo(Buffer *buffer){
    u32 i = 0;
    for (Buffer_Iterator it = begin_stream(buffer, i, buffer->size);
         it.is_good;
         it = step_stream(it)){
        for (; i < it.end;){
            Step_Emits emits = translate(it.data, &i, it.end, buffer->size);
            for (u32 j = 0; j < emits.count; ++j){
                // TODO(allen): Process the apparent character "emits.step[j]"
            }
        }
    }
}


So now this supports a gap buffer and any other non-perfectly-contiguous data structure. We had to now handle the possibility that the number of emits from one step of translation isn't always 1. We might not get any if our chunk cuts off in the middle of a series that looks like it could be a valid character but depends on the next chunk to be completed. We also want to handle the case where a series of bytes are consumed and discovered to all be invalid. We want to emit them all at once, rather than redoing that section of the loop multiple times to refigure out the same thing, and we already have to handle the case where the number of emits is not always 1 so why not?

So here's a little problem. There are at least three major places in the code that will all have this exact structure, and that looks intricate enough that it would be nice to deduplicate... but the part that varies between the three major pieces of code is everything in the main body of the loop, and to make matters worse those three things all want very different logic about when to break out of the various levels of the triple loop structure. Basically we're looking at polymorphically changing the loop body and setting up some return codes to control the loop breaking, or wrapping the loops in some kind of macro, or just living with the duplication. That is, those are the "code semantic" options. I chose to live with the duplication.

The Metadata Measuring Problem
Now luckily the metadata acts sort of as a data format abstraction. Which means the rest of the logic for seeking and rendering can just read the metadata and doesn't have to duplicate the line wrapping logic. However, I'd like to look at how the metadata measurement works, because there are a few more code semantic abstractions there.

Measuring metadata just means generating a few arrays. One stores the index of the first character for each line. One stores the wrapped y-value of each line. The unwrapped-y value is just "line_height*line_index", but with wrapping each line can be an arbitrary multiple of line_height. In order to enable y-coordinate seeking in wrapped buffers, we need to be able to convert between line index and wrapped y-value. Then for virtual whitespace we need an array that stores the starting x-value for each wrapped line.

The array of line indexes is easily separated from everything else and solved in a single traversal of the buffer. The other two are more difficult to separate, especially for code wrapping, because they both rely on this somewhat abstract idea of where wrapping points actually are. For instance consider the following two identical lines of code, wrapped at two different locations:
1
2
                foobar(param1, param2,
                       param3, param4);

vs
1
2
                foobar(
                    param1, param2, param3, param4);


As you can see, the simple act of changing the wrap point changes the x-shift of the wrapped line. Furthermore, future wrap points might depend on the x-shift that was chosen for the previous line!

So wrap points, wrapped y-value, and x-shifts, all have relations to one another that make them inseparable, or at least not apparently separable. It was also a goal (which I have not yet achieved) that the wrapping and shifting of lines would be programmable logic that any 4coder user could customize. So keeping that in mind I wanted a separation from wrapped y-value logic (which can be determined from wrap points alone) and the logic that picks wrap points and x-shift values. So I end up with something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
void foo(Buffer *buffer, f32 width){
    Wrap_State wrap_state = make_wrap_state(width);
    u32 i = 0;
    for (Buffer_Iterator it = begin_stream(buffer, i, buffer->size);
         it.is_good;
         it = step_stream(it)){
        for (; i < it.end;){
            Step_Emits emits = translate(it.data, &i, it.end, buffer->size);
            for (u32 j = 0; j < emits.count; ++j){
                Wrap_Update update = wrap_update(&wrap_state, emits.emits[j]);
                // TODO(allen): Pull all of this out into a callback.
                switch (update.type){
                    case Wrap_NeedsWrapPoint:
                    {
                        // Examine all potential wrap points, evaluate which would be most visually pleasing.
                        // Determine the x_shift for the newly wrapped line.
                        // Emit metadata about x_shift
                        wrap_at_point(&wrap_state, chosen_index, x_shift);
                        // Backup the reading position to the chosen_index, we cannot be sure the potential wrap points will
                        // be evaluated the same again.
                        do_backtrack = true;
                    }break;
                    case Wrap_NeedsNextCheckPoint:
                    {
                        // Zoom ahead to the next position that might be good for wrapping (usually whitespace)
                        wrap_check_point(&wrap_state, next_check_point);
                        // Emit a potential wrap point with some info about the state of the indentation that would be applied here.
                        // By the way, you're going to have to be reading tokens and synchronizing that read position with the next_check_point.
                    }break;
                }
                if (do_backtrack) goto doublebreak;
            }
        }
        doublebreak:;
        if (do_backtrack){
            // reset the stream position to chosen_index
        }
    }
}


Do you agree with me that this looks unruly? I think the real code was even worse, there was a third case on the switch. There were a ton of variables for managing the synchronization between the token reading, the indenter logic, and the streaming position. And I haven't even pulled out the wrapping logic into a callback yet! At this point there isn't anything about this that uses runtime polymorphism and yet its super complicated, whats going on?

Some might think it would be simpler if I pulled out some of this logic into a separate procedure, perhaps if everything under the "j < emits.count" loop was a separate procedure it wouldn't be so bad... but I think it would be exactly the same thing. Perhaps the popular "handmade" attitude would say I've already abstracted away too many details, this Wrap_State stuff should all be written out inline in the loop to clear things up. But I don't know how anyone will ever be able to write their own wrap logic if they are responsible for understanding and maintaining this very specific internal data structure that doesn't directly relate to what they care about, so it has to be abstracted for them somehow.

Now wait a minute, there's that word "abstract". Where am I using abstraction in this code?

I am abstracting the y-position logic from the wrap point logic with some calls and a "Wrap_State". In a sense I also abstract the idea of UTF-8 translation. The wrapping itself doesn't think about the fact that it wraps UTF-8 encoded text, it just sees pre-encoded codepoints in a format that it is setup to understand. And the translation process doesn't think much about the underlying data structure, the idea of a chunk of text is abstracted from it. This is just a series of abstractions. All of these abstractions rely on some idea of data conversion already, but they rely more heavily on some kind of semantic of my code. What happens if I shift my attention to the data formats in the abstractions?

I could abstract the idea of a series of data chunks by forming and passing an array of structs like this:
1
2
3
4
struct Buffer_Chunk{
    u8 *data;
    u32 length;
};


And then I can just write a procedure that takes in a buffer and returns the array of data chunks.
1
2
u32 buffer_get_chunk_count(Buffer *buffer);
void buffer_get_chunks(Buffer *buffer, Buffer_Chunk *out_chunks);


As you can see, I'm still using code semantic abstraction, but I've shifted the focus to the data conversion. Now a loop over chunks is just a loop over an array, and doesn't involve a buffer chunk iterator.

(Note this is real code from 4coder now. Except that those two procedures aren't forward declared anywhere, their implementation is fairly boring I just omit it because it would involve explaining boring details about the gap buffer data structure that are off topic.)

Alright let's look at the translation phase. No matter what I do I have to translate every character. What if I just translate them all and store them, then loop over them later?

This API may undergo more cleanup, because I think I've found a better way to handle output arrays. But here is the current version
1
2
3
4
typedef u32 Buffer_Model_Step;

internal Translation_Result
translating_consume_chunk(Translation_Chunk_State *state, u8 *in_chunk, u32 in_chunk_size, Buffer_Model_Step *out_chunk, u32 out_chunk_size, Translation_Chunk_Emits *emits);


This still looks a bit complex, but the idea is to no longer emit one step at a time. This procedure emits as many steps as it can. If it runs out of input or output space it returns a code requesting more. If it finishes the input, it returns a code that states the translation is complete. It is designed this way so that the caller can be in charge of the allocation scheme. I think now it would be better to pass in a temporary memory arena, and have output arrays pushed onto the arena. This still allows as much flexibility of allocation, but I am finding it is more convenient because in a lot of cases there is already an arena available to the caller with plenty of space. Anyway, the point is that now translation is done all at once with a series of input chunks and output chunks.

What we have so far looks like this (I have simplified the allocation parts because they're boring):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
void foo(Buffer *buffer){
    // Get an array of chunks
    Buffer_Chunk chunks[2];
    u32 chunk_count = buffer_get_chunk_counts(buffer);
    Assert(chunk_count < ArrayCount(chunks));
    buffer_get_chunks(buffer, chunks);
    
    // Get an array of steps
    u32 step_max = maximum_needed_steps();
    Buffer_Model_Step *steps = allocate_steps(step_max);
    u32 step_count = 0;
    for (u32 i = 0; i < chunk_count; ++i){
        Translation_Chunk_Emits emits = {0};
        Translation_Result result = translating_consume_chunk(
            &state, chunks[i].data, chunks[i].size,
            steps + step_count, step_max - step_count, &emits);
        if (i + 1 < chunk_count) Assert(result == Translating_NeedInput);
        else Assert(result == Translating_Finished);
        step_count += emits.count;
    }
    
    // Handle wrapping
    for (u32 i = 0; i < step_count; ++i){
        Wrap_Update update = wrap_update(&wrap_state, steps[i]);
        // TODO(allen): Pull all of this out into a callback.
        switch (update.type){
            case Wrap_NeedsWrapPoint:
            {
                // Examine all potential wrap points, evaluate which would be most visually pleasing.
                // Determine the x_shift for the newly wrapped line.
                // Emit metadata about x_shift
                wrap_at_point(&wrap_state, chosen_index, x_shift);
                // Backup the reading position to the chosen_index, we cannot be sure the potential wrap points will
                // be evaluated the same again.
                do_backtrack = true;
            }break;
            case Wrap_NeedsNextCheckPoint:
            {
                // Zoom ahead to the next position that might be good for wrapping (usually whitespace)
                wrap_check_point(&wrap_state, next_check_point);
                // Emit a potential wrap point with some info about the state of the indentation that would be applied here.
                // By the way, you're going to have to be reading tokens and synchronizing that read position with the next_check_point.
            }break;
        }
        if (do_backtrack) i = chosen_index;
    }
}


Wrapping still looks really complicated, but notice something interesting that has happened. I said it would be nice to abstract away the outer triple loop that does the streaming and translation, but it was tough because it surrounded the code that wasn't going to be duplicated, and there would be subtle control flow things going on with breaking at different levels. Well, instead of trying to abstract the code semantics I have abstracted by introducing a concept of an intermediate format that the inner code reads, and now all of that upper code can quite easily be shared. Also notice that the control flow for back tracking looks less scary now in the wrapping code.

Now how do I want to untangle the wrapping logic? Well there are a few things here that are actually data that are never stored explicitly but are implicitly computed and used within the code semantics. For one thing, there are wrap points. Wrap points are not stored but we do compute them so that we can figure out x-shift and wrapped y-values. But we can't just generate an array of wrap points without also considering their x-shifts, as we discussed earlier. So those must be figured together in one stage, it is unavoidable. I also said that wrapped y-values are inseparable from this logic, but that was under the assumption that I would not be storing wrap points. If I store wrap points, then wrapped y-values can be computed on a single pass through the wrap points later.

Furthermore we need a line shift for each real line and for each wrap point. Much like finding that a rounded rectangle is a single type that represents all of the shapes in our toy example, I realized I could actually output a homogenous array to represent the solution to the wrapping problem:
1
2
3
4
struct Buffer_Wrap_Point{
    u32 byte_index;
    f32 x_shift;
};


So we will still have a callback from the user but all it has to do is read an input array that specifies potential wrap points and write an output array that specifies the desired wrapping:

1
u32 wrap_proc(Line_Wrap_Parameters params, Buffer_Model_Word *words, u32 word_count, Buffer_Allocator *allocator);


The bit that says "Buffer_Allocator *allocator" is another example of me flailing around a bit looking for the best way to do array output, I think that will change too in favor of just a scratch memory arena. Let's imagine I've cleaned that up and it looks more like this:

1
2
3
4
5
struct Wrap_Result{
    Buffer_Wrap_Point *points;
    u32 count;
};
Wrap_Result wrap_proc(Memory_Arena *output, Line_Wrap_Parameters params, Buffer_Model_Word *words, u32 word_count);


Now what is that little thing that says "Buffer_Model_Word"? I haven't described that yet. That abstracts a few things simultaneously. One responsibility of the old wrapper was to keep track of the x value after each consumed character to determine when a wrap is needed. Another responsibility was to identify good potential wrap points. I didn't want to pass both of those responsibilities onto the author(s) of the wrap_proc(s) so I had to invent a way to abstract that. The code semantic way to abstract that would be to have the wrapping procedure take an array of "Buffer_Model_Step" and then provide helpers for tracking x value and maybe let the user come up with their own rule for potential wrap positions. After some experimentation I realized the only wrap positions that feel good are ones that occur at non-whitespace right after whitespace, and so if wrap_proc is responsible for potential wrap points it will probably lead to a lot of duplication in identifying those points if everyone ends up wanting that behavior. Therefore I decided to choose potential points for the user just to simplify a lot of things and avoid pushing work to users.

So what data format describes a potential wrap point and tracks x values? Something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
typedef u32 Buffer_Model_Word_Type;
enum Buffer_Model_Word_Type_{
    BufferWordType_Normal,
    BufferWordType_Newline,
};
struct Buffer_Model_Word{
    Buffer_Model_Word_Type type;
    u32 byte_start;
    u32 character_start;
    f32 width;
    f32 unwrapped_x_start;
};


An array of this type can be computed from the Buffer_Model_Step array and contain all the information that the line wrapping array needs. The potential wrap points are the "byte_start" values of each word. The only x values we actually need in any of the line wrapping logic I've seen are the x values at potential wrap points. We also need the width of each word to determine which word is the first word that is too long to remain on the current line. And since we can also compute character starts at this point, and will need them later on that got thrown in too (this doesn't pertain much to line wrapping, I worked it out by trial and error in the last stage of this process, which also looks at words).

I also needed to preserve information about where REAL line breaks exist, hence the type field that can be "Normal" or "Newline".

After breaking down wrapping like this the measurement routine looks something like this (again simplified for an imaginary better allocation scheme that I haven't written yet).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
void foo(Memory_Arena *scratch, Buffer *buffer, Wrap_Function *wrap_func){
    // Get an array of chunks
    Buffer_Chunk chunks[2];
    u32 chunk_count = buffer_get_chunk_counts(buffer);
    Assert(chunk_count < ArrayCount(chunks));
    buffer_get_chunks(buffer, chunks);
    
    // Get an array of steps
    u32 step_max = maximum_needed_steps();
    Buffer_Model_Step *steps = allocate_steps(step_max);
    u32 step_count = 0;
    for (u32 i = 0; i < chunk_count; ++i){
        Translation_Chunk_Emits emits = {0};
        Translation_Result result = translating_consume_chunk(
            &state, chunks[i].data, chunks[i].size,
            steps + step_count, step_max - step_count, &emits);
        if (i + 1 < chunk_count) Assert(result == Translating_NeedInput);
        else Assert(result == Translating_Finished);
        step_count += emits.count;
    }
    
    // Get an array of words
    u32 word_max = maximum_needed_words();
    Buffer_Model_Word *words = allocate_words(word_max);
    u32 word_count = word_breaks_font_consume_chunk(&state, steps, step_count, true, words, word_max);

    // Get an array of wrap points from the user defined rule
    Wrap_Result wraps = wrap_func(scratch, wrap_params, words, word_count);

    // Generate the rest of the metadata (line begin indexes and wrapped y-values)
    // This API is not real because I went full-crazy with the allocation here in the real code base.
    // I have a plan to do it more like this in the real version, but the point is it just outputs two arrays at once of different lengths.
    // So it confused me at first.  I hope I can find a way to make a single homogenous array of output here.
    Metadata_Result result = compute_metadata(scratch, words, word_count, wraps.points, wraps.count);
    store_metadata(&buffer->metadata, result);
}


Is this better? How is this better?
What did I gain and what did I lose by making these change? Well this clearly takes more memory because it stores lots of temporary buffers between stages. That might mean the top speed is not as good as the old version, because it basically does all the same computations but also does more loads and stores with possibly more cache missing than when all of the memory is either in the original input, the final output, or the stack. On the other hand, all of the code is much simpler to profile, and optimize with this change, and so in practice I already have it working faster than the old system which was a hell of a hard time to optimize.

I have also gained a clear way in which wrapping will actually be programmable for users. I didn't actually eliminate the code semantic abstraction that allows for that either, the plan is still to use a callback. But the emphasis of what that callback does is now shifted towards a data conversion problem, where it gets just enough information to answer the question of how to wrap the text, and is only expected to answer that one question by outputting a simple homogenous array. Furthermore it only gets called once (hopefully; this depends on allocator nonsense). Before it had to interact back and forth with basically every stage of this process such as when to do backtracking, computing x values, finding potential wrap points, responding when the wrapped y-value system wanted a new wrap point to be chosen.

You may also notice that I've chosen to do a lot of stages with input chunks and output chunks of arbitrary length. This means I could spin up several threads and do the whole process in a pipeline if I decide I want to. I could even eliminate the chunks to increase the top speed of the single threaded option, but still use a pipeline approach when there are multiple files.

I can now clearly imagine ways of supporting more than UTF-8 too. I just have to write a translation routine from other formats to the Buffer_Model_Step array. Before that would have required some kind of polymorphic switch on a field in the buffer, because I didn't have a way to tease out JUST the translation stage from the whole series of nested loops, so I would have to pass down something polymorphic into the loop that decides how to translate.

I haven't done everything I need for virtual whitespace yet, but I have a clear path for it by adding a filter to the translation procedures that removes the leading whitespace on each line. Which I think will be much cleaner than having explicit logic in all three system for tracking when whitespace is being skipped. The data format also provides a simple way to improve the flexibility of the virtual whitespace system that should fix problems with comments that have gone unfixed in the current version because of complexity.

I've been very happy with the results so far.

Related topics
Well that's all I have to say about this idea and how I've been trying to put it to use in 4coder. There are a lot of follow up topics that I can clearly see relate to this in one way or another. I thought I'd leave a list so you can consider it for yourself:

1. In CS theory, the "reduction" is the bread and butter tool used in many proofs. A reduction is a way of solving your unsolved problem, by converting it to a *solved problem. A special type of reduction, the "mapping reduction", is a reduction where you just convert the input of your problem into input for a different problem in such a way that the solution to the different problem will also be the solution to your problem. That is what I am using in the toy example with shapes where I had a set of collision problems, and I reduced them all to the problem of collisions between rounded rectangles. I would say understanding reductions is possibly the best practical skill I got from studying the foundations of theoretical CS.

*Actually the "solved" problem is often only hypothetically solved for the purpose of proofs.

2. My point might just be a rehash of Data Oriented Programming. However, I don't think of it that way. I see Data Oriented Programming as an attitude of honesty about the what the real problem is, and dealing with the problem in the best way you can. I think of the various abstraction types as techniques to consider when you are looking for the "best way".

3. It is unclear to me that I could have just jumped straight to the data format abstractions I have now, and it seems possible that I would have gotten it dreadfully wrong if I had tried to jump straight there. Like any abstraction I think introducing a new intermediate data format too early is bad. Perhaps a more experienced programmer can see these things coming down the line sooner and therefore get a nice boundary sooner, but I don't have any tips on how you could do that.

4. What does this mean for the "code is data" perspective? As I consider it, I think it actually supports that perspective. If code semantics and data formats are two useful ways to abstract problems, then being able to treat code as data allows you to take advantage of both mindsets in more cases. One problem is that code is usually written in a Turing complete language, so when you treat code as data, you might start dancing around theoretical limits like the halting problem. It feels like there is something interesting to say here about the continuum of information between data and code, and the complexity of data as a function of the minimum power a machine needs to interpret the data, but the concrete implications of this line of thought elude me.

Bye bye
Thanks for checking this out everyone! A 4coder progress update and news post will be coming sometime soon, later!
Interesting, thanks for sharing.

There are code tag problems in the post, around
Wrapping still looks really complicated,...
and
After breaking down wrapping like this the measurement...
.
Ahh thanks!
Wow Allen, fantastic post! Thanks for sharing these insights, I feel like I've learned a lot about 4coder and clarified some of my own thoughts about programming.
Thank you for this insightful post about how your journey to tackle this problem went and where. I think it is encouraging to see that code that once worked for a task that is growing and maturing wasn't a bad one just because it isn't the optimal one. You showed how iteration and simply tackling the problem by actively coding helps more than just to start by a theoretical optimal solution was interesting. Thank you!
This is a very good post, and I think it highlights well the pitfalls of exaggerating the conceptual aspects of types to the detriment of the functional aspects of the code. I think your final example in the section about shape collision, where you have a single struct with all the fields and if/else on the values of them to direct program flow, is elegant, and I think I'm going to make use of something similar.

That said, I think you overstated the benefit there in terms of avoiding a table of cases in the shape_collision function. With the same approach for the struct, you could very well have ended up with something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
bool shapes_collide(Shape *a, Shape *b) {
    bool result = false;
    if (a->width == 0 && a->height == 0) {
        if (b->width == 0 && b->height == 0) {
	    // NOTE(allen): Pull out circle/circle because sometimes
	    // we might have pre knowledge about the shapes that allows us to skip the if.
	    result = circle_circle_collide(a->pos, a->radius, b->pos, b->radius);
	} else {
	    //circle/aabb logic
	}
    } else {
	if (b->width == 0 && b->height == 0) {
            //aabb/circle logic
	} else {
	    //aabb/aabb logic
	}
    }
    return(result);
}


So the mere act of arranging the struct without unions and enums doesn't make you immune to overnesting: you have to see beyond and recognize there are benefits in not doing it like this, as you noted with regards to automatically covering future expansion of shapes.

Similarly, the enum and union approach can yield code with an equivalent low degree of nesting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
bool shapes_collide2(Shape *a, Shape *b) {
    bool result = false;
    if (a->type == Shape_Circle && b->type == Shape_Circle) {
        // NOTE(allen): Pull out circle/circle because sometimes
        // we might have pre knowledge about the shapes that allows us to skip the if.
        result = circle_circle_collide(a->pos, a->radius, b->pos, b->radius);
    else {
        // general rounded AABB collision logic.
    }
    return(result);
}


You can achieve the same with a switch statement, but that doesn't expand as well with more than two types of shapes, and you end up with a table of cases again. But that's just switch being bad and inferior to if/else and everyone who uses switch should feel bad because they are terrible at coding - nothing new there :P .

As a final note, I just want to reemphasize that I don't mean this to detract from your simple struct solution: stripping it of the enum/union approach to dynamic dispatch is a clear improvement in my opinion, because it doesn't pre-emptively over-typify and over-conceptualize the solution to the problem. But I wanted to point out that is also the "only" (clear) benefit it has in this case, and that the other property of simplifying the nesting is achievable with other approaches as well.

You probably were aware of this already, but your post made me consider this for a while, and this is what I came up with as a result of that thinking.
Thanks!

Yes I think that is sort of what I was getting at when I was saying this is a flexible approach. Once you realize that you can solve your problem without the enum-union-double-dispatch you have a whole host of choices of how you structure the solution. Whereas without that insight, at least to me, it feels like everything was locked into one possible way of working.