The 2024 Wheel Reinvention Jam is in 7 days. September 23-29, 2024. More info

Example of Custom Lexer?

Does anyone have an example of a simple custom lexer? I want to add highlighting to markdown and for org mode files.

I was looking in Ryan Fleury's metadesk parser and he uses the c++ lexer.

https://github.com/4coder-archive/4coder_fleury/blob/master/4coder_fleury_lang_list.h :

 F4_RegisterLanguage(extensions[i],
                                F4_MD_IndexFile,
                                lex_full_input_cpp_init,
                                lex_full_input_cpp_breaks,
                                F4_MD_PosContext,
                                F4_MD_Highlight,
                                Lex_State_Cpp);

He has some code to parse a metadesk file in F4_MD_LexFullInput: https://github.com/4coder-archive/4coder_fleury/blob/0441af06d4b1b71b27519254235ecf54e1c7a582/4coder_fleury_lang_metadesk.cpp#L57, but it doesn't seem to work when I try to swap lex_full_input_cpp_breaks with it.

He also uses the custom lexer generator for Jai, but I haven't looked into how to use the lexer generator.

You probably getting more answers on the 4coder discord. I don't think F4_MD_LexFullInput works but I also never try to plug in a custom handroll lexer. The lexer generator is also very complicated but I have more experience toying with it. I may have the answer if you have any questions regarding the generator one. There are also stream archives on Allen's channel where he first implemented the generator.

Oh, ok I found the 4coder discord server.

I asked Ryan about it and he confirmed that he never got his handrolled lexer to work.

I will fiddle around with the lexer generator.

Thanks!

I've been trying out Metadesk and your question piqued my interest so I took a look at Ryan's custom lexer. After some debugging, it turns out not that hard to handroll a custom lexer, it's just that Ryan's implementation has a couple of bugs. A proper lexer will make every byte in a file belong linearly to a token. If you have 3 tokens: a, b, and c; then a.pos + a.size == b.pos and b.pos + b.size == c.pos. So a simple way to know whether or not your lexer has a bug is to loop through every token and assert this (also assert(last_token.pos + last_token.size == state.string.size). The Ryan's lexer has basically 3 bugs:

  1. Change i < strmax && state->at + i < state->one_past_last; in the loop's condition to i < strmax && state->at < state->one_past_last;
  2. In the "Multi-line String Literal" and "Multi-line Char Literal" sections, change { i, 1, TokenBaseKind_LiteralString, 0 } to { i, 3, TokenBaseKind_LiteralString, 0 } and for(i64 j = i+1; to for(i64 j = i+3;.
  3. If you have a string literal that you never close (e.g. has an open quote mark but not an end quote mark), that token's size will be bigger than the file 1 or 3 bytes (because of all the token.size += 3; and token.size += 1; in all the String/Char Literal sections). So I just put a simple cap on in the final else.
        else
        {
            state->at = state->string.str + i;
            emit_counter += 1;
            if(emit_counter >= max)
            {
                goto end;
            }

            // add this
            if (state->at >= state->one_past_last)
            {
                if (list->last)
                {
                    Token* last = list->last->tokens + list->last->count - 1;
                    last->size = state->string.size - last->pos;
                }
            }
        }

Edited by longtran2904 on

I finally got around to writing some basic parsers for org and markdown.

https://github.com/zhedye/4cc/blob/23832d7ad213598935ed4a8b4b087afb9a40a985/code/custom/4coder_edye/4coder_edye.cpp#L1687

Thank you very much.

4coder-highlighting=Screenshot 2024-09-14 021152.png