Does anyone have an example of a simple custom lexer? I want to add highlighting to markdown and for org mode files.
I was looking in Ryan Fleury's metadesk parser and he uses the c++ lexer.
https://github.com/4coder-archive/4coder_fleury/blob/master/4coder_fleury_lang_list.h :
F4_RegisterLanguage(extensions[i], F4_MD_IndexFile, lex_full_input_cpp_init, lex_full_input_cpp_breaks, F4_MD_PosContext, F4_MD_Highlight, Lex_State_Cpp);
He has some code to parse a metadesk file in F4_MD_LexFullInput
: https://github.com/4coder-archive/4coder_fleury/blob/0441af06d4b1b71b27519254235ecf54e1c7a582/4coder_fleury_lang_metadesk.cpp#L57, but it doesn't seem to work when I try to swap lex_full_input_cpp_breaks
with it.
He also uses the custom lexer generator for Jai, but I haven't looked into how to use the lexer generator.
You probably getting more answers on the 4coder discord. I don't think F4_MD_LexFullInput
works but I also never try to plug in a custom handroll lexer. The lexer generator is also very complicated but I have more experience toying with it. I may have the answer if you have any questions regarding the generator one. There are also stream archives on Allen's channel where he first implemented the generator.
Oh, ok I found the 4coder discord server.
I asked Ryan about it and he confirmed that he never got his handrolled lexer to work.
I will fiddle around with the lexer generator.
Thanks!
I've been trying out Metadesk and your question piqued my interest so I took a look at Ryan's custom lexer. After some debugging, it turns out not that hard to handroll a custom lexer, it's just that Ryan's implementation has a couple of bugs. A proper lexer will make every byte in a file belong linearly to a token. If you have 3 tokens: a, b, and c; then a.pos + a.size == b.pos
and b.pos + b.size == c.pos
. So a simple way to know whether or not your lexer has a bug is to loop through every token and assert this (also assert(last_token.pos + last_token.size == state.string.size
). The Ryan's lexer has basically 3 bugs:
i < strmax && state->at + i < state->one_past_last;
in the loop's condition to i < strmax && state->at < state->one_past_last;
{ i, 1, TokenBaseKind_LiteralString, 0 }
to { i, 3, TokenBaseKind_LiteralString, 0 }
and for(i64 j = i+1;
to for(i64 j = i+3;
.token.size += 3;
and token.size += 1;
in all the String/Char Literal sections). So I just put a simple cap on in the final else.else { state->at = state->string.str + i; emit_counter += 1; if(emit_counter >= max) { goto end; } // add this if (state->at >= state->one_past_last) { if (list->last) { Token* last = list->last->tokens + list->last->count - 1; last->size = state->string.size - last->pos; } } }
I finally got around to writing some basic parsers for org and markdown.
Thank you very much.