Towards a Better REGEX

2024-10-15

Oct 15, 2024

REGEX consists of 2 nano-DSLs1.

1. Inhale text (aka "parsing")

2. Exhale text (aka "rewriting")

,,, but, REGEX uses flat parsing instead of recursive-descent parsing. Flat parsing is restricted to parsing on a line-by-line basis. Recursive-descent parsing can inhale nested (non-line-oriented) languages like C, Javascript, Haskell, Lisp, etc.

OhmJS[1] consists of 2 parts:

1. A nano-DSL for Inhaling text. This nano-DSL looks like a BNF grammar

2. Punting to Javascript for Exhaling text and for creating other kinds of data.

,,, and, OhmJS generates recursive descent parsers. This is important. Since the formalization of PEG[2] it has become easier2 to create parsers that can match bracketed pairs, which makes it easier to parse nested textual languages.

Parsers written using parsing combinators, also, consist of 2 parts:

1. Punting to <your favourite general purpose language> for Inhaling text

2. Punting to <your favourite general purpose language> for Exhaling text and for creating other kinds of data.

To go beyond REGEX, we need:

1. A nano-DSL to express text inhalation.

2. A nano-DSL to express text rewriting and exhalation3.

3. Recursive descent parsing.

We need to think in terms of creating nano-DSLs in only a few minutes.

Akin to liberally using REGEXs in code, creation of nano-DSLs for specific purposes should not become mega-projects. There’s a tipping point. If a technique is easy to use, it will be used. If the technique creates resistance to use, e.g. by requiring too much detail, then it will be avoided.

If we had a better REGEX that could inhale and exhale nested languages, we could concentrate on writing programs that write programs, i.e. code generation4.

I believe that general purpose languages are a dead-end5. Instead, we need to build special purpose languages, e.g. nano-DSLs. We haven't been doing this because of the belief/reality that building DSLs using CFGs (like YACC+LEX) is hard and too detailed and devolves into mega-projects.

We, now, have the technology to easily build nano-DSLs using recursive descent parsing techniques, i.e. tools like OhmJS.

OhmJS handles only one half of the problem, i.e. Inhaling ("parsing"). We need to develop better tools for the other half of the problem - tools that don't force you to use general purpose languages for rewriting the input text and exhaling new text.

Aside: I've been experimenting with these kinds of ideas. I've written, and rewritten, nano-DSLs for Exhaling. I used OhmJS+Javascript to implement nano-DSLs for Exhaling. My WIP is called t2t[3] (text to text) - if you want to look over my shoulder and/or participate.

See Also

References

Blog

Videos

Discord

Leanpub [WIP]

Gumroad

Twitter: @paul_tarvydas

Bibliography

[1] OhmJS from https://ohmjs.org

[2] Parsing Expression Grammar from https://en.wikipedia.org/wiki/Parsing_expression_grammar

[3] t2t from https://github.com/guitarvydas/t2t

Elsewhere, I use the term SCN instead of nano-DSL. SCN means “Solution Centric Notation”.

It’s been possible to manually write recursive-descent parsers for a long time. PEG formalized the concept and PEG libraries and languages - like Ohm - make it easier to write recursive-descent parsers.

Note that I didn’t include the creation of other kinds of data. Transpiling text to text is powerful enough. KISS, no need to add in more complication. Making things complicated is a hallmark of the general purpose language approach. Making things simpler is a hallmark of the special purpose language approach. If you need to create another kind of data (a big if), then simply create yet another nano-DSL for the purpose. Avoid complicating specific nano-DSLs by overloading them with extra features (for example, like stuffing assignment into programming languages which started out life based on functional principles).

Aside: compilers are just glorified text-to-text transpilers. Compilers inhale some high-level syntax and exhale line-oriented assembler syntax. Some compilers add the extra complication of a second transpilation step - transpiling assembler code to binary data. Early compilers didn’t bother with this extra complication.

General purpose languages contain a union of features. Taken to the extreme, such unions become bland and incapable of doing any single task well. Writing code generators that transpile well-chosen, sub-paradigm-specific syntax to something more general provides a more focussed approach to problem solving. This approach is taught in Physics classes under the term “simplifying assumptions”.

Paul’s Substack

Discussion about this post