Why Distributed Systems Need Better Text Parsing
Beyond Line-Based Parsing 2025-11-09
When components communicate over a network or through message passing, they need to agree on a format.
HTML
Runtime parsing is already a common practice. Browsers parse strings containing HTML into DOM tree data structures. Rather than relying on pre-parsed data structures, browsers accept large, long character strings and parse them as they arrive.
Ohm (being a structured way to write recursive descent parsers) is actually quite aligned with how HTML parsing naturally works!
The real advantage of purpose-built libraries isn’t that they use some fundamentally different approach - it’s that someone already did all that tedious work of encoding the HTML5 spec’s edge cases. But architecturally, yes, you could build something similar with Ohm.
Looking at the broader question of structured text parsing in distributed systems
Truly distributed systems need to rely on simple transport mechanisms and on runtime parsing, like what browsers currently do.
OhmJS is excellent for exploring this because you can:
Prototype different parsing strategies quickly
Experiment with custom formats optimized for your use case
Understand the trade-offs between simplicity and expressiveness
See where line-based parsing breaks down and tree-based parsing becomes necessary
Examples where this comes up:
Log aggregation systems (structured vs. unstructured logs)
Protocol buffers vs. JSON vs. custom formats
Command languages for distributed actors
Configuration file formats
[In my opinion, the third item stands out the most “Command languages for distributed actors”.]
The spectrum of parsing complexity
Line-based (grep, awk, etc.): Split on newlines, process each line
Simple delimiters (CSV, TSV): Split on specific characters with maybe some escaping
Nested/structured formats (JSON, XML, HTML, S-expressions): Need real parsing with tree structure
Context-sensitive formats: Where meaning changes based on state
Why this matters for distributed code
When components communicate over a network or through message passing, they need to agree on a format. The complexity trade-offs are:
Simple formats (line-based, JSON): Easy to parse, widely supported, but may be verbose or inflexible
Complex formats (custom binary protocols, HTML-like markup): More expressive, but each node needs a proper parser
The PT Pascal compiler, built using S/SL, demonstrates that using complex formats is an entirely reasonable approach to writing large, complicated programs (in this case, a compiler).
Top-down parsing
Any parser that starts from the root/start symbol and works down toward the terminals. This is a broad category.
Ohm is top-down:
It’s based on Parsing Expression Grammars (PEGs)
It starts from the start rule and works down toward matching the input
It uses recursive descent with backtracking
YACC, Bison, etc. are bottom-up:
They’re based on LR parsing (LALR specifically for YACC/Bison)
They build the parse tree from the leaves up to the root
They use a shift-reduce approach with a parse table
They read input left-to-right and produce a rightmost derivation in reverse
The historical context: Bottom-up parsers (YACC, Bison) were favoured historically because:
They can handle left-recursive grammars naturally
They’re efficient (deterministic, no backtracking)
They were easier to generate automatically with the tools available
Top-down parsers (like PEGs/Ohm, recursive descent) have become more popular recently because:
They’re often more intuitive to write and understand
Modern machines handle the backtracking cost better
They integrate more naturally with hand-written code
PEGs avoid some ambiguity issues that plague CFGs
This is a key distinction between traditional parsing tools and modern ones like Ohm!
See Also
Email: ptcomputingsimplicity@gmail.com
Substack: paultarvydas.substack.com
Videos: https://www.youtube.com/@programmingsimplicity2980
Discord: https://discord.gg/65YZUh6Jpq
Leanpub: [WIP] https://leanpub.com/u/paul-tarvydas
Twitter: @paul_tarvydas
Bluesky: @paultarvydas.bsky.social
Mastodon: @paultarvydas
(earlier) Blog: guitarvydas.github.io
References: https://guitarvydas.github.io/2024/01/06/References.html

