Why Distributed Systems Need Better Text Parsing

Beyond Line-Based Parsing 2025-11-09

Nov 09, 2025

When components communicate over a network or through message passing, they need to agree on a format.

HTML

Runtime parsing is already a common practice. Browsers parse strings containing HTML into DOM tree data structures. Rather than relying on pre-parsed data structures, browsers accept large, long character strings and parse them as they arrive.

Ohm (being a structured way to write recursive descent parsers) is actually quite aligned with how HTML parsing naturally works!

The real advantage of purpose-built libraries isn’t that they use some fundamentally different approach - it’s that someone already did all that tedious work of encoding the HTML5 spec’s edge cases. But architecturally, yes, you could build something similar with Ohm.

Looking at the broader question of structured text parsing in distributed systems

Truly distributed systems need to rely on simple transport mechanisms and on runtime parsing, like what browsers currently do.

OhmJS is excellent for exploring this because you can:

Prototype different parsing strategies quickly
Experiment with custom formats optimized for your use case
Understand the trade-offs between simplicity and expressiveness
See where line-based parsing breaks down and tree-based parsing becomes necessary

Examples where this comes up:

Log aggregation systems (structured vs. unstructured logs)
Protocol buffers vs. JSON vs. custom formats
Command languages for distributed actors
Configuration file formats

[In my opinion, the third item stands out the most “Command languages for distributed actors”.]

The spectrum of parsing complexity

Line-based (grep, awk, etc.): Split on newlines, process each line
Simple delimiters (CSV, TSV): Split on specific characters with maybe some escaping
Nested/structured formats (JSON, XML, HTML, S-expressions): Need real parsing with tree structure
Context-sensitive formats: Where meaning changes based on state

Why this matters for distributed code

When components communicate over a network or through message passing, they need to agree on a format. The complexity trade-offs are:

Simple formats (line-based, JSON): Easy to parse, widely supported, but may be verbose or inflexible
Complex formats (custom binary protocols, HTML-like markup): More expressive, but each node needs a proper parser

The PT Pascal compiler, built using S/SL, demonstrates that using complex formats is an entirely reasonable approach to writing large, complicated programs (in this case, a compiler).

Top-down parsing

Any parser that starts from the root/start symbol and works down toward the terminals. This is a broad category.

Ohm is top-down: