Questioning Whether Code Is Data

Aug 26, 2024

Overview

Code is data.

We often hear that.

But,,, a running program is not just data. A running program is an interpretation of the code data, i.e. interpreter + data.

Something new is going on here, and, I question whether we’ve put our finger on it.

I don’t have the answers, but, I have some questions and observations.

Interpreters Interpreting Interpreters

A running program is an interpreter.

And, the interpreter is usually interpreted by another interpreter - a CPU chip.

Some examples...

Prolog code is often written in WAM bytecodes.

The WAM bytecodes are interpreted by a Prolog engine.
The Prolog engine is essentially written in assembler.
The assembler opcodes are interpreted by a hardware CPU chip.

A web page is often written in assembly languages, like Javascript and HTML.

Browsers implement web-rendering engines. The code for the engines in a way that ultimately feeds assembler into a CPU.
Web page programmers write in web-page assembly language, like Javascript and HTML, which the browser-rendering engine inhales and interprets.

Likewise, in function-based programming, code is written using function-based instructions.

Function-based programming was a fiction invented by Lisp 1.5 and espoused later in higher-level languages like C and ALGOL. Function-based programming has evolved into the current fad of Functional Programming,
Function-based instructions are interpreted by a function-based-paradigm-engine.
It may look like the function-based instructions are CPU bytecodes, but that’s not quite accurate. Typically, function-based instructions do not appear as single bytecodes, but, are rolled out as groups and sequences of lower-level assembler instructions, sometimes wrapped in subroutines. Even though there doesn’t seem to be an explicit engine, it’s there in little clumps of rolled-out opcodes.
The function-based-paradigm-engine is interpreted by a CPU chip.
Function-based instructions implement things like function calls and context switching and memory protection and operating systems. CPUs do not directly implement these things. CPUs implement subroutines using simple techniques like mutation of RAM using a limited callstack, and shared, global, mutable, fast memory cells called registers, and we use these lower-level constructs to build up the scaffolding for implementing higher level constructs like function calls, context switching, memory protection, etc. It takes extra code and hardware to construct this scaffolding. That extra code looks like opcodes, but, is really a little (?) engine that implements the function-based paradigm on top of which we can stack concepts like referential transparency.

What’s New Here?

Programming isn’t very new. We’ve been building machines for centuries. Programming a machine is the same as designing and implementing a machine.

Reprogramming is something relatively new. We have designed a meta-machine - a CPU - that can be scripted to perform specific tasks. Then, we can change the scripts. We can reprogram the meta-machine. In fact, we’ve done that in the past, e.g. with player pianos. We’ve just found a yet cheaper way to encode scripts - electrical impulses instead of rolls of paper. When you make something 10x easier to do, your approach to problem solving can change.

Programming is the act of creating a script that will ultimately run a machine. The nuance, here, is that instead of fashioning the script out of mechanical gears and pulleys, we are creating the scripts in a way that can be easily modified.

We build one machine - the meta-machine - and we use it for a variety of purposes.

Code Is Data ≡ Text to Text Transpilation

If code is data, then I suggest that the penultimate form of functional programming devolves into simple text-to-text transpilation.

Mathematical notation is based on the idea of referential transparency. Microsoft Word calls that find-and-replace. Hardware designers call it pin-for-pin-compatibility.

We already know how to do that kind of thing and we have a wide body of existing literature on the subject. But,,, we called the results compilers instead of referring to them by what they really are: text-to-text-transpilers. Compiler researchers worked out concepts like pattern-matching1 decades ago.

Originally, it was quite obvious that compilers were simply text-to-text transpilers. Compilers inhaled high-level code and exhaled assembler in text form. Later, compilers hid that transpilation step under the covers and converted the assembler text directly into blobs of bits called object code. If you step back and look at the “rules” of functional programming, you can see that all of the rules - like “no side effects” - add up to making elaborate find-and-replace possible.

Write Code That Writes Code

PEG raises the bar on text-to-text transpilation techniques. Now, we can parse recursively-nested lumps of text using PEG instead of parsing flat lumps of text using CFG techniques.

It is commonly believed that macros are the domain of Lisp-like languages that use lists composed of CONS cells instead of text composed of characters. And, it is commonly believed that implementing macros for other languages leads to mega-projects2.

If you separate the concept of text-to-text transpilation away from programming languages and implement stand-alone t2t transpilers the ideas become easier to reason about and to implement. After all, a programming language is just an IDE for programming. We might consider using more modern versions of IDEs that bolted t2t (text-to-text transpilation) into the workflow instead of piling every possible operation into an old-fashioned, 1950s notion of what an IDE should be. A more modern IDE workflow would be one in which programmers develop programs using multiple paradigms, each with its own syntax (not necessarily a textual syntax), then combine all of the parts together into a single program.

My current impression is that OhmJS is the best manifestation of PEG that I’ve encountered.

PEG used to be called recursive descent parsing, but, PEG formalizes the concept and gives a better warm-and-fuzzy feeling to those who care. PEG is a DSL for describing parsers. CFG-based techniques are DSLs for describing languages. PEG describes interpretation of code. CFG does something haughtier and is harder to use.

PEG parsing is yet another engine built on top of lowly CPU hardware.

Text rewriting based on parsing results is yet another engine built on top of lowly CPU hardware.

Text-rewriting should be part of development workflow. In fact, it already is - many programming text editors implement batch find-and-replace operations. PEG allows us to become more serious about the matching and rewriting aspects.

Conclusion

I argue that a running program is not just data. The phrase “code is data” produces the wrong impression of what is going on.

We have two things

A reprogrammable meta-machine.
A way to create scripts for the meta-machine, and, a way to reprogram those scripts.

Just because the meta-machine inhales scripts in textual form does not mean that we have to use programming tools that are strictly based on text.

We can treat the “code” part of a program as “data”. We need to attack this notion more heavily. I suggest that OhmJS and PEG and diagram editors give us new ways to think about this aspect of creating running programs. We can build new syntaxes and workflows on top of already-existing programming languages.

References

OhmJS:

https://ohmjs.org/

t2t: [WIP] https://github.com/guitarvydas/t2t and https://github.com/guitarvydas/build-t2t

0D: [WIP] https://github.com/guitarvydas/0D

draw.io:

https://app.diagrams.net/

PEG: https://bford.info/pub/lang/peg/

WAM: https://github.com/a-yiorgos/wambook

Kinopio2md: [WIP] https://github.com/guitarvydas/kinopio2md

Prolog For Programmers: