RT Transpiler

Towards Higher Level Syntax for Programming Languages 2024-12-07

Dec 08, 2024

Goal

The goal of this project is to create a VHLL - Very High Level Language - that can be used to describe a full mutual multi-tasking kernel, called 0D, and to have the VHLL transpiler compile the code to Python, Common Lisp and Node.js.

The VHLL is currently named ‘RT’1.

Process

An overview of the process is provided in video form, below.

The multi-tasking kernel was first manually written in about 1,400 lines of Python (and, prior to that in the Odin programming language).

The PLWB - Programming Language Workbench, written in draw.io[1], using t2t[2] - is used as the backbone of the project.

Firstly, RT was written to mimic the Python code and to generate runnable Python - creating what is essentially a no-op. The RT code was written to mimic the hand-written Python code, then the RT code was used to automatically generate the equivalent operation using Python. The generated Python code was regression-tested against a small drawware project called ‘rtlarson’. Rtlarson is a Larson Scanner coded written as drawware using draw.io[1], that contains several simple software components - Containers and Leaves. The Container components are coded as draw.io diagrams2, whereas the Leaf components are coded as text in RT. The various components can be seen in the video included below.

Generating legal Python code comes with several gotchas, like correctly formatting indentation-based code and inserting line number comments that relate original RT source code to generated Python code. Single RT statements can generate multiple Python statements, so mapping original source code to generated code is not obvious to human readers without the addition of comments.

Secondly, the RT program was tweaked to generate Common Lisp. Common Lisp, as a language, is at a different extreme from Python. Common Lisp is indentation-insensitive and does not have the same semantics as Python.

Unlike most existing programming languages, Common Lisp has a recursive syntax that requires the use of recursive parsing rules instead of simple use of the standard BNF (and OhmJS) parsing operators, such as ‘+’, ‘*’, and, ‘?’. Once the grammar was expressed in recursive form, generating Lisp code - with enclosing parentheses - was simplified.

It was necessary to invent the concept of ‘ExternalPhrases’ to accommodate some of the drastic differences between Python syntax and Common Lisp syntax. ExternalPhrases appear in RT as function macros - preceded by an octothorpe ‘#’ character.

Subsequently, a Javascript generator was created. Javascript syntax and semantics fall in between the two extremes of Python and Common Lisp. Creation of a node.js generator took only several hours, spread over two days.

The programming language workbench is a pipeline that uses ‘divide and conquer’ to chop up the task of writing a transpiler into smaller steps, to reduce cognitive load during design and development. The main workhorse of the pipeline is the emit transpiler which uses a common grammar - ‘emit.ohm’ - along with specialized emitters for Python, Common Lisp and Javascript.

The PLWB pipeline is composed of several phases:

Inhale
Syntax Check
Semantics Check
RT transpiler (‘RT to Python’, ‘RT to Common Lisp’, ‘RT to Javascript’)
Exhale.

The inhale step converts the incoming RT source code into an internal form, used in the remainder of the pipeline. The internal form is designed for machine-readability. Several Unicode characters are inserted into the source code to make parsing easier in subsequent passes. Namely, identifiers are bracketed by Unicode brackets “❲” and “❳”. Such bracketing makes it easier to use Ohm’s syntactic rules in later passes without the need for further tokenization. Strings and comments are encoded in a way that preserves whitespace in Ohm syntactic rules. Syntactic rules help to make grammars more human-readable, but, syntactic rules skip and delete whitespace which can cause unexpected results in strings and comments and certain combinations of identifiers. It is possible to write complete grammars in purely syntactic form, but this involves workarounds and cognitive loading that interferes with rapid design and MVI (Minimum Viable Implementation) approaches. Interrupting MVI with consideration for workarounds is a form of premature optimization. As is well-known, it is better to “get it working correctly” before spending time on workarounds and optimization and tightening up of code.

The inhale step simply reads in the source code and uses t2t to convert the input to internal form. T2t requires two specifications - a grammar specifications (‘internalize.ohm’) and a rewrite specification (‘internalize.rewrite’). The grammar specification is a DSL which is documented in the OhmJS[3] documentation. The rewrite specification is a DSL which is documented in the Text-to-Text article[4].

The syntax check step simply parses the internalized input using a grammar (‘syntax.ohm’) and spits out the unaltered input, or, error messages in the case of syntax errors.

The semantics check step parses the incoming RT source code and collects up information about the program. This is where type checking would be collected and checked. In the current implementation, very little work is done in the semantics pass, the gathering and checking work being “left up to the reader”. At this point, checks are only of a general nature and are not specific to the target output languages. One semantic check is implemented as a demonstration of what can be done. We check to see that the LHS of the “≡” (assign-once) operator is a plain identifier, not a general expression, as allowed by the grammar as written. When such a semantic error is found, an error message is inserted into the code and the pipeline does not propagate messages downstream. The actual code to handle this case can be seen in the rule “Defsynonym” in semantics.ohm. The code in semantics.rewrite inserts an error message into the stream of characters in the rule “Defsynonym_illegal”. Th existence of semantics errors is performed by a pattern match (similar to “grep”) for a particular pattern (“r’(>>>.*?<<<)’”).

The transpilation step occurs after the semantics pass. At this point, the incoming code has been deemed to be correct. The only work done by the transpiler is to map incoming code to target code, without the need to worry about correctness of input source. Dividing-and-conquering the pipeline in this manner makes each step simpler (and quicker) to write.

The commonality between the emitters is abstracted in ‘emit.ohm’ using pattern-matching. The transpilation becomes specialized in custom ‘.rewrite’ specifications for each target language (Python, Common Lisp, Javascript). At first, I just got emit.ohm to work in identity mode - the inhaled text was simply exhaled verbatim. After that, I simply copied the identity rewrite code for each target - Python, Common Lisp, Javascript - and “hacked” on the code until the desired output was created. The hacking consisted mostly of two kinds of operations:

deleting syntactic sugar from the RT code
writing snippets of .rewrite code that would transform pattern-match capture trees into legal code for a given target, e.g. like using “concatenate” to generate string concatenation for Common Lisp generation, while using “str” and “+” to generate string concatenation for Python code generation.

As mentioned earlier, the syntax of the .rewrite DSL is explained in other articles[4]. Essentially, the .rewrite code is written as a set of rules, corresponding 1:1 with the grammar rules. Rewriting consists of creating strings using constant characters, and, string interpolation, and the ability to call out to support code that returns strings.

Fly-Over

A fly-over of the RT transpiler is contained in the following video:

Snippets

Below, I’ve included some selected snippets of code written in RT, and show how they are translated into Python, Common Lisp and Javascript.

RT

name ≡ mangle_name (template.name)

This construct is one of two different kinds of assignment allowed in RT. This form, utilizing the “≡” operator is an assign-once operator. It allows the code emitter to treat the assignment as a macro instead of creating a temporary variable. This feature affects how code is generated for Python vs. how code is generated for Common Lisp. In Python, we simply use the stock form of assignment “=“ whereas in Common Lisp we use let instead of setf in this case.

Python

name = mangle_name ( template.name) #line 44

The mapping from RT to Python is fairly straight-forward.

Note that I insert a source-code line number as a comment in the generated code. The line number corresponds to the line in the RT code that cause the given code to be generated. In this case the mapping is 1:1, but, that is not always the case. One line of RT might generate multiple lines of code in the target language.

Python provides only for comments that end with a newline. Care has been taken to ensure that generated line numbers do not interfere with other generated Python code. In some cases, line number comments are cascaded at the end of a line of generated code.

Common Lisp

(let ((name (funcall (quote mangle_name) (slot-value template 'name) #|line 44|#)))

Obviously, the generated Common Lisp code is quite different from the generated Python code.

Note that Lisp requires that subsequent statement blocks be wrapped in enclosing parentheses. This is quite different from the usual line-oriented assemblers and popular programming languages. Making this kind of thing work required rewriting of the grammar rules to use recursive rules instead of the usual BNF-ish Kleene operators like ‘+’, ‘*’ and ‘?’.

Note that Common Lisp allows multi-line comments (“#| ... |#”), so line number comments can be directly embedded in generated code. The generated code is meant only for machine-readability. Embedded line number comments are less human-readable, but, that doesn’t really matter. For now, users can refer to the line numbers manually if required. In the future, tools and IDEs might automatically map between RT source and generated code by parsing the line number comments.

JS

let name = mangle_name ( template.name)/* line 44 */;

The generated Javascript code is fairly straight-forward.

Javascript allows for embedded multi-line comments of the form “/* ... */“.

Care was taken to introduce variables using let statements. The semantic rules for introducing variables differ between Python and Javascript. These differences are reflected in how the RT code is written and how the corresponding code generators work.

RT

reg.templates@name ⇐ template

This is the second form of assignment supported by RT. It performs mutation utilizing the “⇐” operator. The variable, or data slot, must exist and be mutable. Currently, such thorough checking is not performed, being left “up to the reader” in future variations of the semantics pass.

The “@“ operator is a key/value lookup operator for a dictionary object. The LHS - left hand side - is a variable/slot, the RHS is an expression. For example, if we wanted to find the value associated with the “id” key in a variable x, we would write x@“id”.

Python

reg.templates [name] = template #line 49

Common Lisp

(setf (gethash name (slot-value reg 'templates)) template) #|line 49|#

The Common Lisp code generator uses hash tables to implement dictionaries.

JS

reg.templates [name] = template;/* line 49 */

RT

reg != ϕ

Python

reg!= None

Common Lisp

(not (equal reg nil))

JS

( reg!= null)

RT

instance_name ⇐ #strcons (owner_name, #strcons (“.”, template_name))

This is an example of an External Phrase. It has the syntax of a function call, where the name of the function is preceded by the octothorpe (“#”) character.

The syntax pass of the RT transpiler passes any legally formatted External Phrase. The emit phase checks to see that only supported combinations are allowed. The semantic pass should - but currently doesn’t - check the general veracity of arguments passed to External Phrases.

In general, the full syntax of RT is essentially a macro-processor. It accepts legal phrases and encodes them in ways that are compatible with the target languages. The “#” External Phrases are meant to handle cases that cannot be easily generalized into RT syntax, the fall-through being to express External Phrases in standard function call syntax.

#strcons is meant for concatenating pairs of strings. It is possible to use #strcons in a recursive manner, allowing for concatenation of any number of strings.

Python

instance_name = str( owner_name) + str( ".") + template_name #line 66

In Python, we use ‘str’ and ‘+’ to implement #strcons.

Common Lisp

(setf instance_name (concatenate 'string owner_name (concatenate 'string "." template_name))) #|line 66|# )

In Common Lisp, we use ‘concatenate’ to implement #strcons.

JS

instance_name = `${ owner_name}${ `${ "."}${ template_name}` }` ;/* line 66 */

In Javascript, we use template strings and string interpolation to implement #strcons.

RT

#print_stdout (“*** PALETTE ***”)

Another example of a simple External Phrase.

Python

print ( "*** PALETTE ***") #line 78

Common Lisp

(format *standard-output* "~a" "*** PALETTE ***") #|line 78|#

JS

console.log ( "*** PALETTE ***");/* line 78 */

Indent and Exdent

Unicode characters “⤷”, “⤶” are used to signify indents and exdents in the code generators. This matters most for generating legal Python, since Python is indentation sensitive.

Defobj [ _defobj ident Formals line1? lb line2? init+ rb line3?] = ‛\nclass «ident»:⤷\ndef __init__ (self,«Formals»):«line1»⤷«line2»«init»«line3»⤶⤶\n’

I use indents and exdents in the other code generators (Common Lisp, Javascript), even though the target languages are not indentation sensitive. Code indentation helps during debugging of the code generators by making the generated code more human readable. During the final steps of debugging, it is possible to use more experienced indenters, like that of emacs.

The indents are stripped at the last moment during the exhale pass. It took about 50 lines of code (written in Javascript) to properly indent the generated code based on the unicode markers. The fact that the indenter is written in Javascript does not matter to the final result, since the goal is only to rewrite incoming text as different outgoing text. The Javascript indenter is used in the Python generation pipeline and other pipelines.

Bibliography

[1] Draw.io from https://app.diagrams.net

[2] t2t from https://github.com/guitarvydas/t2t

[3] OhmJS from https://ohmjs.org

[4] Experiments With Text to Text Transpilation from https://programmingsimplicity.substack.com/p/experiments-with-text-to-text-transpilation?r=1egdky

Paul’s Substack

Discussion about this post

Paul’s Substack

RT Transpiler

Towards Higher Level Syntax for Programming Languages 2024-12-07

Goal

Process

Fly-Over

Snippets

RT

Python

Common Lisp

JS

RT

Python

Common Lisp

JS

RT

Python

Common Lisp

JS

RT

Python

Common Lisp

JS

RT

Python

Common Lisp

JS

Indent and Exdent

Bibliography

See Also

Discussion about this post