Why Do We Use Whitespace To Separate Identifiers in Programming Languages?

2024-09-15

Paul Tarvydas

Sep 15, 2024

Why Do We Use Whitespace To Separate Identifiers in Programming Languages?

The concept of using textual programming languages was invented in the early days of computing.

At that time, only two kinds of character sets were in common use

EBCDIC
ASCII

Unicode did not yet exist.

Both, ASCII and EBCDIC had a very limited range of characters available. EBCDIC was based on 8-bit bytes and ASCII was based on 7-bit characters (the 8th bit in a byte was reserved for a simple kind of error check - the “parity bit”).

ASCII eventually won out as the character set of choice.

ASCII devotes 33 character values - 0 through 32 - to being non-printable display control codes such as TAB, VERTICAL TAB, NUL, etc. A blank (space) is coded as 32 (0x20) in ASCII. That leaves only 95 (127-32) printable characters for representing all characters in ASCII-based programming languages.

At the time, parsing technology was fairly crude and parsers were often programmed in assembler.

Most unprintable characters were used for controlling output devices and for simplified networking. Such uses became obsolete as hardware was improved and extended.

It turned out that most of the unprintable characters were not used in programming languages and only four unprintable characters remained in programming languages:

TAB (0x08)
LINEFEED (0x09)
CARRIAGE RETURN (0x0D)
SPACE (0x20)

Programming languages were designed by using a very few select words from the English language, like “IF”, “THEN”, “WHILE”, etc.

The idea of parsing phrases composed of words separated by blanks, was frowned upon because it would entail backtracking during parsing. At the time, hardware was fairly slow and backtracking was deemed to be too inefficient for practical use, despite the fact that backtracking parsing was known of at the time[1].

The combination of

Using single words from the English language
The influence of the one-command-per-line mentality of assembler instructions
Abhorrence of “inefficiency”, including backtracking parsing
A large number of unprintable characters in the ASCII character set
Fairly primitive parsing technology
An emphasis on human readability over that of machine readability

led to the acceptance of whitespace - spaces, tabs, newlines (LINEFEED followed by CARRIAGE RETURN) - as separators between command words in programming languages.

The idea of not using separators was tried. Early FORTRAN parsed “IFXYZ” as an “IF” control word followed by an identifier “XYZ”. This made it possible for coders to easily make mistakes (“typos”). Error checking was improved by the insistence that coders use separator characters to more fully specify their intentions. Hence, “IFXYZ” was deemed to be an identifier, whereas “IF XYZ” was deemed to be an “IF” control word followed by the identifier “XYZ”.

This early attitude further resulted in the idea that double-quote characters (0x22) would serve double-duty as the pre and post markers for character strings. This meant that strings could not contain strings unless escape sequences were used.

Today

We can reconsider the design of programming languages, from the perspective of fast hardware, Unicode, and, allowing the use of backtracking and the existence of the PEG[2] formalization.

We simply need to bracket identifiers with various kinds of quotes and parentheses.

Bibliography

[1] Earley Parser from https://en.wikipedia.org/wiki/Earley_parser

[2] Parsing Expression Grammar from https://en.wikipedia.org/wiki/Parsing_expression_grammar

Josh Marinacci

Jul 22

While we certainly could use whitespace as an operator using modern technology, why would you want to do that? How would it help? I find even Python’s significant indentation to be confusing. While I love the idea of using whitespace for consistent formatting, making it mandatory seems like the path to confusing compiler errors.

Expand full comment

1 reply by Paul Tarvydas

1 more comment...

Paul’s Substack

Discussion about this post

Paul’s Substack

Why Do We Use Whitespace To Separate Identifiers in Programming Languages?

2024-09-15

Why Do We Use Whitespace To Separate Identifiers in Programming Languages?

Today

See Also

Bibliography

Discussion about this post