Why Do We Use Whitespace To Separate Identifiers in Programming Languages?
The concept of using textual programming languages was invented in the early days of computing.
At that time, only two kinds of character sets were in common use
EBCDIC
ASCII
Unicode did not yet exist.
Both, ASCII and EBCDIC had a very limited range of characters available. EBCDIC was based on 8-bit bytes and ASCII was based on 7-bit characters (the 8th bit in a byte was reserved for a simple kind of error check - the “parity bit”).
ASCII eventually won out as the character set of choice.
ASCII devotes 33 character values - 0 through 32 - to being non-printable display control codes such as TAB, VERTICAL TAB, NUL, etc. A blank (space) is coded as 32 (0x20) in ASCII. That leaves only 95 (127-32) printable characters for representing all characters in ASCII-based programming languages.
At the time, parsing technology was fairly crude and parsers were often programmed in assembler.
Most unprintable characters were used for controlling output devices and for simplified networking. Such uses became obsolete as hardware was improved and extended.
It turned out that most of the unprintable characters were not used in programming languages and only four unprintable characters remained in programming languages:
TAB (0x08)
LINEFEED (0x09)
CARRIAGE RETURN (0x0D)
SPACE (0x20)
Programming languages were designed by using a very few select words from the English language, like “IF”, “THEN”, “WHILE”, etc.
The idea of parsing phrases composed of words separated by blanks, was frowned upon because it would entail backtracking during parsing. At the time, hardware was fairly slow and backtracking was deemed to be too inefficient for practical use, despite the fact that backtracking parsing was known of at the time[1].
The combination of
Using single words from the English language
The influence of the one-command-per-line mentality of assembler instructions
Abhorrence of “inefficiency”, including backtracking parsing
A large number of unprintable characters in the ASCII character set
Fairly primitive parsing technology
An emphasis on human readability over that of machine readability
led to the acceptance of whitespace - spaces, tabs, newlines (LINEFEED followed by CARRIAGE RETURN) - as separators between command words in programming languages.
The idea of not using separators was tried. Early FORTRAN parsed “IFXYZ” as an “IF” control word followed by an identifier “XYZ”. This made it possible for coders to easily make mistakes (“typos”). Error checking was improved by the insistence that coders use separator characters to more fully specify their intentions. Hence, “IFXYZ” was deemed to be an identifier, whereas “IF XYZ” was deemed to be an “IF” control word followed by the identifier “XYZ”.
This early attitude further resulted in the idea that double-quote characters (0x22) would serve double-duty as the pre and post markers for character strings. This meant that strings could not contain strings unless escape sequences were used.
Today
We can reconsider the design of programming languages, from the perspective of fast hardware, Unicode, and, allowing the use of backtracking and the existence of the PEG[2] formalization.
We simply need to bracket identifiers with various kinds of quotes and parentheses.
See Also
References: https://guitarvydas.github.io/2024/01/06/References.html
Blog: https://guitarvydas.github.io
Videos: https://www.youtube.com/@programmingsimplicity2980
Discord: https://discord.gg/65YZUh6Jpq
Leanpub: [WIP] https://leanpub.com/u/paul-tarvydas
Gumroad: https://tarvydas.gumroad.com
Twitter: @paul_tarvydas
Bibliography
[1] Earley Parser from https://en.wikipedia.org/wiki/Earley_parser
[2] Parsing Expression Grammar from https://en.wikipedia.org/wiki/Parsing_expression_grammar