2023-02-27

streamlined scheme syntax

over time scheme has been extended with various inconsistent read-syntax forms which became part of the official standard or a quasi-standard in implementations. these read-syntax forms differ from other syntax in that they are not based on round bracket delimited expressions and break the regularity of the fundamental bracket syntax, and that they necessitate extra complexity in the parser, which is usually a non-scheme, or at least scheme independent, part of the implementation. read-syntax is otherwise of concern with serialisation to avoid additional deserialisation processing. efforts to introduce new read-syntax seem typically guided by an interest to decrease typing effort for specific constructs, with at the same time inadvertently increasing the amount of different semantic patterns and permutations, or for giving in to personal habit, or following the notion that it would be better if scheme would be more like other popular languages - as if there were not enough alternatives that already share similar non-scheme syntax, or also, that a language need to be made especially comfortable for users of other-languages or that the current in-language possibilities were insufficient. with syntax, details that may seem like nitpicks matter - single characters are important - a single character can break a program. specific combinations of them carry the semantics, are repeated with everything they encode, read, written and in the mind frequently. imagine for example the difference it would make to have to prefix every variable with a dollar sign.

the following describes a small number of reductions and renamings in an attempt to simplify things and is still completely compatible with current interpreters

listing

removed

#F #T
square-bracket-sexp [ ]
scsh-block-comment #! !#
srfi30-block-comment #| |#
upper-case-symbol
' , #' #, {backtick}
multiple-return-values

renamed

  • first car
  • tail cdr
  • pair cons
  • pairs cons*

aliased

  • q quote
  • qq quasiquote
  • l lambda

code-comments

  • #;() the nestable block comments which are already part of the official standard
  • ; line comments

additional read syntax

#! hash-bang

modified form

(let (name value) body ...) for single bindings

other

  • allowing utf8 in identifiers could be problematic because using different alphabets leads to inaccessible knowledge and might support further international separation of people by language. furthermore, apart from cultural conservatism, it does not seem to be a technical progression
  • symbols and identifiers should be lowercase. the set of upper case characters is not necessary
  • a positive and supported extension are hash-comma readers. they are a small addition to the language that allows custom read-syntaxes (which means not using bound identifiers) generically, without adding more and more syntactically relevant special prefix characters and structures which have to be more tediously learnt and deciphered to understand the code

rationale

syntax

removed

alternative delimiters for s-expressions like square brackets

  • the main problems with these are added syntactic noise, superfluousness and repeated bracket type alteration
  • they are a complication for reading and learning
  • they have the exact same meaning as (), but still require additional processing from the human reader. a newcomer does not know about them, and will wonder what these square brackets mean. might not even find it out on their own without documentation, even though they mean nothing new. other languages use square brackets for literal array definition, which adds unnecessary confusion
  • they make reading, understanding and editing more difficult in regards to successive opening or closing delimiters at the beginning or end of expressions. a mix of square and round brackets removes the practicality of adding to or moving round brackets that appear in succession to quickly change nesting of expressions (re-purposing brackets to be paired with other brackets for different expressions). having this freedom is an elegant property from the simplicity of the original syntax. without square brackets, only the count of opening and closing brackets is important to ensure the valid nesting of the range delimited expressions. square brackets add a completely new condition that has to be accounted for: the order and type of brackets. this is all relatively high cost
  • even considering the additional set of characters and keys that have to be used alternatingly between brackets and the attention necessary to execute that are sufficing reasons against them
it may try to improve on the task of finding the closing delimiter, but worsens it:
[(lambda (a b) (+ a b))]
((lambda (a b) (+ a b)))

my guess is that the real reason for adding them is rooted in a mistake, because it goes so much against the general simplicity style of schemes design. or it might be about increasing the number of possible permutations to make the view of code more entertaining

quote syntax

  • at least discouraged
  • it sacrifices of the simplicity of dealing with regular bracket list-syntax, where elements matter instead of characters in front of the opening delimiter, and can too easily be replaced with this
  • the elusive appearance that clutters the code with particularly small, non-alphanumerically cryptic, hard to discern special symbols outside of lists makes it harder to read. usually identifiers at the beginning of read-syntax-lists tend to describe the list contents, like for example (syntax (a b)). #'(a b) does hardly look as clear and helpful
  • "display" uses (quote) syntax for parsed code, which can be confusing. it shows the ambiguity and complication, and it should
  • short bindings like "q" are as easy to read as any other s-expression, are not much more difficult to write using structural editing, and may be even simpler to manage because of that. no extra complexity has to be built into the structural-editing algorithms
  • the backtick in particular is an indistict invention. sidenote

alternative block comment syntax

  • the standard-specified syntax with hash, semicolon and round brackets is sufficient, and elegant because of its retaining and simple transformation of fundamental syntax, starting with the # prefix that is known for special read-syntax constructs, followed by a semicolon that is already in use for line comments
  • it is quickly added to any possibly nested brackets expression. paredit-mode may not be able to handle it, but smartparens-mode is
  • guile uses #! !# for block comments because it starts like a hash-bang commonly found at the beginning of shell executed scripts, but this still requires the closing part on a separate line for hash-bangs in scripts

uppercase false and true

not needed

additions

hash-bang: seems necessary for creating shell executable scripts. this format is the standard for shell executable scripts which are important because they allow scheme programs to be used as simple commands on most systems. the syntax is one line starting with #!

semantics

multiple return values

  • it leads the programmer to think of a low-level optimisation in the form of tedious-to-work-with, ambiguity creating syntax
  • it does not enable a very useful new way of expression because everything could be specified using lists and the basic "pattern matching" that lambda application provides; concepts where a high investment in compiler optimisation is likely done because it is ubiquitous. example, passing "multiple return values" to a procedure and binding to identifiers: (apply (lambda (a b . c) #t) (list 1 2 3 4 5))
  • doubles the possible syntax and semantics for result value destructuring. you have to learn call-with-values, values, let-values, the new "too few arguments" problems you will be dealing with and repeatedly rewrite one way of passing multiple values into the other, as of course you will still be working with procedures that are well applied with lists
  • continuation-passing-style could be a better alternative for all cases where multiple-return-values are deemed useful. and it still keeps the many-to-one relation between arguments and result, the simplicity of which is not to be underestimated. anecdote: for example, in a automated testing library i wrote, input and expected output arguments are specified alternatingly. input arguments can be lists to specify multiple input arguments. that means input arguments that are lists always have to be wrapped in a list to designate them as a first argument without ambiguity. an analogous complication would have arisen had i implemented the same interpretation for output arguments
  • the execution time when using multiple return values was 6 times longer when i tested it. now what was the reason for using them again. tested with guile 2. that it can be slower might say something about its implementation complexity
  • the theoretical performance benefit is relevant when value destructuring happens often, which i have seen in mathematics related algorithms. it is questionable though, if the use of multiple-return-values in existing procedures like partition, span or list-diff+intersection can lead to an appropriate performance benefit
  • in a case for mrv, what seems missing syntactically is a feature where the values are spliced into the arguments of the standard application form, like so: (proc a b (mv-producer) e). something like "apply-values" could also be useful: (apply-values (l (a b . c) (+ a b)) mv-producer)

renamings

a few names have been changed for increased clarity. it should be the goal of a language to have a consistent naming scheme with regular plain english names, not abbreviations, that make it easy to understand what they mean, and to use new terms only if it is absolutely necessary, and be able to improve

first

  • "car" and "cdr" are absolutely opaque words. even knowing about the etymology, coming from "contents of the address part of register number" and "contents of the decrement part of register number", does not really help to infer the meaning
  • the word increases vocabulary while not really adding a distinct meaning, adding ambiguity
  • "car" is about referencing the first pair element in a list, or the first list element. that is why first is a relevant name
  • the next best word might be "head", but this could include to mean multiple elements, while "first" is really just about the first element of a pair or list. as a sidenote, the linux command-line tool "head" selects one or multiple elements
  • "left" and "right" may be even better because they somewhat emphasize two-valuedness, and would avoid the figurative aspect of the following renaming "tail"
  • the word "first" can be considered short. one could use abbreviations for it, but i would not bother, because losing the clarity obtained by using common english words is not worth it
  • there is "last" in srfi-1 for lists, so "first" is the opposite end
  • names are usually vague, but we should strive for keeping the vagueness low without having to invent new words. sometimes concepts are so different, a new word is appropriate. but not in this case. appeals to tradition, or arguments like that it supposedly sounds better over the phone, do not cut it

tail

  • "tail" is already a common name for its result
  • the name "cdr" has the same problems as "car"
  • an alternative could be "rest" for "rest of list", but rest has a broader meaning and might lead to confusion easier because of the existence of rest-arguments, and when using the word rest for "rest of something" instead of "rest of list-elements"
  • as mentioned above for "first", "right" could be another viable name

pair

  • aside from there being historical or technical explanations for "cons cells" or the like, the name coming from "construct", which is too general, is not about what we specifically do when using cons
  • we are creating pairs, 2-tuples with a left/first element and a right/last/second element, which as a verb is called "to pair" and is coincidentally also the word for the result
  • sometimes the word "cons" is used to mean "prepend"

pairs

  • like "pair" but chains the pairs with their second element
  • (pairs 1 2 3) is equivalent to (pair 1 (pair 2 3))

l q qq

  • these are the shortest renamings. because they are only few, and fundamental syntax forms, used very often (read: in almost every code file), and the result is visible literally in the arguments, it should be acceptable to have short, opaque binding names for them. it is in any case better than special chars

examples

'test (q test)
'(a b c) (q (a b c))
(map (lambda (e) (+ 1 e)) mylist)
(map (l (e) (+ 1 e)) mylist)
  • it is not easy to add an abbreviation for "unquote" without redefining quasiquote, for example "uq", because the macro definition includes the longer keyword
  • l is certainly better than using the greek lambda character special symbol. a character which is not included on most keyboards, in very few world languages if any, not english and needs to be in the font

each

this is optional, the original "for-each" might actually be the better name. each is shorter for bindings like each-integer

modified let

for single bindings instead of

let ((testname testvalue))

the following can be used

let (testname testvalue)

as of yet i have not found any kind of conflict yet. it works well, making the code simpler