parsing
we’ll also define a few helper functions to make reading text a little easier, without having to perform any bounds checks whenever we read tokens.
export const eof = "end of file"; lexer.current = (state) => { return state.position < state.input.length ? state.input.charAt(state.position) : eof; }; lexer.advance = (state) => ++state.position;
our lexer will run in a loop, producing tokens until it hits the end of input or an error.
export function lex(input) { let tokens = []; let state = lexer.init(input); while (true) { let start = state.position; let kind = lexer.nextToken(state); let end = state.position; tokens.push({ kind, start, end }); if (kind == eof || kind == "error") break; } return tokens; }
remember that error handling is important! we mustn’t forget that the user can produce invalid input - such as this string:
{example}
haku does not have curly braces in its syntax, so that’s clearly an error! reporting this to the user will be a much better experience than, perhaps… getting stuck in an infinite loop.
now for the most important part - that
lexer.nextToken
we used will be responsible for reading back a token from the input, and returning what kind of token it has read.for now, let’s make it detect parentheses. we of course also need to handle end of input - whenever our lexer runs out of characters to consume, as well as when it encounters any characters we don’t expect.
lexer.nextToken = (state) => { let c = lexer.current(state); if (c == "(" || c == ")") { lexer.advance(state); return c; } if (c == eof) return eof; lexer.advance(state); return "error"; };
with all that frameworking in place, let’s test if our lexer works!
export function printTokens(input) { let tokens = lex(input); for (let { kind, start, end } of tokens) { if (kind == "error") { let errorString = input.substring(start, end); console.log(`unexpected characters at ${start}..${end}: '${errorString}'`); } else { console.log(`${kind} @ ${start}..${end}`); } } } printTokens(`()((()))`);
( @ 0..1 ) @ 1..2 ( @ 2..3 ( @ 3..4 ( @ 4..5 ) @ 5..6 ) @ 6..7 ) @ 7..8 end of file @ 8..8
…seems pretty perfect!
so let’s write another function that will lex those.
lexer.skipWhitespaceAndComments = (state) => { while (true) { let c = lexer.current(state); if (c == " " || c == "\t" || c == "\n" || c == "\r") { lexer.advance(state); continue; } if (c == ";") { while ( lexer.current(state) != "\n" && lexer.current(state) != eof ) { lexer.advance(state); } lexer.advance(state); // skip over newline, too continue; } break; } };
except instead of looking at whitespace and comments in the main token reading function, we’ll do that outside of it, to avoid getting whitespace caught up in the actual tokens’
start
..end
spans.export function lex(input) { let tokens = []; let state = lexer.init(input); while (true) { lexer.skipWhitespaceAndComments(state); // <-- let start = state.position; let kind = lexer.nextToken(state); let end = state.position; tokens.push({ kind, start, end }); if (kind == eof || kind == "error") break; } return tokens; }
we’ll introduce a function that will tell us if a given character is a valid character in an identifier.
since S-expressions are so minimal, it is typical to allow all sorts of characters in identifiers - in our case, we’ll allow alphanumerics, as well as a bunch of symbols that seem useful. and funky!
export const isIdentifier = (c) => /^[a-zA-Z0-9+~!@$%^&*=<>+?/.,:\\|-]$/.test(c);
now we can add identifiers to
nextToken
:lexer.nextToken = (state) => { let c = lexer.current(state); if (isIdentifier(c)) { lexer.advanceWhile(state, isIdentifier); return "identifier"; } if (c == "(" || c == ")") { lexer.advance(state); return c; } if (c == eof) return eof; lexer.advance(state); return "error"; };
defining integers is going to be a similar errand to identifiers, so I’ll spare you the details and just dump all the code at you:
export const isDigit = (c) => c >= "0" && c <= "9"; lexer.nextToken = (state) => { let c = lexer.current(state); if (isDigit(c)) { lexer.advanceWhile(state, isDigit); return "integer"; } if (isIdentifier(c)) { lexer.advanceWhile(state, isIdentifier); return "identifier"; } if (c == "(" || c == ")") { lexer.advance(state); return c; } if (c == eof) return eof; lexer.advance(state); return "error"; };
an amen break
:bulb: for the curious: here’s why I implement lexers like this!
many tutorials will have you implementing lexers such that data is parsed into the language’s data types. for instance, integer tokens would be parsed into JavaScript
number
s.I don’t like this approach for a couple reasons.
pre-parsing data like this pollutes your lexer code with wrangling tokens into useful data types. I prefer it if the lexer is only responsible for reading back strings.
implemented my way, it can concern itself only with chewing through the source string; no need to extract substrings out of the input or anything.
there’s also a performance boost from implementing it this way: lazy parsing, as I like to call it, allows us to defer most of the parsing work until it’s actually needed. if the token never ends up being needed (e.g. due to a syntax error,) we don’t end up doing extra work eagerly!
if that doesn’t convince you, consider that now all your tokens are the exact same data structure, and you can pack them neatly into a flat array.
if you’re using a programming language with flat arrays, that is. such as Rust or C.
I’m implementing this in JavaScript of course, but it’s still neat not having to deal with mass
if
osis when extracting data from tokens - you’re always guaranteed a token will have akind
,start
, andend
.
there are many parsing strategies we could go with, but in my experience you can’t go simpler than good ol’ recursive descent.
knowing that similarity, we’ll start off with a similar set of helper functions to our lexer.
parser.init = (tokens) => { return { tokens, position: 0, }; }; parser.current = (state) => state.tokens[state.position]; parser.advance = (state) => { if (state.position < state.tokens.length - 1) { ++state.position; } };
note however that instead of letting
current
read out of bounds, we instead clampadvance
to the very last token - which is guaranteed to beend of file
.this yields the following EBNF grammar:
Expr = "integer" | "identifier" | List; List = "(" , { Expr } , ")";
we’ll start by implementing the
Expr = "integer" | "identifier"
rule. parsing integers and identifiers is as simple as reading their single token, and returning a node for it:parser.parseExpr = (state) => { let token = parser.current(state); switch (token.kind) { case "integer": case "identifier": parser.advance(state); return { ...token }; default: parser.advance(state); return { kind: "error", error: "unexpected token", start: token.start, end: token.end, }; } };
we’ll wrap initialization and
parseExpr
in another function, which will accept a list of tokens and return a syntax tree, hiding the complexity of managing the parser state underneath.parser.parseRoot = (state) => parser.parseExpr(state); export function parse(input) { let state = parser.init(input); let expr = parser.parseRoot(state); if (parser.current(state).kind != eof) { let strayToken = parser.current(state); return { kind: "error", error: `found stray '${strayToken.kind}' token after expression`, start: strayToken.start, end: strayToken.end, }; } return expr; }
this function also checks that there aren’t any tokens after we’re done parsing the root
Expr
production. it wouldn’t be very nice UX if we let the user input tokens that didn’t do anything!now it’s time to parse some lists. for that, we’ll introduce another function, which will be called by
parseExpr
with an existing(
token.its task will be to read as many expressions as it can, until it hits a closing parenthesis
)
, and then construct a node out of that.parser.parseList = (state, leftParen) => { parser.advance(state); let children = []; while (parser.current(state).kind != ")") { if (parser.current(state).kind == eof) { return { kind: "error", error: "missing closing parenthesis ')'", start: leftParen.start, end: leftParen.end, }; } children.push(parser.parseExpr(state)); } let rightParen = parser.current(state); parser.advance(state); return { kind: "list", children, start: leftParen.start, end: rightParen.end, }; };
and the last thing left to do is to hook it up to our
parseExpr
, in response to a(
token:parser.parseExpr = (state) => { let token = parser.current(state); switch (token.kind) { case "integer": case "identifier": parser.advance(state); return { ...token }; case "(": return parser.parseList(state, token); // <-- default: parser.advance(state); return { kind: "error", error: "unexpected token", start: token.start, end: token.end, }; } };
now let’s try parsing an S-expression!
printTree("(hello! ^^ (nested nest))");
{ "kind": "list", "children": [ { "kind": "identifier", "start": 1, "end": 7 }, { "kind": "identifier", "start": 8, "end": 10 }, { "kind": "list", "children": [ { "kind": "identifier", "start": 12, "end": 18 }, { "kind": "identifier", "start": 19, "end": 23 } ], "start": 11, "end": 24 } ], "start": 0, "end": 25 }
I don’t know about you, but I personally find the JSON output quite distracting and long. I can’t imagine how long it’ll be on even larger expressions!
to counteract that, let’s write an S-expression pretty printer:
export function exprToString(expr, input) { let inputSubstring = input.substring(expr.start, expr.end); switch (expr.kind) { case "integer": case "identifier": return inputSubstring; case "list": return `(${expr.children.map((expr) => exprToString(expr, input)).join(" ")})`; case "error": return `<error ${expr.start}..${expr.end} '${inputSubstring}': ${expr.error}>`; } }
let’s try something more complicated, with comments and such.
export function printTree(input) { let tokens = lex(input); let tree = parse(tokens); console.log(exprToString(tree, input)); } printTree(` (def add-two ; Add two to a number. (fn (n) (+ n 2))) `);
(def add-two (fn (n) (+ n 2)))
looks like it works!
amen break, part two
here’s a fun piece of trivia: I’m wrote a Nim S-expression parser for Rosetta Code way back in July 2019.
it’s definitely not how I would write a parser nowadays. it’s pretty similar, but the syntax tree structures are quite different - it doesn’t use the lazy parsing trick I talked about before.
interpretation
we’ll again start off by defining a function that initializes our interpreter’s state.
right now there isn’t really anything to initialize, but recall that we don’t have our tokens parsed into any meaningful data yet, so we’ll have to have access the source string to do that.
treewalk.init = (input) => { return { input }; };
in the meantime, let’s prepare a couple convenient little wrappers to run our code:
export function run(input, node) { let state = treewalk.init(input); return treewalk.eval(state, node); } export function printEvalResult(input) { try { let tokens = lex(input); let ast = parse(tokens); let result = run(input, ast); console.log(result); } catch (error) { console.log(error.toString()); } }
so let’s patch those integers in!
this is where we’ll need that source string of ours - we don’t actually have a JavaScript
number
representation of the integers, so we’ll need to parse them into place.treewalk.eval = (state, node) => { switch (node.kind) { case "integer": let sourceString = state.input.substring(node.start, node.end); return parseInt(sourceString); default: throw new Error(`unhandled node kind: ${node.kind}`); } };
traditionally, in Lisp-like languages, a list expression always represents a function application, with the head of the list being the function to call, and the tail of the function being the arguments to apply to the function.
let’s implement that logic then!
export const builtins = {}; treewalk.eval = (state, node) => { switch (node.kind) { case "integer": let sourceString = state.input.substring(node.start, node.end); return parseInt(sourceString); case "list": // <-- let functionToCall = node.children[0]; let builtin = builtins[state.input.substring(functionToCall.start, functionToCall.end)]; return builtin(state, node); default: throw new Error(`unhandled node kind: ${node.kind}`); } };
we could try this out now, except we don’t actually have any builtins! so I’ll add a few in, so that we can finally perform our glorious arithmetic:
function arithmeticBuiltin(op) { return (state, node) => { if (node.children.length < 3) throw new Error("arithmetic operations require at least two arguments"); let result = treewalk.eval(state, node.children[1]); for (let i = 2; i < node.children.length; ++i) { result = op(result, treewalk.eval(state, node.children[i])); } return result; }; } builtins["+"] = arithmeticBuiltin((a, b) => a + b); builtins["-"] = arithmeticBuiltin((a, b) => a - b); builtins["*"] = arithmeticBuiltin((a, b) => a * b); builtins["/"] = arithmeticBuiltin((a, b) => a / b);
a brief intermission
all we have to do is swap out the evaluation kernel…
import { getKernel, defaultEvalModule } from "treehouse/components/literate-programming/eval.js"; export const kernel = getKernel(); kernel.evalModule = async (state, source, language, params) => { if (language == "haku") { printEvalResult(source); return true; } else { return await defaultEvalModule(state, source, language, params); } };
anyways, it’s time to turn haku into a real programming language!
programming languages as we use them in the real world are Turing-complete - roughly speaking, a language is Turing-complete if it can simulate a Turing machine.
the same thing happens in haku (though we don’t have a runtime for this yet, so you’ll have to take my word for it.)
((fn (x) ((fn (x) x) 2)) 1)
this is perfectly fine, and the result should be 2 - not 1!
try evaluating this in your head, and you’ll see what I mean. it’s better than me telling you all about it.
so to represent scope, we’ll introduce a new variable to our interpreter’s state.
treewalk.init = (input) => { return { input, scopes: [new Map(Object.entries(builtins))] }; };
scopes
will be a stack ofMap
s, each representing a single scope.our builtins will now live at the bottom of all scopes, as an ever-present scope living in the background.
variable lookup will be performed by walking the scope stack from top to bottom, until we find a variable that matches.
treewalk.lookupVariable = (state, name) => { for (let i = state.scopes.length; i-- > 0; ) { let scope = state.scopes[i]; if (scope.has(name)) { return scope.get(name); } } throw new Error(`variable ${name} is undefined`); };
we’re stricter than JavaScript here and will error out on any variables that are not defined.
now we can go ahead and add variable lookups to our
eval
function! we’ll also go ahead and replace our bodged-in builtin support with proper evaluation of the first list element.in most cases, such as
(+ 1 2)
, this will result in a variable lookup.treewalk.eval = (state, node) => { switch (node.kind) { case "integer": let sourceString = state.input.substring(node.start, node.end); return parseInt(sourceString); case "identifier": return treewalk.lookupVariable(state, state.input.substring(node.start, node.end)); case "list": let functionToCall = treewalk.eval(state, node.children[0]); return functionToCall(state, node); default: throw new Error(`unhandled node kind: ${node.kind}`); } };
time to build our
fn
builtin.we’ll split the work into two functions: the actual builtin, which will parse the node’s structure into some useful variables…
builtins.fn = (state, node) => { if (node.children.length != 3) throw new Error("an `fn` must have an argument list and a result expression"); let params = node.children[1]; if (node.children[1].kind != "list") throw new Error("expected parameter list as second argument to `fn`"); let paramNames = []; for (let param of params.children) { if (param.kind != "identifier") { throw new Error("`fn` parameters must be identifiers"); } paramNames.push(state.input.substring(param.start, param.end)); } let expr = node.children[2]; return makeFunction(state, paramNames, expr); };
and
makeFunction
, which will take that data, and assemble it into a function that follows our(state, node) => result
calling convention.export function makeFunction(state, paramNames, bodyExpr) { return (state, node) => { if (node.children.length != paramNames.length + 1) throw new Error( `incorrect number of arguments: expected ${paramNames.length}, but got ${node.children.length - 1}`, ); let scope = new Map(); for (let i = 0; i < paramNames.length; ++i) { scope.set(paramNames[i], treewalk.eval(state, node.children[i + 1])); } state.scopes.push(scope); let result = treewalk.eval(state, bodyExpr); state.scopes.pop(); return result; }; }
so to add support for that, we’ll clone the entire scope stack into the closure, and then restore it when necessary.
export function makeFunction(state, paramNames, bodyExpr) { let capturedScopes = []; // Start from 1 to skip builtins, which are always present anyways. for (let i = 1; i < state.scopes.length; ++i) { // We don't really mutate the scopes after pushing them onto the stack, so keeping // references to them is okay. capturedScopes.push(state.scopes[i]); } return (state, node) => { if (node.children.length != paramNames.length + 1) throw new Error( `incorrect number of arguments: expected ${paramNames.length}, but got ${node.children.length - 1}`, ); let scope = new Map(); for (let i = 0; i < paramNames.length; ++i) { scope.set(paramNames[i], treewalk.eval(state, node.children[i + 1])); } state.scopes.push(...capturedScopes); // <-- state.scopes.push(scope); let result = treewalk.eval(state, bodyExpr); state.scopes.pop(); return result; }; }
with that, our program now works correctly:
((fn (f) ((f 1) 2)) (fn (x) (fn (y) (+ x y))))
3
being able to define arbitrary functions gives us some pretty neat powers! to test this out, let’s write a little program that will calculate Fibonacci numbers.
the most basic is the recursive way, which is really quite simple to do:
function fib(n) { if (n < 2) { return n; } else { return fib(n - 1) + fib(n - 2); } } console.log(fib(10));
55
the downside is that it’s really inefficient! we end up wasting a lot of time doing repeat calculations. try going through it yourself and see just how many calculations are repeated!
in either, you will notice we need to support comparisons to know when to stop iterating! so let’s add those into our builtins:
function comparisonBuiltin(op) { return (state, node) => { if (node.children.length != 3) throw new Error("comparison operators require exactly two arguments"); let a = treewalk.eval(state, node.children[1]); let b = treewalk.eval(state, node.children[2]); return op(a, b) ? 1 : 0; }; } builtins["="] = comparisonBuiltin((a, b) => a === b); builtins["<"] = comparisonBuiltin((a, b) => a < b);
it’s easy enough to
!=
,<=
,>
, and>=
from these, so we won’t bother adding those in for now.of course, we’ll also need an
if
to be able to branch on the result of our comparison operators.builtins["if"] = (state, node) => { if (node.children.length != 4) throw new Error("an `if` must have a condition, true expression, and false expression"); let condition = treewalk.eval(state, node.children[1]); if (condition !== 0) { return treewalk.eval(state, node.children[2]); } else { return treewalk.eval(state, node.children[3]); } };
now we can write ourselves a recursive Fibonacci!
((fn (fib) (fib fib 10)) ; fib (fn (fib n) (if (< n 2) n (+ (fib fib (- n 1)) (fib fib (- n 2))))))
note that in order to achieve recursion, we need to pass
fib
into itself - this is because thefib
variable we’re binding into the first function is not visible in the second function.but if we run it now:
55
we can see it works just as fine as the JavaScript version!
rememeber to remember
once again, let me sketch out what I’d like it to look like. to declare a persistent value, you use
def
:(def fib (fn (n) (if (< n 2) n (+ (fib (- n 1)) (fib (- n 2))))))
if this looks familar, that’s because it probably is - I used the exact same example at the start of the post!
of course now we will also need to teach our whole runtime about the environment, right down to the kernel…
import { defaultEvalModule } from "treehouse/components/literate-programming/eval.js"; export function run(env, input, node) { let state = treewalk.init(env, input); return treewalk.eval(state, node); } export function printEvalResult(env, input) { try { let tokens = lex(input); let ast = parse(tokens); let result = run(env, input, ast); // NOTE: `def` will not return any value, so we'll skip printing it out. if (result !== undefined) { console.log(result); } } catch (error) { console.log(error.toString()); } } kernel.evalModule = async (state, source, language, params) => { if (language == "haku") { state.haku ??= { env: new Map() }; printEvalResult(state.haku.env, source); return true; } else { return await defaultEvalModule(state, source, language, params); } };
now for
def
- it’ll take the value on the right and insert it intoenv
, so that it can be seen in the future.builtins.def = (state, node) => { if (node.children.length != 3) throw new Error( "a `def` expects the name of the variable to assign, and the value to assign to the variable", ); if (node.children[1].kind != "identifier") throw new Error("variable name must be an identifier"); let name = node.children[1]; let value = treewalk.eval(state, node.children[2]); state.env.set(state.input.substring(name.start, name.end), value); };
this is a pretty simple augmentation to the base syntax. instead of reading a single expression, we will read a toplevel - as many expressions as possible until we hit
end of file
.parser.parseToplevel = (state) => { let children = []; while (parser.current(state).kind != eof) { children.push(parser.parseExpr(state)); } return { kind: "toplevel", children, // Don't bother with start..end for now. }; }; parser.parseRoot = (state) => parser.parseToplevel(state);
with a
toplevel
node ready, we can now handle it in our interpreter:treewalk.eval = (state, node) => { switch (node.kind) { case "integer": let sourceString = state.input.substring(node.start, node.end); return parseInt(sourceString); case "identifier": return treewalk.lookupVariable(state, state.input.substring(node.start, node.end)); case "list": { let functionToCall = treewalk.eval(state, node.children[0]); let result = functionToCall(state, node); return result; } case "toplevel": let result = undefined; for (let i = 0; i < node.children.length; ++i) { result = treewalk.eval(state, node.children[i]); if (result !== undefined && i != node.children.length - 1) throw new Error(`expression ${i + 1} had a result despite not being the last`); } return result; default: throw new Error(`unhandled node kind: ${node.kind}`); } };
but it’s never that easy is it
we’ll write a function that walks over our AST, and inserts source strings into it.
export function insertSources(node, input) { if (node.start != null) { node.source = input.substring(node.start, node.end); } if (node.children != null) { for (let child of node.children) { insertSources(child, input); } } }
now I am aware this is changing object shapes quite a lot, which is suboptimal. but I would really like to keep the interpreter simple, so bear with me.
now we can patch the relevant parts of the interpreter to read from the
node.source
field, instead ofsubstring
ing the source string passed to the interpreter. this is pretty mechanical so I’ll just dump all the relevant code here:treewalk.eval = (state, node) => { switch (node.kind) { case "integer": return parseInt(node.source); // <-- case "identifier": return treewalk.lookupVariable(state, node.source); // <-- case "list": let functionToCall = treewalk.eval(state, node.children[0]); return functionToCall(state, node); case "toplevel": let result = undefined; for (let i = 0; i < node.children.length; ++i) { result = treewalk.eval(state, node.children[i]); if (result !== undefined && i != node.children.length - 1) throw new Error(`expression ${i + 1} had a result despite not being the last`); } return result; default: throw new Error(`unhandled node kind: ${node.kind}`); } }; builtins.fn = (state, node) => { if (node.children.length != 3) throw new Error("an `fn` must have an argument list and a result expression"); let params = node.children[1]; if (node.children[1].kind != "list") throw new Error("expected parameter list as second argument to `fn`"); let paramNames = []; for (let param of params.children) { if (param.kind != "identifier") { throw new Error("`fn` parameters must be identifiers"); } paramNames.push(param.source); // <-- } let expr = node.children[2]; return makeFunction(state, paramNames, expr); }; builtins.def = (state, node) => { if (node.children.length != 3) throw new Error( "a `def` expects the name of the variable to assign, and the value to assign to the variable", ); if (node.children[1].kind != "identifier") throw new Error("variable name must be an identifier"); let name = node.children[1]; let value = treewalk.eval(state, node.children[2]); state.env.set(name.source, value); // <-- };
and of course, to top it all off, we still need to insert source information into the nodes before evaluating our tree:
import { defaultEvalModule } from "treehouse/components/literate-programming/eval.js"; export function printEvalResult(env, input) { try { let tokens = lex(input); let ast = parse(tokens); insertSources(ast, input); // <-- let result = run(env, input, ast); // NOTE: `def` will not return any value, so we'll skip printing it out. if (result !== undefined) { console.log(result); } } catch (error) { console.log(error.toString()); } } kernel.evalModule = async (state, source, language, params) => { if (language == "haku") { state.haku ??= { env: new Map() }; printEvalResult(state.haku.env, source); return true; } else { return await defaultEvalModule(state, source, language, params); } };
data structures
the coolest part about lists is that we don’t even need to do anything on the JavaScript side to implement them - we can use our good old friend Lambda calculus, along with a really cool tool called Church encoding, which allows us to encode lists using nothing but functions!
and a function to reduce a list to a single element. this function has various names in various languages, but the idea is that it allows us to walk over a list, modifying a value along the way, until we get a single, final value.
(def clist/reduce (fn (init op list) (if (= (clist/tail list) 0) (op init (clist/head list)) (clist/reduce (op init (clist/head list)) op (clist/tail list)))))
once again, the recursive logic is kind of tricky; if you draw it out, you should be able to understand it much easier!
can I just say something real quick
I’m swiftly starting to dislike my parenthesized syntax choices here. they would be fine in an editor capable of highlighting mismatched parentheses, but Helix refuses to highlight any parentheses in
.tree
files until I add atree-sitter
grammar to it.the example above took me way too long to get working than I want to admit. honestly it’s a failure of tooling on my side, (should’ve embedded source spans into all these errors so that they can be reported more cleanly!) but I really don’t want to spend too much time on what’s basically just a prototype.
string manipulation
I find the modern orthodox string syntax just fine:
"text between quotation marks."
there’ll be no escape sequences for the time being - they inflate the surface area of the lexer quite a lot, and aren’t needed when you can store a bunch of magic strings in variables:
(cat ; `cat` will be our string concatenation function "Hello, world!" \n ; notice how \n is just a regular variable! "This is another line.")
let’s add that to the lexer then!
lexer.string = (state) => { lexer.advance(state); // skip over initial quotation mark while (lexer.current(state) != '"') { if (lexer.current(state) == eof) { return "error"; } lexer.advance(state); } lexer.advance(state); // skip over closing quotation mark return "string"; }; lexer.nextToken = (state) => { let c = lexer.current(state); if (c == '"') { return lexer.string(state); // <-- } if (isDigit(c)) { lexer.advanceWhile(state, isDigit); return "integer"; } if (isIdentifier(c)) { lexer.advanceWhile(state, isIdentifier); return "identifier"; } if (c == "(" || c == ")") { lexer.advance(state); return c; } if (c == eof) return eof; lexer.advance(state); return "error"; };
and to the parser…
parser.parseExpr = (state) => { let token = parser.current(state); switch (token.kind) { case "integer": case "identifier": case "string": // <-- parser.advance(state); return { ...token }; case "(": return parser.parseList(state, token); default: parser.advance(state); return { kind: "error", error: "unexpected token", start: token.start, end: token.end, }; } };
so to top this all off, we’ll make the interpreter produce a string value for string literals, like it produces numbers for integer literals:
treewalk.eval = (state, node) => { switch (node.kind) { case "integer": return parseInt(node.source); case "string": // <-- // NOTE: We chop the quotes off of the string literal here. return node.source.substring(1, node.source.length - 1); case "identifier": return treewalk.lookupVariable(state, node.source); case "list": let functionToCall = treewalk.eval(state, node.children[0]); return functionToCall(state, node); case "toplevel": let result = undefined; for (let i = 0; i < node.children.length; ++i) { result = treewalk.eval(state, node.children[i]); if (result !== undefined && i != node.children.length - 1) throw new Error(`expression ${i + 1} had a result despite not being the last`); } return result; default: throw new Error(`unhandled node kind: ${node.kind}`); } };
"hello!"
hello!
with strings added to the language, we’ll need some basic operations to put together and break apart strings.
I don’t want to write a whole load of wrapper code every single time we want to declare a simple JavaScript function that’s available from haku, so let’s build a helper for that first:
export function wrapJavaScriptFunctionVarargs(func) { return (state, node) => { let args = Array(node.children.length - 1); for (let i = 1; i < node.children.length; ++i) { args[i - 1] = treewalk.eval(state, node.children[i]); } return func(...args); }; } export function wrapJavaScriptFunction(func) { let inner = wrapJavaScriptFunctionVarargs(func); return (state, node) => { if (node.children.length != func.length + 1) throw new Error( `\`${func.name}\` expects ${func.length} arguments, but ${node.children.length - 1} were given`, ); return inner(state, node); }; }
then there’s also
sub
, for indexing. one thing that always seemed kind of arbitrary is thatsubstring
in JavaScript and other languages defaults its argument to the string’s length; I personally think if the second argument is not provided, you almost always want to just get the character at that index.builtins.sub = wrapJavaScriptFunctionVarargs((str, start, end) => { if (typeof str != "string") throw new Error("`sub` expects a string as the first argument"); if (typeof start != "number") throw new Error("`sub` expects a number as the second argument"); end ??= start + 1; return str.substring(start, end); });
(sub "hello, world!" 0)
h
(sub "hello, world!" 0 5)
hello
and last but not least, a pair of functions to convert between numbers and strings.
builtins["to-string"] = wrapJavaScriptFunction((n) => n.toString()); builtins["to-number"] = wrapJavaScriptFunction((s) => parseInt(s));
(cat "today's magic number is: " (to-string 65))
today's magic number is: 65
(+ (to-number "60") 5)
65
we won’t care about the fallibility of
to-number
for now; we’re hand-waving away all the error handling anyways.