6502 Scripting: a Tcl Shell on the Commodore 64

The C64’s user interface is Commodore BASIC v2, which presents itself without delay as a REPL the instant the machine is powered on. BASIC lives in 8 KB of ROM of 6502 machine code. Michael Steil has worked on collecting versions of the BASIC source for various 8-bit platforms, putting them into a single source tree, and adding comments. There’s also EhBASIC by Lee Davidson, a popular BASIC interpreter used by a number of 6502-based homebrew computers. EhBASIC fits into 10 KB.

There were other interpreted languages available for the C64, e.g. LISP dialects such as LISP 64 (page in German). From a glance at the screenshots, LISP 64 is delivered as a 52-block large PRG file, so about 13 KB in size.

With scripting on resource-deprived microcontrollers becoming more popular, I was wondering if there are also new options for 6502-based systems. Alas, languages like MicroPython still require at least hundreds of KB of memory for the interpreter. Then, I came across an excellent blog post by Serge Zaitsev who was evidently on a similar mission in 2016.

Serge wrote a small interpreter for a dialect of Tcl, called Partcl. Tcl’s appeal is in its simplicity—there is not a whole lot to the language’s syntax (see Serge’s great and brief summary). I used Tcl extensively when writing my thesis since the ns-2 network simulator is based on it (or rather its object-oriented extension, OTcl. The language is easily extensible, making it great to embed. In the case of my thesis, that meant creating a lot of event sinks that can capture transmission characteristics of network packets.

Partcl implements a few of Tcl’s commands and substitutes some more complex ones (such as expr to evaluate an arithmetic expression) with others (2-ary math functions, i.e. [* [+ 1 2] 3] instead of [expr (1+2)*3]). The whole interpreter is implemented in 680 lines of C code, comes with a nice test suite, and reportedly runs using only 10 KB of flash on an STM32F051 MCU.

So how will it fare on an 8-bit 6510 running at less than 1 MHz inside a C64? To find out, the first step is to get it compiled for that target. For 6502-based CPUs such as the 6510, the cc65 cross-compiler is available. The documentation notes some deviations from the ISO C standard and some limitations due to the constrained resources of 6502-based 8-bit systems.

This is my first foray into C for the 6502, so I was curious what those constraints mean in practice and how easy it would be to get Serge’s code running on a machine from the 1980’s. The first issue to address are variable declarations. While cc65 implements some C99 features, that does not include variable declarations in the middle of blocks, presumably due to it being a single-pass compiler. Hence, all declarations need to move to the beginning of their respective blocks (including the ones in macros, specifically tcl_each in Partcl).

Next, cc65 requires struct initializers to be constant expressions and otherwise complains (“Error: Constant expression expected”). In contrast, C99 allows non-constant expressions if the struct is not declared as static. The quick fix is to initialize those respective fields with something else and then later populate them using the non-constant expressions.

The last challenge was more mystifying with cc65 proclaiming that there are “Too many local variables.” The compiler implements its own stack, it does not use the hardware one provided by the 6502. That makes sense considering the limitations of what the 6502 provides, i.e. 256 bytes of fixed stack space from $0100 to $01ff, but it comes of course with a speed penalty. By default, the C64 mode of cc65 has its software stack grow downwards from $cfff.

The compiler maintains a stack pointer and it knows the offset of local variables to that pointer at compile time. To fetch a local variable, it uses the 6502’s indirect indexed addressing mode, i.e. LDA (stack_pointer), Y. This can be done relatively fast in either 5 or 6 cycles (on top of loading the offset into Y, 2 cycles) but requires that the offset fits into the Y register, i.e. it is not larger than 255. Hence, only 256 bytes can be addressed this way.

My initial attempt was to move some local variables off the stack by making them static. That puts those variables into the data segment at fixed addresses, hence functions using it are no longer reentrant since multiple concurrent invocations will access the same memory. In many spots in the code that is not an issue. As a bonus, access is faster (immediate addressing mode, 2 cycles). The --static-locals compiler option gives this treatment to all local variables.

As it turns out, a large array buffer allocated on the stack was the main culprit. After trimming down its size, all the other errors for local declarations disappeared. Lesson learned: the line with that error message is not necessarily the source of the problem.

At this point, the interpreter runs on the C64. After some additional minor changes such as printing a prompt, it looks like this:

Tcl on the C64

Curly braces map to two PETSCII graphics characters on the C64, which actually work as decent substitutes (PETSCII has no curly braces). They are available on the “Q” and “W” keys, which also works reasonably well. But alas, the closing one is on the left key while the opening one is on the right one. That could be adjusted, of course.

Serge’s code makes it easy to extend the interpreter, so I’ve also added peek and poke commands to interact with the system, e.g. to change the screen colors. The next step is to run the test suite on the C64 to make sure it also completes there without errors. That, however, did not work so well (see screenshot below, compiled in debug mode).

Heap corruption

This took a bit longer to sort out. The culprit was a variable on the stack that was still used through a pointer after going out of scope. While this didn’t cause any issues when running the code on Linux, it tripped things up on the C64. After fixing this, there is one final hurdle. The tests recursively compute the 20th Fibonacci number. The computation has exponential complexity and therefore takes forever on the C64, so I changed it to computing the 5th one to have it complete in a reasonable amount of time. And with that, the tests complete:

Tests completing

So how well does it work? The interpreter fits into 16 KB (plus the memory needed to run programs). Not surprisingly, that’s larger than the hand-written assembler of Commodore BASIC. The tests weigh in at about 25 KB (the test binary contains the interpreter code without the REPL plus all the test cases).

While snappy on my desktop, the interpreter is very slow on the underpowered 6502. Anything non-trivial takes at least several seconds to compute. Bear in mind that the interpreter is not optimized for speed. To put this into more perspective, consider the test suite. On my desktop, it takes about 20 milliseconds to complete. For a C64 to slog through it, it takes almost 3 minutes. Nonetheless, it is fun to see Tcl code executing on a C64 and to toy with a REPL that is not BASIC.