LLX > Neil Parker > Apple II > Kaleidoscope

(There's also a version in Flash now too. The Java applet that used to be on this page is archived here.)

Those of you who have been in the Apple II world long enough to remember
the DOS 3.3 System Master disk may recall the Integer BASIC program on
it called `COLOR DEMO`. One of its menu options is
`KALEIDOSCOPE`, which draws pretty symmetrical pictures on the
lo-res screen.

I occasionally like to run `COLOR DEMO` and watch the
pretty kaleidoscope colors. Listing the source code, I noticed that it
doesn't run forever--it stops after a fixed number of iterations. But
each iteration was rather slow, and I never had the patience to watch
the whole thing from start to finish.

Then one day I studied the listing more closely, and it occurred to
me that the source code was rather inefficient, and that I could
probably speed it up quite a bit with a little rewriting--maybe by
enough to watch the whole thing. So this article is about my
adventures optimizing `KALEIDOSCOPE`.

The original Integer BASIC source code, stripped of extraneous fluff, is as follows:

700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19: FOR J=0 TO 19:K=I+J 750 COLOR=J*3/(I+3)+I*W/12 760 PLOT I,K: PLOT K,I: PLOT 40-I,40-K: PLOT 40-K,40-I 770 PLOT K,40-I: PLOT 40-I,K: PLOT I,40-K: PLOT 40-K,I 780 NEXT J,I,W

There are a couple of obvious things that can be improved in this
code. In line 750, the expression `I*W/12` is recomputed on
each pass through the `J` loop, even though it doesn't depend
on `J`. Likewise, there's no need to recompute
`40-I` and `40-K` for each `PLOT`
statement.

So here's an improved routine:

700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19: X=I*W/12: I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K 750 COLOR=J*3/(I+3)+X 760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40 770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I 780 NEXT J,I,W

(Moving `X=I*W/12` and `I40=40-I` out of the
`J` loop is called *loop invariant removal*.
Precalculating `I40` and `K40` instead of
recalculating them each time is called *common subexpression
elimination*.)

But that's not the best that can be done. Multiplication on any
6502-based hardware is always a somewhat slow operation, due to the
6502's lack of a hardware multiply instruction. The `J*3`
operation can clearly be made more efficient by writing it as
`J+J+J`. What's not quite so easy to see is that something
similar can be done with `I*W`...note that each time through
the `W` loop, `I*W` starts out with the value of
`W`, and each time through the `I` loop, it increases
its value by `W`. Thus the following code is mathematically
the same as the above, but uses only quick additions instead of slow
multiplications:

700 GR : CALL -936: FOR W=3 TO 50:IW=0: FOR I=1 TO 19:IW=IW+W: X=IW/12: I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K 750 COLOR=(J+J+J)/(I+3)+X 760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40 770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I 780 NEXT J,I,W

(The `IW` optimization is called *strength
reduction*).

This code runs noticeably faster than the original code, but I still found it too slow to wait for. Clearly more drastic measures are called for...

Machine language! That's right...my next step was to hand-compile the BASIC into into 6502 code. Not only does machine code run much faster than interpreted BASIC could ever hope to, there are additional optimizations that wouldn't be of much help to the BASIC but would help machine language.

First the variables, all on page zero for speed. All the variables
used in the BASIC code can be stored in single bytes, except for
`IW`, which needs to store values up to 950. The two
additional variables `A1L` and `A1H` are temporaries
used in the division by 12 and by `(I+3)`. Memory locations
were chosen to avoid conflict with BASIC's zero-page usage.

1 W = 6 2 I = 7 3 J = 8 4 K = 9 5 A1L = $3C 6 A1H = $3D 7 K40 = $FC 8 I40 = $FD 9 IWL = $FE 10 IWH = $FF

Necessary I/O locations:

11 KBD = $C000 12 KBDSTRB = $C010 13 LORES = $C056 14 AN3OFF = $C05F

Necessary ROM routines:

15 PLOT = $F800 16 SETCOL = $F864 17 SETGR = $FB40 18 HOME = $FC58

Now for the code. First we initialize the lo-res screen and clear
the text area at the bottom. The `LDA LORES` is done because
`SETGR` doesn't, and the `LDA AN3OFF` is in case
somebody was using double-res graphics recently.

19 ORG $2000 2000: 20 40 FB 20 JSR SETGR ;GR 2003: AD 56 C0 21 LDA LORES 2006: AD 5F C0 22 LDA AN3OFF 2009: 20 58 FC 23 JSR HOME ;HOME

Now the `W` loop. This just starts it out--the increment of
`W` and test for the final value are done at the bottom of the
loop.

200C: A9 03 24 LDA #3 ;FOR W=3 TO 50 200E: 85 06 25 STA W

Initialize `IW`. `I40` is also initialized here so
instead of subtracting `I` from 40 each time, we can just
`DEC` it each time through the `I` loop. This is another
strength reduction, which wasn't worthwhile in BASIC but is in machine
language.

2010: A9 00 26 WLP LDA #0 ;IW=0 2012: 85 FE 27 STA IWL 2014: 85 FF 28 STA IWH 2016: A9 28 29 LDA #40 ;I40=40 2018: 85 FD 30 STA I40

Now the `I` loop begins. It works just like the
`W` loop.

201A: A9 01 31 LDA #1 ;FOR I=1 TO 19 201C: 85 07 32 STA I

`IW=IW+W` is straightforward.

201E: A5 06 33 ILP LDA W ;IW=IW+W 2020: 18 34 CLC 2021: 65 FE 35 ADC IWL 2023: 85 FE 36 STA IWL 2025: 90 02 37 BCC ILP1 2027: E6 FF 38 INC IWH

Here `IW` is divided by 12, and the result is stored in the
`X` register (which is safe because nothing else in the code
clobbers it). The division is done using a standard 16-bit by 8-bit
long division algorithm (with the dividend in `A1L` and
`A1H`), coded inline because it's used only once and it saves
the cycles that `JSR/RTS` would use.

2029: 85 3C 39 ILP1 STA A1L ;X=IW/12 202B: A5 FF 40 LDA IWH 202D: 85 3D 41 STA A1H 202F: A0 10 42 LDY #16 ;(inline 16/8 division) 2031: A9 00 43 LDA #0 2033: 06 3C 44 DIV1LP ASL A1L 2035: 26 3D 45 ROL A1H 2037: 2A 46 ROL 2038: C9 0C 47 CMP #12 203A: 90 04 48 BCC DIV1A 203C: E9 0C 49 SBC #12 203E: E6 3C 50 INC A1L 2040: 88 51 DIV1A DEY 2041: D0 F0 52 BNE DIV1LP 2043: A6 3C 53 LDX A1L

Here's the previously-promised decrement of `I40`. Also,
`K40` is set up to receive the same treatment.

2045: C6 FD 54 DEC I40 ;I40=I40-1 2047: A5 FD 55 LDA I40 ;K40=I40 2049: 85 FC 56 STA K40

The beginning of the `J` loop. The end of the loop (below)
will guarantee that the accumulator holds the current value of
`J` for the `K=I+J` computation.

204B: A9 00 57 LDA #0 ;FOR J=0 TO 19 204D: 85 08 58 STA J 204F: 18 59 JLP CLC ;K=I+J 2050: 65 07 60 ADC I 2052: 85 09 61 STA K

Since you can't control-C a machine language program, this gives the user a chance to quit at any time by pressing a key.

2054: AD 00 C0 62 LDA KBD ;IF PEEK(KBD)>127 THEN DONE 2057: 10 06 63 BPL NOKEY 2059: 8D 10 C0 64 STA KBDSTRB 205C: 4C DD 20 65 JMP DONE

The color computation. The shift-and-add computation below is
quicker that adding `J` to itself three times. Note that
`J` is never greater than 19, so the carry flag is
automatically guaranteed to be clear before the first two
`ADC` instructions. Another
inline division is used (`A1L` by `A1H`, result in
`A1L`)...this one only needs to divide 8 bits by 8 bits.

205F: A5 08 66 NOKEY LDA J ;COLOR=J*3/(I+3)+X 2061: 0A 67 ASL 2062: 65 08 68 ADC J 2064: 85 3C 69 STA A1L 2066: A5 07 70 LDA I 2068: 69 03 71 ADC #3 206A: 85 3D 72 STA A1H 206C: A9 00 73 LDA #0 ;(inline 8/8 division) 206E: A0 08 74 LDY #8 2070: 06 3C 75 DIV2LP ASL A1L 2072: 2A 76 ROL 2073: C5 3D 77 CMP A1H 2075: 90 04 78 BCC DIV2A 2077: E5 3D 79 SBC A1H 2079: E6 3C 80 INC A1L 207B: 88 81 DIV2A DEY 207C: D0 F2 82 BNE DIV2LP 207E: 8A 83 TXA 207F: 18 84 CLC 2080: 65 3C 85 ADC A1L 2082: 20 64 F8 86 JSR SETCOL

The eight plot commands. These could be made faster by rearranging
them so that all the plots with the same Y coordinate are done in
groups, which minimizes the number of times the row address has to be
calculated. But I kept the plot order the same to guarantee that the
display would *identical* to the original BASIC display.

2085: A4 07 87 LDY I ;PLOT I,K: etc... 2087: A5 09 88 LDA K 2089: 20 00 F8 89 JSR PLOT 208C: A4 09 90 LDY K 208E: A5 07 91 LDA I 2090: 20 00 F8 92 JSR PLOT 2093: A4 FD 93 LDY I40 2095: A5 FC 94 LDA K40 2097: 20 00 F8 95 JSR PLOT 209A: A4 FC 96 LDY K40 209C: A5 FD 97 LDA I40 209E: 20 00 F8 98 JSR PLOT 20A1: A4 09 99 LDY K 20A3: A5 FD 100 LDA I40 20A5: 20 00 F8 101 JSR PLOT 20A8: A4 FD 102 LDY I40 20AA: A5 09 103 LDA K 20AC: 20 00 F8 104 JSR PLOT 20AF: A4 07 105 LDY I 20B1: A5 FC 106 LDA K40 20B3: 20 00 F8 107 JSR PLOT 20B6: A4 FC 108 LDY K40 20B8: A5 07 109 LDA I 20BA: 20 00 F8 110 JSR PLOT

The second half of the `K40` strength reduction, and the end
of the `J` loop:

20BD: C6 FC 111 DEC K40 ;K40=K40-1 20BF: E6 08 112 INC J ;NEXT J 20C1: A5 08 113 LDA J 20C3: C9 14 114 CMP #20 20C5: D0 88 115 BNE JLP

The end of the `I` loop. Alas, the top of the loop is too
far away to reach with a `BNE`.

20C7: E6 07 116 INC I ;NEXT I 20C9: A5 07 117 LDA I 20CB: C9 14 118 CMP #20 20CD: F0 03 119 BEQ NOILP 20CF: 4C 1E 20 120 JMP ILP

And the end of the `W` loop:

20D2: E6 06 121 NOILP INC W ;NEXT W 20D4: A5 06 122 LDA W 20D6: C9 33 123 CMP #51 20D8: F0 03 124 BEQ DONE 20DA: 4C 10 20 125 JMP WLP 20DD: 60 126 DONE RTS --End assembly, 222 bytes, Errors: 0

And that's it.

The machine-language version runs in about 20 seconds on a standard Apple II, and in about 8 seconds on a IIGS at full speed. That's plenty fast enough to satisfy my curiosity--I had no problem waiting for this version to finish.

So was it worth it? Well...that depends. I was glad to finally have a version that I could watch from start to finish, but the final pattern isn't really any more spectacular than the early stages.

I had entirely too much fun writing and optimizing the machine language version.

Here's the source code (Merlin assembler) and binary in a Shrinkit archive (1587 bytes) or a zipped 5.25-inch ProDOS disk image (2395 bytes) for those who want to try the machine language version themselves but don't want to type in the hex code manually. And here's the assembly listing for those who want to see it without the running commentary.

(Just so Applesoft programmers don't feel left out, only a few minor modifications need to be made to convert the Integer program to Applesoft.)

700 GR : HOME : FOR W = 3 TO 50:IW = 0: FOR I = 1 TO 19:IW = IW + W: X = INT (IW / 12): I4 = 40 - I: FOR J = 0 TO 19:K = I + J:K4 = 40 - K 750 COLOR= INT ((J + J + J) / (I + 3)) + X 760 PLOT I,K: PLOT K,I: PLOT I4,K4: PLOT K4,I4 770 PLOT K,I4: PLOT I4,K: PLOT I,K4: PLOT K4,I 780 NEXT : NEXT : NEXT

(This could be speeded up slightly by putting all the numeric
constants in variables, since Applesoft can look up a variable faster
than it can parse a constant. It can be speeded up a *lot* by
compiling with the Beagle
Compiler,
but nowhere near as much as the handwritten assembly code above.)

**Update:** Since writing the above, I've learned a
couple of things: First, the Integer BASIC code apparently didn't
originate in `COLOR DEMO`, but dates back to the famous *Red
Book*, the original reference manual that shipped with early Apple
II's. I've never seen an authentic *Red Book*, but apparently the
kaleidoscope program was originally called `ROD'S COLOR
PATTERN`, and was attributed to Randy Wigginton.

Second, I'm not the only person out there in Internet-land who's tried hand-compiling this code. The earliest ones I've seen were published in the Apple Assembly Line newsletter--the January 1984 issue contains a 68000 version by Bob Urschel (intended to run on an Apple II with a 68000 coprocessor card), and the March 1984 issue contains a 6502 version by Charles H. Putney. There's also a version by John B. Matthews.

None of these versions uses quite the same approach as my code. Bob
Urschel's version is a straight translation of the original Integer
BASIC, with no attempts at optimization at all. Charles Putney's
version does the loop invariant removals and the common subexpression
eliminations, but uses table lookups for `I*W/12`

,
`J*3/(I+3)`

, and the PLOTs. John Matthews's version is
somewhere between those two extremes, with no optimizations for the color
computation, but using table lookups to speed up the PLOTs. My effort
probably falls somewhere between the Matthews and Putney versions.

LLX > Neil Parker > Apple II > Kaleidoscope

Original: June 4, 2004

Modified: August 21, 2005--added Update

Modified: February 5, 2007--Apple Assembly Line is at a new URL

Modified: June 1, 2008--John B. Matthews moved to a new URL

Modified: October 20, 2008--added link to Flash version

Modified: March 8, 2017--Replaced Java applet with Javascript, because
browsers don't support the Java plugin anymore. Also, Apple Assembly Line
and John B. Matthews are at new URLs again.

Modified: April 5, 2018--Added zipped disk image