
Hacker News · Feb 23, 2026 · Collected from RSS
Article URL: https://www.colino.net/wordpress/archives/2026/02/23/the-challenges-of-porting-shufflepuck-cafe-to-the-8-bits-apple-ii/ Comments URL: https://news.ycombinator.com/item?id=47128631 Points: 7 # Comments: 1
This post originally appeared in the June 2025 issue of Juiced.GS, and has been expanded with more details. I am very proud to have succeeded in porting Shufflepuck Cafe to the 8 bits Apple II, bringing a very dynamic 1989 game to an 1979 platform without losing in playability or details. In this article, I will share the challenges that made me pause, and how I solved them. If you are interested in the game itself, please head over to the Shufflepuck Cafe for Apple II project page. 1 : Displaying sprites and moving them The first challenge was that I didn’t even know how to cleanly display a sprite, handle the mouse, etc. I have solved this one by stepping down and… starting with a Glider port, as this game is technically much less complicated than Shufflepuck. It took me a month of very late evenings and hyper-focusing the whole week-ends with no kids, but I then had a much better idea of how to write a game, some foundation to build on, and a cool port of Glider. In the process, I also learned more things that would prove useful for Shufflepuck Cafe: how to easily play sounds, and how to fit a lot of data both on a 140kB floppy disk and in memory. I already wrote things that could play sounds, but in these previous projects1, the sound samples came over the serial port. Instead of manually rewriting my player, this time I did things well and wrote a player generator, which, as I figured later during Shufflepuck’s development, made things far easier. I’ll come back to those subjects later. 2 : A 3D table ? As you probably know, Shufflepuck is a “3D” game. The Shufflepuck table Of course, in reality, Shufflepuck is not a 3D game. The table is a background, and sprites (two pushers and one puck) are displayed over it. The only thing required to make it look 3D is some coordinate transformation, and some scaling. 2a: The coordinates For the coordinates, viewed from “inside” the code, the table is a 255 pixels wide by 192 pixels high rectangle. I realize now that the rectangle could have been 255 pixels high, but this is not a problem as-is. So, after lots of theory reading, I figured that what I wanted to do is a one-point perspective transformation. I drew over the background and figured the vanishing and other key points using a temporary layer : My perspective parameters For a given X on the rectangle, the “graphical” X (gx) on the screen will be X multiplied by x_factor% and shifted by x_shift pixels, where the factor and the shift depend of Y: gX = X*x_factor(y) + x_shift(y); and gy will be Y multiplied by y_factor% (as the puck goes backward, its graphical Y is “flattened” compared to the geometric one, otherwise, it seems to accelerate the further it is.) The magic values come from the points of interest: V (the vanishing point), F (the back of the table) and M (the front of the table). For a given x,y on the rectangle, we will have: depth(y) = Fy + My - y y_divisor(y) = depth(y)/((My-Vy)/Fy) x_factor(y) = Fy/depth(y) x_shift(y) = (-Vx * Fy)/depth(y) +Vx + Mx gx = x * x_factor(y) + x_shift(y) gy = My - (((My - y) * Fy) / y_divisor(y) This was of course tested and iterated over and over using a simple SDL proof of concept: The table with a set of calculated gx,gy every 10 points Of course, this kind of computation is not something one wants to do in 6502 assembly when speed matters, so I wrote a lookup table generator, which created three tables for me to quickly find gX,gY given X,Y. These three tables required one multiplication and one division (none of which the 6502 can do natively, so they are expensive) to compute gX, as the X factor depends on Y and I couldn’t make a 255*192 lookup table, of course (48kB, haha). I got rid of the division: instead of storing a percentage, I store a “per256tage”, which replaces the division with a single instruction (moving the high byte of the multiplication result to the low byte). I am proud of that _transform_xy function that trades size for speed. It takes 612 bytes (of which 580 are the lookup tables) and executes in 138 cycles. I made the first Shufflepuck commit once I got that right, because it seemed like the absolute basis for this game. 2b: The perspective Of course, in three dimensions, the closer an object is, the bigger it is. You guessed it: it would be impossible to scale sprites in real time on a 1MHz 6502, so once again, we’ll trade size for speed. There are four sprites for the player’s pusher, six sprites for the puck, and two for the opponent’s pusher. The sprite to use is determined using its Y coordinate. The different versions of the pushers and puck Given that each pixel is a bit on the Apple II, all of these sprites are also stored 7 times, each version shifted one pixel to the right, in order to quickly position the sprite on the X axis instead of shifting everything manually. We sure did trade size for speed there: these three elements occupy 7329 bytes of memory. 3 : Draw fast enough Drawing sprites on the Apple II is a time-consuming process. One has to select the correct version of the sprite via X modulo 7 (via a lookup table); get the first line’s start address (two more lookup tables, thanks to HGR interlacing!), add the X divided by 7 offset (a fourth lookup table!) to get the first byte to update; fetch a background byte to back it up, AND that with the sprite mask, OR this with the sprite data, and store the result on screen; do that for a full line, then iterate to the next line, etc. The front pusher’s sprite is rather large at 49×17 pixels, and the puck when in front is also a bit large. Drawing each sprite at each frame amounted to almost 14000 cycles, and that was too much, for two reasons: the first one is that there’s a new frame to draw every ~17ms (or ~17000 cycles – the joy of 1MHz is that a cycle basically equals a microsecond. It’s not exactly that but it’s close enough), and in addition to drawing, one has to actually have enough cycles to run the game logic, too. And the second one: you don’t have 17000 cycles to draw a frame, unless you only draw on the last line of the screen! This page explains it really well. If you want a flicker-free, clean draw, you have to be faster than the CRT beam. Start drawing right when the beam leaves the bottom-right corner of the screen, and you have 4550 cycles to draw on the first line before it arrives there. After that, you have an extra 65 cycles per line. At 14000 cycles per update, we couldn’t win the race and our pusher flickered when it was at the end of the player’s side of the table. Flickering is noticeable when the player’s pusher is at the end of the table. At first, I solved that by updating only half the screen every frame. When the puck was in the opponent’s side of the table, I drew their pusher and the puck on even frames, or the player’s pusher on odd frames. When the puck was in our side of the table, I either drew the opponent’s pusher, or the player’s plus the puck. This made each frame drawn in less that 7000 cycles, but with this technique, I only had a 30 frames-per-second rate. So, I changed the way I draw the player’s pusher. Instead of masking it with the background (which amounts to 20-25 cycles per byte: LDA screen, STA backup, AND mask, OR sprite, STA screen), I used a well-known old technique for performance: exclusive ORing. Exclusive ORing (LDA screen, EOR sprite, STA screen) has three advantages: First, it allows us to spare cycles by not needing to save the background (-4 or 5 cycles per byte). Two, it allows us to spare cycles by not needing to mask (-4 or 5 cycles per byte). Three, as a bonus, the two first speed gains also translate to memory usage gain, getting us rid of the background save buffer (136 bytes), and the mask (all 7 seven versions of all 4 sizes of the sprite: 3400 bytes!) Finally, I linked these assets in the correct order and places so that the biggest pusher and puck sprites would be page-aligned, to avoid the page-crossing penalty and spare three to five extra cycles per byte. Aesthetically, it was pleasing enough; and, performance-wise, it allowed to draw the three sprites for every frame in about 11000 cycles, which was enough to win the race against the beam, provided I drew the sprites in the correct order, from top to bottom; a thing that was needed, anyway, for simple geometry reasons so that the sprite in front of the other would obscure the other. No more flickering. I kept the “clean” (background-masking) method of drawing the puck and the opponent pusher, as the EOR method was extremely ugly on them. During all these optimisation phases, I needed to count cycles to have an idea of how much I needed to gain, or how much slack I had. Counting manually over such large functions is of course unfeasible, so I ran all of these tests with MAME’s CPU tracing log on, and used the trace logs with my debugger in Callgrind Profile mode: In this trace, we can see an average of 11793 cycles per draw (4.7M cycles/402 calls) While imperfect, this debugger helps me a lot on each project. 4. The size of the sounds In the original Shufflepuck Cafe, related to playing, there are the same number of sounds as there are moving elements: three. One when a pusher hits the puck, one when the puck hits a wall, and one famous “window crash” sound when a player misses the puck. But there are variations on these sounds: their pitch is lower on the opponent’s side. I could not have multiple samples to replicate that effect – just the crash sound, a 800ms sample at 8KHz, is 6500 bytes. I couldn’t decently use all of the Apple II memory for sprites and sounds! Instead, I added a feature to my sound player generator – and this is where I got really happy to have it count cycles for me. The idea was that before jumping to the sound player with the sample in the X register, I would set a “slowdown” factor in the Y register. Y=0 means no slowdown. Afterward, each of the sound player’s duty cycles would decrement that counter to 0, effectively wastin