ZXDS developer's diary

I have always been a Spectrum fan. I spent a big part of my childhood programming it, and I have lot of fond memories of those times.

When Nintendo DS appeared and I have found out it has the exact same screen resolution as ZX Spectrum had, I immediately knew that one day I will be tempted to write my own emulator for it. I always thought GBA would be perfect for portable ZX Spectrum, if only it had a bit bigger display to avoid the scaling problem. With the new hardware out of the door, it was clear I will give it a try sooner or later.

Recently, one of my colleagues really got into the homebrew stuff, and talking about it almost daily, I said, heck, why not? I have been extremely busy lately, but with our youngest getting older I thought I might finally reward myself with a little pet project of my own. I was interested in what DS can do and ZX emulator seemed as a fairly simple thing to start with, given the limited time frame I have to get by with.

Training grounds

For the start, I have digged out the sources of the Z80 emulator I have written for the PC version of DeliAY years ago. I remember I wrote the core over one weekend, using my old 680x0 emulator I have written for the Amiga before as a reference, so I pretty much knew what I was writing already. I then spent two evenings adding the proper timing support and support for undocumented flags, which my old emulator for the Amiga lacked. Strange, it all seems so long ago.

The emulator lacked ROM write protection, which was not needed for the purposes of DeliAY, but this was not a problem, as I planned to add paging support anyway. Otherwise it was already pretty much suited for my intended purposes.

So I have spent three evenings to hack up a little emulator in SDL on top of my Z80 emulator. I wanted to make sure I won't be hunting ghost bugs once I start porting it to DS, so a SDL reference seemed like a good idea. Funny, I wonder why I have never done this before, when I was debugging DeliAY. The first day I spent about two hours setting up the SDL framework and framebuffer, setup the emulator, installing the ROM including basic write protection, and basic black and white decoding of the screen. The other day I had just half an hour, which was just enough to implement keyboard support. The last day I spent about an hour adding support for loading 48k SNA snapshots. Of course the first game I tried was Head over Heels, one of my favorites. Boy, it's so good. I couldn't resist and spent the rest of the evening playing it.

Having this all set up, I downloaded and installed devkitPro for ARM. Now I just waited for the reasonably large continuous time slice to start the real work. Finally, my wife decided to go skiing this Saturday (17.3.2007), leaving with our son on Friday evening. I was left to babysit our little daughter, which is not really an ideal development environment, not to mention heavy sore throat I've got, but still it was perhaps the best opportunity I could hope for in a long time.

Knowledge is power

After putting the toddler to bed, I finally started this Friday evening. I first needed to get familiar with the NDS at all, as I haven't had read anything about NDS nor GBA internals before. I browsed through any reasonable NDS and GBA documentation I could find on the net, quickly learning what an amazing piece of hardware the NDS is. It reminded me of the good old Amiga days, when the Hardware Reference Manual was the only documentation we had for a long time to come, so we knew how to control every chip long before we learned how to open a file via dos.library. Ah, so much nostalgia.

The things I basically needed to try out were framebuffer, input, and file I/O. I didn't bother with sound yet, as I knew I definitely won't be implementing it this weekend. Following the libnds examples I quickly put together few of my own tests. It all went quite smoothly, no surprises, and I soon had working tests which I tried in both desmume as well as on my M3 Simply. Kudos to the libnds people and the entire homebrew scene, of course, for making this all possible at all.

Having tested all I needed, I spent some more time reading about the ARM processor architecture. I have never used ARM before, so I just wanted to get a bit of knowledge of what it can do. I didn't plan to write anything in the assembler if possible anyway, but it never hurts to know the target platform. As I expected from a RISC processor, there were no surprises either. Compared to SPARCs and UltraSPARCs I knew from the old UltraLinux days it was quite simple and straightforward. I went to bed pretty satisfied, knowing I am ready to start porting the other day. The only obstacle remaining might be the speed of the Z80 emulation, but I was quite positive about that. Although written in C, the Z80 emulator code was pretty well structured to transform smoothly, so the compiler should have no problem translating it for an orthogonal architecture like ARM in a nearly optimal way.

And there was light...

Getting up early on Saturday, I started programming as soon as and anytime my babysitting duties allowed. I first set up basic dual ARM processor project, filling it with my SDL quick hack. Then I started refactoring it into something which I could compile for both SDL and NDS at the same time, and slowly filled the NDS stubs with actual implementations. Framebuffer, key input, file I/O, one at a time, NDS features were added throughout the day. I tried things in desmume first, only occasionally testing them on the real hardware (messing with micro SD back and forth just to find out it doesn't work would be a real pain). All went quite smoothly, but when I got to the point I needed to load the snapshot, I hit a little problem regarding desmume.

For some reason, it seems that desmume can't read the files properly. It is really weird, as the desmume scans the directories just fine, and opens the file as well, but reads nothing. I guess I should talk to desmume people about that. I tried ideas and dualis as well, but one of them crashed and the other doesn't support I/O at all, so I sticked with desmume in the end. Given this fact, the SDL reference turned out to be priceless, as it allowed me to test the platform independent features before I had to try them on the real hardware.

Nevertheless, I slowly proceeded and in the evening I had a fairly basic 48k emulator running my favorite game. The speed was a tad bit slower than the original, but I knew that there was a huge space for improvement in the screen displaying routine, which was just a bunch of naive nested loops processing one bit at a time.

The next step then was to rewrite this routine to make it more efficient, adding color support at the same time. This is when things started to go wrong. I switched from 16bit framebuffer mode to 8bit palette mode, and just modified the inner loop to use the new mode. It didn't work. Well, it did in desmume but not on the real hardware. I wasted a lot of time finding out what is going on. I made sure I have set up the extended rotoscale affine transform and everything properly about a hundred times. It was getting late and I still didn't have the palette mode working. It took me about a dozen tests, each of which I had to copy to the real hardware first, before I tracked down the cause - even if the format is 8bit, the VRAM bus is 16bit and must be always written to as either 16 or 32 bits at a time. Sigh. One has to wonder why they don't mention this in any of those basic homebrew tutorials.

I was pretty upset that I wasted so much time due to such a basic mistake. When dealing with memory mapped regions, it is quite common they need to be accessed this way, and I simply should have realized this way earlier. I could have had the color screen already working instead. But on the other hand, it was not such a bad day after all. I had nothing on the DS in the morning, and in the evening I had at least a black and white emulator running my favorite game at just about the right speed. Guess one should be just happy with results like that.

The spectral rainbow

On Sunday, we had relatives coming for a visit, so I didn't really have much time. In the morning, I have finally written the proper code for color screen display, processing two nibbles at a time. Compared to the decoding routines I wrote for the Amiga ages ago, this was indeed a piece of cake. Although the color decoding made it more complex as a whole, the result was way faster than the primitive routine I used before. The Head over Heels became a little too fast to be conveniently playable, so I added toggle for enabling and disabling synchronization with vertical line retrace. When enabled, it slows the emulator down to 60 frames per seconds, which is still 20% faster than the original was, but it is not really that much noticeable. There are still a trick or two I might use to speed the conversion further, but I guess this will do for now. On the other hand, giving a little speed burst to Driller or Dark Side would never hurt, would it?

Having achieved this nice speed up, I knew that I can add the paging support without having to worry about the speed too much. However, it had to wait until the evening. With the visitors gone, I had finally tweaked the Z80 emulator memory access routines to support paging. As expected, the change of speed was not noticeable at all. Both ARM processor and the compiler seem to do a good job here. I want to eventually check out the generated code to see how much room is there for improvement, but given these results this doesn't bother me much.

Unfortunately, it was getting late and I had a day at work ahead of me, so with everything necessary already in place, I left the sweet cream of bringing the paging to life for the other day.

Page in, page out

After coming back from work on Monday, I turned on my computer as soon as the kids were in the bed. I had just a few hours of time to work ahead of me, so I focused mainly on finishing the 128k emulation. I changed the Z80 startup code to support both 48k and 128k mode, adding more memory and using the appropriate ROM as needed. Then I moved on to the last missing piece, the I/O support. I added hooks for both the paging as well as the AY ports, and while I was at it, I implemented the kempston joystick support and also added the missing bits to the ULA port (sans MIC, that is). I worked mostly with my SDL framework, as these changes didn't require much platform specific stuff I would need to test, but once the 128k title screen came up, I couldn't resist and I prepared a build which I ran on the hardware. It was really nice to see that initial 128k requester coming to life on my DS.

However, it made me realize that now I can no longer do without keyboard support on my DS. I had no time for preparing any graphics left, though, so I just put in a simple code which divides the screen in uniform 4x10 matrix and maps it to the keyboard bits accordingly. I also added support for resetting the emulator and toggling the 48k/128k mode at the same time. I would have loved to extend the snapshot loading code to support 128k snapshots as well, but I knew it would take more time than I had left and it was already quite late so I ordered myself to go to the bed. Not that I could fall asleep anyway, excited as I was.

Pick me, pick me!

There were two goals for Tuesday evening: file requester and 128k snapshot support. I have chosen to start with the first one, as it is more boring and uninteresting of the two and I wanted something pleasant to look forward to.

First I had to setup the bottom screen so I could actually display something on it. When deciding on what mode I would use I briefly considered using some graphics for the keyboard layout, too, but discarded the idea as I didn't want to spend more time than necessary on this. The bottom screen is something which will need proper UI design and major overhaul anyway and whatever I'll do now will be just thrown down the toilet, so in the end I decided that just the basic (but convenient) libnds console will do for now. Setting that up was a matter of minutes, as expected.

Now I had to refactor the code once again so I could implement the requester in both my SDL and NDS frameworks. This was slowing me down a bit, as I had to implement the same thing twice, basically. When implementing the SDL part, I couldn't help myself but wonder why something like reqtools.library is not included in today's OSes by default. Nevertheless, tedious and unimaginative as this work is, I proceeded slowly but steadily.

I wanted to avoid any path concatenation, so the idea was to scan the current directory for files and directories, let the user choose whatever he wants, and if it was a directory, chdir() there and rinse and repeat. It worked like a charm in SDL, however, the desmume had problems to switch to some directories, for whatever reason. I hoped it will work fine on real hardware as it always did before, but alas - this time I was wrong! Once I have tried it on the real hardware, it turned out that the diropen() can't cope with relative paths at all, at least on my M3 Simply. Damn.

Having the choices of either rewriting the whole thing once over again, or hacking up something useful from what I already had, I chose the latter for now. I was annoyed, it was already getting late and I still have had the 128k snapshot loading to do. I simply excluded directories from the list and restricted the whole thing to root directory only. Far from ideal, but it will do the job for now.

Bigger! Better?

Getting the file requester out of the way, I eagerly moved on to update the snapshot loading routine. The SNA snapshot fileformat specs are simple, however one has to wonder why they have chosen to build the 128k version on top of the 48k one. The 48k part of 128k snapshot can't be used by emulators who don't understand the 128k format anyway, so why bother? It seems it only complicates the bank layout and the possible inclusion of one page twice is particularly laughable. Nevertheless, I have added the bank filling code in no time, and was ready to test it out.

I gave it a quick ride with 128k snapshots I created from some of the games I had around in TAP format. It worked fine, however none of the games used the banks extensively. I decided to try Chase H.Q., which was the first game which came up on my mind which I knew loads levels in the 128k banks. I tried it and it didn't work. The whole thing seemed frozen in the menu screen. Quick check of the Z80 code executed revealed the following:

LOOP LD  A,0
     AND A
     JR  Z,LOOP

Seeing this it was clear that for some reason the interrupts were not enabled when they should. I double checked the documentation and my implementation, and realized that they meant that bit 2, not 2nd bit, encodes the IFF2 state. Given the format origins, it's sort of obvious, bit 2 is the P/V bit of the F register, where the IFF2 state is stored after LD A,I/R instruction. After fixing this embarrassing bug, the Chase H.Q. snapshot worked fine.

The shadow of the past

With the file selector now working, I have experimented with a few more snapshots. As I expected, my own Star Dragon hung reading port #2BFF, waiting for ULA to start rasterizing the screen. With the T counter available in my I/O routines it will be trivial to implement this properly, but as I had little time at my hands, I just put in a quick hack to alternate between 0xFF and 0x00 every time an unmapped port is read from to work around this for now.

Now Star Dragon worked fine, but it proved that games which take the ray position into account won't do without at least per-line screen sampling. Such games usually take two frames (PAL people would call that fields, but they are not distinguished as far as Spectrum is concerned) to update the screen once, first waiting for the ray to start drawing the first frame, drawing the screen background from top to bottom, than use the vertical blank to throw in the sprites as well, and spending the rest of second frame for game logic and sound generation. Without proper screen sampling, such games are very likely bound to flicker. Looks like I'll change the priority of this feature on my list, then.

Another thing which the Star Dragon snapshot reminded me was how much the snapshots suck in general. In the good old days we despised any snapshot version of a game, always looking for pristine originals. And for a reason, too. The snapshot version usually meant missing loading screen, ripped 128k music, bloated code blocks, not seldom missing level data, and sometimes even broken game itself. And all this applies more or less to today's snapshots as well. The fact that the 48k snapshot modifies the two bytes on the stack can have disastrous consequences. The reason is that popping values from the stack was one of the fastest ways of loading 16bit value to registers on Z80. Therefore it was quite common to save the real stack somewhere, point it to some data and start fetching them using the POP instructions. For example, the sequence

POP DE
LD  A,(HL)
AND D
OR  E
LD  (HL),A

was common way of putting masked sprites on the screen. Or in Star Dragon, I used something similar to the following "loop" to render the stars on the star field quickly, mixing both data and return addresses on the stack, without having to test the ending condition:

POP HL
LD  (HL),A
RET

Even without tricks like these, it is quite simple to screw up the game using a snapshot. In my particular case, the modified stack values obviously messed up the initial decompression, resulting in scrambled font.

So the lesson learned here is that the snapshots, 48k SNA ones in particular, really suck. The 128k SNA snapshots are slightly better, but as they don't contain info about the AY registers at all, they might easily screw up the AY music. As a result, I decided I won't accept any bug reports regarding game crashes until the TAP/TZX format support is added. Games are usually archived as tape files today and it is just too easy to mess up a snapshot. With so little time I have, I can't afford to waste it hunting bugs which don't exist.

Doubtly advanced arithmetics

I haven't had much time for ZXDS recently, but when I tried Tetris 2 once for a while, I soon noticed that after finishing the round one, the round which followed was not two as expected, but seven. Oh my, I thought, this means that I either set the half carry flag incorrectly somewhere, or I managed to mess up the DAA instruction. And as I was sure that Franta didn't use any fancy trick in this case, I must have messed it up really badly.

The dreaded DAA instruction. I am sure that if you ask any of the Z80 emulator developers which instruction he hates most, he will most likely say it is the DAA instruction (or maybe BIT n,(HL), depending on how advanced his emulator is). The DAA instruction is used for performing arithmetics on BCD numbers, which are simply decimal digits, each stored in one byte nibble. It is the only instruction which needs the flags N and HC, and this is perhaps where its infamous emulation history originates. Long time ago, when men were brave and computers weak, emulators usually neglected emulating these flags entirely, opting for decent speed instead. This really made the DAA instruction impossible to emulate properly. The emulators then had to resort to various tricks, like checking what instructions preceded the DAA instruction, in a weak attempt to make it work at least to some extent.

It's interesting though that nowadays, when the speed is not of such concern anymore, the DAA instruction is still causing troubles to many. You can find lot of different implementations floating around, many of which are bizarrely complex, some which are simple but incorrect, and, finally, those which are bizarrely complex and incorrect at the same time. Many implementations opted to use lookup tables to make sure they get it right. If you have seen some of the complex tables which attempt to explain how DAA works, you might understand why. But this is all really unnecessary.

The truth is that the DAA is in fact fairly simple to understand and emulate. All it takes is just to understand the theory behind how the BCD arithmetics works. The sole purpose of the DAA instruction is to handle the overflow/underflow of each decimal nibble after normal non-BCD arithmetics operation was performed. The nibble overflow/underflow occurred when the carry flag for that nibble was set (for example, after 9+9 or 0-9), or the value of the nibble itself is out of the range (that is, greater than 9, for example, after 9+1). In such case the nibble must be adjusted by 6 to get it right once again. Beware, though, that in case of addition, this correction itself may cause a carry overflow. So a nibble needs to be adjusted not only if it is already out of range, but also if it equals 9 and lower nibble is out of range. Or, simply put, both nibbles combined are greater than 0x99.

Putting this all together, the high nibble is corrected if the carry flag is set or A is greater than 0x99, and the lower nibble is corrected if half carry flag is set or lower nibble of A is greater than 9. The correction factor of both nibbles is simply assembled in a temporary place and then subtracted or added to A as a whole, depending on whether N flag is set or not, respectively. The flags are affected by the result of the outcome as usual, with the exception of carry flag, which is of course set whenever the higher nibble was corrected (this is obvious, as "was corrected" means that the decimal nibble overflowed/underflowed in the first place, and the purpose of the carry flag is to make this information available for addition/subtraction of higher order bytes if necessary). Given the simplicity of this implementation, it's no surprise that the DAA works this way even for input values which are impossible to achieve via normal BCD arithmetics, no matter what some people speculate.

Knowing all this, I couldn't believe I didn't get even the basic DAA functionality right. I checked my implementation, and the relevant line read

if(IFHC||(A&0x0f)>9) a=0x06;

as it should. I checked that I set the half carry flag correctly in both ADD and INC instructions, one of which Franta likely used prior DAA in Tetris 2 (he used the latter, in fact). As a last resort, I finally looked at the half carry macros:

    byte HC ;
#define IFHC    (H&0x10)
#define SETHC   HC=(byte)~0;
#define CLRHC   HC=0

When I finally spotted the typo, I couldn't help myself but laugh.

On the bright side, this convinced me to run the ZEXDOC and ZEXALL tests afterwards to make sure I haven't made other stupid typos like this. I really appreciated my SDL framework in this case, as it allowed the tests to take mere minutes instead of hours. The ZEXDOC passed (or better say gave the same results as real Spectrum does) right away, and ZEXALL failed only due emulation of flags 5 and 3 after BIT instructions, which I knew was missing for a reason. The fact is that the BIT n,(HL) instruction sets these flags depending on the state of Z80's special internal 16bit register known as MEMPTR, and little was known about this register when I wrote my Z80 emulator years back. It looks like last year some smart Russian fellows finally discovered how this register works thanks to CPI and CPD instructions, so implementing it exactly is now possible, but I doubt it is really worth the hassle. I at least implemented proper setting of these flags for all other BIT instructions. While messing with flags, I also improved the IN/OUT[ID](R) instructions, adding behavior which was undocumented until recently as well. So as far as the Z80 core is concerned, the ZXDS should do pretty well now.

R Type loading error

Over the past few evenings I was adding support for the tape files. I was considering whether I should do the tape files or sound first, and in the end I have chosen to start with the tape files for two reasons. First, I was curious if there will be any problems with the TZX file format, and second, I knew that once I have added the sound, I would just keep listening to my favorite tunes and I would have trouble forcing myself to do anything else. So tape files was the choice.

First evening I spent just a little time refactoring the code to support multiple file formats. This basically meant replacing the snapshot loading logic with a file format dispatcher and adding stubs for the formats I intend the support. I also added the persistent file buffer where the loaded file could be kept in its entirety. I was first considering streaming the tape files, but the relative addressing used in TZX format has convinced me it will be much easier if I will have it loaded all at once. Finally, I extended the port I/O routine with hook for sampling the tape, testing it with a stub that just generated the loader pilot tone. With the lack of sound, I just quickly mapped the border color directly to background color of the touchscreen to make sure I got that right. I intended to use this quick hack just for this test, but I sort of liked it, so I decided to leave it enabled by default until I come up with something better (if at all).

The other evening I added the support for the TAP files. Instead of trying to implement any fast loading, I opted for implementing the normal tape sampling first. I might have chosen otherwise if I was adding TAP files only, but as I was aiming for TZX support as well, I wanted to make sure I get the tape sampling right first. The TAP file format is trivial, so the only interesting part here was to design the state machine for the pulse generator which I could use for both TAP and TZX files. I briefly overviewed the TZX file format specs to get an idea what I will need. It was immediately clear that TZX will make it a bit more tricky to get the pulse edges right on the block boundaries, but I ignored that for now as I was focusing on the common denominator of the two formats. Fortunately the basic data block is pretty same in both cases, so in short time I had a simple pulse state machine up and running, and once I got the final pause edge right, I was able to load my first TAP file. I then tried a few more and it all worked fine, so except the problems caused by lack of any tape start/stop mechanism, the TAP files were usable and I was ready to move on towards TZX the next day.

The TZX is much more complex format than TAP file is. It can store information about pulses in about six different ways. I decided to implement them all except the CSW and GDB, which seemed just too complex and seldom used to be worth it for now. I also decided to implement any flow control blocks later, as none of the TZX files I looked at used them. I extended the pulse generator state machine with three new states which allowed me to generate pulses for all the TZX blocks. I also added three new lambda states which allowed me to adjust the pulse level as needed, so I could force or suppress edge at the beginning or end of a block as necessary. The TZX specs is very specific about this and it varies greatly with each block type, and it would be a pain to make this part of the pulse generating states themselves, as those are always better implemented in a uniform way. Once it was all done, I just tested that I didn't break the TAP file support and left the rest for the other day.

The fourth evening I finally added the support for the TZX files themselves. It basically meant to fill in the block processing stubs I have prepared before with code which handles each block type appropriately. I just had to be careful to make sure I get the edges at the block boundaries right, programming the pulse generator state sequence accordingly. I was pleased that I got it all right for the first time, and I was soon able to load any of the Speedlock and Alcatraz protected games I tried. It was about ready for another alpha release, but first I wanted to implement some logic which would start and stop the tape automatically as needed, to make the tapes of games with additional levels usable as well.

So the fifth evening I just polished the tape handling API, adding internal interface for tape deck control. Then I experimented with the heuristics which would allow starting the tape automatically when needed. Stopping is not such a problem, as TZX files contain explicit instructions where to stop, and in TAP files I simply decided to stop after each block. Finding the heuristics appeared a bit more tricky than I expected. Once again I appreciated my SDL testing framework, which allowed me to load the tape files in no time even without any loading acceleration techniques in use and to trace the Z80 emulator more conveniently than the NDS build ever could. I hoped I could use the state of the MIC output as part of the heuristics, as the MIC line should be off while loading, but it soon proved that some loaders do not follow this recommendation. Eventually I ended up with something which worked reasonably well on the TAP and TZX files I have tested. It is certainly not perfect, as it might be fooled by some tight key testing loop and start the tape when it should not, but I plan to add manual tape deck control one day anyway so it should not be a big deal.

Weaving the waves

With tape files working reasonably well, I finally shifted my focus towards the sound emulation. On Tuesday evening, I didn't feel like I wanted to do any programming, so I just lazily browsed the NDS documentation on this topic. The NDS sound hardware is really neat. It reminded me the Amiga hardware a lot - the latches, possibility to start all channels at once in perfect sync, it was all very refreshing after all those braindead sound APIs one has to deal with on other platforms. Even the sound bias is there to avoid abrupt sound clicks. It made me wonder how much the Factor 5 people actually affected the Nintendo's hardware design. The sound channels can't generate the IRQ on their own (which makes sense, having 16 of them), but thanks to sharing the same base clock speed they can be easily coupled with timers to work around that. In fact, such setup is even better, as it allows generating IRQ at arbitrary points in the sample, not just at the end. All in all, amazing piece of hardware.

Excited about the sound hardware features, I just tried a few quick tests in desmume and on real hardware. It seemed the desmume dealed with the sound fine, which was good, as I hoped I will be able to use it for testing as much as possible. The SDL framework won't be of much use in this case, as it can't test the NDS specific stuff I will need. Satisfied I went to sleep, looking forward to the promising evening of the next day. My wife was supposed to visit her friends, so I hoped to use the entire evening for intense programming.

On Wednesday evening, I started the actual work. The idea was to send any audio related I/O together with the timing information to ARM7 and do all the sound mixing there. This would offload all the audio work from ARM9, resulting in virtually no slowdown on the emulating side. I intended to use the NDS IPC FIFO for this, as it seemed to be perfectly suited for this type of communication. I started writing the ARM7 main loop and the FIFO processing, but I soon ran into annoying problems. The BIOS SWI call I wanted to use while waiting for FIFO data turned out to be buggy, so it seems impossible to use without the chance of missing the IRQ. And some of the documentation I had available got the FIFO empty/full bits wrong, so I had to test myself what was right and what wrong. Furthermore, it turned out that the FIFO bits don't work properly in desmume either, so I had to try everything on the real hardware (about time I get running apps over WiFi working, shuffling with the micro SD card back and forth is really starting to get me mad). All these silly issues combined slowed me down a lot, so when I finally got it working, it was quite late already and I was not really in the mood of starting to program the streaming sound system. In fact, I was pretty much upset and frustrated that I have wasted so much time on something I expected to have finished in no time. I just felt to bed and hoped the next day will be better.

On Thursday evening, I quickly wrote the streaming sound system which would suite my needs perfectly. With the NDS sound features it was a piece of cake. Four looped sound channels playing infinitely, coupled with timer IRQ handler for tracking active buffer and detecting buffer underrun. Really simple and neat code. The only problem was it didn't work. I wrote simple wave generator which was supposed to test it, and when I tried it in desmume I got scratchy noise instead of the expected tone. I tried it on a real hardware, but got similar noise, only somewhat faster, so I concluded that it will be fine if I get it working in desmume first (this was a cardinal mistake, as it turned out later). The funny thing was that the sound code was only about 80 lines of code, so it was really hard to do something wrong there. Of course the immediate suspects were the timers' frequencies, especially considering no documentation was really sure whether the ARM7 timers run at 33MHz or 16MHz. I tried both halving and doubling the frequencies, but to no avail. The next thing in the row were possible off-by-one errors when setting the timers. I tried all possible combinations which might make sense I could think of, but no luck either. I started to suspect the wave generator itself. I changed it many times, playing with both periods of the wave as well as the volume, but the generated sound didn't really come close to what it should be. The drop outs were still there, and the volume changed only occasionally. However I looked at it, it was all really weird. I only occasionally tried it on the real hardware, but it never worked either.

Finally, I started messing with the sound buffer counts and lengths themselves. When I have made the whole loop long enough, it became audible that the loop itself might be fine and there is just a drop out at the end. I modified the wave generator to produce slowly decaying volume envelope spanning several seconds and to stop the sound after four such waves so I could diagnose this better. I was shocked when I tried it in desmume. The sound didn't stop at all, and the desmume kept repeating just the leading part of the generated wave. It meant the sound support in desmume was not as great as I have thought and that it played tricks with me all night long. I cursed myself for being so stupid and quickly tried it on the real hardware. The drop outs at end of the loop were there, but at least the volume was decreasing as it should and the sound stopped as expected. It was already late, so I decided to postpone further tests until the other day. I pretty much wasted the evening, but this time I was only little upset because of my own stupid trust in the emulator. The sound streaming can be tricky to get right, so I could tolerate spending two evenings on this without being ashamed too much.

On Friday, I asked our sound guru at work for a peer review of the sound system code I have written. We concluded that it really should work as designed and the only cause of problems might be an incorrect hardware setup. When we sampled the wave (I wonder why I haven't thought of it the evening before), it turned out that it looks like if it missed every other slice worth the sample size I used. Now it was easy to deduce that the timer was really running twice as fast than it should so the wave generator was getting ahead of the sound DMA every time the sample looped. Argh... When I came home, I quickly halved the timer frequency and tried it on the real hardware. I was rewarded with a clear, slowly decaying tone, free of even the slightest sound click. Silly, if I have tried everything on the real hardware in the first place, I could have had it working the night before. Still it was nice to have the sound streaming working, as I expected it to be the most tricky part of the entire sound emulation. Now I could happily look forward to plugging in the speaker and AY emulators. Pity it was the Easter weekend ahead, which meant we were leaving to the country and it had to wait until Monday evening.

The 8bit orchestra

The rest was fairly simple. On Monday evening I took my old DeliAY sources and adapted them according to my current needs. This mostly meant changing the API so the emulators could use the buffers I provided externally instead of mixing into their own buffers. I also got rid of the stuff which converted T cycles to samples, as it was now done by the main loop for both AY and speaker at the same time. By end of the evening I had it all compile fine with the new API in place, ready for the next day.

On Tuesday evening I changed the ARM7 main loop to actually feed the AY and speaker emulators with the data received from ARM9. The AY tunes played fine right away, but the speaker tunes suffered from buffer underruns every now and then. It turned out that with the huge amounts of I/O data sent in such case the SWI wait misses the IRQ and stalls the FIFO just way too often, and the main loop itself becomes the bottle neck as well. Still it was nice to see that even without any optimizations the ARM7 was able to handle the sound emulation at about the right speed. Four channels sampled at 32kHz yields about 250 CPU cycles per sample, which is not exactly too many cycles to spare.

It was delightful to listen to some of my favorite tunes of the past. And Tetris 2 (in Tetris 2 mode, of course) was now a real blast with the 128k music. The 48k tunes I tried sounded fine as well, although the speaker could use some oversampling indeed, but overall I was quite satisfied with the results. I'll try to implement the oversampling once I am done with the obvious optimizations I planned for the next day.

Divide and conquer

On Wednesday, I changed few things to get rid of the obvious cycle eaters - division and modulo operators. The ARM7 has no instruction to compute these efficiently, so each such operation takes quite a lot of cycles. Fortunately I didn't really need most of them. I used them to make sure the AY emulator doesn't depend on the sampling frequency in DeliAY, but I could easily replace them with a simple loop as well. There was just one division left, the one converting Z80 cycles to samples in the main loop. I have decided to move it to ARM9 instead, which is faster, has cache, and, most importantly in this case, can use the hardware divider. The added benefit was that the ARM7 now didn't have to know about which frequency the Z80 CPU uses at all, it is just told how many samples to generate instead.

Regarding the SWI wait which occasionally stalled the FIFO, I have decided to remove it entirely. When ARM7 is fed with I/O data frequently enough, it spends most time waiting for empty sound buffer anyway, so the SWI wait for FIFO wouldn't help battery wise too much anyway.

With these changes, the sound emulation was fast enough to deal with any 48k tune I have tried without single buffer underrun. This meant the release was near, however there was one last thing which I wanted to do before the release - improve the video/audio synchronization. Before the sound was added, the Z80 emulator could either run as fast as possible, or to synchronize with vertical retrace at 60Hz. Once the sound was added, it was necessary to either synchronize everything to audio, or synchronize everything to real time in case the other modes were to be used.

So I have spent Thursday evening adding proper support for three synchronization modes - no synchronization which runs at full speed and audio is driven by the real time passed, audio synchronization when audio is generated according to the Z80 timing and the Z80 emulator is stalled as necessary waiting for space in the FIFO, and finally the video synchronization, when Z80 emulator waits for vertical retrace after every frame and audio is driven by real time. To measure the real time, I have setup the ARM9 timers to count the time passed in samples directly, so I could easily pass that to ARM7 as needed. With this in place, the no synchronization worked fine right away.

For proper audio synchronization, I just needed to make sure I keep the FIFO fairly full at all times, regardless whether the Z80 generates sound or not. Otherwise the Z80 emulator could run several frames ahead when there was no sound output and then stall whenever some sound started, causing nasty visual hiccups in the shoot-em-ups I have tried. Keeping FIFO full was simply accomplished by splitting the single frame emulation to several time slices, and pinging ARM7 between them. The result was quite satisfying, in many cases an untrained eye could not perhaps even recognize it was not synchronized to video frames.

As for video synchronization, it turned out to be pretty problematic. When enabled it worked fine for a while, but eventually waiting for the vertical retrace caused too much time to pass before the ARM7 was fed with more data, and once it was fed, it spent to much time generating the samples, so if ARM9 tried to send too much data during that period, it was stalled for too long once again, and here we go into the feared evil feedback loop. However, while the speedup of 20% caused by running 50Hz emulation at 60Hz frame rate was not that much noticeable when there was no sound, with the sound enabled it was surprisingly annoying, so I have decided to disable this mode in the release in the end. After all, the audio synchronization is good enough, and I'll eventually try to convince the DS to display the frames at 50Hz by playing tricks with the VCOUNT register.

Nitro goes nitro

Once the sound worked fairly well, there were just two missing features left I wanted to do before concluding the initial development phase and moving on to the user interface: improve the tape file loading times and add support for save states. As usual, I started with the former, as it was clearly the more tricky of the two.

Right from the start, I have passed on the possibility to implement loading of the entire data block at once. Although this is fairly simple to do for the standard loader contained in ROM, it would be of relatively little use as most programs use their own loaders anyway. And I simply didn't want to waste my time by reverse engineering every single loader out there - life is just too short and precious to be wasted like that. Instead, I have chosen to start with the most essential low level part of every loader - the edge sampling routine.

The edge sampling routine of the standard ROM loader looks like this:

EDGE2   CALL EDGE1
        RET  NC
EDGE1   LD   A,22
DELAY   DEC  A
        JR   NZ,DELAY
        AND  A
SAMPLE  INC  B
        RET  Z
        LD   A,#7F
        IN   A,(#FE)
        RRA
        RET  NC
        XOR  C
        AND  #20
        JR   Z,SAMPLE
        LD   A,C
        CPL
        LD   C,A
        AND  #07
        OR   #08
        OUT  (#FE),A
        SCF
        RET

It starts with two entry points for sampling two or one pulse at a time, respectively, then there is a fixed delay when the tape is not sampled (to ignore the possible little oscillations), followed by the sampling loop itself. The B register is used for counting the iterations throughout the sampling loop, and the C register is used for holding the expected input level in bit 6 and current border color in bits 0-2. Once the input level changes, the loop stops, the C register is toggled accordingly to match the new level and border color is updated to produce the distinctive loading stripes. The loop may also end when space is pressed or the B counter overflows, indicating timeout - either of these outcomes is reported via status flags. And all this in just 34 bytes. Nice example of neat Z80 coding indeed.

Perhaps thanks to the fact that most programmers are lazy by nature, many loaders use either the very same or just slightly modified version of the above code. Obviously most people didn't want to do their own T cycle counting and simply used the proven sampling loop, tweaking just little bits of it. The most common changes include use of different or no color stripes, or ignoring the space keypress. In case of turbo loaders, the initial delay may be different or specified by the caller or missing entirely. Some people even bothered to use different registers instead of B and C. Still, the resemblance to the original code makes this routine quite easy to spot just by quickly browsing through the memory disassembly.

The first step to achieve the faster loading times was naturally to speed up the sampling loop itself. The emulator obviously knows how long the current pulse is going to last, so if it could gather enough information about the sampling loop, like how long it takes and what register is used as the counter, it could easily and instantly adjust the counter and tape position appropriately during the emulation of the IN instruction. This was what I focused on on Monday evening - experimenting with what were the options of finding out that necessary information about the loader. I started with simple pattern matching testing few bytes before and after the IN instruction. I taught the patterns to recognize anything what looked like the ROM code, and then tested few dozens of various other loaders to see what the hit ratio is. I was once again glad I had the SDL framework to work with. It soon turned out that I had to add a new pattern for about every second loader I tried, as the loaders, albeit all the same in the essence, differed in annoying little details. But on the bright side, they all really looked reasonably similar, so I was positive that I will be able to find some solution in the end.

Anyway, once the emulator was able to recognize reasonable amount of loaders, I quickly implemented the tape warping code itself, as I was curious about the speed gain due this change alone. As I expected, it was about 2 or 3 times faster, which was about as fast as any standard turbo loader when loading zeros. Such loaders spend most time in the initial delay loop and just few loops in the sampling loop, especially when loading zero bits, whose duration is usually half of the duration of one bits. This also meant that there was less gain in case of TZX files, many of which use turbo loaders already. Still, I didn't give up my hope as this was just the start.

The other day, I have consolidated the code a bit. I have rewritten the recognizing code to detect sequence of specific instructions ending with the JR Z,SAMPLE instruction instead of the inflexible pattern matching. The address of the loop start gave me the instruction which I could use for finding out what register is the counter, as well as if it is incremented or decremented. I also combined this recognizing code with the code I have written before which tested when the tape should start. It was already measuring the duration of the loop, so I didn't have to deduce it from the loader instructions themselves. The new code worked pretty well, recognizing much more loaders without any additional hassle.

Once I was done with that, I started to experiment with how I could get rid of the delay loop before the sampling loop, which now became the major bottleneck. I wrote a testing code which tried to guess the addresses of the two entry points, as the delay loop is often located right after them. I didn't intend to replace the edge sampling routine as a whole, but I hoped I might use that address to trap the delay loop and skip it entirely. The hit rate was not exactly exciting, and also it turned out that many of the loaders, especially those which display counters, actually have their delay loop located entirely out of their edge sampling routines. This however lead to an idea which I planned to explore tomorrow.

The next day, I modified the Z80 emulator to simply keep track of the last DEC A instruction emulated. This made this address readily available inside the trapped IN instruction. Now it was simple to test if that instruction was indeed part of the delay loop and trap it if it was. Once that instruction was executed the next time, everything was adjusted as if the loop has passed in no time. The results were pretty nice, both techniques combined decreased the loading times about ten times. Still not as fast as I would have liked, but already something I could get by with, especially considering the save states will be supported soon, too.

On Friday evening, I implemented one last acceleration technique. The loader usually keeps gathering the loaded bits in one register and processes them once the entire byte is loaded. So if it is known what register is used and the underlying tape file block allows this, it is possible to inject the first 7 bits into the register directly without the loader finding any difference. I have implemented this only for the ROM loader for now, as it would be a real shame if it was loading only ten times faster, however it could be fairly easily adapted to much broader range of loaders without the need to reimplement them all in their entirety. Although not as fast as loading the entire block instantly, this once again lead to huge speed gain. I also liked the fact that I could still see the loading screen and the loading progress, something what instant loading lacks.

Once this was done, all that remained was to polish any of the remaining rough edges. One loader used DJNZ based delay loop instead of the DEC A loop, so I have changed the Z80 emulator to natively detect such loop and skip it, while still counting the T cycles properly and honoring the IRQ line signal as necessary. This positively affected the ROM loader itself as well, as it uses such delay loop during the sensing of the pilot tone. Another thing was to make sure the loader traps are uninstalled at appropriate moments, to prevent some other code from being trapped by mistake. In particular I didn't like the trapping of the DEC A instruction too much, so instead of trapping it I changed the JR NZ instruction in a similar way to what I did to DJNZ to find out if it was part of the DEC A loop and skip it as well. This had an added benefit of speeding up all such delay loops, no matter where they were located. Now the only trapped instruction was the sampling IN itself, which I have decided to simply uninstall when another IN was executed. Not really foolproof, but fair enough.

So, with all the changes I have done in place, I was able to load 48k games in about 10 seconds on average and many 128k games in less than half a minute. Although not instant, that was something I could bear with. And knowing that once the save states are implemented, one will usually have to load the tape file only once, I was fairly satisfied with the outcome.

Total recall

Now the save states were the only feature missing from what I could finally dare to call a decent emulator core. Well, the save states and the support for Z80 snapshots, but for me that was basically the same thing, as I planned to base the save states on Z80 snapshots anyway.

So, on Saturday evening, I have implemented the loading of Z80 snapshots. Nothing too interesting to write about. The Z80 format is fairly simple, although like most snapshot formats, it suffers from what I would call an fwrite() syndrome. That is, it is obvious that the format started as a simple dump of the internal structures of the Z80 emulator, and not as a carefully designed flexible format not tailored towards single implementation. And similar to the SNA format, when the support for 128k snapshots was added, the author decided to build the new version on top of the old one, even if it was of little use as the format itself was concerned. Well, looks like it's common that many people designing file formats simply don't really know when to draw a thick line and start again with something which makes sense. Nevertheless, I implemented support for all three versions including decompression in short time with little problems. The only part which I left out for now was the use of user defined keys, which I will add once there is some support for this in the emulator itself, and loading of the T cycle counter, which I postponed until the other day, just to make sure it will match my saving implementation exactly.

On Sunday evening, I had a tight schedule ahead of me. I wanted to make the release on Monday, to contribute to Spectrum's 25th anniversary, and I still had to implement Z80 snapshot saving, build the support for save states on top of it and plug it all into the emulator, using only its currently very limited UI options. That was quite a few things to do and not so much time left, so I had to proceed efficiently, avoiding any mistakes.

The first thing was naturally to implement the saving of Z80 snapshots. I quickly wrote the code which created an uncompressed v3 Z80 snapshot in a memory buffer. All I had to do was to make sure it can access the state of the Z80 processor, the memory banks, and the state of all important I/O ports, including AY registers. I also had to add an extra code to the Z80 emulator which would allow bringing it to a state which could be serialized safely. The Z80 format has no support for cases when interrupts are enabled, but are temporarily suppressed, like after an EI instruction, so it was necessary to make sure the Z80 is not in a state like that.

The interesting part was making sure the T cycle counter is stored and restored properly. The Z80 format is particularly awkward in this respect. As if storing the counter as a simple 24 bit number would be too much hassle, the Z80 author once again decided to store it in the way which suited his implementation at that time, but made it inconvenient for anyone else. Well, never mind. I quickly wrote a code which converted my internal counter to and from the required form. To make sure I got it right, and in fact the rest as well, I decided to plug the save/restore pair directly into the main loop of the emulator. When invoked once per frame, it seemed to work fine, but when I tried to invoke it multiple times per frame, it turned out the interrupt never happened, pointing out a bug in the T cycle counter storing. Once I fixed that, everything worked as expected, so I went for an extra stress test. I was in a hurry, but I didn't really want to rush the release, so it was important I get everything right, and especially the save states themselves. I have tightened the loop, allowing it to execute just one T cycle at a time, effectively invoking the save/restore pair after every single instruction. I have praised my SDL framework once again, as it wouldn't be possible otherwise. It really brought the emulator to its knees, but I was really satisfied, seeing it was still working fine, albeit crawling as a snail.

The rest was fairly simple. I have added an array holding the buffers for the ten save slots I wanted to use, and quickly modified API between the input processing routine and the main loop so it could pass back the commands to save or load to given slot. It worked great in the SDL, so I decided to give it a try on the real hardware as well. I was a bit worried if I have not exceeded the memory limit yet, as there was now quite a few statically allocated buffers already. I modified the NDS input routine to generate the saving and loading commands as well. I hesitated for a while, though, deciding if I really should use the L key as the modifier for choosing between the save and load. I didn't want to loose the ability of using it as the Caps Shift, so I made it to work as the modifier only only when kempston was disabled, and retained the ability to use it as both the modifier and the Caps Shift when kempston was enabled. It was not really optimal, but the alternatives were even worse, so I decided it was the way to go, at least until the proper UI is added. I also had to modify the keyboard layout to make space for the slot "buttons". I must have had been really tired already, as it took me several tries to get the fairly simple equations mapping the touch pen position to respective keys or slot buttons right. Fortunately I could test it out in desmume, so I was soon ready to make a build for the real hardware.

When preparing the build, I have got an idea that I might finally get the proper application icon attached. I have played with this before, so I just quickly prepared proper bitmap and copied both builds to my SD cards. Interestingly enough, the build without the icon worked fine, including the save states, which was really pleasing, but the other build behaved as if I immediately canceled the file requester. Puzzled, I decided to look at it later, as I still had to make the save states persistent.

Fortunately, making the save states persistent was a piece of cake, thanks to the use of the libfat library. All I had to do was to write the memory buffer to the card after it was saved to memory, and to try to read it into the memory in case there was no valid save state in that slot yet. I decided to use fixed file names for now, as it was the simplest way. Ten slots is fairly enough and people could always use other means of renaming the files anyway. I quickly tested it on the real hardware, and I was really surprised how fast it worked. For some reason I was expecting the saves to take much longer, so it was a nice surprise indeed.

With all this done, I felt to bed fairly exhausted. I have implemented everything I planned for the first phase of the development, and I was ready for the release the other day, just in time for the Spectrum anniversary. After five weeks since the first release, I guess this was not a bad achievement after all.

The show must go on

Quite some time has passed since I was last working on ZXDS. And all that just because I got too much involved in discussing the TZX file format over at the World of Spectrum forum. It all started with few simple (even if somewhat heretic) questions about the need for the new GDB and CSW blocks and ended up with designing the new PZX format and then writing the accompanying toolset to go along with it. Sigh. I should know better than that, but it was just too hard to resist.

Writing the format was simple, I did that overnight, as I had pretty good idea what I wanted ever since I started ranting about the TZX. I then let the design cool down for about a month, to make sure I do not rush anything, occasionally improving few subtle details as time passed. I also spent one evening during that period adding the PZX support to ZXDS itself, just as a proof of concept. It was really nice to work once again according to a format specs of my liking, and it was indeed no surprise that everything went smoothly as expected.

Once I was sure the format is not going to change anymore, I started working on the tools themselves, which put quite a burden on my already ever-so-busy schedule. It took almost two month to complete, working on it few hours on evenings every now and then. Still, I was quite happy with the result, as the tools proved quite handy for experimenting with tape files and tape file loading support in general. In particular the tools for converting the PZX to and from text format turned out to be a brilliant idea for development purposes, as it is just so easy to process and generate the textual data with few lines of code. I even used them in ways I didn't intend originally, like creating WAV file from debugging output of my speaker routine.

And then my family and I left for a holiday, and voilà, four months since last ZXDS release have passed like water in the river. Now it was really about time to come up with another release. After not seeing the ZXDS code for such a long time I wanted to start up with something fairly simple, which was fine, as I was planning a little cleanup release or two before starting the work on proper GUI anyway. I intended to do the per-line screen sampling and the 50Hz VCOUNT stuff first, but I noticed that people most wanted an improved file requester, so I decided to go with that one instead.

For some reason, I always find coding of a directory scanning routine boring. To make it at least somewhat interesting in this case, I decided that I will make it work with no limit on the number of files in a directory. This way it became a little neat memory-bound sorting exercise to keep me entertained, and in the end it saved quite some chunk of memory for some other use as well. Now talk about benefits.

With this new directory scanning code in place, I decided it would be a shame not to change the displaying code to a proper full-screen display. Besides, I too was tired of seeing just one file at a time. So I did that, it was mostly a matter of making sure the standard console printing code doesn't scroll the screen when it should not. The controls otherwise remained the same. And while it would be really nice to have a touch screen controls, it will have to wait until I have the corresponding GUI infrastructure for that in place, as that would be a waste of time at this point otherwise.

The ways of contention

I was working quite intensively on ZXDS over the past few weeks. Well, that means one or two evening sessions a week, which is about as much as I can afford at this time. Still better than nothing, I guess.

There were just few little things I wanted to improve before finally moving on to the GUI: implement per-line screen sampling, fix reading from unmapped ports, and sync the LCD display to 50Hz. Well, obviously things don't always turn out as expected.

I started with per-line screen sampling. Some games like Green Beret or my own Star Dragon which take the ray position into account were flickering badly, so I wanted to improve this by sampling each line at about the same time the real machine does, not only an entire screen once per frame. To do that, I simply changed the main loop code to emulate the top border area first, then each line separately, then the bottom border. I copied the content of the corresponding line to a separate buffer after each line emulated, and then used this captured data instead of the current content of the screen memory I used previously for creating the screen image in the frame buffer.

The results were quite nice. The Green Beret was working fine, and Star Dragon was much better, however few sprites were still flickering near the bottom of the frame. This was undoubtedly because Star Dragon waits until the ray leaves the top border and enters the screen area by checking for screen data to appear on the bus by reading values from an unmapped port. This behavior is called floating bus among the initiates, and I haven't implemented it properly yet, apart from the simple quick hack to make the test pass at all. As a quick test I extended the relevant code to return 0xFF while in the top or bottom border area, and tried it again.

This time, even Star Dragon was fine, so I went on and tried few more games to see how it works in general. Even with the increased amount of data transfered the overall speed still seemed fast enough to reach the 100% mark, which was nice, as I was a bit afraid that this might become a problem. However, the more games I have tried, the more disappointed I was. The Uridium title seemed quite fine, but the hall of fame was not as good as I have hoped for. Of course, with no contention support in place I didn't expect multicolor effects to work correctly, but I thought that at least some of the simple line based ones might look acceptable. Well, they didn't.

But the worst came when I have tried Arkanoid and my own Qang. I have noticed before that these two games ran twice as fast than they should at some points, but I have attributed that to the missing floating bus emulation. Without the top border delay, these games would have about 25% more cycles per frame to do their work, so it was easily possible that what was intended to take somewhat more than one frame managed to get done in one frame, resulting in the observed speedup. But now when I have added the top border delay, regardless of the fact that the rest of the floating bus behavior was not properly emulated, it could mean only one thing. The games ran too fast simply because the code was not contended as it would be on a real machine. This was pretty bitter discovery indeed.

Disgruntled, I have spent the next evening adding the proper floating bus emulation, returning the correct screen data at the appropriate times. This seemed quite pointless given the circumstances, but it was mostly to get myself some time before making the difficult decision I had to make. I either had to accept the fact that the emulation will remain imperfect up to the point of some games being unplayable, or I had to step forth and try to implement the contention in its entire complexity. Sigh. Neither choice was exactly appealing if you ask me.

Memory and I/O contention. The holy grail of Spectrum emulation. The catch is that while the ULA is reading the data from screen memory, it can temporarily halt the CPU if it wants to access the same memory or do the I/O at the same time, and emulating this is neither trivial nor cheap. It took many years before the first emulator achieved such degree of accuracy and even since then it remained the domain of but a few PC emulators. The sole idea of being able to achieve that on a 66MHz handheld seemed pretty much off point. The Spectrum CPU runs at 3.5MHz, which gives 19 CPU cycles per 1 cycle emulated, or 76 cycles per 4 cycle instruction. That doesn't look like much, considering the amount of work which needs to be done for each instruction, and that's not taking into account that there is still some screen decoding and audio processing which needs to get done as well.

On the other hand, the idea that I might be able to pull it off was thrilling indeed. Of course, if I was about to give it a try, I wouldn't be satisfied with per-line screen sampling anymore, I would go for T state accurate sampling straight away. This means the screen bytes are sampled at exactly the same T states the ULA on a real machine reads them from the memory before displaying them. This of course adds even more to the contention complexity, but it just wouldn't make sense to have one without the other. Besides, there is nothing like rising to a challenge, is it?

Do you like it multicolor?

The chances of success were pretty slim, but I was sure that once I started working on the GUI, I would not dare to cut this deep into the emulation core ever again. So it was more like now or never type of question. Of course, the fact that it would take lot of work before I could hope for a first test to see if it can be done or not didn't help matters much either. In case it would not turn out well, I would have wasted lot of time, not to mention the devastating impact it would certainly have on my morale, but the glimpse of the prospects in case it worked on the other hand... Well, I just had to give it a go. Besides, I knew that I would never forgive myself for not trying.

Once I have made my mind, there was a lot of work to do. This was a pivotal change indeed. With the new goal set, I had to make sure that everything is accurate to the last T cycle spent, otherwise the whole work would be in vain. I started gathering all available information I could find on the net, and slowly prepared the Z80 core for the upcoming changes at the same time. I already knew how all these things work, but now I needed to get the actual timing info. It was important for me to get it all right for the first time, as I was sure I would not have enough strength to rewrite this over and over again, so I wanted to make sure I do my initial research well and do not forget to account for anything.

Regarding the Z80 core, the first change was obvious. Instead of adjusting the T counter once per instruction, I had to rewrite this in terms of M1, M2, M3, I/O, and extra wait cycles as per the Z80 data sheets, as each of these cycles may become contended depending on what value it places on the bus. I spent one evening adding the appropriate macros to each of the emulated instructions, and the next evening I changed these macros to actually do the contention, based on a naive but straightforward contention delay routine. I wanted to have some simple reference implementation first so I could use it for verifying that I got all the contention timings right, and I didn't care about speed yet as I was now using my SDL implementation exclusively, which is fast enough even if I hook in dumb pieces of code like this.

As for the accurate timing info, this turned to be quite a problem though. I have found several sources of information which listed when exactly does the contention start, how long it lasts, when is the screen sampled and when do the sampled data appear on the bus. However the problem was that the timing info differed among the sources, and more strangely, some pieces of information seemed to contradict the others. For example, it said the first pixel was to be displayed at 14336 T cycles after interrupt, however the related screen data were supposed to appear on the bus 2 T cycles later, which obviously doesn't make sense at all. Or the contention was supposed to start only 1 T cycle before that, but that would not give ULA enough time to fetch both the pixel and attribute data it needs to display the output.

The more info I sought, the more unanswered questions I had, and in the end I just had to start a thread at World of Spectrum forums which would answer them all once and for all. It took more than two weeks of serious head scratching, writing various Z80 tests and staring at the Z80 data sheets, but in the end it was worth it. Thanks to the help of few smart fellows we were eventually able to really find the answers for all our questions. The revelation came to me at the point when I have realized that not all the T states when the ULA is reading the screen memory have to be contended, as the ULA and CPU are competing for the data bus, not the address bus, and that the ULA halts the CPU whenever the MREQ line is high, not when it is about to go low. Since then, it all become crystal clear.

Now I could finally fill in the gaps by using the accurate timing values. The contention itself was working more or less correctly even before, and adjusting it to whatever offset was trivial, but in order to implement the accurate screen sampling, I needed to know how exactly does the ULA interact with the machine cycles involved, M3 in particular. The I/O cycle was important too, as it is possible to switch the entire screen at once by using port #7FFD, so I had to account for that as well. Equipped with the necessary knowledge, I eagerly started moving forward on Monday evening.

I have duplicated the main Z80 emulator loop, creating two identical copies. The idea was that it was necessary to emulate the contention only during the screen area, so why waste precious cycles while emulating the border area. This way I had to deal with the contention only for 60% of the time, which I hoped might give me the edge I needed to make it work at all. For the last time I have checked that the contention timings were correct by using all the tests I have collected during the past two weeks, and then I systematically started to change each loop towards its final appearance.

The loop for the border area was simple, this just meant replacing the T counter macros with variants which simply adjusted the counter, doing no contention at all. For the screen area loop, I have removed the interrupt handling code and changed the outer loop to iterate over all 192 lines. Now I needed to combine it with the screen sampling loop. Of course, sampling the screen between the instructions would not do, so I had to incorporate it directly into each instruction which might modify the screen. To do that, I started from the innermost loop which would copy the appropriate amount of screen bytes depending on the current value of the T counter, and put it directly inside of the macro responsible for timing of the M3 cycle. Then I made sure that this innermost loop gets all the pointers it needs for it to work set up properly. At first I thought I will have to do some corrections because of the left and right border area between the lines, but to my surprise the whole code turned out to be pretty simple in the end.

I quickly tested the result with Shock megademo, and it worked perfectly. Lines in Echology were a little off at the right end of the screen, though, but once I have remembered I forgot to include the screen sampling loop in the I/O cycle and fixed that, it worked as well. I tried few more demos and games and everything worked as it should, too. That was great news indeed.

Speedwise, the emulating loop must have certainly slowed a lot, but on the other hand the amount of the data transferred in comparison to per-line sampling remained the same, so the net effect should be about the same as well. Now all that I needed to do was to replace the contention macros with a table based variants which would have a chance to run fast enough so I could finally give it a try on the DS. But I left that for the next day, as it was quite enough of work and excitement for one evening already.

The next evening I quickly wrote the code to initialize the contention tables I needed and adjusted the timing macros to use them directly. I used the appropriate section macros to put the tables directly to DTCM, partly to make the access as fast as possible, but more importantly, to make sure that accessing the tables doesn't flush any other data from the cache. The access to main memory is quite slow, so writing the code in a cache friendly way seems critical, and I was sure I will need that if I was to succeed.

Finally, after several weeks of work, the point when I was ready to try it on my DS came. As it was working in the SDL, I was not afraid that it won't work on the DS, the question of course was how fast it will work. I prepared the build and tried it in desmume first, just to make sure it really works. The 128k requester came up after while, which was good, so I copied it to my SD card and ran it on a real machine. First nothing happened, the screen remained pitch black, then after very very long time, the familiar red stripes started to appear at the bottom of the screen in a horrifying slow motion. My heart sank low. So this was it. All my hopes shattered to dust. There was no chance I could ever speed up something this slow.

Absent mindedly, I started tapping the keys on the touch screen to see if I can get some response at all. Nothing seemed to happen for as long as I have waited. I tapped few of the save slots, too, just to make the screen do anything. The Zynaps title screen started to appear, chunks of graphics at a time. Suddenly the 48k music started playing. This got me thinking again. My first thought was like, hmm, that's interesting, accessing the screen memory must be causing some terrible cache misses which make everything this slow, while the tight 48k replay routine perhaps fits in the cache and runs quite fine. I tried tapping few more save slots out of curiosity if I can get similar behavior in other cases as well, when it finally stroke me.

There was no way how the 48k music could sound fine while everything was running this slow. Well, at least not unless the sound itself was involved in this. I quickly checked the code and that was really it. When I have changed the main loop to implement the per-line screen sampling, I have removed the code which pinged ARM7 several times per frame, and left just one ping at the end of the frame. I thought it was enough, which was true originally, but I haven't realized that once I have changed the code to use the hardware divider, the division became signed, resulting in integer overflow unless that code was called at least twice per frame. Aargh! The overflow caused the code to request thousands of samples to be played after each frame instead of the usual few hundreds, eventually stalling everything until the ARM7 coped with that. Those were the delays I was seeing and that was the reason why the 48k music sounded fine, not some cache related stuff.

I quickly commented out the relevant code to get rid of any audio synchronization issues for now, and feverishly prepared a new build. This time the 128k requester came up in no time, and although now I had no sound I could use to judge the actual speed, it seemed fast enough as far as I could tell. Of course, I instantly loaded the Shock megademo, and when I saw the famous "I made it, I made it" along with the full screen multicolor stripes, I was almost about to burst out with tears of joy.

Harmony of dissonance

From now on the rest was fairly simple in comparison. I have reenabled the audio code, making sure it was now called multiple times per frame as required. Quick test showed that the emulator speed was really fine in most cases, however while playing some 48k tunes it dropped below 100%, resulting in audio dropouts. This was not much of a surprise, though, even before I have noticed that the previous release could not keep up with some tunes like Trapdoor, so I was intending to improve the speaker processing anyway.

To do that, I changed the code to compute the output volume on the ARM9 side, sending only the completed samples to ARM7, rather than leaving the volume processing to ARM7 entirely. This also allowed me to compute the output volume exactly by synthesizing it over the entire sample period, rather than sampling it at discrete points, and still reduce the overhead to at most one division per sample rather than per one OUT instruction processed. This resulted in the speaker output sounding much better, and although still subject to aliasing issues, it was basically as good as it can get without the use of some fancy low pass filters.

The speed improved considerably as well, however it still was not enough for some tunes. I tried to change few things here and there but nothing helped much, so I have decided to do one more change in the screen sampling code which could save a lot. While I did some profiling before, I have noticed that the screen decoding loop was clearly memory bound, so it didn't look like a good idea to copy the screen data to a separate buffer and decode them afterwards after all. Originally I wanted to keep the screen sampling loop as simple as possible, but now it turned out that it might be better if it did the screen decoding at the same time as well. So that's what I did, combining the two loops into one, thus saving entire buffer worth of reads and writes. Indeed, the speed was now perfect, dealing fine with whatever I have thrown at it.

The release was getting near, but there was one more thing I wanted to do first. The screen emulation was now topnotch, but I was sure that a tearing display wouldn't do it any justice, so I wanted to finally force the LCD refresh rate to match the frame rate of the emulated Spectrum. Now this was indeed just an icing on the top of a cake, but I knew myself that once you have seen something properly synchronized to the vertical interrupt, you'll always be able to tell the difference ever since.

Of course, doing that was not entirely trivial, as there were several things I needed to synchronize together in perfect accord, each of which was running at different frequency of its own. The sound at about 32.7kHz, the Spectrum at about 50Hz, and the LCD display at about 60Hz. The DS hardware fortunately allowed to do that, even if normally it would be quite impossible. The key feature in this case was of course the single clock which drives every DS component. Thanks to this, it was possible to base everything on one common frequency and use that for everything else. Naturally, the fraction of the audio frequency was chosen as the master frequency, because the audio must be kept running at all times with no gaps in between, as the human ear would be able to perceive even the slightest clicks otherwise.

Having this common master frequency, it was now simple to use it for converting T cycles passed to amount of samples to generate, as well as to use it to set up a timer which would count the 50Hz which would drive the LCD, and still keep everything in perfect sync. The refresh rate of the LCD itself can't be set directly, though, however the DS features a writable VCOUNT register which can be used to delay the start of the next frame as needed. It is normally intended to be used to synchronize display of machines participating in multiplayer games, but it was trivial to abuse it for holding the retrace after each frame displayed until the interrupt handler driven by the 50Hz timer allowed it to go, effectively slowing the LCD refresh rate to 50Hz as well.

Of course, this alone would not be of much use if the emulator kept drawing into the frame buffer as it was being displayed, so I had to add support for drawing into back buffer instead and swap the buffers only after the front buffer was displayed in its entirety. This somewhat complicated matters further, as the emulator loop could be now stalled not only by waiting for FIFO used to pass the audio commands to ARM7, but also while waiting for the back buffer to become available again. To be able to deal with stalls like this, I went for triple buffering in the end. Double buffering is fine for eliminating tearing, but it requires at least three buffers if one needs to run a bit ahead, and that's exactly what was needed in this case.

Now I was finally able to enjoy the perfectly smooth display output. It always surprises me how big difference it makes. The eye certainly is sensitive to frequency changes in time domain as well. I ran several demos and was really glad to see how well it all worked. The only disappointment was the scrolling background in Uridium. Although the scrolling itself is perfectly smooth, the LCD can't keep with rapidly inverting pixel colors, which makes the chequered pattern flicker badly at some speeds. Well, seems like nothing ever comes without a grain of salt. Nevertheless, I was pretty satisfied with this release, and my hopes were that even if there were no changes in the UI everybody seems to be longing for, there might be some people out there which would find this release as exciting as I did.

Widget, his sprite, and their painter

After taking about two weeks rest, I have finally moved onto what I have been looking forward from the very beginning - the user interface. So far, I haven't used much of the DS specific features, so I was hoping I will be finally able to find out what it has to offer in the 2D graphics department. To start with, I spent few evenings studying the available docs about the sprites and tiles and so on. I also wrote few simple tests to make sure I understand how they are organized in the video memory.

Once I was sure I know how it all works, I started the actual work. I didn't want to get caught in the trap of the feature creep which often comes with the bottom-up design, so I just drafted the skeleton of the simple UI toolkit I was going to use, and then I started refactoring the main loop to what it shall look like. The first obvious change was the input processing. Instead of the hardcoded touch screen save slot and keyboard matrix I have written code dealing with touching generic widgets, and the hardcoded button mapping was replaced with customizable key and joystick array. I included support for sticky keys as well, if only for shift and caps for now, and implemented the simulation of the Spectrum keyboard matrix effect, which is responsible for generating extra keypresses when certain combinations of more than two keys are held.

The next evening I implemented the brand new touch screen keyboard, built upon the widget API, including switching between multiple layouts and simulating keypress of two keys with single widget, which was going to be needed for those extra keys of the Spectrum Plus style keyboard. Once this was done, I filled in the widget stubs with actual implementations, as I now had pretty good idea of what exactly I'll need. I also improved the widget input processing itself as well, as while most widgets shall be activated only after being released, those of the touch screen keyboard must respond immediatelly.

Two weeks later, when I came back from a business trip, I briefly experimented with copying the screen content to both displays using DMA because of one idea I got while I was away, and then I returned to the UI code again. To be able to actually display some graphics, I wrote simple code to clear, draw, and copy rectangles, as well as a simple printing routine featuring word wrapping and proportional printing. Instead of using the frame buffer directly, though, I decided to let it all operate on an offscreen paint buffer. This approach has several advantages, as one doesn't have to deal with memory bus width and alignment issues, accessing it features all benefits of memory caching, and it can be drawn to without worrying about flickering issues. But most of all, one can use single set of functions to prepare its content, and then easily copy it to the VRAM frame buffer, show its part within a requester window, convert it to tiles to be used as sprites, or just about anything else as he desires.

That's actually what I wrote the next evening - routine to convert given paint buffer rectangle to a set of tiles, returning a single value which could be directly used to initialize a sprite. The initialized sprite uses about 48 bits, but cramming the necessary info into 32 bits was desirable for ease of use. Also, the sprite sizes supported by the hardware are limited to certain fixed dimensions, so I included the option of having to use multiple sprites to cover the rectangle. I already knew I'll need something like that to display the space bar of the Spectrum Plus keyboard anyway. Once this was done, I wrote the sprite management routines themselves. The allocation itself was simple, the only somewhat interesting part was their upload to hardware. The catch is that the hardware sprites may be uploaded only during vertical blank, but it isn't much of a problem, as one needs to maintain a copy anyway in order to be able to implement atomic changes. Of course that's what I wanted, so I used simple locking counter as it allows nesting. This way I can lock the sprites, change everything I want, and unlock it again. Without this, the multisprite stuff would be impossible to do without risking partial screen updates.

Having all this in place, I was ready for a simple proof of concept test. The next evening I have added simple code which painted the layout of the keyboard into the paint buffer using simple rectangles, and then created tile sets for each of the pressed keys. The simple keyboard graphics would be replaced with the actual keyboard graphics one day, but the rest was about to stay. I also added in the code to create corresponding painter for each keypress. Painter is an extra abstract object introduced to simplify things. It is used mostly to treat group of multiple sprites as one object, and can also take care of adjusting its visualisation depending on the desired widget state. Without it, working with rectangles which require multiple hardware sprites would be a pain. To simulate a real screen as it would look like in the future, I also added few simple menu icons to the top. I quickly prepared an actual build and tried it out. I was pleased that the tile and sprite stuff was working fine. The only thing which needed improving was the widget handling code, which incorrectly treated sliding away from a pressed menu widget as if it was released. Once this was fixed, I knew I was ready to start building the actual user interface on top of all this.

The art and perils of UI programming

Coming a bit late...

From Russia with love

Coming a bit late, too...

Christmas marathon

Coming too...

Games galore

With the Christmas release out of the door, I had to calm down a bit. I was all eager to implement the TR-DOS paging right away, however I have been neglecting my family for way too long, so I decided to put it on hold for now and relax. Having finally delivered the UI, I thought I could sit back and enjoy the emulator myself for a while, too. Until now, I have spent way more time developing it than using it, so it was time for a change. I also thought that it might be nice opportunity to try out the games and make screenshots for the page I was planning, to make sure the time was not entirely wasted.

The games page was an idea I got some time ago when I have realized that many users out there don't really have an idea which games are worth trying (actually I feel the same thing myself whenever I want to try out an emulator of a platform I am not familiar with). I first wanted to just point them to some of the existing lists, but I soon found out that most of them are full of stuff that newcomers would hardly found acceptable nowadays, so I decided to create a list of my own.

I naturally wanted the page to have game screenshots, so I have spent one January evening adding support for both raw and bitmap screenshots to ZXDS. I also noticed before that quite a few TZX files actually used the for/next loops, so I threw that one in, too. Then I downloaded lot of the finest games, added some more from my tape backups, and started playing.

I originally expected the page to have just a couple dozen entries, but I soon realized that there are way too many good games to be left out, even if I tried to select only the best of the best. The list quickly grew to few hundred games. It also became obvious that there is no way I could responsibly order them by ranking, as many were just too different to compare, so I decided to group most of them by their genre instead. I also had one extra group for a selection of assorted favorites to start the page with, and one dedicated to conversions. Luckily the games I had problems categorizing felt in either of these extra groups. Ordering all the other groups was a bit tricky, too, as I intended to write little something about each game and I wanted the text to flow naturally from one group to another, but I eventually came with an order I was satisfied with.

While working on the list, I played through the games, making notes, taking screenshots, having good time, eventually adding new entries. Time went by quickly, one month passing another. I ended up taking several thousands screenshots in the end, from which I had chosen about few hundred to put on the page. I then started writing the page text itself, and it went much slower than I expected. More and more time passed and it still was not completed. I did not like that I did not touch ZXDS for such a long time now at all, but there was no way back, I had to push on. Eventually some more time later the text was written, and after tuning the page's look a bit, working around IE's broken support for floats in the process, I finally made it public. Hooray! I did not expect it would take so much time when I have started, but I didn't regret it in the end. The page really lists some of the best Spectrum games and I hope that it will help bring joy to anybody interested out there.

A/V output

Even before I was done with the games page, I promised to myself that I will fill in the missing diary entries before I move on and start working on ZXDS again. But time went on and I still couldn't find enough spare time to write down everything I wanted about the Christmas release. Holding up to my promise, time inevitably passed by and little progress was made in the end.

This has finally changed when one of the WoS users reported missing sound effects in Steel Eagle. Unlike the tinny sound issue I have noticed in some tunes before, this one looked like a serious bug, so I decided to ditch the diary update for now and get some work done instead.

Boy, it felt so nice to do some coding again. I quickly found and fixed the bug. As I expected, it was caused by ARM9 not updating the envelope register when it didn't change, and thus not restarting the envelope in such case. Interesting, I think that at some point I have made the very same trivial mistake in each AY emulator I have worked on before.

Once it was fixed, I didn't want to stop there. Once I started messing with the audio code, I thought it might be the right time to fix the other audio issues as well. So I finally devoted some time to getting the correct AY volume tables, and I also went through the formulae describing the computation of final output volume and adjusted the master volume appropriately to make sure the output is not clamped.

The next day, I addressed the tinny sound issue. I took Chase H.Q. 2, where it manifested itself quite notably, and had a look at the AY register values to make sure I am on about the right thing. As expected, the tinny sound was indeed caused by the frequency register of one tone channel set to zero, which yields a square wave of about 218kHz. Normally you won't hear such a thing, but when sampled at 32kHz, the result is an annoying tinny sound caused by effect known as aliasing. To work around that, I decided to lock the tone generator output at constant value for frequencies higher than the sampling rate, as the ARM7 is not powerful enough to solve this properly by oversampling and applying some quality low pass filter. The naive solution would be to generate silence instead, but that would not allow the output to be further modulated by volume envelopes or other volume changes, which is a must for recent tunes. Interestingly, the subtle change in the code involved also speeded up the overall sound emulation, making some 48k tunes like Zanthrax which experienced occasional hiccups before sound flawlessly. That's what I call 2-in-1 improvement.

Now it was time to add few new audio features as well. I have modified the audio setup code to support configurable volume and panning values, then built the most commonly used ACB and ABC stereo modes on top of that. Of course, I kept the monaural output as well, as it is how the 128k sounded when the output went through the TV. Regarding the stereo separation, I have decided to make it somewhat configurable as well, and added two modes, one which is as wide as possible, which is most appropriate for the builtin speakers, and the other which uses about 50% crossover, which is more easy on the ears when using the headphones. I also addressed the relative volume of AY channels and the Spectrum beeper. Some docs say the beeper is about as loud as one AY channel, but that makes the AY sound too loud. So I have made it sound as two channels by default, and added the option of making it as loud as one or all channels as well. That should cover most common AY setups. To pass the selected parameters to ARM7 I have used the IPC interrupt, as I didn't want to slow down the main ARM7 loop at all.

Now all I needed was to add the appropriate UI icons and button shortcuts to control all this. And the relevant config section. And the help texts. The maintenance cost of adding silly little feature like this increased considerably since the UI was added, so I was quite glad I have delayed it for so long. To minimize it further, I have decided to throw in few video features first as well, to complement the changes in the audio department. The grayscale palette was the natural choice. It added to the nostalgic feel, too, as it was quite common to use Spectrum with a B&W TV back then. The next thing added was a special mode for displaying the Spectrum screen without any attributes, using just the ink and paper colors. This is something you can't achieve on a normal Spectrum, but I have found it quite interesting during the early ZXDS development, before the color display was implemented. Unlike the grayscale display, which uses just a modified palette, this one required different set of the screen decoding tables, but as it allows display of hidden loading screen messages and as it may shed some light on inner working of some games and demos, I thought it might be nice to have it in as well.

Later, when adding all the icons for the new features, I ran out of the tile memory available. The icons themselves are 24×24 pixels large, but have to be displayed as 32×32 sprites, wasting quite a lot of tile memory for the transparent area. For now, I have obtained some of that memory back by overlapping the transparent area of subsequent icons, but I will have to keep an eye on this in the future.

Flippity flop

Coming soon...

Action replay

Coming soon...

Green greens

On Wednesday I have decided it's about time to tackle the issue of power consumption. So I finally changed the active waiting loops to proper IRQ-driven idle waits. I briefly experimented with it long time ago when those loops were written, but never got it working correctly, so I have postponed it until later, to make sure it doesn't interfere with other time critical components. As I expected, in the end it was about six lines of code added to about three loops, but it took several hours and quite some head scratching to get it right. Not because I would not know how the interrupts work - the DS uses the same elegant scheme I am used to ever since the Amiga days. Neither because I would not know how to use the involved SWI calls properly - there is a good reason why those functions enable the master interrupt, making it possible to avoid missing one, something what everyone seems to carelessly neglect. The problem was that putting the CPU to sleep and waking it up takes some time, so incorporating such thing into a time sensitive loop such as the ARM7 sound FIFO loop was not entirely trivial. And the fact that the SWI calls involved are bugged in some way and the documentation is not exactly revealing much in this regard didn't help much either. But now it seems to be all working fine, as I have verified by listening to all the tunes from Melody Music I-VI and Music Player I-II myself.

The next evening I have added the code to turn off the backlight of the bottom screen when not in use. Having the IPC infrastructure in place, this was a piece of cake, just a matter of telling ARM7 to toggle one bit in the power management register. Ironically, the logic controlling when to turn it off and on was more work than that. For turning it off, I have decided that a simple timeout, optionally configurable in the config file, is good enough. For turning it back on and keeping it on, I have used both the touch of the bottom screen as well as any activity displayed on the bottom screen. This required placing few hooks here and there which keep tracking of any significant activity, as well as making sure the first touch once it is off doesn't activate any of the visible widgets. Not a big problem, though, and once I added that config variable as well, it all worked like a charm.

To be continued...

Let's inflate!

Coming soon...

Poking fun

Coming soon...

Load down less wires

One week before Christmas it happened for the first time. Now it is all just a matter of time. Stay tuned...

A tapestry of time

After I have enabled the RTC support to keep track of the real time properly, I was quite disconcerted when I have found out that the mktime() call brought in tons of time zone stuff which increased the size of the ARM7 code by quite a few useless kilobytes. I added a mental note that I might want to replace it with my own mktime() one day. Later, when I was adding time parsing support to the HTTP client, it became clear that it would be wise to use my own version of gmtime() call as well, as the RFC explicitly disallows use of time zones other than GMT, and I didn't want another bunch of useless time zone and locale code linked in.

The straightforward mktime() implementation is well known and fairly simple. It looks like this:

time_t mktime( struct tm * tm )
{
    // Make everything relative to March.

    int month = ( tm->tm_mon - 1 ) ;
    int year = ( 1900 + tm->tm_year ) ;

    if ( month <= 0 ) {
        month += 12 ;
        year -= 1 ;
    }

    // Now compute the number of days since 1970, taking leap years
    // into account.

    const uint days = (
        tm->tm_mday               // The number of days in the month (+1).
        + ( 367 * month / 12 )    // Plus appropriate amount of days for each month (+30).
        + ( year * 365 )          // Plus appropriate amount of days for each year.
        + ( year / 4 )            // Plus leap day for every 4th year...
        - ( year / 100 )          // except every 100th year...
        + ( year / 400 )          // unless it is the 400th year.
        - 719499                  // Make it all relative to 1.1.1970.
    ) ;

    // Now just convert it all to seconds while including hours, minutes and seconds.

    const uint hours = ( ( days * 24 ) + tm->tm_hour ) ;
    const uint minutes = ( ( hours * 60 ) + tm->tm_min ) ;
    const uint seconds = ( ( minutes * 60 ) + tm->tm_sec ) ;

    return ( (time_t) seconds ) ;
}

From what I have gathered, the algorithm was first published by Gauss many years ago. When I have seen it for the first time, I was really surprised that it was possible to convert the month number into the number of days passed so easily. Sure, the computation of number of leap days is neat as well, but it is this part which is really interesting:

days = ( 367 * month / 12 )

I don't know about you, but to me the number of days in a month during the year never seemed quite logical. I always have to use the "knuckles rule" to figure out how many days each month has, and then there is the nasty February with its own leap day rules. When you look at the sequence, it looks like there is a pattern which repeats after 7 months, between June and August:

31 28 31 30 31 30 31 31 30 31 30 31
Gauss, however, was able to look beyond that and figured out that everything is much simpler when everything starts in March:
31 30 31 30 31 31 30 31 30 31 31 28

This way, the pattern repeats each 5 months, but most importantly, the problematic February comes last, so it doesn't affect how many days passed till begin of each month. This allowed Gauss to use simple rounding trick to convert the month into the number of days passed and deal with the leap days efficiently. Brilliant.

When I have figured this out many years ago, I was wondering how the opposite conversion works. I was really surprised to find out that any gmtime() implementation I came across used month table for the conversion, and a while loop for computing the year number. Given Gauss was able to do the conversion so easily, I was sure it must be possible to do it the other way as well. Indeed, after few tests I came up with values which convert properly biased number of days passed back to the month number:

month = ( days * 10 / 306 )

As for directly computing the year from the number of days passed, one would expect it would be tricky due to the leap day rules, but in the end it turns out it's quite simple as well. Having cycle of k days with 1 leap day every four cycles, one can use a rounding trick again and compute the cycle index simply like this:

cycle = (days * 4 + 3) / (k * 4 + 1) 

From this, it's simple to compute the day number within its cycle:

days -= ( cycle * k + cycle / 4 )

Alternatively, we might pretend each cycle has (k+1) days by removing the irregularities by adding one for each leap day which didn't actually happen:

days += ( cycle - cycle / 4 )

Now, how does this apply to the conversion of number of days to years, when leap day happens every fourth year, unless the year number is divisible by 100 but not by 400?

Well, the answer is simple. All it takes is to realize that regardless of the leap day rules the number of days in a century is fixed, with an extra leap day occuring every fourth century. This makes it easy to compute the number of the century and use it to remove this centurial irregularity. Once that is done, the yearly cycle with a leap day every fourth year becomes regular, and the same rules may be used to compute the number of the year. Applying all this, the gmtime_r() implementation may look like this:

struct tm * gmtime_r( const time_t * time, struct tm * tm )
{
    const uint seconds = (uint) *time ;

    tm->tm_sec = ( seconds % 60 ) ;
    tm->tm_min = ( ( seconds / 60 ) % 60 ) ;
    tm->tm_hour = ( ( seconds / 3600 ) % 24 ) ;
    
#define DAYS(y) ((y)*365+(y)/4-(y)/100+(y)/400)

    uint day = ( seconds / 86400 ) ; 		// Days since 1.1.1970 (0..X).
    day += ( DAYS(1970) - 59 ) ;                // Days since 1.3.0000 (0..X).
    uint wday = ( day + 3 ) % 7 ;               // Day of the week (0..6).
    uint cent = ( day * 4 + 3 ) / DAYS(400) ;   // Centuries since 1.3.0000 (0..X).
    day += ( cent - cent / 4 ) ;                // Remove the centurial irregularity.
    uint year = ( day * 4 + 3 ) / DAYS(4) ;     // Years since 1.3.0000 (0..X).
    day -= ( year * 365 + year / 4 ) ;          // Days since 1.3.YYYY (0..365).
    day += 31 ;                                 // Days since 1.3.YYYY + 31 (31..396).
    uint month = ( day * 10 / 306 ) ;           // Months since 1.3.YYYY + 1 (1..12).
    day -= ( month * 367 / 12 ) ;               // Days since 1.MM.YYYY + 1 (1..31).
    if ( ( month += 2 ) > 12 ) {                // Make everything relative to January.
        month -= 12 ;
        year++ ;
    }
    
    tm->tm_mday = day ;
    tm->tm_wday = wday ;
    tm->tm_yday = -1 ;                          // Not computed, as I don't need it.
    tm->tm_mon = month - 1 ;
    tm->tm_year = year - 1900 ;
    tm->tm_isdst = -1 ;
    
    return tm ;
}

Note that it doesn't compute the day of the year (which would be trivial to add, though) and doesn't take care of leap seconds and daylight saving time and whatever else, but I couldn't care less about these things in context of ZXDS development. Now I would be really surprised if no one else came up with all this before, but as I have never seen it anywhere, here it is. Maybe some of you will find it useful, or at least marginally interesting.

Mr. Fixman

Coming sooner or later...

For a friend I didn't know

... and I have come to fully realize one thing: No matter how much wisdom you have acquired, how many amazing things you have witnessed, or what precious trinkets you have gathered, once you are gone, all that matters are the good deeds you have done for the others.

Hopefully coming sooner before it is too late...

Save me, please!

Coming sooner or later...

Now shake your hands...

Coming sooner or later...

Mr. Fixman strikes again

Coming sooner or later...

Little bit of iconomy

Coming sooner or later...

Birthday present

Coming sooner or later...

Plus? Plus. Plus!

Coming sooner or later...

One more Plus

Coming eventually, too.

New horizon ahead

Too busy to write something now...

Rolling rolling rolling...

And here we go again... Will I ever find time to write all these down?

Let's Zzoom!

Coming eventually, too.

Memories of the past

Coming eventually, too.