NES emulation discussion Brad Taylor The reason you are reading this is because you have written an NES emulator, and it runs extremly slow on your PC, or you are thinking about writing an emulator, and you want to make an informed decision as to how to go about doing it. As you might have guessed, emulating the NES is not an easy task. Besides implementing a 6502 and graphics engine, there's alot of small things to manage as well (not to mention the dilemma the dozens of different bankswitching schemes for the games create). This article discusses efficient emulation implementation of the NES's CPU (6502) & PPU. +-------------+ |CPU emulation| +-------------+ 6502 Opcodes for a particular game must be processed, and a unique task carried out for each one (or at least, the documented ones). This section discusses efficient implementation of a 6502 engine for the NES. Basic optimizations ------------------- - don't bother implementing dynamic recompiliation into your 6502 core. The NES's 6502 only runs at 1.79 MHz, and with 6502 instructions averaging 3.5 to 4 clock cycles, the equivelant inlined x86 (assembly-written) code can emulate the behaviour of most 6502 instructions in 15 clock cycles or less on a 486. Implementing a dynamic recompilation engine may be more complicated than writing the 6502 core itself, and will only eliminate the "opcode fetch & branch" overhead part of emulated instructions (which will only save you roughly about 5 clock cycles max. on a 486). - interpret the 6502 opcodes using a jump-table based decoder. If possible, instead of using a jump table, organize the branch targets in memory so that the 6502 opcode can be used as the actual branch target address (after a shift, or other ALU operation). This removes the overhead of a memory look-up. - Don't execute the "addressing mode", "ALU/operation", and "fetch next opcode & execute" parts of a 6502 instruction seperately, as this increases the number of program control transfers. Instead, define a seperate routine for each opcode to be emulated (there are 151 defined 6502 opcodes), and inline all code to be executed for each opcode emulation routine. This eliminates the overhead of 2 or more branches, and the possibility of a branch target misprediction. While inlining code will use more memory, remember how cheap memory is nowadays. - Implement a clock cycle counter into your 6502 engine, which will be maintained by every 6502 instruction executed. This counter will mainly be used by the PPU to figure out how timed writes will effect how the output image will be rendered. However, if used also as a terminal counter, when the count expires, program control can be transferred to a different part of the 6502 engine to emulate the behaviour of an event that would normally occur on the NES, on that exact clock cycle (i.e., a timed IRQ event). - As 6502 instructions usually require the P (processor status, or flags) register to be updated after an ALU operation, the x86's ALU instructions updates it's flags register in a similar mannar. Therefore, after emulating the ALU behaviour of a 6502 instruction with an x86 one, use instructions like "LAHF" or "SETcc" to acquire the status of sign, zero, carry, and overflow condition codes. Furthermore, have your emulator store the 6502 flags in the format that they're stored in on the x86 CPU. This way, the flags do not have to be formatted, thus saving time. The only time the flags will have to be converted to/from the 6502 order, is when 6502 instructions PHP, PLP, BRK #xx, RTS, and hardware interrupts are executed. Since these happen much less often than more common arithmetic and logical instructions, it's more efficient to handle the flags in this way. - use the x86's registers to store all the 6502's internal registers in (including a cycle counter). There is enough room to leave 3 general purpose registers left for temporary use. Advanced optimizations ---------------------- - dealing with paged/bankswitched memory. Because the 6502's memory map is nonlinear, when the 6502 engine needs to read/write memory from/to a part of the NES's memory map, use part of the target address to index into a table of pointers. This table of pointers (hereon memory map translation table) will represent the architecture of the NES memory map (there will have to be 2 seperate tables of this type for handling reads & writes from/to NES memory). Each entry in the table represents a linear page (or bank) in the NES's memory map. Generally, this table will have to have a page bank granularity as small as 2K (i.e., 32 entries), to emulate the mirrioring behaviour of the NES's internal RAM properly. Reserve a single bit in each pointer entry to indicate if the target page pointer points directly to a base memory address (in the x86 memory map), or an x86 subroutine (used for emulating hardware behaviour when access to that address occurs). If the entry is a subroutine, then program control should be transferred to the pointer's address. Otherwise, no branch should occur, and the code that stores/loads the data via the pointer base (the original address will be used as the offset into the page) should be inlined within the opcode's code body. Also remember that 6502 opcodes that work directly with stack or zero-page memory can avoid doing a look-up into the memory map translation table, since these areas are hardwired to the NES's internal 2K RAM. - fetching instructions. Because the 6502 is always fetching instructions, and in a sequential mannar, it is more efficient to maintain the 6502's PC as a 32-bit IPC register (Indexed PC) which points to the physical address at which the instructions reside. This avoids having to test the PC's address (with the memory map pointer table) every time an instruction byte needs to be fetched from memory. - handling program control transfering When the IPC must be modified via an absolute program control xfer instruction, using the new 16-bit PC address, the memory map page table is consulted to load the new base address for IPC (this base address is also saved for later). The PC value is then added to IPC to offset it into the page. When a PC value must be saved to the stack, the saved base address (mentioned previously) is subtracted from IPC to calculate the current PC value. Note that support for executing 6502 instructions residing in a hardware register mapped part of memory will requre additional code. - detecting when PC exceeds a page/bank boundary. Because of the way IPC is implemented, when IPC crosses a page boundary, instead of the next virtual page being executed, the next physical page is executed. This will result in the CPU taking behaviour it never intended to take, and may result in compromising the stability of the game being emulated. This problem cannot be overcome, but it may be trapped. By placing 3 of the same special opcodes at the end of these banks in physical memory, the page boundary crossing condition can be detected. Now, this doesn't give much hope towards running games that may potentially do this, but if the game did this anyways, it would probably hang the game without much explanation. By trapping this occurence, you'll know exactly what happened. +-------------+ |PPU emulation| +-------------+ Emulating the NES's PPU is not very difficult. What makes accurate emulation of it difficult however, is the trickery various NES games use to achieve special video effects (like split screen scrolling) not supported natively by it. Really, all these "tricks" are simply accomplished by writing to the appropriate PPU (or related) registers at the right time during the rendering of a frame (picture). On a hardware level, the CPU & PPU in the NES run simultaniously. This is why a game can be coded to make a write out to a PPU register at a certain time during a frame, and the (on-screen) effect can be seen almost immediately after (i.e., in a few clock cycles). But basically, this really means that timing determines where an effect occurs/appears on the screen. So, the first instinct one has for writing a NES emulator is to execute both the CPU & PPU engines alternately on every (NES) clock cycle. The results of this will give very accurate emulation, BUT- doing this will also be VERY processor intense (since on a hardware level, the PPU does much more work per clock cycle). As a result, emulators coded like this turn out to be the slowest ones. Efficient PPU emulation ----------------------- In the simplest context, all the PPU does is draw a picture on the screen. On the hardware level, counters are being incremented, and index & pattern table data is being fetched, via these counters. What I'm getting at here is that the PPU's operation is fixed, and therefore predictable. By executing the PPU engine once, only when a frame is to be rendered, a great amount of efficiency can be achieved, since the emulation of the PPU can be executed together as alot of repeated instructions. However, this has the caveat that games that use almost any special effects (mid-frame PPU writes) will not work. Accurate & efficient PPU emulation ---------------------------------- The only time the operation of the PPU changes is when a game makes a mid-frame write out to a PPU register (or a bank switching register). Since these writes change how the PPU draws the graphics from that point on, they need to be compensated for. By implementing a clock cycle counter in the CPU core, it is possible for emulated hardware to know exactly when a read/write is occuring. Therefore, when a write to a PPU register occurs, the PPU engine can then determine if the write is going to change the way the picture is going to be rendered, and at the exact clock cycle (which really translates into a screen pixel). For example, say the CPU engine is executing instructions. Then, on clock cycle 13000 (relative to the last VINT), a write to the PPU's scroll registers are made (which causes a split-screen effect). Now, first the PPU translates 13000 CC's into X/Y coordinates (in this case, on-screen scanline 93, roughly pixel #126 (the equations to do these calculations will be revealed later)). All pixels before this point will now be rendered to a buffer, using the data in the PPU registers prior to the write. Now the screen area before the write occured has been rendered accurately, and the screen will progressively continue to be updated in this fashion as more mid-frame writes occur. If no more occur, when the CPU hits a predetermined clock cycle number (like 29781- the default #of CPUCC's in one frame on a real NES), or when your PPU emulation engine wants to (possibly in sync with the VGA's refresh rate (in this case, CPU can execute as many CC's as your PC will allow in a frame)), the CPUCC will reset to zero, and the rest of the video frame in the buffer will be rendered. If the rendering is directly to a video buffer, use a double frame buffer to avoid the flicker the burst rendering will cause. Knowing when to update the screen --------------------------------- The following list describes PPU status registers/bits that if a game modified/changed mid-frame, would change the way the rest of the frame is rendered. O = update objects, P = update playfield. O object enable bit O left column objects clipping O 8/16 scanline objects O active object pattern table O pattern table bankswitch (which effects active object pattern table) PO color emphasis bits PO black & white/color select P playfield enable bit P left column playfield clipping P scroll registers P X/Y name table selection P name table bankswitch (hypothetical) P active playfield pattern table P pattern table bankswitch (which effects active playfield pattern table) Note that any PPU mapped memory (which means name, pattern, color & palette tables) can only be changed while objects & the playfield are disabled. Since the screen is blank during this time, these writes do not effect how the screen is rendered. However, there are two exceptions to this. One is that if the transparency color is modified (since this effects the color of the "blanked" area). The other is if the EXRAM that MMC5 based games use is enabled (meaning, being used instead of the traditional color tables). Since it is ported through the PPU and CPU bus, the game may modify it at anytime mid-frame. If the EXRAM features is enabled, writes via the CPU bus will have to be intercepted, which will then update the playfield. Collision flag -------------- Games without hardware for scanline counting often poll this bit to find out when to make a write out to a PPU register which will result in a split screen, or a pattern table swap/bankswitch. The collision flag is set when the first non-xparent pixel of object 0 collides with a playfield pixel that is also non-xparent. Since the clock cycle at which the collision occurs is predictable, when a game requests the status of this flag for the first time, a routine part of the PPU engine can calculate at which clock cycle this flag will be set (calculations will be shown later). Subsequent requests after this would then only require the engine to compare the current CPU clock cycle, to the calculated collision clock cycle. Whenever a mid-frame change occurs (whether it effects the playfield, or objects), the clock cycle at which the collision flag will go off will have to be recalculated (unless it has already gone off). MMC3 IRQ timer -------------- The MMC3's IRQ timer relies on the toggling of one of the PPU's address lines. Basically, it's counting operation is more or less at a constant rate (meaning predictable). However, when the PPU bus is disabled (via disabling the playfield & objects, or during the V-blank period), the counter must quit counting. Manual toggling of PPU address bits during this time will have to be intercepted, and the IRQ timer advanced appropriately. CPUCC to X/Y coordinate equations --------------------------------- The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU CC figure by 3, we get the total amount of pixels that have been rendered (including non-displayed ones) since the VINT. 341 pixels are rendered per scanline (although only 256 are displayed). Therefore, by dividing PPUCC by this, we get the # of completely rendered scanlines since the VINT. 21 blank scanlines are rendered before the first visible one is displayed. So, to get a scanline offset into the actual on-screen image, we simply subtract the amount of non-displayed scanlines. Note that if this yeilds a negative number, the PPU is still in the V-blank period. PPUCC = CPUCC * 3 Scanline = PPUCC div 341 - 21 PixelOfs = (PPUCC+16) mod 341 CPUcollisionCC = ((Y+21)*341+X)/3 Note that if the PixelOfs equation yeilds a number higher than 255, the PPU is in the H-blank period. Also, fetched pattern table bitmaps have to travel through internal shift registers before they appear on the video out of the PPU, and that's why you see a +16 there; this is the delay. +----------------------------+ |Tips for efficient rendering| +----------------------------+ There are 2 effective & efficient ways of rendering NES graphics. The only difference between the two is in the way the NES mid-screen palette changes are handled. This will in turn have a huge impact on rendering speeds. The first method is to program the VGA palette registers to reflect the direct values the PPU palette registers contain. The VGA palette would be divided into 64- 4 color palettes. When sequential horizontal pixels are to be drawn, a large (32-bit or more) pattern table data fetch can occur (pixels for pattern table tiles should be organized in memory so that the 2 bits for each horizontally sequential pixel are stored in 8-bit increments). This chunk of fetched pixel data can then be masked (so that other pixel data from the chunk is not used), an indexed "VGA palette select" value can be added to the value, and finally can then be written out to the virtual frame buffer in one store operation. The "VGA palette select" value is fetched via the VGA palette select table, which corresponds to the 8 classic PPU palettes (4*2 elements in the table; therefore, a tile's attribute data (either PF or OBJ) is used as the index into this table). This table indicates which 4-color group of 64 groups in the VGA palette to use for color selection for the group of pixels being written. The idea is that when a mid-frame palette change occurs (or at any time, for that matter), the affected PPU palette in this table is changed to point to where the new palette modifications will be made in the VGA's palette. The corresponding VGA palette entries will also have to be updated appropriately (generally, VGA palette updates will be made in a ring-buffer fashion. A pointer which keeps track of the first available 4 palette entries will be incremented when any entries in a 4-color PPU palette are changed). Basically, this method offers the fastest possible way to render NES graphics, since data is fetched from pattern table memory and written directly to the virtual frame buffer. The number of pixels processed simultaniously can be as high as 8 (with 64-bit integer instructions). However, the # of mid-screen PPU palette modifications possible is limited to 64 times (32 for PF and 32 for OBJs, since one of the bits in the pixel data will have to be used to distinguish between these types), but multiple consecutive modifications to a single 4-color PPU palette only count as one actual modification. The second method, which is the most straightforward, is to store the PPU's 52-color matrix as constant data in the VGA palette registers. Before a pixel can be drawn, pixel color is calculated (via pattern table & palette select data). The PPU palette registers are looked up in some way or another, and the contents of the palette register element is written to a virtual frame buffer as the pixel data. This technique is obviously insensitive to mid-screen palette changes, and is also simpler to implement than the previous method. However, only 1 pixel can be written at a time, since each pixel has to be looked up into the palette. As a result, this method is about 4 times slower when compared to the aforementioned one. The remainder of the section focus on the details of this particular implementation, since this is the only way to emulate mid-screen palette changes flawlessly. - draw the playfield tiles (including transparency pixels) to a buffer, then draw objects after. By doing it this way, individual tiles can be drawn w/out having to process object pixels at the same time (which makes color look-up and pixel packing easier). Since one tile can only have 4 unique colors, these colors can be loaded into a CPU register, and indexed by using a shift via CL, instead of having to re-read color values stored in memory. Remember- register work is always faster than memory (this is of course, assuming that the particular CPU's implementation of rotating via cl is an efficient one, otherwise it may actually be faster to do memory look-ups instead). - If rendering the image in the aforementioned way, change the pattern tables stored in memory before starting emulation so that the bitmap for any scanline of a tile is stored in an 8- 2-bit packed pixel format, instead of the 2- 8-bit planar method used by default. By doing this, it will allow the tile rendering routine to easily use the 2-bit number for indexing the 4 color palette associated with the particular tile. Of course, by changing the pattern tables, whenever pattern table memory is read or written to, the format of the data will have to be converted. Since this happens much less often (even in games that use CHR-RAM), it is very feasible. - If rendering the image in the aforementioned way, render to a temporary memory buffer before writing it to video memory. When background priority objects have to be drawn, the memory will have to be read, and reading video memory is EXTREMELY slow. Also, by writing to a temp buffer before writing to video memory, you avoid the overhead of making alot of small writes out to the video buffer, since pixels are usually rendered in single units at a time. Once the data is in the buffer, it can then be copied to the video mem using the largest data transfer unit possible, 32-bits (or 64-bits, if using MMX instructions MOVQ or MOVNTQ). Remember- video memory is slower than main memory, and the access overhead is the same regardless of the data size, so writes to it should be as few and as large as possible. - Use a linear frame buffer if possible for video memory. Besides not having to bankswitch, writing to it is 4 times faster than the legacy one at address 000A0000. As far as I understand, all PCI video cards (and definetely AGP) have linear frame buffers, even if the VESA BIOS on the older ones don't support it (but VESA 2.0 drivers can be downloaded & run with these cards). This will pretty much make it impossible for graphics to be displayed on an old ISA video card, but even if you were using the legacy frame buffer, the ISA video card would never be fast enough to handle the 3.7 MB/s (minimum) transfer rate required by these graphics routines. Writing a PPU graphics engine which would work full-speed on an ISA video card would be very limited in features & support, plus would have to use VGA hardware for scrolling, and use the programmable interval timer to generate interrupts to generate split-screen effects on the VGA's display). - Since the PPU only has the ability to produce 52 unique colors (not including color emphasis bits, which may allow for 8 times more colors), a single byte can be used for every pixel, with 2 bits left over. To draw background priority objects after the playfield has already been drawn, all non-transparent playfield pixels should use one of these 2 extra bits to indicate a non-transparent playfield pixel. Then the object's pixels would be drawn wherever there are xparent playfield pixels. However, to ensure that higher background priority objects are drawn over lower foreground objects, all object pixels drawn will set the bit to indicate a xparent playfield pixel. - Avoid branches. When drawing foreground priority objects overtop the playfield, transparent object pixels shouldn't be drawn. Instead of using a branch to avoid writing the pixel, write the pixel to a dummy address, which is calculated via arithmetic/logical instructions. - Inline code in small loops with a constant # of iterations, espically if the loop count is low, and is the most inner. This reduces overhead by avoiding a backwards branch, and the loop & index counters required. For example, when drawing tiles, it would be a good idea to inline the code to draw 8 horizontal pixels. - Use CPU registers in the inner most loops. Avoid using far (seg:ofs) pointers (x86), since loading segment registers is slow. - If implemented very well, the aforementioned method of PPU rendering can consume less than 20 million clock cycles/second on a 486 CPU (providing you've got a fast PCI video card). EOF