NES emulation discussion

Brad Taylor

The reason you are reading this is because you have written an NES emulator, 
and it runs extremly slow on your PC, or you are thinking about writing an 
emulator, and you want to make an informed decision as to how to go about 
doing it. As you might have guessed, emulating the NES is not an easy task. 
Besides implementing a 6502 and graphics engine, there's alot of small 
things to manage as well (not to mention the dilemma the dozens of different 
bankswitching schemes for the games create). This article discusses 
efficient emulation implementation of the NES's CPU (6502) & PPU.


+-------------+
|CPU emulation|
+-------------+
6502 Opcodes for a particular game must be processed, and a unique task 
carried out for each one (or at least, the documented ones). This section 
discusses efficient implementation of a 6502 engine for the NES.


Basic optimizations
-------------------
- don't bother implementing dynamic recompiliation into your 6502 core. The 
NES's 6502 only runs at 1.79 MHz, and with 6502 instructions averaging 3.5 
to 4 clock cycles, the equivelant inlined x86 (assembly-written) code can 
emulate the behaviour of most 6502 instructions in 15 clock cycles or less 
on a 486. Implementing a dynamic recompilation engine may be more 
complicated than writing the 6502 core itself, and will only eliminate the 
"opcode fetch & branch" overhead part of emulated instructions (which will 
only save you roughly about 5 clock cycles max. on a 486).

- interpret the 6502 opcodes using a jump-table based decoder. If possible, 
instead of using a jump table, organize the branch targets in memory so that 
the 6502 opcode can be used as the actual branch target address (after a 
shift, or other ALU operation). This removes the overhead of a memory 
look-up.

- Don't execute the "addressing mode", "ALU/operation", and "fetch next 
opcode & execute" parts of a 6502 instruction seperately, as this increases 
the number of program control transfers. Instead, define a seperate routine 
for each opcode to be emulated (there are 151 defined 6502 opcodes), and 
inline all code to be executed for each opcode emulation routine. This 
eliminates the overhead of 2 or more branches, and the possibility of a 
branch target misprediction. While inlining code will use more memory, 
remember how cheap memory is nowadays.

- Implement a clock cycle counter into your 6502 engine, which will be 
maintained by every 6502 instruction executed. This counter will mainly be 
used by the PPU to figure out how timed writes will effect how the output 
image will be rendered. However, if used also as a terminal counter, when 
the count expires, program control can be transferred to a different part of 
the 6502 engine to emulate the behaviour of an event that would normally 
occur on the NES, on that exact clock cycle (i.e., a timed IRQ event).

- As 6502 instructions usually require the P (processor status, or flags) 
register to be updated after an ALU operation, the x86's ALU instructions 
updates it's flags register in a similar mannar. Therefore, after emulating 
the ALU behaviour of a 6502 instruction with an x86 one, use instructions 
like "LAHF" or "SETcc" to acquire the status of sign, zero, carry, and 
overflow condition codes. Furthermore, have your emulator store the 6502 
flags in the format that they're stored in on the x86 CPU. This way, the 
flags do not have to be formatted, thus saving time. The only time the flags 
will have to be converted to/from the 6502 order, is when 6502 instructions 
PHP, PLP, BRK #xx, RTS, and hardware interrupts are executed. Since these 
happen much less often than more common arithmetic and logical instructions, 
it's more efficient to handle the flags in this way.

- use the x86's registers to store all the 6502's internal registers in 
(including a cycle counter). There is enough room to leave 3 general purpose 
registers left for temporary use.


Advanced optimizations
----------------------
- dealing with paged/bankswitched memory.

Because the 6502's memory map is nonlinear, when the 6502 engine needs to 
read/write memory from/to a part of the NES's memory map, use part of the 
target address to index into a table of pointers. This table of pointers 
(hereon memory map translation table) will represent the architecture of the 
NES memory map (there will have to be 2 seperate tables of this type for 
handling reads & writes from/to NES memory). Each entry in the table 
represents a linear page (or bank) in the NES's memory map. Generally, this 
table will have to have a page bank granularity as small as 2K (i.e., 32 
entries), to emulate the mirrioring behaviour of the NES's internal RAM 
properly. Reserve a single bit in each pointer entry to indicate if the 
target page pointer points directly to a base memory address (in the x86 
memory map), or an x86 subroutine (used for emulating hardware behaviour 
when access to that address occurs). If the entry is a subroutine, then 
program control should be transferred to the pointer's address. Otherwise, 
no branch should occur, and the code that stores/loads the data via the 
pointer base (the original address will be used as the offset into the page) 
should be inlined within the opcode's code body. Also remember that 6502 
opcodes that work directly with stack or zero-page memory can avoid doing a 
look-up into the memory map translation table, since these areas are 
hardwired to the NES's internal 2K RAM.


- fetching instructions.

Because the 6502 is always fetching instructions, and in a sequential 
mannar, it is more efficient to maintain the 6502's PC as a 32-bit IPC 
register (Indexed PC) which points to the physical address at which the 
instructions reside. This avoids having to test the PC's address (with the 
memory map pointer table) every time an instruction byte needs to be fetched 
from memory.


- handling program control transfering

When the IPC must be modified via an absolute program control xfer 
instruction, using the new 16-bit PC address, the memory map page table is 
consulted to load the new base address for IPC (this base address is also 
saved for later). The PC value is then added to IPC to offset it into the 
page. When a PC value must be saved to the stack, the saved base address 
(mentioned previously) is subtracted from IPC to calculate the current PC 
value.

Note that support for executing 6502 instructions residing in a hardware 
register mapped part of memory will requre additional code.


- detecting when PC exceeds a page/bank boundary.

Because of the way IPC is implemented, when IPC crosses a page boundary, 
instead of the next virtual page being executed, the next physical page is 
executed. This will result in the CPU taking behaviour it never intended to 
take, and may result in compromising the stability of the game being 
emulated. This problem cannot be overcome, but it may be trapped. By placing 
3 of the same special opcodes at the end of these banks in physical memory, 
the page boundary crossing condition can be detected. Now, this doesn't give 
much hope towards running games that may potentially do this, but if the 
game did this anyways, it would probably hang the game without much 
explanation. By trapping this occurence, you'll know exactly what happened.


+-------------+
|PPU emulation|
+-------------+
Emulating the NES's PPU is not very difficult. What makes accurate emulation 
of it difficult however, is the trickery various NES games use to achieve 
special video effects (like split screen scrolling) not supported natively 
by it.

Really, all these "tricks" are simply accomplished by writing to the 
appropriate PPU (or related) registers at the right time during the 
rendering of a frame (picture).

On a hardware level, the CPU & PPU in the NES run simultaniously. This is 
why a game can be coded to make a write out to a PPU register at a certain 
time during a frame, and the (on-screen) effect can be seen almost 
immediately after (i.e., in a few clock cycles). But basically, this really 
means that timing determines where an effect occurs/appears on the screen.

So, the first instinct one has for writing a NES emulator is to execute both 
the CPU & PPU engines alternately on every (NES) clock cycle. The results of 
this will give very accurate emulation, BUT- doing this will also be VERY 
processor intense (since on a hardware level, the PPU does much more work 
per clock cycle). As a result, emulators coded like this turn out to be the 
slowest ones.


Efficient PPU emulation
-----------------------
In the simplest context, all the PPU does is draw a picture on the screen. 
On the hardware level, counters are being incremented, and index & pattern 
table data is being fetched, via these counters. What I'm getting at here is 
that the PPU's operation is fixed, and therefore predictable.

By executing the PPU engine once, only when a frame is to be rendered, a 
great amount of efficiency can be achieved, since the emulation of the PPU 
can be executed together as alot of repeated instructions. However, this has 
the caveat that games that use almost any special effects (mid-frame PPU 
writes) will not work.


Accurate & efficient PPU emulation
----------------------------------
The only time the operation of the PPU changes is when a game makes a 
mid-frame write out to a PPU register (or a bank switching register). Since 
these writes change how the PPU draws the graphics from that point on, they 
need to be compensated for.

By implementing a clock cycle counter in the CPU core, it is possible for 
emulated hardware to know exactly when a read/write is occuring. Therefore, 
when a write to a PPU register occurs, the PPU engine can then determine if 
the write is going to change the way the picture is going to be rendered, 
and at the exact clock cycle (which really translates into a screen pixel).

For example, say the CPU engine is executing instructions. Then, on clock 
cycle 13000 (relative to the last VINT), a write to the PPU's scroll 
registers are made (which causes a split-screen effect). Now, first the PPU 
translates 13000 CC's into X/Y coordinates (in this case, on-screen scanline 
93, roughly pixel #126 (the equations to do these calculations will be 
revealed later)). All pixels before this point will now be rendered to a 
buffer, using the data in the PPU registers prior to the write. Now the 
screen area before the write occured has been rendered accurately, and the 
screen will progressively continue to be updated in this fashion as more 
mid-frame writes occur. If no more occur, when the CPU hits a predetermined 
clock cycle number (like 29781- the default #of CPUCC's in one frame on a 
real NES), or when your PPU emulation engine wants to (possibly in sync with 
the VGA's refresh rate (in this case, CPU can execute as many CC's as your 
PC will allow in a frame)), the CPUCC will reset to zero, and the rest of 
the video frame in the buffer will be rendered. If the rendering is directly 
to a video buffer, use a double frame buffer to avoid the flicker the burst 
rendering will cause.


Knowing when to update the screen
---------------------------------
The following list describes PPU status registers/bits that if a game 
modified/changed mid-frame, would change the way the rest of the frame is 
rendered. O = update objects, P = update playfield.

O	object enable bit
O	left column objects clipping
O	8/16 scanline objects
O	active object pattern table
O	pattern table bankswitch (which effects active object pattern table)

PO	color emphasis bits
PO	black & white/color select

P	playfield enable bit
P	left column playfield clipping
P	scroll registers
P	X/Y name table selection
P	name table bankswitch (hypothetical)
P	active playfield pattern table
P	pattern table bankswitch (which effects active playfield pattern table)

Note that any PPU mapped memory (which means name, pattern, color & palette 
tables) can only be changed while objects & the playfield are disabled. 
Since the screen is blank during this time, these writes do not effect how 
the screen is rendered. However, there are two exceptions to this. One is 
that if the transparency color is modified (since this effects the color of 
the "blanked" area). The other is if the EXRAM that MMC5 based games use is 
enabled (meaning, being used instead of the traditional color tables). Since 
it is ported through the PPU and CPU bus, the game may modify it at anytime 
mid-frame. If the EXRAM features is enabled, writes via the CPU bus will 
have to be intercepted, which will then update the playfield.


Collision flag
--------------
Games without hardware for scanline counting often poll this bit to find out 
when to make a write out to a PPU register which will result in a split 
screen, or a pattern table swap/bankswitch. The collision flag is set when 
the first non-xparent pixel of object 0 collides with a playfield pixel that 
is also non-xparent. Since the clock cycle at which the collision occurs is 
predictable, when a game requests the status of this flag for the first 
time, a routine part of the PPU engine can calculate at which clock cycle 
this flag will be set (calculations will be shown later). Subsequent 
requests after this would then only require the engine to compare the 
current CPU clock cycle, to the calculated collision clock cycle. Whenever a 
mid-frame change occurs (whether it effects the playfield, or objects), the 
clock cycle at which the collision flag will go off will have to be 
recalculated (unless it has already gone off).


MMC3 IRQ timer
--------------
The MMC3's IRQ timer relies on the toggling of one of the PPU's address 
lines. Basically, it's counting operation is more or less at a constant rate 
(meaning predictable). However, when the PPU bus is disabled (via disabling 
the playfield & objects, or during the V-blank period), the counter must 
quit counting. Manual toggling of PPU address bits during this time will 
have to be intercepted, and the IRQ timer advanced appropriately.


CPUCC to X/Y coordinate equations
---------------------------------
The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU 
CC figure by 3, we get the total amount of pixels that have been rendered 
(including non-displayed ones) since the VINT.

341 pixels are rendered per scanline (although only 256 are displayed). 
Therefore, by dividing PPUCC by this, we get the # of completely rendered 
scanlines since the VINT.

21 blank scanlines are rendered before the first visible one is displayed. 
So, to get a scanline offset into the actual on-screen image, we simply 
subtract the amount of non-displayed scanlines. Note that if this yeilds a 
negative number, the PPU is still in the V-blank period.

PPUCC = CPUCC * 3
Scanline = PPUCC div 341 - 21
PixelOfs = (PPUCC+16) mod 341
CPUcollisionCC = ((Y+21)*341+X)/3

Note that if the PixelOfs equation yeilds a number higher than 255, the PPU 
is in the H-blank period. Also, fetched pattern table bitmaps have to travel 
through internal shift registers before they appear on the video out of the 
PPU, and that's why you see a +16 there; this is the delay.


+----------------------------+
|Tips for efficient rendering|
+----------------------------+
There are 2 effective & efficient ways of rendering NES graphics. The only 
difference between the two is in the way the NES mid-screen palette changes 
are handled. This will in turn have a huge impact on rendering speeds.

The first method is to program the VGA palette registers to reflect the 
direct values the PPU palette registers contain. The VGA palette would be 
divided into 64- 4 color palettes. When sequential horizontal pixels are to 
be drawn, a large (32-bit or more) pattern table data fetch can occur 
(pixels for pattern table tiles should be organized in memory so that the 2 
bits for each horizontally sequential pixel are stored in 8-bit increments). 
This chunk of fetched pixel data can then be masked (so that other pixel 
data from the chunk is not used), an indexed "VGA palette select" value can 
be added to the value, and finally can then be written out to the virtual 
frame buffer in one store operation. The "VGA palette select" value is 
fetched via the VGA palette select table, which corresponds to the 8 classic 
PPU palettes (4*2 elements in the table; therefore, a tile's attribute data 
(either PF or OBJ) is used as the index into this table). This table 
indicates which 4-color group of 64 groups in the VGA palette to use for 
color selection for the group of pixels being written. The idea is that when 
a mid-frame palette change occurs (or at any time, for that matter), the 
affected PPU palette in this table is changed to point to where the new 
palette modifications will be made in the VGA's palette. The corresponding 
VGA palette entries will also have to be updated appropriately (generally, 
VGA palette updates will be made in a ring-buffer fashion. A pointer which 
keeps track of the first available 4 palette entries will be incremented 
when any entries in a 4-color PPU palette are changed). Basically, this 
method offers the fastest possible way to render NES graphics, since data is 
fetched from pattern table memory and written directly to the virtual frame 
buffer. The number of pixels processed simultaniously can be as high as 8 
(with 64-bit integer instructions). However, the # of mid-screen PPU palette 
modifications possible is limited to 64 times (32 for PF and 32 for OBJs, 
since one of the bits in the pixel data will have to be used to distinguish 
between these types), but multiple consecutive modifications to a single 
4-color PPU palette only count as one actual modification.

The second method, which is the most straightforward, is to store the PPU's 
52-color matrix as constant data in the VGA palette registers. Before a 
pixel can be drawn, pixel color is calculated (via pattern table & palette 
select data). The PPU palette registers are looked up in some way or 
another, and the contents of the palette register element is written to a 
virtual frame buffer as the pixel data. This technique is obviously 
insensitive to mid-screen palette changes, and is also simpler to implement 
than the previous method. However, only 1 pixel can be written at a time, 
since each pixel has to be looked up into the palette. As a result, this 
method is about 4 times slower when compared to the aforementioned one. The 
remainder of the section focus on the details of this particular 
implementation, since this is the only way to emulate mid-screen palette 
changes flawlessly.

- draw the playfield tiles (including transparency pixels) to a buffer, then 
draw objects after. By doing it this way, individual tiles can be drawn 
w/out having to process object pixels at the same time (which makes color 
look-up and pixel packing easier). Since one tile can only have 4 unique 
colors, these colors can be loaded into a CPU register, and indexed by using 
a shift via CL, instead of having to re-read color values stored in memory. 
Remember- register work is always faster than memory (this is of course, 
assuming that the particular CPU's implementation of rotating via cl is an 
efficient one, otherwise it may actually be faster to do memory look-ups 
instead).

- If rendering the image in the aforementioned way, change the pattern 
tables stored in memory before starting emulation so that the bitmap for any 
scanline of a tile is stored in an 8- 2-bit packed pixel format, instead of 
the 2- 8-bit planar method used by default. By doing this, it will allow the 
tile rendering routine to easily use the 2-bit number for indexing the 4 
color palette associated with the particular tile. Of course, by changing 
the pattern tables, whenever pattern table memory is read or written to, the 
format of the data will have to be converted. Since this happens much less 
often (even in games that use CHR-RAM), it is very feasible.

- If rendering the image in the aforementioned way, render to a temporary 
memory buffer before writing it to video memory. When background priority 
objects have to be drawn, the memory will have to be read, and reading video 
memory is EXTREMELY slow. Also, by writing to a temp buffer before writing 
to video memory, you avoid the overhead of making alot of small writes out 
to the video buffer, since pixels are usually rendered in single units at a 
time. Once the data is in the buffer, it can then be copied to the video mem 
using the largest data transfer unit possible, 32-bits (or 64-bits, if using 
MMX instructions MOVQ or MOVNTQ). Remember- video memory is slower than main 
memory, and the access overhead is the same regardless of the data size, so 
writes to it should be as few and as large as possible.

- Use a linear frame buffer if possible for video memory. Besides not having 
to bankswitch, writing to it is 4 times faster than the legacy one at 
address 000A0000. As far as I understand, all PCI video cards (and 
definetely AGP) have linear frame buffers, even if the VESA BIOS on the 
older ones don't support it (but VESA 2.0 drivers can be downloaded & run 
with these cards). This will pretty much make it impossible for graphics to 
be displayed on an old ISA video card, but even if you were using the legacy 
frame buffer, the ISA video card would never be fast enough to handle the 
3.7 MB/s (minimum) transfer rate required by these graphics routines. 
Writing a PPU graphics engine which would work full-speed on an ISA video 
card would be very limited in features & support, plus would have to use VGA 
hardware for scrolling, and use the programmable interval timer to generate 
interrupts to generate split-screen effects on the VGA's display).

- Since the PPU only has the ability to produce 52 unique colors (not 
including color emphasis bits, which may allow for 8 times more colors), a 
single byte can be used for every pixel, with 2 bits left over. To draw 
background priority objects after the playfield has already been drawn, all 
non-transparent playfield pixels should use one of these 2 extra bits to 
indicate a non-transparent playfield pixel. Then the object's pixels would 
be drawn wherever there are xparent playfield pixels. However, to ensure 
that higher background priority objects are drawn over lower foreground 
objects, all object pixels drawn will set the bit to indicate a xparent 
playfield pixel.

- Avoid branches. When drawing foreground priority objects overtop the 
playfield, transparent object pixels shouldn't be drawn. Instead of using a 
branch to avoid writing the pixel, write the pixel to a dummy address, which 
is calculated via arithmetic/logical instructions.

- Inline code in small loops with a constant # of iterations, espically if 
the loop count is low, and is the most inner. This reduces overhead by 
avoiding a backwards branch, and the loop & index counters required. For 
example, when drawing tiles, it would be a good idea to inline the code to 
draw 8 horizontal pixels.

- Use CPU registers in the inner most loops. Avoid using far (seg:ofs) 
pointers (x86), since loading segment registers is slow.

- If implemented very well, the aforementioned method of PPU rendering can 
consume less than 20 million clock cycles/second on a 486 CPU (providing 
you've got a fast PCI video card).


EOF