• Forum
  • Lounge
  • Ramblings of an NSF enthusiast (guide to

Ramblings of an NSF enthusiast (guide to writing an NSF player)

Pages: 1234
So this is a story all about how I tried to recommend an NES emulator as a 'confidence building' project way back when. Then the topic somewhat came up again recently, and I shifted gears to an 'NSF Player' (which is just a mini-NES emulator ... just the CPU and audio functionality -- NSF files are the music code extracted from the game).

For whatever reason... I really want someone here to do a project like this. I can't explain why. I have a weird fascination with it. It's like living vicariously through other board members. It's a little insane, I know, but at the same time I can't help it.

So I'd like to take a minute just sit right there, I'll tell you all about the basics of NES emulation.


The NES CPU has a 16-bit address bus. This means there are only 64K addresses. It operates on memory mapped I/O, so different addresses corespond to different areas of the system.

Here is a [very] simplified mapping (areas not mentioned are either "mirrors" or are open bus... I'll get into that later... maybe):

$0000 - $07FF   =   RAM
$4000 - $4017   =   APU registers (writes to here will generate audio)
$6000 - $7FFF   =   More RAM
$8000 - $FFFF   =   ROM  (ie:  the NSF file containing the code/data to play the music)

You can think of this like a giant array. The Game/NSF code will read/write different areas in this array. To emulate the system, you merely have to catch the special reads/writes and make them do stuff.


The NES has a slightly modified NMOS 6502. The main difference as far as emulation is concerned is that there's no decimal mode... so ADC and SBC behave the same whether or not the D flag is set. But "WHATEVER WTF DOES THAT MEAN I DON'T KNOW ANYTHING ABOUT ASSEMBLY USE TERMS I CAN UNDERSTAND YOU ASSHOLE".

Let's start with the registers. There's 3 'main' registers, they're called 'A', 'X', and 'Y'. Not very descriptive names but who cares, right? They basically just act as 8-bit variables. To emulate them... all you need to do is have an unsigned 8-bit var:

uint8_t A;  // <- you successfully emulated the 6502 'A' register!  congrats! 

(yes yes, A stands for "Accumulator" and X,Y are actually called "X-index" and "Y-index"... but I'm trying not to get too technical dammit)

Different instructions will do different things with the A,X,Y registers... and possibly with memory (that addressing space business I mentioned earlier).

But that's only HALF the fun! The other half is the "Addressing Modes" which determine where the instruction gets its information from. But before I get into those... let's look at an example instruction:

LDA #$06

This is the 'LDA' instruction. LDA stands for "Load A" and as you might guess, this assigns the literal value $06 to A. In C, this would be the same as:
A = 0x06;

The $ denotes hexadecimal notation. The # denotes "immediate" mode (that's an addressing mode). Without the # symbol... the instruction means something else:

LDA $06  ; <- no # symbol
// conceptual equivilent C code:
A = memory[ 0x06 ];

This is "Zero Page" mode. Unlike Immediate mode, it is not assigning 6 to A, but instead 6 is an address... and it will read from address $0006 and whatever value it reads will get put into A.

Of course... since addressing space is 16 bits, this means you can also have 16-bit addresses:

LDA $891A

This is "Absolute" mode and is completely identical to "Zero Page" mode except it takes a 16-bit address instead of an 8-bit address. Zero page is used because it's slightly faster and because it takes up less space in the program.


So this is fine and dandy and all... but reading/writing in absolute addresses doesn't give the programmer much flexibility. So this is where X and Y come in. They are used as "indexes" for instructions:

LDA $6000,X
// C equivilent
A = memory[0x6000 + X];

This is "Absolute,X" mode, and as you can see, it merely uses X to index memory... as if it were an array. The comma here denotes addition.

But even that is limiting... to really be turing complete you need to have something like those weird "pointer" things. This is done with "Indirect" addressing modes.

LDA ($10),Y
uint16_t temp_address = memory[0x10] | (memory[0x11] << 8);
A = memory[temp_address + Y];

This is "(Indirect),Y" mode. And yeah yeah I know... this is more complicated. But chillax, it's not so bad.

The (parenthesis) here denotes indirection. Indirection basically is just fetching a 16-bit pointer from the given address... then reading from THAT pointer using Y to index it.

Programs can use this to write a pointer to some kind of lookup table or data in the ROM and then use Y to index it. For example:

LDA #$A0    ; load low byte of address
STA $10     ; write it to memory[0x10]
LDA #$80    ; load high byte
STA $11     ; write it

; at this point... memory[0x0010] and memory[0x0011] form a 16 bit pointer, pointing to address $80A0

LDY #$02    ; put literal 2 in Y
LDA ($10),Y ; reads from address $80A2  (pointer address of $80A0 + contents of Y (2) = $80A2)

Indirection with X can also be done... but it's different... and stupider and much less useful:

LDA ($10,X)

Notice how the ",X" is INSIDE the parenthesis whereas the ",Y" was outside? That's because with X, the indexing is done before the indirection. So it ends up being like this:

uint16_t pointer = memory[0x10 + X] | (memory[0x11 + X] << 8);
A = memory[ pointer ];

Don't ask me to explain why anyone would find this useful. I still haven't figured it out.


Remember back on the memory map... I mentioned that addresses $0000-07FF was RAM? Well one 'page' ($100 byte block) of that is "the stack" and is treated somewhat specially.

If you're unfamiliar with the concept of a stack... think of a stack of plates. You can put a plate on top (push) or you can take a plate off the top (pull). But you don't really have [easy] access to any plates other than the one on the top.

This concept of pushing/pulling values onto a stack is the same idea that is behind the 6502 stack. It's easily illustrated with 2 instructions: PHA (push A onto stack) and PLA (pull A off of stack):

LDA #$05
PHA       ;  pushes 5 onto the stack:  [bottom] $05  [top]
LDA #$03
PHA       ;  pushes 3 onto the stack:  [bottom] $05  $03  [top]

LDA #$00  ;  erase A.  A=0

PLA       ;  A=3  (value off top of stack):  [bottom] $05  [top]
PLA       ;  A=5  (value off top of stack):  [stack is empty]

To keep track of the stack... there's an 8-bit 'SP' register. This register is not directly manipulated by the code like A,X, and Y are. Instead... it's used implicitly in instructions like PHA.

The stack starts at address $01FF and grows down (so if the stack is completely full.. $01FF is the bottom of the stack, and $0100 is the top).

With that in mind...
// PHA in C:
memory[ 0x0100 + SP ] = A;

// PLA in C:
A = memory[ 0x0100 + SP ];

On a somewhat unrelated note... in addition to A,X,Y, and SP... there's also a 'PC' register. The PC register is 16-bits and basically just tells the CPU which address it should be reading instructions from. Every time you execute an instruction, PC gets incremented to point to the next instruction.

So if PC=$8000 , this just means that the CPU will read the next instruction opcode from address $8000, the increment PC.

Like SP, the PC cannot be directly modified in the 6502 code like A,X,Y are. Instead, the program will "jump" to different areas in the code with the JMP instruction:

JMP $C000  ; jumps to address $C000
// equivilent C code:
PC = 0xC000;

JMP is basically like a goto. Except this is 6502 assembly, so it's not evil like goto in C++ is.

But JMP sucks for subroutines because you want to be able to "return" or jump back to whatever called this code. For that... there's JSR ("jump to subroutine") and RTS ("return from subroutine")

JSR functions exactly like JMP does... only it will push the PC onto the stack (in 2 bytes... since the PC is 16-bits).
RTS will just pull 2 bytes off the stack, and stick that value in the PC to jump back (basically undo-ing the JSR)


To be written. I need to take a game break.
I'v been very interested in making an NES emulator since you suggested it previously, even created a repo for it, but haven't had time with school. Just took my last final for the semester so I'll have some time to work on this until next semester starts.
It's going to take some time to ingest all this information. Incredible post.
closed account (Dy7SLyTq)
im considering doing it. if i find a good enough guide or you finish this one. will you be willing to answer questions?
Great read. It's certainly an interesting project, though I'm equally as certain that I wouldn't have the time to fulfill it.

For anyone interested, though, I think playing with Assembly is a worthy and (when it works) fun endeavour. I've made a couple of 68000 games over the last couple of years and it's pretty astonishing how addicted you can get.

To support Disch's post, here are the 6502 opcodes: http://www.6502.org/tutorials/6502opcodes.html
I too would love to get into this subject. If someday I have the time I might revisit this post. Thanks for sharing!
Glad there's interest. =) And yes... you can always bug me with questions on this. I freaking love this stuff.


So with everything shown so far, you might be wondering how programs do conditional statements. Well that's an excellent question! But before I can answer it, let's take a look at yet another register (last one! I promise!)

The register is generally just called 'P' or 'ST'. It is [conceptually] 8 bits, but not it is not treated as a whole like all the other regs are. Instead... it's treated as a series of flags... where each bit in the register represents a different "flag". The flag can be set (1) or clear (0) to denote the current state of the processor. Nearly all instructions will modify one or more flags in addition to their normal operation.

So what are the flags? There are 6 of them:

- N (bit 7: $80) is the "Negative" flag. It will be set if the previous operation resulted in a "negative" number. Though note, that all operations the 6502 performs are really unsigned... so nothing is ever really negative. In this context... "negative" just means "has the high bit set". That is... a result of $00-7F would be "positive" (N=0) and a result of $80-FF would be "negative" (N=1)

- V (bit 6: $40) is the "Overflow" flag. And is used to help facilitate signed computations. This is the most complicated flag, which I'll explain later. It is rarely used.

- D (bit 3: $08) is the "Decimal" flag. As I touched on previously... this flag does ABSOLUTELY NOTHING on the NES. So don't worry too much about it.

- I (bit 2: $04) is the "Interrupt Mask" flag. Interrupts (IRQs) are not necessary for emulating NSFs (and in fact.. NSFs must not support IRQs or many of them won't work)... so don't worry about this flag either.

- Z (bit 1: $02) is the "Zero" flag. It is similar to the N flag. It is set if the previous instruction resulted in a value of zero, and is clear otherwise.

- C (bit 0: $01) is the "carry" flag. This is used mainly to perform multi-byte arithemtic. I'll explain in more detail laters.

Of the above flags... N and Z are the most commonly used. Then C... followed distantly by V... and I and D are all but worthless in the NSF world.

some guides also mention a 'B' flag to indicate BRK. They lie, don't believe them. There is no B flag

So you might be wondering wtf do these flags have to do with anything. The reason I mention them is because there are "branch" instructions, which are effectively conditional jumps. They will only jump if the corresponding flag is set or clear.


BEQ $12   ; BEQ = "Branch if Equal to Zero"

BEQ will perform a jump only if the Z flag is set. This conditional jump is how the 6502 code performs loops and if statements.

Note that unlike JMP and JSR which take an absolute address... all branches take a RELATIVE address. This is the only place in all of 6502 where you actually used a signed number.

That is...
BEQ $12   ; Branch forward $12 bytes if Z=1

BNE $F0   ; Branch backwards $10 bytes if Z=0

(BNE = "Branch if Not Equal to Zero" -- which is the opposite of BEQ)

Equivilent C code:
// BEQ $12
if( Z )
    PC += 0x12;
// BNE $F0
if( !Z )
	PC += static_cast<signed char>( 0xF0 );  // <- make it signed so it'll be negative 

There are branch operators for V, N, and C flags as well:

BCC = Branch if Carry clear (if C=0)
BCS = Branch if Carry set (if C=1)
BVC = Branch if V clear (if V=0)
BVS = (if V=1)
BPL = Branch if Plus (if N=0)
BMI = Branch if Minus (if N=1)
BNE = (if Z=0)
BEQ = (if Z=1)


As mentioned... nearly all instructions modify some flags in some way. Most of them are relatively straightforward. I'll leave a reference page to explain exactly what does what. A good reference page is here:


For example.... the simple instruction LDA, in addition to loading the A register, will also modify N and Z flags according to the value loaded:

LDA #$00  ; Z=1, N=0
LDA #$53  ; Z=0, N=0
LDA #$86  ; Z=0, N=1

While N and Z are extremely straightforward... C and V might take a bit of explanation.

To explain C... let's take a look at a new instruction... "ADC". ADC stands for "Add to A with Carry". It effectively performs an addition operation. The 'carry' is added to the result (if C=1) to add an additional 1 to the sum. This is used for doing multibyte artihmetic.

Let's step back a minute and go back to elementary school arithmetic. To sum multi-digit numbers, you are taught to arrange them like so:

 + 19

You would then add the "ones" place.... and if it exceeded 10, you would add a little '1' above the tens place to indicate the carry:

  1    <- the carry
+ 19

The C flag on the 6502 serves that same function. Only instead of recording the carry between "digits" as the above example does... it records it between "bytes".

So for an example... let's say you have a 16-bit variable at address $10 (lo) and $11 (hi). And then let's say you wanted to add a 16-bit number $AABB to that. In 6502 you'd do it like so:

CLC       ; "Clear Carry", forces C=0
LDA $10   ; load low byte
ADC #$BB  ; add low byte of #$AABB to it
STA $10   ; write it back

LDA $11   ; load high byte
ADC #$AA  ; add high byte of #$AABB to it
STA $11   ; write it back

Any carry from the first ADC is stored in the C flag... then when the second ADC is performed... if C is set, it will add an additional 1 to the result of the addition.

The V flag is a little weirder. It's for programs that want to do signed arithmetic (ie: treat it as though A were a signed integer instead of an unsigned integer). The C flag doesn't work so well for this because it doesn't tell you when you have 'wrapped' around the signed boundary. For example:

$70 + $60 = $D0   ; <- C=0

; which, in signed decimal, would be....

112 + 96 = -48   ; <- wtf it wrapped!

Therefore the V flag will detect these kinds of signed overflow. The logic is pretty simple:

if Positive + Positive = Negative .... V=1
if Negative + Negative = Positive .... V=1
V=0 in all other instances

Remember in this contexted... "Negative" just means "high bit set" and "Positive" means "high bit clear".

ADC and SBC (addition and subtraction) are just about the only instructions to modify V. (BIT also modifies it... but in a completely unrelated way... nevermind that for now).

I and D flags are not modified by instructions, but can only be set/cleared explicitly by CLI/SEI/CLD/SED instructions.

That should be enough info to understand most/all of the 6502 workings. The rest can be pieced together from a reference page (cough http://www.obelisk.demon.co.uk/6502/reference.html bookmark it cough). It really isn't that complicated.

PROTIP: SBC functions identically to ADC, only the operand has all its bits flipped. That is...

ADC #$00
; is the same as

So don't let the "borrow" stuff in SBC confuse you... just call your ADC code and flip all the bits in the operand.
Last edited on


Now that the CPU is out of the way... let's look at big picture audio stuff! Wooooooo

For those of you who are unfamiliar with PCM... I explain the basics in this thread:

The biggest hurdle to overcome when emulating the audio is "downsampling". The NES will output one PCM sample every CPU cycle. Since it's PCM you'd think you can just dump that to a .wav file or something and play it... and normally you'd be right.. but the problem is the NES runs at 1789772.7272 cycles per second. In contrast... Most audio is played back at 44100 samples per second. So to emulate the audio you need to "scale down" how many samples are played back.

The easiest way to do this is a naive "nearest neighbor" approach. IE: capture and output one sample every 40.58 cycles (1789772.7272 / 44100 = 40.58). This will work but will create "aliasing" resulting in a very "tinny" and unpleasant sound. Nonetheless, it's not a bad way to start as it's pretty simple.

A significantly better way is to implement linear interpolation to average all the samples. So every X cycles you would sum all the output samples, then divide them by the number of samples to get the average. This will remove most of the aliasing.

An even BETTER way is to do a form of bandlimited sythesis. Which is a complex topic I'm not going to get into right now. (maybe later)

For the purposes of this guide... just remember that you need to scale back the output samples somehow.

The NES has 5 audio channels:
2 'Pulse' channels
1 'Triangle' channel
1 'Noise' channel
1 'DMC'

The concept behind all of them are the same, but their functionality differs slightly.

A great detailed reference page to NES audio is here:


Bookmark it.


To generate a tone, each channel consists of 2 main parts:

- A period divider
- A tone generator (on the triangle channel, this is the Tri-Step generator)

The idea is... every CPU cycle, the period divider is "clocked", which counts down an internal counter. Once that counter wraps, the period divider clocks the tone generator... which alters the channel's output.

Lower values for the period divider mean the tone generator is clocked more frequently... which results in the waveform cycling faster... which results in a higher pitch tone.

Higher values for the period divider mean the tone generate is clocked less frequently, which results in lower pitch tone.

So the period divider effectivly controls the tone of the channel. This can be calculated with the below formula:

Hz =  -----------
         P * X

P =  -----------
       Hz * X

Hz = the output tone in Hz (ex: Middle C = 261.625565 hz)
CPU_CLOCK = the CPU clock rate (1789772.7272)
P = the period divider
X = the number of clocks required to fully cycle the tone generator (on the tri channel, this is 32)

The period divider is set via audio registers $400A and $400B for the triangle.

So if the NSF wants to play a middle-C tone... it would write $00D6 to the period divder regs (do the math from the above tables if you want to see how I came up with that number):

LDA #$D6
STA $400A
LDA #$00
STA $400B   ; play middle C on the triangle!

So how does this actually generate the tone? What does the tone generator do? Well each channel does something different, which is how they all have their unique sound. The triangle channel's tone generator is the Tri-Step generator, which basically forms the below sequence:

0123456789ABCDEFFEDCBA9876543210... then repeats this pattern

The above sequence is what the channel outputs (ie: that's the generated PCM sample). That is... every sample, the current state of the Tri-step generator is output. It's that simple. CPU drives the period divider -> which drives the Tri-Step -> which outputs audio samples.

The tri-step can be easily emulated any number of ways. I personally tend to keep a counter and increment it each time it's clocked by the period divider. Bit 4 (0x10) of the counter is set when it's counting down:

// every time the tri-step is clocked
tristep = (tristep + 1) & 0x1F;

// when determining channel output
if( tristep & 0x10 )     output = tristep ^ 0x1F;
else                     output = tristep;
closed account (Dy7SLyTq)
a) what flavor of assembly is this? or does it not matter and the point was to do it in c?

b) where can i learn to use the bitwise and bit shifting operators & | ^ >> << because it looks very important
what flavor of assembly is this? or does it not matter and the point was to do it in c?

6502 assembly. Although the point is to do it in C/C++. The goal is to simulate a 6502 CPU and execute 6502 machine code. So you wouldn't be writing assembly... you would be executing existing assembly.

b) where can i learn to use the bitwise and bit shifting operators & | ^ >> << because it looks very important

It is.

I explain most of it here:
All this is way over my head. This is definitely challenging because after reading all that, almost none of it makes a bit of sense to me. Think my head exploded by the time I was done reading.
It's the challenging tasks that will make you a better programmer.
Yeah, challenging tasks. After reading this, the original thread, NES Emu dev wiki links, and PMs from Disch; I still haven't the slightest clue where to begin. That is how challenging it is in my mind and makes any interest it poses to me completely disappear.

I get told to make this sort of thing for confidence building and in truth it does just the opposite. I look at the NSF Player and NES Emulator and just immediately start thinking "I can't do this."

To everyone else good luck with this project. I'm removing myself from the thread.
closed account (Dy7SLyTq)
@bhx: if it helps... dr. paul carters book on assembly is very helpful. it taught me intel assembly using nasm. it looks different from this assembly, but the core practices look the same. and doing stuff over my head is the best way to learn for me so you could probably get it too.

@ihutch: thanks for the link. it really helped
I still haven't the slightest clue where to begin

Start with the CPU.

The core concept is simple.

1) Give yourself 0x10000 bytes of "memory"
2) Keep track of the PC (a 16-bit variable)
2) Read a byte from memory from the current PC (ie: op = memory[PC];)
3) Increment PC to point to next byte in memory
4) 'op' (the byte read) is your opcode. Put that in a big switch to determine what instruction it is
5) Do whatever task that instruction does in 6502 .... but do it in C.
6) Repeat.

Just getting a functioning CPU will get your brain where it needs to go for this.

void emulate6502()
    // execute 100 commands:

    for(int i = 0; i < 100; ++i)
        uint8_t op = read_from_memory( PC );

        switch( op )

        case 0xA9:   // LDA immediate
            A = read_from_memory( PC );   ++PC;
            N_flag = (A & 0x80);  // N set if high bit is set
            Z_flag = (A == 0);   // Z set if zero


Of course... rather than inlining everything in the switch like I'm doing there, you might want to write functions to do stuff.

It's OK if you don't understand it, BHX. That's why you should ask questions. I love answering these questions... it's super fun for me.
Last edited on
Just an uninformed suggestion for anyone attempting this, either use something like #define LDA 0xA9 or map the instructions or something to make the opcodes more readable for yourself. But if you try to memorize the instruction set first you'll never get anything done.
closed account (Dy7SLyTq)
@disch: how do i give myself 0x10000 bytes of memory. i tried converting the value to decimal and got a number like 65 k or something like that. what do you mean by pc? what does it stand for?

@computergeek: makes sense but where would i find the other commands? i dont know the hex codes for the commands
DTSCode wrote:
@disch: how do i give myself 0x10000 bytes of memory. i tried converting the value to decimal and got a number like 65 k or something like that.
char memory[65536];//?
what do you mean by pc? what does it stand for?
Program counter, I believe it tracks the current instruction.
Last edited on
closed account (Dy7SLyTq)
why would i have it that size though if it its only going to execute 100 commands? and thank you. program counter makes sense
I'm pretty sure the 100 was an example and there will be much more than 100 commands.
closed account (Dy7SLyTq)
ok thanks that makes more sense
Pages: 1234