Simple CPU :) - updated 30/10/2019

Figure 1 : Simple CPU

Modern CPU's are complex beasts, highly optimised and tricky to understand. This makes it very difficult to see why one was constructed in the way it was. Part of the problem is the requirement for backwards compatibility i.e. a new processor has to be able to run code from the previous generations. This incremental development can result in a very confusing instruction-set and a 'cluttered' hardware architecture. Also, when you do start to dig down into the literature a lot of things are not always fully disclosed i.e. to protect those industrial secrets. However, at their hearts all processors are simple machines and in some respects have not changed that much since the 1940s i.e. they still use instructions and perform a Fetch-Decode-Execute cycle. Therefore, to prove that processors are actually very simple to build and understand i developed a simple 8bit CPU architecture that can be implemented in an FPGA. To be honest it wasn't really designed, it evolved, therefore, the hardware could be optimised quite a bit. However, the aim was to break the processor down into its fundamental building blocks i.e. Boolean logic gates. Then combining these to form more complex components e.g. adders, multiplexers, flip-flop, counters etc, which are at the heart of any computer. The basic block diagram of this computer is shown in figure 1, a very simple machine, made from registers, multiplexers and an adder. The operation of this machine and its components will be explained from the bottom up, logic gates to "Hello World", implemented in a Spartan 3 FPGA, its hardware defined in schematics. The source files for each version of this processor can be downloaded (links in each section) and simulated using the Xilinx ISE 14.7 ISim tool.

Back to Basics - simpleCPU_v1
Testing: "Hello World"
Testing: fully tested :)

NOTE, some of the circuit diagrams are quite large, therefore, you may not be able to see signal / component names if viewed from this page. However, if you click on an image you will get a high resolution image, but some are quite large, so may take a while to load. PDFs are also available for some images.

Back to Basics - simpleCPU_v1

Figure 2 : Logic gates, symbols (top), waveform (bottom)

At its lowest level every operation in the computer is performed by logic gates like those shown in figure 2, for more information on Boolean logic refer to: (Link) (Link). Using these basic building blocks we can construct the four fundamental components of any computer:

Logic: every block within the computer can be considered to be made from Boolean logic gates, however, this category refers to specific larger logic blocks e.g. adders, address decoders, instruction decoders etc.
Multiplexers: from one point of view a computer just moves information from one point to another. Controlling the path taken by this information are multiplexers: switching junctions, allowing information to be passed between functional blocks.
Registers: fast, short term memory. As part of the Fetch-Decode-Execute cycle a computer needs to remember its state, the instruction to be processed and any results generated.
Memory: this computer uses a classic Von Neumann architecture i.e. one memory, storing both the program (instructions) to be executed and the data to be processed in the same memory device.

Referring back to figure 1 you can see that this processor has three multiplexers (MUX) controlling the data and address buses. Note, in its simplest form a bus is just a group of wires carrying a binary numbers (Link ). Multiplexers are switches allowing the processor to select information from multiple sources (buses) and route it to a single destination. To select which data source should be used a multiplexer has one or more control lines as shown in figure 3. This 2:1 MUX has two data inputs (A,B), one output (Z) and an input (SEL), selecting which one of the two data inputs should be connected to its output, as defined by its truth table shown in the bottom of figure 3. When SEL=0 input A is connected to the output Z. When SEL=1 input B is connected to the output Z.

Figure 3 : 2:1 bit multiplexer, circuit diagram (top), truth table (bottom)

This multiplexer can select between two bits (binary digits), however, within the processor we need to select between 8 bit (byte) buses i.e. alternative data or address values. To achieved this the multiplexer is replicated eight times i.e. one per bit in the bus, as shown in figure 4. Note, each mux2_1_1 rectangle (symbol) contains the circuit shown in figure 3. To save space i have only shown the first three multiplexers, a full circuit diagram is available here: (Link ). The circuit symbol for this 8bit multiplexer is shown in figure 5, it's interface has three 8 bit buses (thick lines) and one signal (thin lines):

A(7:0) - input bus, eight bit values, bits labelled A7,A6,A5,A4,A3,A2,A1,A0
B(7:0) - input bus, eight bit values, bits labelled B7,B6,B5,B4,B3,B2,B1,B0
Z(7:0) - output bus, eight bit values, bits labelled Z7,Z6,Z5,Z4,Z3,Z2,Z1,Z0
SEL - input signal, Boolean, selects which input is connected to the output

Figure 4 : 2:1 byte multiplexer circuit diagram (first three stages only)

Figure 5 : 2:1 bye multiplexer, symbol (top), waveform (bottom)

Note, this approach of using hierarchical schematics is a good example of a fundamental design principle in computer science i.e. abstraction. Identifying key functionality, separating these out into distinct components, that can then be reused in the construction of other components or systems, simplifying schematic layout, construction and hides unnecessary details.

To state the obvious designing complex computing systems is a complex process, there is a large gap between user requirements and system realisation which is "difficult" (impossible) to perform in a single step. To simplify this design process we break the system down in to a number of functional blocks e.g. the top level block diagram shown in figure 1. We can then incrementally refine these components to a final implementation whilst ignoring irrelevant details.

In addition to 2:1 multiplexers shown in figure 1 the ALU also needs a 4:1 multiplexer (discussed later) i.e. a multiplexer that has four inputs and one output. This can be constructed from three 2:1 byte multiplexers, as shown in figure 6. Note, each mux2_1_8_v1 rectangle (symbol) contains the circuit shown in figure 3 i.e. 3 x 8 x 4 = 96 logic gates. The Xilinx schematics and symbols for these multiplexers can be downloaded here: (Link ). For more information on multiplexers refer to : (Link ).

A(7:0) - input bus, eight bit values, bits labelled A7,A6,A5,A4,A3,A2,A1,A0
B(7:0) - input bus, eight bit values, bits labelled B7,B6,B5,B4,B3,B2,B1,B0
C(7:0) - input bus, eight bit values, bits labelled C7,C6,C5,C4,C3,C2,C1,C0
D(7:0) - input bus, eight bit values, bits labelled D7,D6,D5,D4,D3,D2,D1,D0
Z(7:0) - output bus, eight bit values, bits labelled Z7,Z6,Z5,Z4,Z3,Z2,Z1,Z0
SEL0 - input signal, Boolean, first stage selection
SEL1 - input signal, Boolean, second stage selection

Figure 6 : 4:1 byte multiplexer, circuit diagram and truth table

All number crunching is performed in the Arithmetic and Logic Unit (ALU), implementing the central processing units (CPU) main arithmetic functions. To implement an ADD function (one of the core instructions of any computer), there are a number of different hardware solutions, each having advantages and disadvantages. However, to avoid an in depth discussions of their relative merits i'm going to stick with the basic half adder, as shown in figure 7. For more information on adders refer to: (Link ) (Link ). This circuit adds together two bits A and B, producing a Sum and Carry. Working through the truth table you can see the addition process 0+0=0, 0+1=1, 1+0=1. As this is a binary (base 2) machine, the maximum value any digit can store is 1, therefore, when A=B=1 the Sum output can not represent the result of 2, so a Carry is generated. This Carry would then be added to next digit in the number.

Figure 7 : Half adder, circuit diagram (top), truth table (bottom)

On its own a half adder is not that useful as we need to add together 8bit numbers, but it can be used as the building block for a full adder, as shown in figure 8. Note, each half_adder rectangle (symbol) contains the circuit shown in figure 7. This circuit can add together three bits: A,B and Cin. Another way to think about this circuit is that it counts the number of 1's:

0+0+0=00 : all zeros, result 00
0+0+1=01 : one 1, result 01
0+1+1=10 : two 1s, result 10, or decimal 2
1+1+1=11 : three 1s, result 11, or decimal 3

Figure 8 : Full adder, circuit diagram (top), truth table (middle), waveform (bottom)

Again at first glance the ability to add three bits together may not seem that useful, however, if we replicate this hardware eight times, connecting the Carry out (Cout) of the previous stage to the Carry in (Cin) of the next we can build a Ripple adder, as shown in figure 9. Note, again each rectangle (symbol) contains a sub-circuit, in this case full_adder shown in figure 8. To save space i have only shown the first three full adders, a full circuit diagram is available here: (Link ). An alternative way of considering this hardware is the pseudo code shown in figure 10. Each full adder is a modulus 2 adder, conceptually the addition process starts with the least significant digit (LSD) and 'ripples' through the hardware to the most significant digit (MSD) i.e. bits X0 and Y0 are added together to produce a Sum Z0 and a Carry C1, this Carry is then added to the next significant digits X1 and Y1 etc. This sequential behaviour does limit the hardware's performance. However, don't forget that the hardware associated with each digit's addition is all working in parallel e.g. best case performance: 987+12=999, no carries would be generated, the additions of 9+0, 8+1 and 7+2 would all be performed in parallel and complete in one unit of time. However, worst case performance: 999+1=1000, this would take four units of time owing to the carries having to ripple through the hardware from the LSD to the MSD. Note, its silly to say, but its important to remember that hardware is not software. When analysing hardware some elements may have a sequential behaviours, but ALL logic gates will be working in parallel (are active all the time).

Figure 9 : Ripple adder circuit (first three stages only)

Figure 10 : Ripple adder architecture

Figure 11 : Ripple adder symbol

The top level circuit symbol for the ripple adder is shown in figure 11 (containing the circuit shown in figure 9), this symbol's interface has three 8bit buses (thick lines) and two signals (thin lines):

A(7:0) - input bus, eight bit values, bits labelled A7,A6,A5,A4,A3,A2,A1,A0
B(7:0) - input bus, eight bit values, bits labelled B7,B6,B5,B4,B3,B2,B1,B0
Cin - input signal, Boolean, Carry input from previous calculations
Cout - output signal, Boolean, Carry out from ripple adder

The complete ALU circuit diagram is shown in figure 12. With a few small modifications the ripple adder can be used to implement a subtraction function. The subtraction hardware can be implemented using 2s complement (Link ) i.e. subtraction by the addition of negative numbers. To generate a negative number each bit is inverted and then 1 is added to the result, as shown in the example below.

    123     45 = 000101101   123 = 01111011      001111011 
  -  45    inv = 111010010                     + 111010011
  -----     +1 = 111010011                     -----------
     78                                          001001110  = 78
  -----                                        -----------
                                                11111   1

To invert each bit an array of XOR logic gates are used, bitwise_inv_v1, as shown in figure 13. This circuit has an 8bit input bus (A), each bit is XORed with the signal EN. XORing a bit with 0, returns the same value. XORing a bit with 1, returns the inverse of that value. To add 1 the Carry-In (Cin) signal to the ripple adder is set to 1, incrementing the final result. In the ALU this is controlled using signals S2 and S3. The ADD and SUB functions can be simulated using the Xilinx ISim software tools, as shown in figure 14. In this waveform diagram the calculations 123+45 and 123-45 are performed. Remember, when you use a 2's complemented representation in a calculation the final Carry-out is ignored. The Xilinx schematics, symbols and VHDL testbench for this ALU can be downloaded here: (Link ).

  A B Z
  0 0 0       A XOR 0 = A
  0 1 1       A XOR 1 = NOT A
  1 0 1
  1 1 0

Figure 12 : ALU circuit diagram

Figure 13 : Bitwise XOR

Figure 14 : ALU simulation, ADD and SUB

The Carry-in input can also be used to perform an increment function i.e. Z=A+0+1. This is quite a common requirement within a program e.g. incrementing a counter/variable, or the processor's program counter. To achieve this function the B input of the adder needs to be set to zero. To do this the replicate_v1 and bitwise_and_v1 circuits shown in figures 15 and 16 are used. The replicate_v1 component uses buffers (BUF) to drive the same signal onto each bit of its output bus Z. These signals are then ANDed with the data on the B input of the ALU. If they are ANDed with 1, the value on the B input of the adder is unaffected. If they are ANDed with 0, the value on the B input of the adder is set to zero. In the ALU this is controlled using signals S2 and S4. The INC function can be simulated using the Xilinx ISim software tools, as shown in figure 17. In this waveform diagram the calculations 123+1 and 45+1 are performed.

Figure 15 : replicate

Figure 16 : Bitwise AND

Figure 17 : ALU simulation, INC

The bitwise_and_v1 component is also used in the ALU to perform the bitwise AND function i.e. Z=A AND B, as shown in the example below. Note, at first glance this may not see that useful a function, but as we will see later its used a lot in programs to test the state of the internal registers etc.

  A B Z                           10101010
  0 0 0       A AND 0 = 0       & 11110000
  0 1 0       A AND 1 = A       ----------  
  1 0 0                           10100000
  1 1 1                         ----------

To select the ALU's function the final 4:1 multiplexer is used, control lines S0 and S1 selecting either the adder's output, bitwise AND output, input A or Input B. The full set of control signals are shown below.

  S4 S3 S2 S1 S0  Z            
  0  0  0  0  0   ADD (A+B)
  0  0  0  0  1   BITWISE AND (A&B)
  0  0  0  1  0   INPUT A
  0  0  0  1  1   INPUT B
  0  1  1  0  0   SUBTRACT (A-B)
  1  0  1  0  0   INCREMENT (A+1)
  1  0  0  0  0   INPUT A
  0  0  1  0  0   ADD (A+B)+1
  0  1  0  0  0   SUBTRACT (A-B)-1

Figure 18 : Arithmetic and Logic Unit (ALU) symbol

Computer's execute instructions using the Fetch-Decode-Execute cycle (Link), therefore, the processor must remember what phase it is in so that it can progress to the next. This temporary memory is implemented using flip-flops, each storing 1 bit of data, as defined by the state table shown in figure 19. A flip-flop has an input pin D and an output pin Q, the value on D is written to Q when there is a change from a logic 0 to a logic 1 on the CLK pin. Another way to think about the CLK pin is that it is the write, or update control signal i.e. the CLK line is pulsed to store a value. Owing to electronic reasons which i will quickly skip over all CLK lines must be connected to the same system clock i.e. a global square wave signal that determines the operating speed of the processor. This would mean that every flip-flop would update its output every clock cycle, which would not be very useful. Therefore, to control when different flip-flops update their outputs we use the clock enable input pin CE. If CE=0 the CLK pin is ignored. If CE=1 the CLK pin is used.

Figure 19 : D-type flip-flip

Within the processor temporary values are stored in registers, within a register each bit is stored in a flip-flop, as shown in figures 20, 21 and 22. To make larger registers multiple smaller registers are grouped together. Note, the rectangular components register_4 and register_8 contain the circuit diagrams shown in figure 20 and 21 respectively. The operation of the 8bit register can be simulated using the Xilinx ISim software tools, as shown in figure 23. In this waveform diagram the values 123 and 45 are stored in the register, using the CLK, CLR and CE lines to control when these values are updated. The Xilinx schematics, symbols and VHDL testbench for these registers can be downloaded here: (Link ).

Figure 20 : Four bit register, symbol (left), circuit (right)

Figure 21 : Eight bit register, symbol (left), circuit (right)

Figure 22 : Sixteen bit register, symbol (left), circuit (right)

Figure 23 : Eight bit register simulation

As shown in figure 1 this processor has three main registers:

Program Counter (PC) : 8 bit register used to store the address in memory of the current instruction being executed.
Instruction Register (IR) : 16 bit register, updated at the end of the fetch phase with the instruction to be processed (decoded and executed).
Accumulator (ACC) : 8 bit register, a general purpose data register, providing data (operand) to be processed by the ALU and used to store any result produced. Note, we can only store one 8 bit value at a time on the processor, other data values will need to be buffered in external memory.

Flip-flops are also used to generate the sequence of control signals needed to perform the functions defined by each instruction. These are contained within the decoder or control-logic block (figure 1), the circuit diagram symbol for this component is shown in figure 24. Inside this component are the instruction decoding logic and sequence generators needed to control the processor, as shown in figure 25. A high resolution diagram can be downloaded here: (Link ).

MUXA : output, ALU A input MUX control
MUXB : output, ALU B input MUX control
MUXC : output, address MUX control, selecting PC or IR
EN_DA : output, accumulator (ACC) register update control
EN_PC : output, program counter (PC) register update control
EN_IR : output, instruction register (IR) update control
RAM_WE : output, memory write enable control
ALU_S0 : output, ALU control line
ALU_S1 : output, ALU control line
ALU_S2 : output, ALU control line
ALU_S3 : output, ALU control line
ALU_S4 : output, ALU control line
IR : input bus, 8bits, high byte of instruction register, contains opcode
ZERO : input, driven by 8bit NOR gate connected to ALU output, if 1 indicates result is zero
CARRY : input, driven by carry out (Cout) of ALU
CLK : input, system clock
CE : input, clock enable, normally set to 1, if set to 0 processor will HALT
CLR : input, system reset, if pulsed high system will be reset

Figure 24 : Decoder symbol

Figure 25 : Decoder circuit diagram

This processor has a slightly modified Fetch-Decode-Execute cycle. To save an adder (reduce hardware) i added an additional phase, it now has a Fetch-Decode-Execute-Increment cycle, the final phase incrementing the PC to the address of the next instruction (when needed).

Note, normally the PC update would be performed in the Fetch/Decode phase, i moved this into a separate phase to highlight these types of "housekeeping" operations are needed in the Fetch-Decode-Execute cycle i.e. you need to maintain the machinery within the processor that allows you to crunch instructions and data.

To identify which phase the processor is in a sequence generator is used, as shown in figures 26 and 27. This is a simple ring counter (Link), using a one-hot (Link) encoded value to indicate the processor's state, as shown in figure 28. Initially the value 1000 is loaded into the counter (fetch code), on each clock pulse the one-hot bit is then moved along the flip-flop chain, looping back to the start after four clock cycles. To determine the processor's state you simply identify which bit position is set to a logic 1. One-hot encoding i.e. when a number only ever has one bit set, is very easy to decode in hardware, but its sparse encoding means you need a lot of bits to represent larger values as not all bit states are used. A four bit one-hot value can represent 4 states, or the value 0 to 3, using a binary encoding you could represent the value 0 to 15, but you would need to decode all bits to determine its value, where as using one-hot you only need to look at one bit.

1000 : Fetch
0100 : Decode
0010 : Execute
0001 : Increment

Figure 26 : Sequence generator symbol

Figure 27 : Sequence generator

Figure 28 : Sequence generator simulation

During the Fetch phase the current instruction pointed to by the program counter (PC) is loaded into the instruction register (IR). Then in the decode phase the high byte of the 16bit, fixed length instruction is decoded by the instruction decoder shown in figure 29. A high resolution diagram can be downloaded here: (Link ). To simplify construction this processor has a very limited instruction set:

Load ACC kk : 0000 XXXX KKKKKKKK

Add ACC kk : 0100 XXXX KKKKKKKK

And ACC kk : 0001 XXXX KKKKKKKK

Sub ACC kk : 0110 XXXX KKKKKKKK

Input ACC pp : 1010 XXXX PPPPPPPP

Output ACC pp : 1110 XXXX PPPPPPPP

Jump U aa : 1000 XXXX AAAAAAAA

Jump Z aa : 1001 00XX AAAAAAAA

Jump C aa : 1001 10XX AAAAAAAA

Jump NZ aa : 1001 01XX AAAAAAAA

Jump NC aa : 1001 11XX AAAAAAAA

In this instruction syntax X=Not used, K=Constant, A=Instruction Address, P=Data Address. The complexity of an instruction is also defined by its addressing mode i.e. not just how much number crunching it does, but how it fetches its operands (data). Again, to simplify the required hardware these instructions are limited to simple addressing modes:

Immediate : operand KK, a constant value, this is immediately available as it is stored in the instruction register i.e. part of the instruction read during the fetch phase.
Absolute : operand AA or PP, an address in memory. Again, the address is stored in the instruction register, either specifying where the data to be processed is stored i.e. an INPUT instruction, or where a result produced should be stored in memory i.e. an OUTPUT instruction. Also used to define the address in memory of the next instruction to be fetched i.e. JUMP.

The register transfer level (RTL) description of each of these instructions is:

Load ACC kk : ACC <- KK

Add ACC kk : ACC <- ACC + KK

And ACC kk : ACC <- ACC & KK

Sub ACC kk : ACC <- ACC - KK

Input ACC pp : ACC <- M[PP]

Output ACC pp : M[PP] <- ACC

Jump U aa : PC <- AA

Jump Z aa : IF Z=1 PC <- AA ELSE PC <- PC + 1

Jump C aa : IF C=1 PC <- AA ELSE PC <- PC + 1

Jump NZ aa : IF Z=0 PC <- AA ELSE PC <- PC + 1

Jump NC aa : IF C=0 PC <- AA ELSE PC <- PC + 1

Note, in RTL syntax "<-" reads as "updated with", M[X] is memory, addressed at location X. For more information on addressing modes refer to: (Link).

The top 4 to 6 bits of each instruction defines the opcode: an unique binary pattern that allows the CPU to identify what function needs to be performed, where the data (operands) are and where any result produced should be stored. Note, the top nibble (4bit) of each instruction is unique to that instruction. For those of you who know your machine code you will recognise these instructions are based on the original PicoBlaze machine code (Link), another nice small 8 bit processor that i have used in Lectures. The instruction decoder shown in figure 29 converts the unique opcode into a one-hot value, these are then used during the Decode and Execute phases to control the processor's hardware. To ensure these signals are not active during the Fetch and Increment phases they are ANDed (zeroed) with the result of the logical OR of the Decode and Execute signals from the sequence generator, as shown in figure 30.

Figure 29 : Instruction decoder

Note, the AND gates in figure 29 decode the opcode field i.e. each zero in the opcode nibble is passed through a NOT gate, such that when a specific opcode is decoded only one AND gate will have all of its inputs driven to a logic one i.e. there will never be two AND gates producing a logic one at the same time (a one-hot encoded output).

Figure 30 : Complete instruction decoder, circuit (top), waveform (bottom)

The processor supports unconditional and conditional JUMP instructions. The conditional JUMP instructions are based on the result of the last ALU operation i.e. the ZERO and CARRY bits for the ADD, SUB and AND instructions. These are stored in a 2bit register, as shown in figure 31. Note, the ZERO bit is generated by an 8bit NOR gate connected to the ALU's output bus i.e. a NOR only produce a 1 when all of its inputs are 0.

Figure 31 : Status register

The sequence_generator, instruction_decoder and status_register form the core elements of the processor's control unit (decoder block in figure 1). The signals from these units are used to generate the control signals for the system MUXs, ALU and REGs, as shown in figure 32. This table defines the state of each control signal, for each phase, to implement that instruction's function. This table also defines the control signals needed for the Fetch and Increment phases. The logic that drives these control signals (right hand side of figure 25) is derived from this table.

Figure 32 : Control signals

Most of this control logic is quite intuitive, you simply combine the one-hot output from the Decoder with the state bits from the Ring-counter to produce the logic 1's in each row of the table in figure 32. A slightly more complex bit is the Jump logic shown in figure 33. If the processor is in the Execute phase, the instruction decoder and status signals determine if the program counter (PC) should be updated i.e. should the jump address be loaded into the PC. If a JUMP instruction is taken, then the system does not need to increment the PC, as it already contains the address of the next instruction. Therefore, when the processor is in the Increment phase it checks to see if a jump has been taken, if it has been the PC is not enabled i.e. the result PC+1 is not stored in the program counter.

Figure 33 : Jump logic

Figure 34 : test code

To illustrate the operation of this computer, consider the simple program in figure 34. This program adds two 8 bit unsigned numbers together and tests for an overflow. If an overflow occurs the result is saturated i.e. set to its maximum value (255). The step by step analysis of the execution of this program is shown in figures 35 to 37, adding the data values 250 and 10. This will cause an overflow to occur, resulting in the final result being saturated to 255. If however the data values used are 250 and 5, no overflow will occur as shown in figure 37. Note, in each phase the active address and data paths are highlighted dark BLUE.

Figure 35 : steps 1 to 5

Step 1 : reset, clear line pulsed to reset all flip-flop, initialising all registers to their default values. First fetch phases initiated, ADDR MUX selects PC i.e. the value 0, memory address 0 read, first instruction (INPUT) stored in IR.
Step 2 : decode, opcode field decoded, instruction identified as an INPUT instruction, control-logic configures data paths to route absolute address (6) stored in IR to address bus using ADDR MUX. Memory read, data at address 6 accessed (250), data driven onto data-in bus, routed through DATA MUX to input of ACC.
Step 3 : execute, ACC updated with accessed data, at the end of this phase ACC = 250.
Step 4 : increment, instruction completed, processor needs to increment PC to the address of the next instruction. Control-logic configures ALU to perform increment function, PC routed to ALU, value incremented and stored back into PC.
Step 5 : fetch, instruction at address 1 read (ADD) and stored in IR.

Figure 36 : steps 6 to 11

Step 6: decode, opcode field decoded, instruction identified as an ADD instruction, control-logic configures data paths to route immediate data (10) stored in IR to ALU. Second operand stored in ACC also routed to ALU. Control-logic configures ALU to perform add function.
Step 7: execute, addition of two numbers performed i.e. 250+10=260. As this is an 8bit machine it can not store the 9bit result in the ACC i.e. an overflow has occurred (max value for 8bits is 255). Therefore, the carry flag is set and the value 4 stored in the ACC. Note, the reason 4 is stored is that a nine bit result is produced i.e. 260 = 256+4 = 100000100. The ninth bit is effectively the carry bit, leaving the lower 8 bits in the ACC (00000100). After the ACC is updated the PC is incremented to the next address (2).
Step 8: fetch, decode, execute. The conditional jump instruction JUMPNC is "executed", as a carry was generated the "jump if no carry" does not jump to address 4, therefore the PC is incremented to the address of the next instruction (3).
Step 9: fetch, decode. The next instruction is fetched and decoded, instruction identified as a LOAD instruction, control-logic configures data paths to route immediate data (255) stored in IR to ALU. Control-logic configures ALU to perform pass through function (data unaltered).
Step 10: execute, ACC updated with the value 255. Next, PC is incremented to the next address (4).
Step 11: fetch, decode. The next instruction is fetched and decoded, instruction identified as an OUTPUT instruction, control-logic configures data paths to route absolute address (7) through ADDR MUX to address bus. ACC data passed to memory using data-out bus. Memory control bus set to indicate a write operation.

Note, a common misunderstanding is the differences between the LOAD and INPUT instructions. For this processor LOAD does not read memory. To read the contents of an external memory location we would use the INPUT instruction. The LOAD instruction is an internal operation copying the immediate value stored in the IR to the ACC, in the above example the constant 255. I agree is this confusing if you have used other processors, but for this processor i decided to follow PicoBlaze syntax :)

Figure 37 : steps 12 to 17

Step 12: execute, data on data-out bus (255) written to address 7. At the end of the write operation PC incremented to the address of the next instruction (5).
Step 13: fetch, decode. instruction at address 5 read (JUMP) and stored in IR. Opcode field decoded, instruction identified as an unconditional JUMP instruction i.e. jump always performed, control-logic configures data paths to route absolute address (0) stored in IR to PC.
Step 14: The unconditional JUMP instruction is executed updating the PC to the address 0. As a jump instruction has occurred the increment phase is not performed as the PC already contains the address of the next instruction.
Step 15: repeat, control has returned to the start of the program, operations repeated.
Step 16: fetch, decode. Program modified to demonstrate what would happen if no carry was generated, the immediate value used by ADD instruction changed from 10 to 5. Add operation performed producing a result of 255. This fits within 8 bits (11111111), therefore, carry flag not set. Control-logic detects this flag is not set, configures ALU to pass the absolute address stored in IR unaltered, value routed to input of PC.
Step 17: execute, PC updated with jump address (4). As a jump instruction has occurred the increment phase is not performed as the PC already contains the address of the next instruction. LOAD instruction is jumped over i.e. bypassed.

The memory used in the system (figure 38) stores both instructions and data i.e. a Von Neumann architecture. This could be constructed from the built in memory components used in the FPGA, but they are a real pain to configure i.e. initialise with the instructions (machine code) and data values needed. Therefore, i decided to cheat and use a bit of VHDL. This is a Hardware Description Language (HDL) representation of what the memory should do, abstracting away from the low level logic gates its actually made from. This allows me to simply type in the binary values as shown in figure 39. Note, i've only shown the first few values. This description is then synthesised (converted) by the Xilinx tools into the required hardware components. The complete computer system is shown in figure 40. A high resolution diagram can be downloaded here: (Link ).

Figure 38 : RAM

Figure 39 : VHDL (asynchronous, later modified to be synchronous)

The test program shown in figure 38 (machine code, comments on the right in green), loads the value stored in memory location 10 and adds 10 to it (decimal 10, not binary 10). If this does not generate an overflow i.e. a value larger than 255, the result is written back to memory location 10. If it does generate an overflow i.e. 250+10, the value is saturated to the maximum value i.e. 255. Therefore, program counts from 0 to 250 and then stops, as shown in the simulation shown in figure 41, look at the bottom row, this shows the contents of memory location 10 as a hexadecimal value (base 16). Note, the Xilinx schematics, symbols and VHDL testbench for the complete system can be downloaded using the links below (in next section). Have a play, modify the data values or write your own program, then re-run the simulation (top_level_v1_tb).

Figure 40 : Complete system

Figure 41 : Test program simulation

An interesting design point that i do confess ive skipped over for the moment is how the computer's memory is organised. The simpleCPU uses a Von Neumann architecture, therefore, instructions and data are stored in the same memory. This causes a conflict regarding a memory location's width i.e. how many bits are stored in each location, as instructions are 16bits and data is 8bit. The number of addressable memory locations is fixed by the address bus, in this case 8bits (256 location). This gives the designer a few choices:

RAM 256 x 8bit : each memory location stores 8bits, therefore, an instruction has to be split over two memory locations. As a result the processor's fetch phase will now have to read two memory locations to retrieve the high and low byte of the instruction. This requires the PC to be incremented twice, therefore adding an additional phase to sequence generator in figure 26. Alternatively, additional hardware could be added to the system e.g. an incrementer, but this would required an extra input on the address multiplexer to drive PC+1 onto the address bus, all of this increases hardware complexity. The advantage of this approach is that the processor can store data to any memory address.
RAM 256 x 16bit : each memory location stores 16bits, therefore, an instruction can be fetched in a single memory transaction. However, as the ALU and ACC are 8bits there is now a problem of how an 8bit value is stored in a 16bit memory location. The simple solution is to store CPU data in the lower byte and pad the higher byte with 0x00. This means that memory locations used to store data will now waste 8bits, as the CPU can only read from and write to the low byte. For a system with only 256 memory locations, wasting 50% of your data memory is quite a big hit, but, to keep things simple this is the solution this version of simpleCPU uses i.e. the simplest / smallest hardware solution.
RAM 128 x 16bit : memory is now byte addressable, each memory location stores a 16bit values, however, the processor can also read/write to either the high or low byte. Therefore, the effective address bus is reduced to 7bits (128 x 16bit locations) as making the memory byte addressable means the processor still has to specify 256 x 8bit memory locations. Implementing this in hardware is a little tricky. The simplest solution is to have two 128 x 8bit memories connected in parallel, one storing the high bye, the other the low byte. Address bus lines A7 - A1 are used as the "address bus", address line A0 is used to select which memory should be accessed when performing byte read/writes e.g. A0=0: low byte, A0=1: high byte. With a little bit of glue logic this also simplifies byte writes as the unused memory device can simply be disabled i.e. when writing to the low byte the high byte must not be altered. Instruction fetches can still be performed in a single transaction, but instructions are now aligned to even addresses e.g. at addresses 0,2,4,6,8 ...

Testing: "Hello World"

A requirement when writing you first program on any new processor is to print "HELLO WORLD" to the screen. Normally a simple task, however, for a processor with no screen and only seven basic instructions a little more of a challenge. The simplest way to add a screen is to use a serial terminal (Link)(Link), then bit-bang out serial data packets (Link). However, you first need an output port, as shown in figure 42. This is simply a flip-flop who's CE pin is only enabled for a specific address, in this example address 0xFF (255). When data is written to this address (using an OUTPUT instruction), RAM is updated as normal, but, data bit 0 is also stored in this flip-flop, its Q output being connected to the transmit (TX) line of a serial bus.

Figure 42 : Serial Output port

Each character in the "HELLO WORLD" message string is stored in memory, locations 0xF0 to 0xFD, as an ASCII value (Link). To simplify the program they are actually stored as their inverted forms e.g. H = 0x48 (01001000), inverted = 0xB7 (10110111). The program reads each character's bits and outputs these on the serial port. The serial data packet format for the character 'K' (0x4B = 01001011) is shown in figure 43 (Link). Each bit is allocated a time slice on the serial port's line. The default speed for the serial port is 9600 bits per second i.e. each bit is valid for 104 us (1/9600), packets start with a start bit (1) and finish with a stop bit (0). These packets are received by a terminal program running on a remote computer and displayed on its screen. The serial packets and terminal display are shown in figures 43 - 46.

Figure 43 : Serial packet format (wiki link above)

Figure 44 : HELLO WORLD serial packet (captured on scope)

Figure 45 : First three characters "HEL" (captured on scope)

Figure 46 : Complete message displayed in terminal

The program to send the message string "HELLO WORLD" to the serial terminal is shown below. Hopefully most of the code is self explanatory :). The next character to be transmitted is read from memory location 0xE0, a bit mask is applied using a bitwise AND to select the desired bit. Based on this result a conditional jump then selects a 0 or a 1, which is stored to memory location 0xFF i.e. the output port (serial bit). This is repeated seven more times until all eight character bits have been outputted. The program below could of been substantially reduced if a SHIFT instruction had been included. The final twist is how the program scans through the message string. As this processor does not support indirect addressing modes, self modifying code (Link) is used to access the next character in the string i.e. instructions from 0x4F to 0x58. The program reads an INPUT instruction from memory, it then adds 1 to this instruction. As the INPUT instruction's address field is in the lower byte this changes the address that would be read to the address of the next character in the string. This modified instruction is then written back to memory and executed by the program. It goes without saying that self modifying coded i.e. a program that rewrites itself is not a good idea, however, its very useful in this case. Note, in early computer this was actually how the first subroutines i.e. functions, were implemented on EDSAC. The program then fetches the next character, storing it to memory location 0xE0, if this is a NULL (the value 0) the program finishes, otherwise the program jumps to memory location 0x02 and repeats the TX code. From a software structure point of view would of been nice if the processor had supported subroutines, perhaps for version 2.

    DATA_RAM_WORD'("1010000011110000"), --00   INPUT ACC 0xF0  - load first char
    DATA_RAM_WORD'("1110000011100000"), --01   OUTPUT ACC 0xE0 - update tx memory                      
    DATA_RAM_WORD'("0000000000000001"), --02   LOAD ACC 1      - Start Bit                       
    DATA_RAM_WORD'("1110000011111111"), --03   OUTPUT ACC 0xFF - Set serial port high          
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)
    DATA_RAM_WORD'("0100000000000001"), --05   ADD ACC 1       - Increment count    
    DATA_RAM_WORD'("1001110000000101"), --06   JUMP NC 0x05    - repeat 50 times       
    DATA_RAM_WORD'("1010000011100000"), --07   INPUT ACC 0xE0  - read character                     
    DATA_RAM_WORD'("0001000000000001"), --08   AND ACC 0X01    - mask bit0                   
    DATA_RAM_WORD'("1001000000001011"), --09   JUMP Z 0x0B     - if zero output              
    DATA_RAM_WORD'("0000000000000001"), --0A   LOAD ACC 1      - else set 1                        
    DATA_RAM_WORD'("1110000011111111"), --0B   OUTPUT ACC 0xFF - Set serial port bit                         
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                      
    DATA_RAM_WORD'("0100000000000001"), --0D   ADD ACC 1       - Increment count                         
    DATA_RAM_WORD'("1001110000001101"), --0E   JUMP NC 0x0D    - repeat 50 times                         
    DATA_RAM_WORD'("1010000011100000"), --0F   INPUT ACC 0xE0  - read character 

    DATA_RAM_WORD'("0001000000000010"), --10   AND ACC 0X02    - mask bit1 
    DATA_RAM_WORD'("1001000000010011"), --11   JUMP Z 0x13     - if zero output                           
    DATA_RAM_WORD'("0000000000000001"), --12   LOAD ACC 1      - else set 1                          
    DATA_RAM_WORD'("1110000011111111"), --13   OUTPUT ACC 0xFF - Set serial port bit                    
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                           
    DATA_RAM_WORD'("0100000000000001"), --15   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000010101"), --16   JUMP NC 0x15    - repeat 50 times                         
    DATA_RAM_WORD'("1010000011100000"), --17   INPUT ACC 0xE0  - read character                             
    DATA_RAM_WORD'("0001000000000100"), --18   AND ACC 0X04    - mask bit2                                   
    DATA_RAM_WORD'("1001000000011011"), --19   JUMP Z 0x1B     - if zero output                     
    DATA_RAM_WORD'("0000000000000001"), --1A   LOAD ACC 1      - else set 1                       
    DATA_RAM_WORD'("1110000011111111"), --1B   OUTPUT ACC 0xFF - Set serial port bit                        
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                         
    DATA_RAM_WORD'("0100000000000001"), --1D   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000011101"), --1E   JUMP NC 0x1D    - repeat 50 times                          
    DATA_RAM_WORD'("1010000011100000"), --1F   INPUT ACC 0xE0  - read character   

    DATA_RAM_WORD'("0001000000001000"), --20   AND ACC 0X08    - mask bit3
    DATA_RAM_WORD'("1001000000100011"), --21   JUMP Z 0x23     - if zero output                           
    DATA_RAM_WORD'("0000000000000001"), --22   LOAD ACC 1      - else set 1                          
    DATA_RAM_WORD'("1110000011111111"), --23   OUTPUT ACC 0xFF - Set serial port bit                    
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                           
    DATA_RAM_WORD'("0100000000000001"), --25   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000100101"), --26   JUMP NC 0x25    - repeat 50 times                         
    DATA_RAM_WORD'("1010000011100000"), --27   INPUT ACC 0xE0  - read character                             
    DATA_RAM_WORD'("0001000000010000"), --28   AND ACC 0X10    - mask bit4                                  
    DATA_RAM_WORD'("1001000000101011"), --29   JUMP Z 0x2B     - if zero output                     
    DATA_RAM_WORD'("0000000000000001"), --2A   LOAD ACC 1      - else set 1                       
    DATA_RAM_WORD'("1110000011111111"), --2B   OUTPUT ACC 0xFF - Set serial port bit                        
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                          
    DATA_RAM_WORD'("0100000000000001"), --2D   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000101101"), --2E   JUMP NC 0x2D    - repeat 50 times                          
    DATA_RAM_WORD'("1010000011100000"), --2F   INPUT ACC 0xE0  - read character      
  
    DATA_RAM_WORD'("0001000000100000"), --30   AND ACC 0X20    - mask bit5
    DATA_RAM_WORD'("1001000000110011"), --31   JUMP Z 0x33     - if zero output                           
    DATA_RAM_WORD'("0000000000000001"), --32   LOAD ACC 1      - else set 1                          
    DATA_RAM_WORD'("1110000011111111"), --33   OUTPUT ACC 0xFF - Set serial port bit                    
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                            
    DATA_RAM_WORD'("0100000000000001"), --35   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000110101"), --36   JUMP NC 0x35    - repeat 50 times                         
    DATA_RAM_WORD'("1010000011100000"), --37   INPUT ACC 0xE0  - read character                             
    DATA_RAM_WORD'("0001000001000000"), --38   AND ACC 0X40    - mask bit6                                  
    DATA_RAM_WORD'("1001000000111011"), --39   JUMP Z 0x3B     - if zero output                     
    DATA_RAM_WORD'("0000000000000001"), --3A   LOAD ACC 1      - else set 1                       
    DATA_RAM_WORD'("1110000011111111"), --3B   OUTPUT ACC 0xFF - Set serial port bit                        
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                         
    DATA_RAM_WORD'("0100000000000001"), --3D   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110000111101"), --3E   JUMP NC 0x3D    - repeat 50 times                          
    DATA_RAM_WORD'("1010000011100000"), --3F   INPUT ACC 0xE0  - read character   
	
    DATA_RAM_WORD'("0001000010000000"), --40   AND ACC 0X80    - mask bit7
    DATA_RAM_WORD'("1001000001000011"), --41   JUMP Z 0x43     - if zero output                           
    DATA_RAM_WORD'("0000000000000001"), --42   LOAD ACC 1      - else set 1                          
    DATA_RAM_WORD'("1110000011111111"), --43   OUTPUT ACC 0xFF - Set serial port bit                    
    DATA_RAM_WORD'("0000000011001110"), --04   LOAD ACC 0xCE   - wait loop (104us)                             
    DATA_RAM_WORD'("0100000000000001"), --45   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110001000101"), --46   JUMP NC 0x45    - repeat 50 times                              
    DATA_RAM_WORD'("0000000000000000"), --47   LOAD ACC 0      - stop bit                           
    DATA_RAM_WORD'("1110000011111111"), --48   OUTPUT ACC 0xFF - Set serial port bit                          
    DATA_RAM_WORD'("0100000000000001"), --49   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110001001001"), --4A   JUMP NC 0x49    - repeat 256 times                         
    DATA_RAM_WORD'("0000000000000000"), --4B   LOAD ACC 0      - stop bit                           
    DATA_RAM_WORD'("1110000011111111"), --4C   OUTPUT ACC 0xFF - Set serial port bit                          
    DATA_RAM_WORD'("0100000000000001"), --4D   ADD ACC 1       - Increment count                          
    DATA_RAM_WORD'("1001110001001101"), --4E   JUMP NC 0x4D    - repeat 256 times    
    DATA_RAM_WORD'("1010000001010010"), --4F   INPUT ACC 52    - read instruction
	 
    DATA_RAM_WORD'("0100000000000001"), --50   ADD ACC 1       - increment address field
    DATA_RAM_WORD'("1110000001010010"), --51   OUT ACC 52      - update instruction                  
    DATA_RAM_WORD'("1010000011110000"), --52   INPUT ACC 0xF0  - execute instruction                     
    DATA_RAM_WORD'("1110000011100000"), --53   OUTPUT ACC 0xE0 - update tx memory                         
    DATA_RAM_WORD'("0001000011111111"), --54   AND ACC 0xFF    - is char NULL?                         
    DATA_RAM_WORD'("1001010000000010"), --55   JUMP NZ, 2      - no, TX                 
    DATA_RAM_WORD'("0000000011110000"), --56   LOAD ACC 0xF0   - restore original instruction            
    DATA_RAM_WORD'("1110000001010010"), --57   OUT ACC 52                        
    DATA_RAM_WORD'("1000000001010110"), --58   JUMP 56         - yes, halt  

    DATA_RAM_WORD'("0000000010110111"), --F0  H      = 0x48 inv 10110111
    DATA_RAM_WORD'("0000000010111010"), --F1  E      = 0x45 inv 10111010                         
    DATA_RAM_WORD'("0000000010110011"), --F2  L      = 0x4C inv 10110011                         
    DATA_RAM_WORD'("0000000010110011"), --F3  L      = 0x4C inv 10110011                           
    DATA_RAM_WORD'("0000000010110000"), --F4  O      = 0x4F inv 10110000                         
    DATA_RAM_WORD'("0000000011011111"), --F5  'SP'   = 0x20 inv 11011111                        
    DATA_RAM_WORD'("0000000010101000"), --F6  W      = 0x57 inv 10101000                    
    DATA_RAM_WORD'("0000000010110000"), --F7  O      = 0x4F inv 10110000                       
    DATA_RAM_WORD'("0000000010101101"), --F8  R      = 0x52 inv 10101101                  
    DATA_RAM_WORD'("0000000010110011"), --F9  L      = 0x4C inv 10110011                            
    DATA_RAM_WORD'("0000000010111011"), --FA  D      = 0x44 inv 10111011                        
    DATA_RAM_WORD'("0000000011110010"), --FB  'CR'   = 0x0D inv 11110010                       
    DATA_RAM_WORD'("0000000011110101"), --FC  'LF'   = 0x0A inv 11110101                        
    DATA_RAM_WORD'("0000000000000000"), --FD  'NULL' = 0x00                          
    DATA_RAM_WORD'("0000000000000000"), --FE                            
    DATA_RAM_WORD'("0000000000000000")  --FF

Had to make a few small changes to the schematics and VHDL files to minimise the hardware size e.g. in its present form the memory is implemented by the software tools as a 256:1 16bit multiplexer, which takes up quite a bit of space. Adding a clock allows the memory to be mapped to a BlockRam i.e. the default RAM on the FPGA. The final project that prints "HELLO WORLD" can be downloaded here: (Link). NOTE: this version has 'errors' corrected version below.

What next for this computer, i want to see if i can get it to fit into a Xilinx 9572 CPLD, which we use for teaching hardware design, would need to use external RAM/ROM, however, the main limitation is that this programmable hardware only has a very small amount of hardware e.g. 72 flip-flops. Update, gave up on this idea mostly due to component sourcing issues for the RAM and EPROMs needed for the instruction and data memory.

Testing: fully tested :)

Originally this processor was a quick case study, used a to illustrate the process of realising a simple processor architecture from Boolean logic gates. Do confess i only tested the hardware enough to get the "Hello World" case study working, as a result i didn't check that all the instructions worked correctly :). Received an email asking for some more detail / explanations about the operation of this processor, i suspect because there were some typos / errors, so i thought i better do a "full" test. The instruction set for this machine is:

Load ACC kk : 0000 XXXX KKKKKKKK

Add ACC kk : 0100 XXXX KKKKKKKK

And ACC kk : 0001 XXXX KKKKKKKK

Sub ACC kk : 0110 XXXX KKKKKKKK

Input ACC pp : 1010 XXXX PPPPPPPP

Output ACC pp : 1110 XXXX PPPPPPPP

Jump U aa : 1000 XXXX AAAAAAAA

Jump Z aa : 1001 00XX AAAAAAAA

Jump C aa : 1001 10XX AAAAAAAA

Jump NZ aa : 1001 01XX AAAAAAAA

Jump NC aa : 1001 11XX AAAAAAAA

Therefore, the test code below should allow me to test "all" the instructions:

    DATA_RAM_WORD'("0000000000000001"), --00   LOAD ACC 0x01     -
    DATA_RAM_WORD'("0100000000000000"), --01   ADD ACC, 0x00     - test NZ NC ACC=1                    
    DATA_RAM_WORD'("0100000011111111"), --02   ADD ACC, 0xFF     - test Z  C  ACC=0    
    DATA_RAM_WORD'("0000000010101010"), --03   LOAD ACC 0xAA     -
    DATA_RAM_WORD'("0001000000001111"), --04   AND ACC, 0x0F     - test NZ NC ACC=0x0A                    
    DATA_RAM_WORD'("0001000000000000"), --05   AND ACC, 0x00     - test Z  NC ACC=0x00  
    DATA_RAM_WORD'("0000000000000001"), --06   LOAD ACC 0x01     -
    DATA_RAM_WORD'("0110000000000001"), --07   SUB ACC, 0x01     - test Z  NC ACC=0x00                    
    DATA_RAM_WORD'("0110000000000001"), --08   SUB ACC, 0x01     - test NZ C  ACC=0xFF
    DATA_RAM_WORD'("1110000011110000"), --09   OUTPUT ACC, 0xF0  - test M[0xF0] = 0xFF
    DATA_RAM_WORD'("0000000000000000"), --0A   LOAD ACC 0x00     - 
    DATA_RAM_WORD'("1010000011110000"), --0B   INPUT ACC, 0xF0   - test ACC = 0xFF 
    DATA_RAM_WORD'("0100000000000000"), --0C   ADD ACC, 0x00     - test NZ NC ACC=0x01
    DATA_RAM_WORD'("1001010000001111"), --0D   JUMP NZ, 0x0F     - skip trap if correct
    DATA_RAM_WORD'("1000000000001110"), --0E   JUMP 0x0E         - trap
    DATA_RAM_WORD'("0100000000000001"), --0F   ADD ACC, 0x01     - test Z NC ACC=0x00
    DATA_RAM_WORD'("1001000000010010"), --10   JUMP Z, 0x12      - skip trap if correct
    DATA_RAM_WORD'("1000000000010001"), --11   JUMP 0x11         - trap
    DATA_RAM_WORD'("0000000000000010"), --12   LOAD ACC 0x02 
    DATA_RAM_WORD'("0100000011111111"), --13   ADD ACC, 0xFF     - test NZ C ACC=0x01
    DATA_RAM_WORD'("1001100000010110"), --14   JUMP C, 0x16      - skip trap if correct
    DATA_RAM_WORD'("1000000000010101"), --15   JUMP 0x15         - trap
    DATA_RAM_WORD'("0110000000000001"), --16   SUB ACC, 0x01     - test Z  NC ACC=0x00  
    DATA_RAM_WORD'("1001100000011001"), --17   JUMP NC, 0x19     - skip trap if correct
    DATA_RAM_WORD'("1000000000011000"), --18   JUMP 0x15         - trap
    DATA_RAM_WORD'("1000000000000000"), --19   JUMP 0x00         - loop

This is not a full test of all functional units, with all possible data values, or all possible combinations of instructions, as this would take forever, but this code does test typical usage so should highlight any major issues. Testing start ok, LOAD, AND, ADD all seemed to work ok, then i came to the SUB instruction :(. I had missed off the ACC update, also messed up MUX_A and MUX_B. For ADD and AND it doesn't matter which way round the operands are passed to the ALU e.g. (ACC=InputA, Constant=InputB) or (ACC=InputB, Constant=InputA), both will produce the same result. Subtraction however, we need (ACC - Constant), performing the subtraction the other way round is not that helpful :(. This was "easy" to fixed i.e. switch over the data paths from the ACC and IR to the two multiplexers, but this does mean i have to update the decoder logic and more annoyingly redraw all the block diagrams in the lecture slides, one day :). Therefore, switched the data buses and updated the decoder, now the ACC has the correct value, but, another joy did not consider the carry flag. The ALU performs subtract by the addition of a negative number, therefore, if ACC=1 subtracting 1 will set the carry flag as this operation is implemented by adding 255. At present not sure how to handle this one, i feel there is a simple solution shouting at me, but, for the moment going for the tried and tested approach of ignoring it :). The new and improved SimpleCPU project files and this test code can be downloaded here: (Link).

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Contact email: mike@simplecpudesign.com

Back

Simple CPU :) - updated 30/10/2019

Table of Contents

Back to Basics - simpleCPU_v1

Testing: "Hello World"

Testing: fully tested :)