Simple CPU v1a1: FPGA - Updated 23/2/2024

Home

A small increment, a small improvement, introducing the new SimpleCPUv1a1 processor. This version of the processor has been modified to include two new instructions: a conditional jump if carry set (JUMPC) and a rotate ACC one bit position right (ROTR). This processor was developed to illustrate how a processor is modified when new instructions are added to a processors instruction-set, the things you need to consider, the things you need to do. These new instructions are then used in a shift-&-add multiplication algorithm.

Computer

Figure 1 : processor

The new and improved ISE project files for this computer can be downloaded here: (Link). The main modification made to this processor is to the ALU, as shown in figure 1. The ALU is now clocked (CLK). This signal is not need by the functional components i.e. the rotate hardware, but rather the Carry flag (CFLAG). The carry flag like the zero flag are status bits indicating the result of the last arithmetic or logical operation. To keep things simple the simpleCPUs zero flag is based on the current result stored in the ACC i.e. driven by an 8bit NOR gate. Unfortunately, we can't use this simple logic based solution for the carry flag as its value would change as the operands entering the ALU change when the next instruction is fetched. Therefore, we need some memory, we need D-type flip-flop to remember the carry flags state. The carry flag is set by the addition and rotate hardware i.e. if an add instruction generates a 9bit result or if the rotate instruction moves a logic-1 in the LSB position into the carry flag, as illustrated by the RTL below:

Assembly : ROTR		
	    
RTL : ACC   <- CFLAG || ACC(7:1)  
      CFLAG <- ACC(0)              

Example       Description	     
move 0x11     ACC <= 00010001  
              CFLAG <= 0                   
rotr          ACC <= 00001000   
              CFLAG <= 1             
rotr          ACC <= 10000100  
              CFLAG <= 0

The carry flag is used to hold the LSB of the ACC that is shifted out of the ACC when the rotate right instruction is performed. This bit is stored in the carry flag until the next arithmetic or logical instruction is performed. A block diagram of how the ACC, rotate right hardware and carry flag are connected is shown in figure 2.

Figure 2 : rotate right block diagram

The carry flag is implemented in the ALU using a D-type flip-flop, as shown in figure 3, storing the state of this flag (COUT) between instructions. The four functional hardware components within the ALU i.e. add/sub, bitwise-and, pass-through and rotate, could all update this flag. However, in general most processors only allow arithmetic and logical instructions to update this flag i.e. MOVE, LOAD/STORE, JUMP instructions do not update the carry flag. This allows a bit more flexibility in software solutions. Therefore, for the simpleCPU_v1a1 the two main inputs to the carry flag are the add/sub and rotate hardware components. The carry signals from these components being selected by a four input bit-multiplexer.

Figure 3 : arithmetic and logic unit (ALU)

The clock enable pin (CE) of the carry flag's D-type flip-flop is used to control when this flag is updated. This pin is controlled by the ALU's CE input port, that is in turn connected to the ACC's' CE line, as shown in figure 1. This signal is set to a logic '1' when the ACC is updated e.g. when an ADD or MOVE instruction is performed. However, to prevent the carry flag from being incorrectly updated when MOVE, LOAD/STORE, or JUMP instructions are executed a small amount of additional decode logic is added to the ALU.

The rotate right function is implemented using the hardware shown in figure 4. The required bit shift is simply implemented using "wires" i.e. input bit-6 is connected to output bit-5 etc. However, within ISE buffers are required as the you can't have a wire (signal) with two names i.e. a signal can not be called A(6) and Z(5). Input bit A(0) is "discarded" and used to drive the COUT pin. The new Z(7) output bit is drive from the CIN pin, which is driven by the carry flag, as shown in figure 3.

Figure 4 : rotate right hardware

The rotate right instruction (ROTR) is a zero operand instruction i.e. in the sense that no additional operands are needed, its functionality is hard-coded, always rotating the ACC contents one bit position to the right. The instruction format used is shown in figure 5. In theory only the 4-bit opcode (0xC)is need to represent this instruction in memory, "freeing" 12-bits of memory for other instructions. However, to simplify the processor's fetch phase i.e. align an instruction within a single memory location, rather than spreading instruction over two memory locations, the ROTR instruction uses the same 16bit fixed length instruction format as all other instructions. Therefore, the "operand" bit field is padded with 0's.

Figure 5 : ROTR instruction format

Within the processor this instruction is first processed by the decode-logic in the control_logic schematic shown in figure 6. The ROTR instruction has been assigned the opcode 0xC. Note, opcode 0xD has also been reserved for the rotate left (ROTL) instruction, but this has not been implemented. As the ROTR instruction does not need to process an operand stored in the IR or external memory the only control signals that need to be updated are those associated with the ACC update (ACC_EN) and ALU function selection (ALU_CTL). When the ROTR is loaded into the IR the YC pin from the onehot_decoder_16 component will be set to a logic 1 (all other outputs will be set to a logic 0). This signal can then be combined with existing signal using OR gates ("join" function) to control the ACC_EN and ALU_CTL signals. To select the ROTR hardware component output in the ALU (shown in figure 3) the ALU_CTL signal needs to be set to 11X i.e. ALU_CTL2=1, ALU_CTL1=1, ALU_CTL0=Don't care. These signals control the four-input MUXs within the ALU, driving the ROTR component's output onto the ALU's output bus.

Figure 6 : control logic - original (top), new (bottom)

The second instruction added to the processor is a conditional jump instruction i.e. JUMPC. This jump instruction will jump to the absolute address specified in bit positions 7 to 0, if the carry flag is set, the bitfields used are shown in figure 7.

Figure 7 : JUMPC instruction format

To test if this new hardware and the existing hardware works the following test code was used:

#############
# TEST CODE #
#############

# Instructions
# ------------
# move, add, sub, and
# load, store, addm, subm
# jump / jumpu, jumpz, jumpnz, jumpc
# rotr
# .data

start:
  move 0xFF
  add 1
  add 1

  move 0
  sub 1
  sub 1

  move 10
  store A
  move 0
  load A
  move 0
  addm A
  move 0
  subm A

  move 1
  rotr
  rotr
  rotr
  rotr
  rotr
  rotr
  rotr
  rotr
  rotr

  move 0
  jumpnz end
  move 1
  jumpz end
  move 0
  jumpz next
  move 0xFF

next:
  move 1
  jumpnz next1
  move 0xFF

next1:
  move 0xFF
  add 1
  jumpc next2
  move 0xFF

next2:
  move 0
  add 1
  rotr
  jumpc end
  move 0xFF
  
end:
    jump end

A:
  .data 0

The results from this code can be seen in the simulation shown in figure 7. If you examine the debug section you can identify when the ROTR instructions are executed i.e. the value 1 is moved into the ACC, this is then repeatedly rotated right, updating the ACC with the values to 0x00, 0x80, 0x40, 0x20 ...

Figure 7 : test code simulation

Multiplication on the simpleCPU can be performed using repeated addition, however this is very inefficient. A more balanced algorithm can be implemented using shift and add instructions, as shown in figures 8 and 9. For more information on this algorithm refer to: (Link).

Figure 8 : shift and add block diagram

Figure 9 : shift and add block flowchart

Note, the two rotate operations highlight by the "ROR bit shifts" label are performed sequentially not in parallel i.e. the first rotate operation shifts the carry flag that was updated by the ADD instruction into the Y variable, rotating out Y(0) into the carry flag. This bit is then rotated into the multiplier variable Z, rotating out Z(0) into the carry flag, that is then dumped i.e. we only look at the multiplier LSB, the multiplier is incrementally overwritten by the low byte of the multiplication result.

The flow chart to perform the multiplication algorithm is shown in figure 9. The multiplicand is loaded into variable X, the multiplier into variable Z. The LSB of the multiplier is tested, if “1” the multiplicand is added to the partial product variable Y. Next, this partial product and the multiplier are shifted to the right. The LSB of the partial product variable overwriting the multiplier's MSB, as the multiplier is rotated to the right, ready to test the next bit position. This process is repeated until all bits within the multiplier have been tested. When complete the high byte of the result will be stored in variable Y and the low byte of the result in variable Z i.e. the 8bit variables Y and Z are used to store the 16bit result. Therefore, multiplication can now be performed using the ROTR and ADD instructions. Processing time is now proportion to the number of bits within the multiplier, rather than the multiplier's value.

The code to implement this multiplication algorithm is shown below:

#
# MAIN PROGRAM
#

start:
    move 8
    store CNT       # number of bits in multiplier

    move 255        # set multiplier
    store X
    move 0          # zero initial partial product
    store Y
    move 255        # set multiplicand
    store Z

loop:
    move 0          # clear carry flag
    add 0           # not needed, but just in case ADDER is updated to use CIN

    load Z          # test multiplier LSB
    and 1
    jumpz shiftY    # if zero shift

    load Y          # if one add multiplicand
    addm X
    rotr            # rotate Y    
    store Y    
    jump shiftZ
    
shiftY:         
    load Y          # rotate Y
    rotr
    store Y   
shiftZ:
    load Z          # rotate Z          
    rotr
    store Z

dec:
    load CNT        # decrement and test loop count
    sub 1
    store CNT
    jumpnz loop

finish:
  load Y            # read result
  load Z
  jump finish

CNT:
    .data 0
X: 
    .data 0
Y:
    .data 0
Z:
    .data 0

The results from this code can be seen in the simulation shown in figure 10. With a system clock speed of 10MHz the program takes approximately 44us to perform the calculation 255 * 255 = 65025 = 0xFE01, which is the result displayed in wave6, the last screen shot in the sequence below. Note, this value can not be loaded into the ACC in one chunk as the ACC is only 8bits wide, rather the 16bit value is stored in the variables Y and Z that are stored in external memory.

Figure 10 : test code simulation

Need to now write a good test program to illustrate the JUMPC instruction.

WORK IN PROGRESS

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Contact email: mike@simplecpudesign.com

Back