Man-Machine Interface

My Forth compiler is working!

I was able to compile this:

: star [char] * emit ;

which outputs a single * character when executed:

star * ok
star star star star **** ok

The compiler is written using a number of Forth primitive words coded in assembly language.  Now that the compiler is working, I can write the remainder of the Forth system in Forth itself.

For example to add a conditional if statement I would write this code in Forth:

: if ['] 0branch , here 0 , ; immediate
: then dup here swap - swap ! ; immediate

The if word compiles a 0branch word, and a zero branch offset and also pushes the location of the branch offset onto the stack.

The then word takes the branch offset address off the stack and calculates the actual offset from the previous if and writes that amount into memory at the branch offset address.

To test this I can use this:

: test-if 123 = if star then ;

The test-if word compares the number on the top of the stack with 123 and if equal it will call the word star defined above which outputs a single * character.  If the number is not equal to 123 then control passes to the code after the then word and nothing happens.

Executing the test shows this:

100 test-if  ok
123 test-if * ok

So it works.  Using Forth itself one is able to extend the compiler with new word to add control flow.  Likewise an else word can be added as well as looping constructs.

Comments

D&D

I Am A: True Neutral Human Wizard (5th Level)

Ability Scores:
Strength-9
Dexterity-9
Constitution-10
Intelligence-15
Wisdom-13
Charisma-11

Alignment:
True Neutral A true neutral character does what seems to be a good idea. He doesn’t feel strongly one way or the other when it comes to good vs. evil or law vs. chaos. Most true neutral characters exhibit a lack of conviction or bias rather than a commitment to neutrality. Such a character thinks of good as better than evil after all, he would rather have good neighbors and rulers than evil ones. Still, he’s not personally committed to upholding good in any abstract or universal way. Some true neutral characters, on the other hand, commit themselves philosophically to neutrality. They see good, evil, law, and chaos as prejudices and dangerous extremes. They advocate the middle way of neutrality as the best, most balanced road in the long run. True neutral is the best alignment you can be because it means you act naturally, without prejudice or compulsion. However, true neutral can be a dangerous alignment when it represents apathy, indifference, and a lack of conviction.

Race:
Humans are the most adaptable of the common races. Short generations and a penchant for migration and conquest have made them physically diverse as well. Humans are often unorthodox in their dress, sporting unusual hairstyles, fanciful clothes, tattoos, and the like.

Class:
Wizards are arcane spellcasters who depend on intensive study to create their magic. To wizards, magic is not a talent but a difficult, rewarding art. When they are prepared for battle, wizards can use their spells to devastating effect. When caught by surprise, they are vulnerable. The wizard’s strength is her spells, everything else is secondary. She learns new spells as she experiments and grows in experience, and she can also learn them from other wizards. In addition, over time a wizard learns to manipulate her spells so they go farther, work better, or are improved in some other way. A wizard can call a familiar- a small, magical, animal companion that serves her. With a high Intelligence, wizards are capable of casting very high levels of spells.

Find out What Kind of Dungeons and Dragons Character Would You Be?, courtesy of Easydamus (e-mail)

Comments

Computer Love

I’ve managed to speed up the 16 bit CPU I created in Logisim.

Originally the instruction fetch was four clock cycles followed by a one or two clock cycle instruction execution.  The instruction fetch does this:

  • Load the Instruction Register (IR) from memory addressed by the Program Counter Register (PC).
  • Increment PC by one.
  • Load the Operand Register (OP) from memory addressed by the PC.
  • Increment PC by one.

Each step is a single clock cycle.  On the fifth clock cycle instruction execution finally begins.  This fetching was too slow; the CPU spent more time fetching instructions from memory than executing them.

The standard way to speed up a CPU is to try to do as much as possible in a single clock cycle in parallel with other operations.  The new instruction fetch is this:

  • Load IR from memory addressed by PC.
  • Load OP from memory addressed by PC+1.
  • Increment PC by two.  Start instruction execution…

The last step is done in parallel with the first cycle of instruction execution, so effectively the instruction fetch is now two clock cycles.  Since the PC is now modified during instruction execution, branch instructions need an extra clock cycle to ensure the PC is valid before it’s used in branch target calculations.

To actually complete this change I moved from a hard-wiring the CPU control circuit to instead using microcode.  The control circuit is what controls the state changes of the CPU as it fetches and executes instructions.  It takes as input the instruction code, the current clock count and the fetch state and outputs values on 17 different control lines that controls the various parts of the CPU.  You can think of the control circuit input as a number, in my design it is a ten digit binary number, 5 bits for the instruction, 4 bits for the clock count and 1 bit for the fetch status.  The output of the control circuit is also a number, in my design it is a 25 digit binary number.  So the control circuit is really just a mapping from the 10 digit input number to the 25 digit output number.  This mapping can be stored in a ROM which makes it much easier to modify.

So the previous control circuit:

Becomes this:

The ROM in the center contains the mapping from the input values to the output values.

Microcode is not so popular these days in real CPU design, but it’s much easier for me to edit and debug which is why I use it.

I’ve been programming this CPU and I’m finding certain deficiencies in the instruction set.  There’s a need for an unsigned set if less than instruction.  The existing SLT and SLTI are signed.  I’ll probably replace SLTI (set if less than immediate) with a new SLTU (set if less than unsigned).  This will make double precision arithmetic easier (and faster).

Comments

SYS64738

Jack Tramiel died yesterday.  He was the founder of Commodore and was responsible for the C64 which was the first real computer I owned.

If it weren’t for Mr Tramiel the 80′s computer market would have looked quite different.  I might have ended up with a MicroBee or Dick Smith CAT!

Comments

The Joy Circuit

I built a 16 bit CPU in Logisim.

It has 15 general purpose registers, 30 machine instructions and a 12 operation Arithmetic Logic Unit (ALU).   The CPU can address up to 65536 16 bit words of RAM.

Here’s a picture of the CPU:

Essentially the above circuit connects the CPU CONTROL circuit to the ALU and REGISTER file.  It also provides input and output to RAM and an external CLOCK.

ALU

First thing created was the ALU.  This is the device within the CPU that performs arithmetic operations, such as addition and multiplication, as well as logic operations, such as shifting or Boolean operations.  My ALU pictured below does 12 operations: addition, subtraction, left shift, logical right shift, arithmetic right shift, and, or, nor, exclusive or, multiplication, division and comparison.

My ALU can perform 12 operations represented by a four bit code:

0000    addition
0001    subtraction
0010    shift left logical
0011    shift right logical
0100    shift right arithmetic
0101    logical and
0110    logical or
0111    logical nor (not or)
1000    logical exclusive or
1001    multiplication
1010    division

Register File

Next I created the register file.  A register file is type of very fast RAM internal to the CPU which is used as temporary storage for memory contents being operated upon.  For example if the CPU was to add two numbers, those numbers would first be loaded from RAM into two registers in the register file before being sent to the ALU to perform the addition.  Once the addition is complete the result would be sent back to a register before being written back to RAM.  The reason for this is that RAM is typically far slower in operation compared to the register file.

My register file contains 16 16 bit registers.  Register 0 is hard wired to zero (0000 0000 0000 0000).

Registers are addressed r0, r1, r2, … r15.  Although all registers are general purpose, by convention some are used for fixed purposes and can be address by alternative names:

r0  - zero  always zero
r1  - at    reserved for assembler use only
r2  - v0    values for function returns
r3  - v1
r4  - a0    function arguments
r5  - a1
r6  - a2
r7  - s0    saved temporaries
r8  - s1
r9  - s2
r10 - t0    temporaries
r11 - t1
r12 - t2
r13 - fp    frame pointer
r14 - sp    stack pointer
r15 - ra    return address

I stole these conventions from the MIPS architecture.

Here’s a picture of my register file.  It has one write port and two read ports which means it can write one register and read two registers in a single operation.

Machine Instruction Encoding

The purpose of the CPU is to execute machine instructions stored in RAM.  I needed to design an appropriate instruction set and encoding for the instructions.

Again I borrowed heavily from the MIPS architecture.  I decided on encoding each machine instruction using two 16 bit words.  There are two types of instructions called X and Y.

The X type encoding has five fields: INS8 DST4 SRC4 TGT4 ignored12

And the Y type encoding has four fields: INS8 DST4 SRC4 C16

The names above indicate the use of the field and the numbers above indicate the bit width of the fields.

Both X and Y type encoding have a INS (instruction) field which encodes the instruction to perform, i.e. addition, branch etc.  The instruction field is eight bits in size which allows for 256 possible machine instructions, although I only needed 30.

Also common to both types are the DST (destination) and SRC (source) fields which indicate registers in the register file.  The source and destination fields are each four bits in size and so can address up to 16 registers.  The source field addresses one of the register whose contents is to be used in an operation.  The destination field addresses a register where the result of an operation is to the written.

The X type encoding has a TGT (target) field, four bits in size which indicate a register to be used in an operation.  The remaining 12 bits in an X type encoding are ignored.

The Y type encoding has a C (constant) field 16 bits in size.  This field represents a 16 bit number which can be used as a memory address, branch displacement or a shift constant.

An example instruction is

add r4, r5, r6

which means add register 5 to register 6 and write the result into register 4.  This instruction uses an X type encoding.

The add instruction has an encoding of 0000 0000.  This encoding is commonly referred as an operation code or opcode.  The complete encoding of this instruction is:

0000 0000 | 0100 | 0101 | 0110 | xxxx xxxx xxxx

Where the first 8 bits are the opcode (0000 0000 for add), the next four bits are 4 for register 4, the next four bits are 5 for register 5 and the next four are 6 for register 6.  The last 12 bits are shown as X which means “don’t care” and can be either zero or one as they are ignored by the CPU.

Another instruction is:

lw r8, r12, 18fa

lw is shorthand for load word and this instruction will take the 16 bit value 18fa, add to it the contents of register 12 and use the resulting value as a memory address.  The contents of the computed memory address will be written to register 8.

The lw instruction uses a Y type encoding.  The lw instruction has a opcode of 0001 1001.   The complete encoding is:

0001 1001 | 1000 | 1100 | 0001 1000 1111 1010

Again the first 8 bits are the lw opcode, the next four are the destination register r8, the next is the source register r12 and the last 16 bits are the constant value 18fa.

There are 30 machine instruction, each having zero to three operands. The operands are either register addresses or a constant 16 bit integer.

This is the complete instruction set:

add d, s, t  - add s to t write result to d
addi d, s, C - add t to immediate C write result to d
sub d, s, t  - subtract t from s write result to d
mul s, t     - multiply s and t write result to hi and lo registers
div s, t     - divide s by t write result to hi and lo registers
sll d, s, C  - shift s left logically by constant C write result to d
srl d, s, C  - shift s right logically by constant C write result to d
sra d, s, C  - shift s right arithmetically by constant C write result to d
sllv d, s, t - shift s left logically by t write result to d
srlv d, s, t - shift s right logically by t write result to d
srav d, s, t - shift s right arithmetically by variable t write result to d
beq s, d, C  - branch, add constant C to PC if s equal to d
bne s, d, C  - branch, add constant C to PC if s not equal to d
slt d, s, t  - set d to 1 if s is less than t
slti d, s, C - set d to 1 if s is less than constant C
and d, s, t  - logically and s and t write the result to d
andi d, s, C - logically and s and constant C write the result to d
or d, s, t   - logically or s and t write the result to d
ori d, s, C  - logically or s and constant C write the result to d
xor d, s, t  - logically exclusive or s and t write the result to d
nor d, s, t  - logically nor s and t write the result to d
j C          - jump, write constant C to the PC
jr t         - jump register, write r to PC
jal d, C     - jump and link (subroutine call), write PC to d, write constant C to PC
jalr d, t    - jump and link register, write PC to d, write t to PC
lw d, s, C   - load word, write contents of memory at address s+C to d
sw d, s, C   - store word, write d to memory at address s+C
mfhi d       - write HI to d
mflo d       - write LO to d
halt         - halt cpu

In addition to the 30 machine instructions there are 16 pseudo instructions:

nop         - no operation
move d, s   - write s to d
clear d     - write zero to d
li d, C     - load immediate, write constant C to d
bgt x, y, C - branch to C if x > y
blt x, y, C - branch to C if x < y
bge x, y, C - branch to C if x >= y
ble x, y, C - branch to C if x <= y
mul d, s, t - write lower 16 bit result of s multiplied by t into d
div d, s, t - write result of s divided by t into d
push d      - push d onto the stack using register sp as the stack pointer
pop d       - pop d from the stack using register sp as the stack pointer
jal C       - jump and link to C using register ra implicitly
jalr t      - jump and link to contents of t using register ra implicitly
not d       - flip the bits of the contents d
neg d       - negate the contents of d

These pseudo instructions are substituted by one or more real instructions by the assembler during assembly.  For example neg d is replaced by:

              nor d, d, d
              addi d, d, 1

Which will flip the bits of d and add one resulting in the 2′s complement (negation) of the contents of d.

Some example code to sum numbers from 1 to 100 is:

    clear  t0             ; zero sum in t0
    li     t1, 0064       ; store 100 in t1 count
loop:
    add    t0, t0, t1     ; sum = sum + count
    sub    t1, t1, 0001   ; count = count - 1
    bne    t1, zero, loop ; jump back to add if t1 is not zero
    halt                  ; sum of 1 to 100 in t0

Control Unit

Next came the most difficult step: the control unit.

Essentially a CPU is a deterministic finite state machine and the control unit is what determines the next state of the CPU based upon its current state.

The current state of the CPU is governed by a clock which pulses many times a second.  Each time the clock pulses the CPU can do one operation.  Modern CPUs can actually do several things per clock pulse (or clock cycle), however I decided to keep thing simple.

To execute the instructions in RAM the CPU needs to fetch the instruction, decode the instruction and only then is it able to execute the instruction.  Instructions are addressed in RAM by the Program Counter (PC) register.  My instructions are two 16 bits words each, so we need two additional registers to store the fetched instruction.  I called these the Instruction Register (IR) and the Operand (OP).

The fetch operation requires four distinct steps:

fetch the word from memory at the address in the PC and store it in IR
add 1 to the PC register
fetch the word from memory at the address in the PC and store it in OP
add 1 to the PC register

These four steps take four clock cycles to complete.  The result is that the instruction to execute is in the IR and OP registers, ready for decoding and execution.

Each component in the CPU has one or more control lines.  These lines control the operation of the component.  For example in the register file above you can see a WRITE input.  This controls if a register will be written.  The control unit is in charge of determining the values of all the control lines based upon the current clock counter and instruction.

Here’s a picture of my control unit circuit:

My control unit has three inputs:

FETCH           0 indicates if the CPU is fetching an instruction
T            0000 clock cycles elapsed since previous instruction completion
INSTRUCTION 00000 opcode from the INS8 field from the IR register

There are 17 outputs which represent control lines to other parts of the CPU.

HALT            0        halt CPU
T RESET         0        reset clock counter
FETCH           0        fetching
PC SRC         00        selector for PC src
PC WRT         00        write PC reg
MAR SRC         0        selector for MAR src
MEM WRT         0        write RAM
IR WRT          0        write IR reg
OP WRT          0        write OP reg
REG SRC       000        selector for write REG src
REG WRT         0        write REG
ALU B SRC       0        selector for ALU B src
REG SRC2 SRC    0        selector for REG reg2 src
ALU OP       0000        ALU opcode
RESULT WRT      0        write RESULT reg
LO WRT          0        write LO reg
HI WRT          0        write HI reg

These 17 control lines coordinate everything that happen in the CPU during an instruction execution.

Here is an example for the instruction lw r2, r4, babe at RAM address 0000 0000 0000 000.

This instruction lw (load word) loads a 16 bit word from RAM into a register.  In this example it adds the 16 bit hexadecimal value babe (47806) to the contents of r4 and fetches the contents of RAM at that address and stores it in r2.  The full machine instruction encoding is: 0001 1001 0010 0100 1011 1010 1011 1110.

Let’s assume the PC contains 0000 0000 0000 0000, the T counter is 0000 and FETCH is 1.

Because FETCH is 1 the control unit will trigger control signals to perform an instruction fetch.  For my CPU the instruction fetch takes a total of 4 clock cycles.  The control circuit outputs 1 on IR WRT to trigger the IR reg to write.  It outputs 0 on MAR SRC to indicate that the PC reg is to be used as the RAM address src (MAR is the memory address register).  This results in the 16 bits (0001 1001 0010 0100) from address 0000 0000 0000 0000 being written to the IR register.  The control unit also outputs 1 on FETCH to ensure it stays in the fetch state.

When the clock again pulses the T counter counts to 0001.  Now the control unit input is FETCH 1, T 1, INSTRUCTION 11001.  The INSTRUCTION value is coming from the INS field of the IR register, but since the CPU is still fetching it’s ignored for now.  The control unit now outputs 01 on PC WRT which selects the value to write into the PC reg.  Value 01 selects PC+1 to be written to PC which results in the PC equal to 1.  The control unit also outputs a 1 on FETCH as the fetch operation is not yet complete.

The clock pulses again causing T to increment to 0010.  FETCH is still 1 so the control unit outputs OP WRT as 1 which causing the OP reg to be written with the contents of RAM.  0 is output on MAR SRC causing the PC reg to be used as the address src.  The result is the 16 bit contents of RAM (1011 1010 1011 1110) at the address contained in the PC reg (0000 0000 0000 0001) is stored in the OP reg.  1 is also output on FETCH to continue the fetch in the next clock cycle.

Again the clock pulses and T becomes 0011.  As in step 2 the PC WRT line is set to 01 which causes the PC reg to be written with the value of PC+1.  This results in the PC containing 0000 0000 0000 0010 which is the address of the next instruction in RAM.  The control unit also sets FETCH to 0 as the fetch operation is complete.

Now T is incremented to 0100 due the the next clock pulse.  The input to the control unit is now FETCH 0, T 0100 and INSTRUCTION 11001.  Because FETCH is 0 the control unit this time will enter the execution phase.  It outputs 1 to RESULT WRT which triggers the RESULT reg to be written.  The RESULT reg is connected to the output from the ALU, so this causes the result from the ALU operation to written to the RESULT reg.  The ALU OP line is set to 0000 which is the ALU opcode for addition.  The ALU B SRC is set to 1 which causing the B input of the ALU to come from the OP reg (1011 1010 1011 1110).  The A input of the ALU is always connected to the SRC1 OUT of the register file.  SRC1 REG of the register file is always connected to the SRC4 field of the IR reg which is 0100 (r4 from the instruction).  This results in the contents of r4 and the OP reg sent to the ALU which performs an addition and send to the result to the RESULT reg which is written.  FETCH is kept at 0 as the instruction is not yet complete.

Now we enter the 5th clock pulse since the beginning of this lw instruction.  T is incremented to 0101, FETCH is still 0 and INSTRUCTION is still 11001.  The control unit outputs MAR SRC as 1 which causes the RESULT reg to be used as the source for the RAM address.  REG SRC is set to 101 which causes the data read from RAM to be sent to the DATA IN input of the register file.  The DST REG input of the register file addresses the register to write and is always connected to the DST4 field of the IR reg.  The control unit outputs 1 to REG WRT which causes the register file to write it’s DATA IN input to the register specified by the DST REG input (r2 in the example).  The control unit also sets T RESET to 1 which triggers the T reg to reset to 0000 and sets FETCH to 1.  Both of these setup for the next instruction fetch as this instruction is now complete.

Computer

Once I had the CPU I needed a computer to test it in.  Here’s a picture of mine:

The computer is just the CPU with some RAM and various input/output (IO) devices.

I added a few memory mapped IO devices to the computer.  A memory mapped IO device is a device that appears to the computer as a series of memory addresses.  A program can read or write to those memory addresses to communicate with the IO device.  For example in my computer I added a simple output screen at address fffe and ffff.  To write a character to the screen a program writes it’s ASCII code the the memory address fffe.  There’s circuity to detect this address and instead of writing to RAM the data is sent to the screen instead.  Writing a 1 to ffff clears the screen.

I added a random number generator to address f002.  A program can get a random 16 bit value every time it reads from address f002.

Here’s a program to write random characters to the screen using the above IO devices.

loop:
      lw      t1, zero, 0f002 ; get random num
      andi    t2, t1, 1f      ; zero top 11 bits
      addi    t2, t2, 20      ; t2 between 20 and 3f (printable ASCII)
      sw      t2, zero, 0fffe ; write random character
      j       loop            ; do it again forever

I wrote an assembler to take this code and assemble it into the machine instructions that cane be loaded into RAM and executed in the CPU emulated by Logisim.

Next Steps

It took about six months to create this CPU, working for a few hours each weekend.

When designing the instruction set I wanted it to be easy to program, hence the large number of registers.  The four cycle instruction fetch is quite poor, so I’m going to try to remove that.  I’m considering a much simpler RISC design, with a single cycle fetch, decode and execute steps.  Perhaps even some form of pipe-lining or dual ALU cores, or perhaps a stack based architecture.

I also have plans of writing a Forth implementation for it and then perhaps an implementation of Lisp or C.  I’d need to write an emulator for the CPU in order to simplify testing as programs running on the CPU in Logisim are quite difficult to debug.

Comments

« Previous entries Next Page » Next Page »