Stored Program Computers and Modern CPUs

In the begining, logic and control were pretty much completely distinct from data.  In [tabulating machines, the data was on cards, and the programming was done by jumpers stored as jumper frames.  Then when electronic computers were being developed and data stored electronically, it was realized that the logic and controls could be represented as data and stored in memories right alongside the data.  Architecturally it might be worthwhile to treat each store differently but in principle they can be in one unified memory store. Regardless, code is data, we load binary machine codes from data store to ready them to run, so at some point the system is told to tread this data as code and it can execute on the hardware. Along the way many architectural variations have been tried, but architects have come to rely on the principle that the simpler and more consistent your architecture the better.  The most popular microprocessors have had word sizes of (power of 2) multiples of 8 bits.  Generation zero had 8 bit registers and 16 bit addresses, and sometimes had extended 16 bit operations.  The emergence of the 32 bit architecture was a bit of a watershed.  A 32 bit byte address can access 4Gbytes directly, which is more than big enough for most problems.  Currently, the state of the art is 64 bit words and addresses which covers more memory that we can build for the forseable future.  You could put a 64 bit processor on an SoC, but mostly it would be overkill.  Better to leave all that chip real estate for other useful functions.

A Word about Virtualization

In NAND to Tetris, a whole layer of their abstraction called "Virtual Machine" is the target of high level language. While the concept of the virtual machine is actually key to how we can change a lot of the hardware details and still run pretty much the same software, the VM in their architecture stack is really a specific concept in high level language design that first became widespread with the Java language. The tools for that course are actually implemented in Java, so it is a natural fit, but I want to emphasize that the concept of virtualization is much more general.

The original virtual machines are the mathematecal abstractions created by Church and Turing to model computation mathematically. It is worth studying their work in detail but even more important are the general results about computation that these two model are first equivalent, they express the same idea of computation and that there are hard theoretical limits to computability that are very much parallel to Gödel's work on the completeness of mathematics. The Church model is interesting in how it doesn't really even reference hardware expression while the Turing model is practically a physical model of computation. The implecation is that the Church model doesn't fully express the costs of computation. In the Turing model you can ask how much tape a run uses or how many steps does it run, how many states in the controller, and so on. You can only formulate these questions about Church computations after considering implementations in some depth. And yet they are shown to be equivalent.

When we move from these abstract models to real CPU implementations, the first thing to notice is that the Turing machine has a potentially infinite tape, but as a practical matter we implement CPUs almost universally with a fixed width, mostly 8, 16, 32 or 64 bits now that all the oddball architectures are long gone. That's ok because we can do it all in software. If you really need infinite precision arithmatic and to potentially address any amount of memory, we can do it in software. This hardware/software tradeoff is important in the NAND to Tetris presentation, and generally throughout the development of leading edge technologies. This tradeoff is central to the RISC/CISC debate mentioned below.

Wherever possible this material should refer to current and emerging best practices, and to complete this discussion of the VM layer there are at least two that are very important. Java and their JVM (Java Virtual Machine) has been around for a while and there are many tools and toolchains to develop and deploy Java based systems. In the Java model, compilers and tools are supposed to compile to run on the VM directly but in practice this isn't always the case. If the code can be translated once from the JVM bytecodes to one or more machine languages that it will run faster as native machine code. This is just a case of moving as much work as you can to before you load and run a program, you can interpret the JVM codes "on the fly" but you optimise the repetetive work of reading and decoding bytecodes and looking up the code fragments needed to evaluate them with a caching scheme. Or do all this even earlier by compiling translations of small blocks of JVM to native code.

The other important technology and one that we will focus a lot of attention on when we get to that level of this abstractions is LLVM or Low Level Virtual Machine. The idea is that much of the work to optimize code as well as link editing and in general how the language tool chain is architected is independent of both the high level language being compiled and the low level machine codes being run on the machine. Part of the LLVM toolset are C++ libraries to generate LLVM code, the upshot of this is creating a  creating a toy compiler. As and excercise you might try to implement the toy language from NAND to Tetris, Jack, in LLVM by building from this example. Also expect golang, built on LLVM, to be an important language for our collaborative work. What makes LLVM so promissing is how it is designed to interface with code generators, not so much to be interpreted directly. That means it doesn't have to have low level byte codes, and the external representations are for persisting tree structures of internal representations shared by the toolchain. These are designed to efficiently store and reload these internal structures which are passed between the toolchain libraries whenever they run.

In virtual memory, CPU architectures include special hardware to make the process efficient, but its purpose is to provide the software a simple and consistent model of main memory. In small architectures with 16 or fewer bits of address space you rarely need to have a program address more memory than the machine phyically has, but even some extended 16 bit architectures like the 8086 and 80286 might use more memory that physically available because of multi-tasking. Here each "process" gets its own memory space and segment registers are used to move around several 64K memory windows available to "user mode" processes. As you might imagine this process is a bit of a mess to manage and when 32 and then 64 bit architectures became available, all of these complex memory models went away.

Now in systems with virtual memory, and this includes modern smart phones too, there is a hardware and software supported virtual machine that provides 1) an ISA (instruction set architecture, a VM instruction set implemented in hardware) and 2) A memory model that supports large flat address spaces and leaves the details of the main memory (RAM) and secondary storage (Disk or Flash) to the system. The ISA is a subset of the CPU's instruction set that is available to user programs and can be targetted by high and low level languages. The instructions and features needed to implement other virtualizations (memory here) are priviledged, and only available to systems code invoked by hardware traps. One kind of trap is to call a system function, another is to service an interrupting I/O device and for virtual memory there are memory traps that are called in the middle of ISA instructions when the memory addresses is not present or written when marked read-only. Thus a new processor or one from a different vendor might need changes to system code because the non-virtualized system instructions and traps are handled differently, but user code runs unchanged.

The word virtual gets quite a workout in computer science because we are doing some much abstraction in so many layers where having somewhat standardized abstractions helps us to manage change. It allows us to work on compilers and optimizations far away from concerns about just where to put that hardware software split for each feature. The hardware architects can focus on the areas that get them the most boost from the software they are actually asked to run and not require that developers contort their processes to the low level hardware optimizations. Virtualization is good for seperation of concerns, whether we name it that or not.

Programs in Memory

Even in the abstract Turing model, there is a clear split between control (the state machine that reads and writes the tape) and the data (the tape), but that is a bit misleading. The reality is that all the interesting computation will have to put recursive elements into memory (write them to tape) because the finite states of the control logic would limit the depth of those recursive elements. In other words, if you implemented Church's Lambda Calculus, you would have to put all the Lambda code into memory (on the tape) and the control logic would implement the language specification which is finite.

While it may have been inevitable that we would soon discover that codes in the data are being "run" or interpreted by other software and hardware, but it still is an important insight to realize that the same digital storage elements that were being used for data could be interpreted by a control mechanism and thus machine language was born.

Processors I have Known and Loved

Around the time I came on the scene, the 8-bit microprocessors ruled, the 6502 in the Apple II and others, several hobby machines with 8080 or Z80 processors, my first machine had a Z80 which was a superset of the 8080 architecture. There were many others, the 6800 familyy from Motorola had some following, but Motorola only had a real hit with the 68000 family, the later entries in this line were the first true 32 bit machines single chip processors.

In the meantime, Intel had a big win with their 8088 chip being selected by IBM for their PC. I started at Victor Business Products at that time where we had an 8088 based machine, the Victor 9000. This ended up being the begining of the PC clone era where Intel based PCs with some derivitive of the IBM PC I/O bus as it morphed over the years. With the 32bit 80386 acheiving an important milestone enabling commodity PC hardware to run advanced multitasking operating systems. In the 1980s and 90s there were a number of competing 32 and 64 bit processors, many RISC processors, but in the end Intel's superior market power and deep pockets was able to consolidate the market for high-end microprocessors.

RISC requires a little more explaination, Reduced Instruction Set CPU vs. CISC (Complex ...) was an idea in processor architecture that you should design a simpler instruction set and invest the hardware resources in other accelleration techniques like pipelining and caching, etc. The argument was that the software, in this case mostly in the compilers where machine level instructions are emitted, can erase any benefit from having complex instructions and addressing modes at the machine instruction level. In the competition of the marketplace this idea pretty much proved out and gave these processors a slight competitive edge, but that edge was overwhelmed by other technological areas where Intel had an edge. They could acheived similar accelleration with a few more gates and they would have those extra gates because their process technology was a generation ahead of much of the competition.

So these days we will see mainly two processor architectures, Intel and ISA compatable AMD processors and both of these companies a many processor models, some for low power and laptops others with as many cores as they can fit with current technology. AMD has recently started putting two different kinds of core on some chips. I almost forgot to mention video accellerators, which are typically some sort of DSP (Digital Signal Processor). Several video card manufacturers are still competing with their own processors but generally these processors are streamlined for repeated computation often with a lot of floating point math work so they give them vector processing features and multiple floating point units. AMD can put video accellrators allongside their standard cores for greater integration at the systems level.

The other important archtecture is ARM, which is on pretty much all smartphones and two project systems we will be interested in. The BeagleBone and Raspberry PI both use ARM processors. The are complete computer systems on small boards with extensive expansion capabilities. You can easily prototype custom hardware to interface whatever you might imagine and to program it from these small but powerful computer systems. For learning projects we can also find products that are more or less a complete Android phone, but just the board with extendable connections for customizations. Therefore we can entertain projects that extend a custom smartphone.

Now, a new family of processors is emerging and already becoming dominant with the remailing processor market. This is a growing segment of embedded devices and more, and a new RISC architecture, RISC-V is becoming more an more significant. It may even take over the entrenched Intel and ARM spaces because it is open and flexible. The projects in the RISC-V space will be of great interest to us because it is open source and flexible. It has a core set of instructions and a well defined process to create extensions in reserved parts of the instruction code space. You can go to their websites for more detail, and to find projects to learn from and engage with.