Stored Program Computers and Modern CPUs

In the begining, logic and control were pretty much completely distinct from data. In [https://en.wikipedia.org/wiki/Tabulating_machine|[tabulating machines]], the data was on cards, and the programming was done by jumpers stored as [[https://en.wikipedia.org/wiki/Plugboard|jumper frames]]. Then when electronic computers were being developed and data stored electronically, it was realized that the logic and controls could be represented as data and stored in memories right alongside the data. Architecturally it might be worthwhile to treat each store differently but in principle they can be in one unified memory store. Along the way many architectural variations have been tried, but architects have come to rely on the principle that the simpler and more consistent your architecture the better. The most popular microprocessors have had word sizes of (power of 2) multiples of 8 bits. Generation zero had 8 bit registers and 16 bit addresses, and sometimes had extended 16 bit operations. The emergence of the 32 bit architecture was a bit of a watershed. A 32 bit byte address can access 4Gbytes directly, which is more than big enough for most problems. Currently, the state of the art is 64 bit words and addresses which covers more memory that we can build for the forseable future. You could put a 64 bit processor on an SoC, but mostly it would be overkill. Better to leave all that chip real estate for other useful functions.
A Word about Virtualization
In NAND to Tetris, there is a whole layer of their abstraction called "Virtual Machine" which is the target of high level language. While the concept of the virtual machine is actually key to how we can change a lot of the hardware details and still run pretty much the same software, the VM in their architecture stack is really a specific concept in high level language design that first became widespread with the Java language. The tools for that course are actually implemented in Java, so it is a natural fit, but I want to emphasize that the concept of virtualization is much more general.
The original virtual machines are the mathematecal abstractions created by [[https://en.wikipedia.org/wiki/Lambda_calculus|Church]] and [[https://en.wikipedia.org/wiki/Turing_machine|Turing]] to model computation mathematically. It is worth studying their work in detail but even more important are the general results about computation that these two model are first equivalent, they express the same idea of computation and that there are hard theoretical limits to computability that are very much parallel to [[https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems|Gödel's work]] on the completeness of mathematics. The Church model is interesting in how it doesn't really even reference hardware expression while the Turing model is practically a physical model of computation. The implecation is that the Church model doesn't fully express the costs of computation. In the Turing model you can ask how much tape a run uses or how many steps does it run, how many states in the controller, and so on. You can only formulate these questions about Church computations after considering implementations in some depth. And yet they are shown to be equivalent.
When we move from these abstract models to real CPU implementations, the first thing to notice is that the Turing machine has a potentially infinite tape, but as a practical matter we implement CPUs almost universally with a fixed width, mostly 8, 16, 32 or 64 bits now that all the [[https://en.wikipedia.org/wiki/PDP-10|oddball]] [[https://en.wikipedia.org/wiki/CDC_6600|architectures]] are long gone. That's ok because we can do it all in software. If you really need infinite precision arithmatic and to potentially address any amount of memory, we can do it in software. This hardware/software tradeoff is important in the NAND to Tetris presentation, and generally throughout the development of leading edge technologies.
In virtual memory, CPU architectures include special hardware to make the process efficient, but its purpose is to provide the software a simple and consistent model of main memory. In small architectures with 16 or fewer bits of address space you rarely need to have a program address more memory than the machine phyically has, but even some extended 16 bit architectures like the 8086 and 80286 might use more memory that physically available because of multi-tasking. Here each "process" gets its own memory space and segment registers are used to move around several 64K memory windows available to "user mode" processes. As you might imagine this process is a bit of a mess to manage and when 32 and then 64 bit architectures became available, all of these complex memory models went away.
Now in systems with virtual memory, and this include modern smart phones too, there is a hardware and software supported virtual machine that provides 1) an ISA (instruction set architecture, a VM instruction set implemented in hardware) and 2) A memory model that supports large flat address spaces and leaves the details of the main memory (RAM) and secondary storage (Disk or Flash) to the system. The ISA is a subset of the CPU's instruction set that is available to user programs and can be targetted by high and low level languages. The instructions and features needed to implement other virtualizations (memory here) are priviledged, and only available to systems code invoked by hardware traps. One kind of trap is to call a system function, another is to service an interrupting I/O device and for virtual memory there are memory traps that are called in the middle of ISA instructions when the memory addresses is not present or written when marked read-only. Thus a new processor or one from a different vendor might need changes to system code because the non-virtualized system instructions and traps are handled differently, but user code runs unchanged.