/* BACKGRND.TXT - KLH10 Background Commentary */ /* $Id: backgrnd.txt,v 2.3 2001/11/10 21:24:21 klh Exp $ */ /* Copyright © 2001 Kenneth L. Harrenstien ** All Rights Reserved ** ** This file is part of the KLH10 Distribution. Use, modification, and ** re-distribution is permitted subject to the terms in the file ** named "LICENSE", which contains the full text of the legal notices ** and should always accompany this Distribution. */ FAIR WARNING: You don't need to read this file. It has nothing of interest to most users. Hackers, however, may like it. It is mostly a rambling commentary on the KLH10 code that attempts to explain why things are the way they are, and speculate on how they might be. And like the code itself, this is an incomplete work in progress. As you read the source you will notice that some parts are rather simplistic and specific, while others are too complex and generalized. Many things are not fully implemented, and many others are over-implemented. Sometimes there are good reasons for this and other times that's just how it ended up when I ran out of time during one development phase or another. A little history may help set the context: HISTORY ======= From an outside viewpoint there have been essentially two major KLH10 versions: the first being a "toy" KS10 emulator (V0.x), and the second a commercial KL10B emulator (V1.x). Internally, of course, many more versions have existed as part of a long incremental development process. [XXX: Flesh out later] Roots - PDP-10 arithmetic package for KCC cross-compiling. First version - KS10 running ITS, then TOPS-20. Synchronous device model. DEC intervention - KL10B version on Alpha, commercial use. Extended addressing, KL devices. Device subprocess model. Merged version - KS and KL, shared devices. Intended for Public ITS systems. Source release - cleanups, multiple tape format support, etc. PRIORITIES ========== Once the decision was made to pursue a commercial KL10B emulator, KLH10 design and implementation was driven by four main priorities, in the following order of importance: [1] Accuracy [2] I/O Performance [3] Stability & reliability [4] Portability [1] Accuracy ------------ The most important goal was implementing an accurate emulation, with two facets. Not only did the emulator have to run existing TOPS-10 and TOPS-20 monitor binaries as-is, it also had to emulate a KL10B accurately enough for users to run their applications and get exactly the same answers as before. Otherwise, it would be worthless as a product. This is a matter of judgement since perfect emulation is impossible, but there is a huge difference between the initial "toy" KLH10 that first ran ITS and the current one. By far the vast bulk of the work went into the task of making the emulator behave correctly, certainly much more than into making it "fast". There were many problems. Some were difficult simply because of their complexity, but the hardest were the ones where the documentation was simply wrong or ambiguous in describing how a real KL10B behaved. Even access to the microcode was not much of a help, since much of the KL logic was actually implemented in hardware. Some of the issues encountered: - Extended addressing idiosyncracies, esp with PXCT - EXTEND instructions - many ambiguities - Floating point - hardware garbage - RH20 and DC10 emulation (more "Adventures in RH20-Land") - DTE 10-11 device emulation - NI20 emulation - Monitor race conditions The DTE and NI20 are particularly complex since in reality those devices were self-contained processors in their own right. Of course, for practical reasons not everything could be emulated; see the file "kldiff.txt" for details on the differences between the KN10 and a real KL10B. [2] Stability & reliability --------------------------- For this kind of product to be worth something, it has to work and keep working. The more problems I ran into while developing it, the more paranoid I became about making possibly destabilizing changes. In particular, once it was deployed to clients I tried to make only those changes that were absolutely necessary for one reason or another. I did build a large battery of tests to help verify whether or not the internal KN10 processor was accurately emulating a real KL10B, as well as a variety of device operations. These regression tests are still very helpful, but they do not catch everything; they are good only up to a point. As a result, once code was proven to work, it tended to remain relatively static from there on. Only with the KLH10 source release and its availability to many more potential testers did I feel safe in finally starting many long delayed cleanups. What this means is that the V2.x sources you have are not the same as the commercial V1.x versions. They are intended to be better, but have not undergone anything close to the same amount of testing and verification. At some point, of course, V2.x should be stable enough that it can replace V1.x with minimal perturbations. [3] I/O Performance ------------------- The biggest problem with the initial toy KLH10 was its lack of asynchronous I/O; this meant that any device I/O would always be "instantaneous" in the sense that all CPU operations would block completely until the I/O was finished. The CPU blockage is okay for a single-user personal system, but unacceptable for a true time-sharing system and completely prohibitive if trying to support real magnetic tape drives. The solution adopted was to implement devices as separate Unix child processes (the DP filename prefix stands for "Device Process"), allowing them all to run independently and asynchronously. [XXX: flesh out a bit more] [4] Portability --------------- While always desirable on general principles, portability is a somewhat idealistic goal -- there is no absolute requirement for it, unlike the other issues. The main reason the emulator remains portable is because I wanted it that way in preparation for the day when it could be distributed. There is nothing in the KLH10 code that requires an integer type larger than 32 bits, nor does it require any features not found in ANSI C. In fact, for a long time it did not even rely on ANSI features either. At the time it was first written, this was an inescapable requirement because 64-bit types, not to mention ANSI compilers, were few and far between; even GCC's "long long" had early bugs. With the Alpha version 64-bit code became possible, but nothing has ever relied on it. This is why every reference to a PDP-10 word or value is done by means of a macro. Even on platforms that support 64-bit types, the architecture sometimes works better when restricted to 32 bits. From the viewpoint of OS (rather than CPU) portability, the code again tries to rely on ANSI C library functions rather than UNIX system calls, but it is fair to say that most of the existing OS-dependent support is Unix oriented, especially for the real-time asynchronous version. Fortunately with the advent of true operating systems for both MacOS (X) and Windows (NT, 2K, XP), this should less of a problem in the future. Non-issue: CPU speed -------------------- It may seem odd to people who have not run a business or service, but the speed of the emulated PDP-10 has never been a real issue. In practice clients have just picked a hardware platform that provides acceptable performance, and as soon as their systems are up and working, they prefer to see as little change as possible. There is little compulsion to upgrade with faster or more featureful versions unless something is actually broken, which should only happen if the platform OS is upgraded to yet another incompatible release. MAJOR DESIGN ISSUES =================== C vs Assembler -------------- The notion of a PDP-10 emulator had been bouncing around for some time before I decided to embark on the project, and performance was always the number one issue; it was not clear whether it could be made to execute PDP-10 code fast enough to be useful. A typical workstation was a 33MHz SPARC-2, and the x86 PCs were still 386s. Several people felt the best way to achieve this was to pick a host architecture (SPARC or the forthcoming Alpha) and code in assembly language. As a long time PDP-10 assembler fan, this approach appealed to me as well, but eventually I came to the conclusion that this would be best attempted in C, for two major reasons: - Portability. After having engineered a major porting effort from TOPS-20 to Unix while at SRI, I never wanted to be locked into a specific machine architecture again. Oracle's extremely impressive porting system provided additional conviction that this was the right approach. - Implementation time. It was much faster to write in C than assembler, especially for RISC-based machines. There was also a third, subjective reason -- at the time, Stu Grossman was writing a simple prototype in C called "KX10". While I was unable to use anything from it, it did convince me that C was the right way to go, and as far as I know Stu was the first to attempt it. As it turned out, this decision also proved to be the right one in terms of performance. I had anticipated that keeping the code portable would eventually lead to "free" performance gains as hardware improved over the years, and that was in fact the case. But one other factor surprised me: C code was sometimes faster than assembly code! It took a while to realize that for the new RISC-based machines with pipelines and caches, new rules applied; often the C compiler provided with a machine would know about tricks or scheduling rules that not only were difficult for a human to code by hand, they also could change from one machine to the next. While doing timing tests on initial versions of the emulator, I found performance so sensitive to tiny and logically unrelated changes that I finally gave up worrying about it. In retrospect all this may seem obvious, but it was not so at the time. Instruction Execution Loop -------------------------- The decision to use C posed some problems with respect to control flow. The microcode of real machines is like a huge plate of extremely sticky spaghetti where every "instruction" has a jump address that can go anywhere else, plus a bunch of chopsticks stirring the mess up with external interrupts and asynchronous device operations. By contrast, C favors the use of separately defined functions with a stack-based calling convention, and control flow strongly favors using the normal call-return mechanism. It is possible to bypass the latter with "longjmp" type invocations, but you can only jump to pre-existing contexts and the mechanism can be very expensive. For someone used to the freedom of assembly language, this is very frustrating. Without going into all possible ways of implementing a PDP-10 emulator in C, there appeared to be two main choices: - Stuffing as much as possible inside a single huge function. Normally this would involve using a large switch() statement for instruction dispatch, plus assorted gotos to accomodate weird control flow issues. - Looping within a small function and dispatching to each instruction as a function call. (Note that the PDP-10 has one characteristic that makes things a little easier -- every instruction must always calculate an effective address regardless of whether it is used or not, so this naturally leads to building an instruction execution loop with the EA calculation as its central focus.) The KN10 code uses the latter approach for a number of reasons, some of which no longer apply. When it was first written, compilers still existed that had problems with very large switch statements, and even now it is still considered easier for them to optimize a function if it does not contain "goto" statements. Keeping instructions in separate routines also made it easier to invoke them explicitly as needed (for XCT, PXCT, etc), examine their assembler output to see what was being produced, and step through them with a debugger. This is almost certainly not an optimal design for speed. PDP-10 Word Representation -------------------------- [XXX - some stuff about the tradeoffs, from word10.h] FUTURE DIRECTIONS ================= Since the odds of new commercial applications showing up are essentially nil, whatever happens next will be entirely up to the PDP-10 hacker community. It will either evolve with their help or quietly go away without it. Recently a number of other PDP-10 emulators have finally appeared, most or all of which have the KS10 as a target. I have not yet looked at them (partly to avoid contamination) and so have no basis for comparison, but in general I think multiple KS implementations are a good thing, and as software tools improve it should become easier. The KS is a fairly clean yet respectable machine that would make an excellent semester project, and was fun to do. The more people who can get their hands dirty, the better. The KL10 by contrast was much harder, and definitely not fun. I hope that with the KLH10 release no one else will have to go through that nightmare, or at least will be able to start from a far better place. FUTURE KLH10 IMPROVEMENTS ========================= The wish-list stack has always been a mile high. Here's a bunch of stuff off the top of my head, in no particular order. - Multi-processing extensions: - Decouple FE from KN10. - Allow other user processes to examine KN10 state and manipulate devices (especially virtual magtapes) without requiring console access. - Re-investigate threading (very carefully; very unportable). - The UI sucks: - Improve present UI, add editing. - Allow input scripts to feed OS console at startup. - Add GUI interfaces: - 340 display (Spacewar lives!) - FE/Console (KA10 panel with blinkenlights!) - Magtape UI - CPU and device visualizations - Accuracy improvements: - Get rid of the last low-bit FDV divergence, even though that may reduce mathematical accuracy. - Implement address break (optional - very slow). - Raw performance: - Develop better performance metrics (TenStones? DECmarks?). - Speed up floating point (DFDV is far and away the slowest instruction). - Alternative word representations. - EA calculation and memory mapping (most of the CPU is burned on this, I believe). - Characterize workload to determine true bottlenecks. - Alternative instruction loops: - Hybrid switch & dispatch - Hand-coded assembler (x86 primarily) - Locality improvements; must match to host cache setup. - Autoconf mechanism to run tests on host and automatically select optimal build configuration. - Network support: - Add AN10 device, couple with IP tunnels like "tun". - Determine OS mods to allow self-connection for systems that currently can't do it (Linux, FreeBSD, Solaris). Get mods incorporated in standard releases. - Magtape support: - Finalize tape naming conventions. - Non-console access (cf UI above). - Finish in-memory support. - Add update capability. - Emulator extensions: - Finish KA10, KI10 versions. TENEX pager. WAITS? - Emulate SC-40 and/or XKL (more "physical" memory). - Additional devices (many!) - DH11 (KS10), or extend DTE (KL10) - AN10 (IMP interface) - DX20, CI, ... - Emulate NCP with IMP interface for older OSes. - PDP-10 software: - Public ITS support. - Further KN10-conditionalized OS mods. - Resurrect older KA/KI OSes. - GCC on KL (Lars). - Distribution: - Add HTML versions of doc. - Better build mechanism. - Permission for Public ITS distribution (legal hoops). - Packaged ITS filesystems. - Packaged DEC OS filesystems. - Packaged NIC/Stanford filesystem.