P6 Rocks! – Jamel Tayeb

Today, I decided to give you a break about my Apple 2 pimping posts series. Instead, I will share about my interactions with a great processor over the years. I am talking about the excellent Pentium Pro processor. Of course, there is some news associated with this post; otherwise, why would I waste our time, right? But let’s start with the beginning. In 1995, Intel Corporation launched the Pentium Pro and its P6 microarchitecture (discontinued in 1998). I fell for this CPU because it is by design ready for Symmetric Multiprocessing (SMP) as-is. Out of the tray, an OEM can build a 4-way SMP computer using the PPro without the need for extra glue or other bus agents. Meaning that the same chip can be the heart of a cheap SMP computer, as well as a massive supercomputer. In 1997 I was working on my Diplôme d’Etudes Approfondies (DEA) in computer science in an AI laboratory.

You can consider the French DEA degree as the first year of Doctoral studies, designed to be an introduction to research in a specific field. I also familiarized myself with multithreaded programming a few years ago using Windows NT 3.1 and Solaris and wanted to use an SMP system for my research. As a student, working most of the time during the day – you have to pay the bills somehow –, I wanted the comfort of working from my rental room during the WE and late in the night after catching the last train. Luxurious! A few lines above, I wrote, “heart of a cheap SMP computer.” Everything is relative, and to me, cheap was anything but affordable. I broke the piggy bank, took a payday advance – I remember my bank telling me, “we would not bet a kopeck on you” –, pulled all the favors I could and bought the parts to build my dual-processor workstation. I picked an Intel PR-440FX motherboard, two Pentium Pro processors of the same stepping (@150 MHz, 256KB L2 cache), EDO memory, and an OpenGL graphics card.

For the missing parts, I cannibalized my previous computers (mainly an i486DX-2 66 based box). Et voila! This workstation served me very well. In particular, I remember working on a parallel 3D modeler in a uniformly sized voxel space. With my buddy Vincent, we came up with a 2D discrete segment drawing approach beating all significant algorithms of the late ’90s. I then applied the methodology to fundamental 3D graphic primitives. I recall visiting Digital Research‘s offices in Paris to run benchmarks of my code on an 8-way Alpha 21064A-based server – they named it Behemoth. When I joined Intel a few years later, I donated my workstation’s guts to a Spanish journalist friend who wanted desperately to have one. I like the idea that this computer served many others for many more years.

And now to our post. I had to have one of these workstations in my collection, and as soon as I could, I grabbed for cheap a mobo + CPUs. Because I like multi-purpose projects, I am planning to use this workstation for another special project. But for now, let’s build a system with regular use in mind, and not just to store it in my hall of fame. One item I don’t regret though about my original system is the awful full-sized ATX ginormous beige tower. So, for my new dual-PPro system, I bought a Thermaltake Core G3 Black Slim case. The Core G3 Black Slim is the smallest new case on the market that can accommodate the big PR-440FX board. Once installed, there are barely a few millimeters gaps remaining in the case! The cherry on the cake: a left-facing window so I can gaze at those two processors working in unison. Unfortunately, with such a small case, comes a few constraints.

First constraint, you must use an SFX sized power supply. I picked the fully modular EVGA SuperNOVA 450 GM, 80+ @ 450W. And since the PR-440FX is an old board, I also needed a 24 to 20 pin power adapter. Second constraint, you cannot install your expansion cards vertically in the case as usual. Instead, you have to bring them somehow into a horizontal position. And there is room only for two of them. Although the case comes with a beautiful PCIe riser cable for modern GPUs, I had to use a PCI aftermarket riser cable for my PCI Diamond Stealth64 2001 series graphics card (S3 Trio64V+ chip). For my second project with this computer, I will use a 16-bit ISA adapter. Unfortunately, there is no full-length ISA riser cable available on the market today. An extra burden I will have to deal with, but I think I have a few options to explore.

Although the PR-440FX uses an Adaptec AIC-7880 controller to provide SCSI support, I went for an IDE to SD adapter for storage (SD SDHC MMC to 3.5″ 40 pins male adapter). This is a convenient solution, as we will see later. There are a few other good reasons to use this motherboard: rock-solid, clean design, and feature-packed (10/100MBps LAN, SB-compatible audio, telephony, etc.). The cable management is a bit tricky because of the lack of space, but in the end, I am pleased with my setup. Note to future PPro system builders out there: the gold-top version on the chip has ~0.3g of gold that can be extracted – using a toxic process. Therefore, these chips are bought-up in bulk, which in turn raises their price. Now, if the cost is not an issue, you could go after the Pentium II Overdrive or the 1MB cache black-top versions instead. These are faster chips.

To the OS now. Of course, I could run many versions of various operating systems on this workstation. I will likely settle for Windows NT 4.0 Workstation (of maybe a 3.51 by pure nostalgia). But for now, I installed the exact same software load I described here: MS-DOS 6.22, Windows For Workgroups 3.11, and a bunch of development tools and retro games. Yummy! Here’s a tip you may not know. It is straightforward to transfer a virtual hard drive image onto a physical drive.

For example, I used the Vmdk2Phys utility you can find here. It takes a VMKD image and writes it into a physical partition. Simple and straightforward. Used jointly with an IDE to SD adapter, I can boot almost any OS in record time. First, I build a system image using a VM on my PC; second, I transfer it to an SD card. Neat and efficient.

Before concluding today’s post, just a few notes on the fantastic P6 microarchitecture of the Pentium Pro processor (5.5 million transistors @ 0.6µ process). As mentioned in the introduction, this processor is 4-way SMP ready out of the box. It was the first Intel CPU to integrate its L2 unified cache, which dramatically improved performances. To me, the most exciting innovation it brought was the out-of-order or speculative execution. Along with the improved branch prediction and data flow analysis, OOO constitutes what is called Dynamic Execution. To allow higher clock rates, the designers extended the pipeline length (12 stages) and reduced the clocks spent on average in each stage by ~33%. An approach pushed to its limits many years later with the Pentium 4 processor’s NetBurst microarchitecture.

With the PPro, independent instructions could be picked from a pool by the principal pipeline agents: fetch, execute, and retire. The concept is simple. Instead of executing the instructions in the order of the program (in-order), the CPU looks for independent instructions – I should say micro-operations (µops) generated by the three parallel decoders from the machine language instructions – that it could execute simultaneously. Once completed, during the retiring stage, the results of the µops update the architectural state of the processor (think about it as your ISA registers). To facilitate the parallel execution of µops, the processor has a 40 internal general-purpose registers’ bank. The search for independent instruction, or look-ahead of the instruction pointer (IP, pointing to the next instruction to execute), spans on average 20 to 30 instructions. If the CPU predicts branches correctly, programs benefit from a considerable speed-up. Of course, if there is a stall in the pipeline, we pay a significant penalty (for flushing the pipe, re-filling it, etc.). Another cool feature of the PPro is its integrated FPU. This brings us to the execution stage – where the various execution units are leveraged to express parallelism –, during which the Reservation Station (RS) assigns instructions from the de-coupling pool to the five ports available, feeding six parallel Execution Units (EU) with µops. Port 0 feeds into an INT or FP EU, port 1 into an INT or Jump EU, port 2 into the AGU (@ generation) for LOAD, and ports 3 & 4 into the STORE AGU. Although the RS can schedule up to five µops per clock, the EU may be busy computing, in which case, the micro-ops are queued-up. Magic consists of feeding this complex mechanism with a continuous stream of independent instructions! For this task, the compiler is our best ally, so we, as developers, can focus on tuning the memory accesses. A game that any programmer of Very Long Instruction Word (VLIW) chip knows to be non-trivial.

There are many other significant and remarkable aspects to the PPro, but I cannot cover them here. I attached several articles from IT magazines of the era in English and French, as well as a plethora of documents and photos. I am sure you will enjoy them. Have a great WE!

Note: Third-party trademarks are the property of their respective owners.

Facebook Tweet Pin

3 thoughts on “P6 Rocks!”

chris meredith says:

August 17, 2020 at 2:51 pm

I had a P6 dual system in the early 90’s it showed it’s speed (somewhat) when compiling. It’s real speed was in shutting windows down. System5 v4 didn’t care for it

Pingback: This 386 Hauls - Jamel Tayeb
Pingback: Unparalleled Potential – Part I - Jamel Tayeb

Share this:

3 thoughts on “P6 Rocks!”

Leave a Reply Cancel reply