[Inside]P6: The Next Step? By Sebastian Rupley and John Clyman [Image] The successor to the Pentium may not be what you expected. For top performance, you need 32-bit applications and a fully 32-bit operating system--even Windows 95 won't do. If you're a serious power user, talk of a new generation of x86 processors probably elicits an almost Pavlovian drool of excitement. And Intel's P6, the next-generation successor to the Pentium, certainly looks like a technological masterpiece in the making. But if you were expecting the P6 to double the performance of the Pentium, hold on for a disappointment. In a best-case scenario, we expect the first P6 systems will run typical business applications 40 to 60 percent faster than Pentium PCs shipping at the same time. That may not sound bad, but it's based on a big assumption: Namely, that you are running 32-bit software on a fully 32-bit operating system such as Microsoft Windows NT or OS/2. With 32-bit applications under Microsoft Windows 95, we expect the P6 to outperform fast Pentiums by only about 20 to 30 percent. And Intel acknowledges that with 16-bit applications under Microsoft Windows 3.1, the P6 is likely to underperform the Pentium. The P6 may earn the dubious distinction of being the first x86 processor to run slower than its predecessors on existing code. Given that the P6 contains nearly twice the number of transistors as the Pentium, it will be substantially more expensive to manufacture. So when the P6 is introduced this fall, Intel will be forced into a difficult positioning battle, both against the company's own existing products and against growing competition from other x86-processor manufacturers. Intel's strategy is therefore to market the P6 not as a direct successor to the Pentium, but as a high-end--almost workstation-class--alternative. It's not unusual for new processors to be positioned for high-end desktops and servers because of their initial high cost and limited availability. But this time, Intel had little alternative: The same design that is intended to accelerate the P6's performance running 32-bit code turned out to be a severe hindrance when running existing 16-bit code. (The vast bulk of applications shipping today are written in 16-bit code, although 32-bit applications are likely to become more commonplace following the shipment of Windows 95.) The P6 does look especially promising for two markets: engineering and graphics workstations, where its relatively strong floating-point performance is a benefit, and servers. A RISC-Y HERITAGE In its quest for high performance, the P6 raids the RISC bag of tricks. The P6 is not only aggressively superscalar--capable of processing as many as three x86 instructions in a single clock cycle rather than the Pentium's two--but is also superpipelined, meaning that the P6's pipelines are deeper to permit higher clock speeds. Intel says the superpipelined design will allow the P6 to achieve clock speeds one-third faster than those of the Pentium on the same fabrication process. As it happens, superpipelining is also one source of the P6's performance problems with 16-bit code; see the sidebar "Why the P6 Prefers NT." The P6 will initially run at clock speeds of 133 and 150 MHz, and 200-MHz-plus versions are likely. (By contrast, Intel promises 150-MHz Pentiums later this year and expects the processor to top out at 180 to 190 MHz.) The P6 communicates with the external world at various fractions of its internal clock speed, yielding system buses that run at anywhere from 50 to 66 MHz. PCI buses will run at half that speed. Initially, the P6 CPU will be fabricated on the same 0.6-micron BiCMOS process used to build most Pentiums. By next year, Intel will move to a 0.35-micron process (currently being used for some 120-MHz and all 133-MHz Pentiums), which will allow for higher clock speeds. The P6 complements its superpipelined design with an out-of-order execution engine that allows it to shuffle the order in which instructions are executed. By setting aside instructions that cannot be executed immediately and processing subsequent instructions that can, the P6 is able to avoid some of the conditions that would stall a processor such as the Pentium, whose pipelines operate in strict lockstep. Of course, the P6 ultimately ensures that all instructions are completed in a way that produces the correct results. (See the following story, "Inside the P6," for details.) Internally, the P6 converts x86 instructions into RISC-like operations that Intel calls micro-ops; this helps simplify processing of the notoriously convoluted x86 instruction set. (AMD's forthcoming K5 processor and NexGen's Nx586 each use a similar technique.) The P6 employs register renaming to help enable out-of-order execution and to work around another classic x86 bottleneck: the limited number of architectural registers defined in the instruction set. The P6 is also unique among mass-market processors in that it actually contains two chips in a single package: the CPU proper and either 256K or 512K of high-speed L2 cache memory. (See the sidebar "The P6 Bus: Optimized for Speed, Multiprocessing.") Intel refers to the P6 package as a dual-cavity design; other processor manufacturers sometimes refer to a similar implementation as a multichip module. DESIGN IMPLICATIONS Aside from its goal of improving performance, the P6's integrated L2 cache has enormous implications for system designs. In Pentium-class systems, the design of memory subsystems--which include the processor, L2 cache, and main memory--has become a major differentiating factor. High-end systems pair 120- or 133-MHz Pentium processors with 256K or 512K of pipelined synchronous-burst cache and EDO (extended data out) DRAM. Mainstream machines use less expensive asynchronous SRAM and standard fast-page mode DRAM. Some manufacturers may leave out L2 cache altogether to achieve entry-level pricing. In a P6 system, choosing a memory subsystem essentially becomes a matter of selecting a 256K or 512K version of the P6 processor; manufacturers no longer can make the same kind of price/performance trade-offs as in the past. Even the option of using EDO DRAM won't be available initially, because Intel's P6 PCIset--the core logic formerly known as Orion that ties the processor to memory and I/O buses--does not support EDO memory or its successor, burst EDO. Third-party core-logic manufacturers such as OPTi, SIS, and VLSI are unlikely to offer alternative chip sets until sometime in 1996. Intel manufactures the L2-cache chip as well, citing the unavailability of suitable parts from external sources. Virtually all P6 systems that ship this year will use not just Intel processors, Intel cache, and Intel chip sets; they will contain Intel-designed motherboards as well. In fact, one-third to one-half of systems' value will likely come from Intel-supplied components. First-tier system vendors, particularly Compaq, have taken pains in the past to differentiate their machines with custom-designed system architectures. But designing and testing custom architectures is resource-intensive and time-consuming; few vendors will be willing to sacrifice sales immediately following the P6's introduction. These time-to-market constraints will compel even the traditional holdouts to use Intel-supplied motherboards through the end of this year. What can you expect to find inside a P6 machine? In rough terms, the components will be similar to those in a high-end Pentium system: lots of RAM, big hard disks, fast graphics cards. P6 systems will aggressively embrace PCI; in many cases they will employ PCI-bridge technology to chain together two PCI buses and provide six PCI slots rather than the three to four slots available in most systems today. ISA slots will still be available for compatibility with older or slower peripherals. How much will P6 machines cost? The processor's 5.5 million transistors in the CPU chip alone--nearly double the number in the Pentium--will make it large and therefore expensive to manufacture. The integrated L2 cache (which contains another 15.5 million transistors for the 256K version or 31 million transistors for the 512K version) is expensive as well. We expect volume pricing for the P6 to fall somewhere from $1,200 to $1,600 per chip in the initial versions. By contrast, 133-MHz Pentium processors sold for $935 apiece in lots of 1,000 prior to August 1 price cuts, and 256K of SRAM averaged around $100. At a system level, that translates into prices of about $5,000 for systems with 16MB of RAM, 2GB hard disks, and 17-inch monitors. Stripped-down P6 systems may appear priced as low as $3,000, but given the applications that the P6 targets, we don't recommend skimping. You might wonder whether Intel would consider moving the L2 cache outboard to allow lower-cost (and lower-performance) versions of the processor. Intel says that is unlikely, that moving the cache off-chip imposes too much of a performance penalty. And given the continuing speed improvements in the Pentium-processor family, it's not clear that the price/performance ratio of a P6 machine with external L2 cache would be compelling compared with the alternatives. Still, it could happen. SERVERS: A BOLD STEP FORWARD The P6 does represent a tremendous opportunity for servers, particularly the CPU-hungry application servers often used to run SQL databases and the like. That's not only because of the processor's raw computational ability, but also because of its high-performance bus. Although multiprocessing Pentium servers are feasible, the P6 makes larger-scale multiprocessing designs much more straightforward. To begin with, the P6 supports glueless four-way symmetric multiprocessing. Glueless means that no additional logic is required to support as many as four CPUs; they already contain all the circuitry required, and--in theory--building a multiprocessing P6 server is as simple as wiring together the pins on the multiple CPUs. The reality is more complex, of course, because properly balancing electrical loads and providing the copious airflow required to keep multiple P6s cool is no small task. But the bottom line is that you can expect to see a proliferation of multiprocessing servers based on the P6 around the end of the year. From a performance perspective, the P6's approach to multiprocessing has a lot to recommend it. The processors' individual caches should be more efficient than the shared-cache design used in most dual-processing Pentium servers. And thanks to the P6's transaction-oriented bus, clustering of multiple server boxes is supported rather easily. Networks of multiprocessing P6 servers connected with a high-speed data link such as ATM or Fibre Channel could be constructed to provide high-end application-server platforms. SOFTWARE IMPLICATIONS The link crucial to the P6's success is actually not hardware but software. Even when the Pentium was first introduced, Intel implored software developers to recompile their applications--that is, take existing program code and retranslate it into machine language in ways that could exploit the superscalar capabilities of the Pentium without sacrificing backward compatibility. Few developers ever bothered to recompile. Still, that did little to slow the acceptance of the Pentium, which provided a substantial performance boost over the 486 nevertheless. With the P6, the story is somewhat different: The bulk of today's existing 16-bit software simply won't run faster on the P6 than on the Pentium, according to Intel. And the needed software conversion is far more involved than simply passing it through a new compiler. To exploit the P6 fully, current 16-bit applications and OSs require a significant rewrite. What developers need to do is write 32-bit code rather than the 16-bit code that has dominated since the early 1980s. When introduced ten years ago, the 386 was the first x86 processor to support 32-bit code, which manipulates quantities four bytes at a time rather than two bytes at a time and overcomes many of the addressing limitations of early x86 processors. (The P6, by the way, is still a 32-bit processor, even though it communicates with the external world using a 64-bit data bus, as does the Pentium.) But demands for backward compatibility and the lack of a mainstream 32-bit operating system have delayed the development of 32-bit programs, except in specialized fields that demand every bit of performance from CPUs. The impending release of Windows 95, which fully supports 32-bit applications although it contains substantial 16-bit code itself, is likely to provide the impetus that software developers need. Already, Microsoft's own best-selling Word for Windows and Excel applications have been ported to 32-bit code to run under Windows NT. Most P6 systems will ship with Windows 95 or Windows NT and a 32-bit office suite already loaded. Still, many users will have substantial investments in key 16-bit applications that they may be reluctant to upgrade. The most likely users of P6 machines will not be running office-automation software so much as high-end CAD, graphics, scientific-modeling, financial, statistical, and neural-networking applications. And in these categories, the P6's superior floating-point performance should be a plus. Aside from these application categories, the place Intel sees P6 truly advancing the state of the art in software is in multimedia. The P6 represents a crucial milestone in Intel's NSP (native signal processing) effort. NSP is a reference platform that consists of chip sets, drivers, and a real-time OS kernel and is designed to help provide multimedia functionality on the CPU without dedicated multimedia hardware. NSP doesn't spell the elimination of dedicated multimedia subsystems, but the marriage of the P6, the P6 PCIset, and Intel's NSP APIs (application programming interfaces) could substantially raise the minimum standard for natively processed multimedia. The P6 does contain circuitry in its floating-point unit that is designed to accelerate so-called multiply-accumulate sequences, which are frequently employed in signal-processing apps. Intel cites applications including speech recognition, MPEG video decoding, video and sound mixing, and telephony as chief candidates for native signal processing on a P6. These apps push the limits of Pentium CPUs without additional hardware. THE VALIDATION CHALLENGE One of the biggest challenges in developing a processor is simply ensuring that it works properly. And the pressure on Intel is particularly great given the fiasco that surrounded last year's discovery of a bug in the Pentium's floating-point unit. As a result, Intel has stepped up an already-intensive validation and debugging process. In addition to testing trillions of instruction sequences on hundreds of machines, the company has instituted a user-test program that seeds early P6 systems to users of esoteric applications. Among the testers are specialists such as Dr. Thomas Nicely, the mathematician who first discovered the Pentium's flaw. Even given the unprecedented degree of compatibility testing that Intel is performing, a device as complex as the P6 CPU is bound to contain some minor latent flaws. Should a bug be discovered after the shipment of the processor, the company pledges to document it in full immediately. MOVING THE MARKET? Does the introduction of the P6 spell the end of the Pentium? Hardly. We expect the P6 will initially receive a lukewarm reception because of its price premium and its affinity for still-scarce 32-bit software. Pentium-based PCs will clearly remain the best choice for most users--especially as Intel continues to churn out faster Pentium variants. The P6 will gain toeholds in the server market and in workstation-class PCs, where its advantages are clearer and price sensitivity is lower. When Intel moves the P6 to a 0.35-micron process in 1996, pushing clock speeds to 166 or 180 MHz, the P6 may become more viable as a mainstream (albeit still high-end) processor. Acceptance of the P6 may take time, particularly because of its affinity for Windows NT, OS/2, and Unix, none of which have a dominant market share on the desktop. Looking ahead a few years, we do expect that the P6 will ultimately displace the Pentium--just as the Pentium has surpassed the 486. The key will be the broad availability of 32-bit applications and mainstream 32-bit operating systems. The first 12 to 18 months of the P6's availability will be devoted to ramping up production and increasing clock speeds. Stay tuned for comparisons of P6 systems as they become available in the next few months. Our Contributors: SEBASTIAN RUPLEY is a senior editor and JOHN CLYMAN is an associate editor at PC Magazine. NICK STAM is technical director for hardware at PC Magazine Labs.