-------------------------------------------------------------------------- A FAIR FIGHT -------------------------------------------------------------------------- In the last couple of months, I have heard much information about the Falcon 030 being compared to other systems. I have decided to write this, not because I want make one system better than the other. But because I want the Falcon to be on an equal playing field as others make their comparisons. The information following show much of the hidden power that lies inside the Falcon that is not included in comparisons. Even with these benchmarks, there are still advantages to designing a system one way or another. The following information does not try to impose that one type of design is better, but just tries to lay down Facts on benchmarking and performance. Even though benchmarking in the end is not the most effective way to determine the speed of the system. I thought that the benchmarks AS IS deserve mentioning. The Main System Most of the comparisons I have heard revolve around the single CPU. Since most systems contain only one real processor it is a natural instinct to only compare CPUs between systems and ignore all else. The CPU included with the Falcon is not the fastest CPU by any means. But it is a competent CPU and runs at a respectable speed for general processing. Before we begin benchmarking, I would like to show how a programmer may design their program. Most programmers program the 'bulk' of their program in 'C.' While doing the bulk of the program, they are not concerned with speed, and usually program code that is not the fastest for the computer to run, but the easiest for the programmer to understand while working and fixing his/her code. In this case, optimizing for speed will not show any noticeable benefits. Where speed is important is with repetitive coding. Parts of the program that must run over and over in a loop of some sort greatly benefits from optimized code. To give an example, Raytracing. Most of the raytracing program just handles windowing, user input, and other various tasks. But where the speed counts the most, is in the Raytracing itself, which could take hours to perform. Improving that routine by two can cut the waiting time in half. If the raytracing program was a 300k program, maybe 1k of that code would effect raytracing time. The rest would be general maintenance and raytracing preparation. Multiprocessing Systems Designers of computers know that most of the code do not NEED the speed and design systems around that fact. It does not mean it is a better system, it just means it is a different way of doing something. The Falcon 030 is such a system. Consisting of a CPU, BLiTTER, and DSP, in the same housing. The benchmarks below do not include DMA. The CPU was designed for general tasks. The BLiTTER is a high speed memory moving processor that can move memory much faster than the CPU. The DSP is a high speed number crunching processor that can handle any intensive math operation, like our Raytracing program illustration. Alone, each processor would not make an effective computer, because of the limitations of each, but together they compliment each other by filling a hole in the performance of the other processors. Multiplication Benchmark Here is a multiplication comparison of 5 processors, 2 of which is in the Falcon. Processor Speed Clock time Bit Size # of mults @ sec. 68000 8mhz 74 clocks 16x16=32 108,000 appx 68030 16 28 16x16=32 571,500 appx 80486 66(internal) 20 ave. 16x16=16 3,300,000 appx ATT3210 55 4 ??????? 13,750,000 appx 56001DSP 32 2 24x24=48 16,000,000 appx As you can see, the Falcon's CPU alone is not lightning fast by any means. However, with the DSP running at the same time, the potential for multiple math operations is staggering. Even though the DSPs multiplication times are theoretical, because of the fact that the DSP can run instructions in parallel. Because of this fact, and depending on if a programmer optimizes the code to take advantage of the parallel processing, the DSP could run faster in comparison. In the DSP User Manual, Motorola claims that the DSP can do a 24x24bit multiply, a 56 bit addition, two data move, and two address-pointer updates can be executed in a single instruction cycle (2 clocks). I have also included a comparison with the Mac's 660AV system, which also has a DSP chip, the ATT3210. Binary Move Benchmarks This instruction compares the about of bytes that can be moved from a source address to a destination address. There are two comparisons, one is a flat memory move, like moving a 200k chuck of memory to another place. Processor Bus Speed Type of Memory Bytes per Second 68000 8mhz 8mhz ST RAM-16bit 2 Megs appx. 68030 16mhz 16mhz ST RAM-16bit 4 Megs appx. BLiTTER 16mhz 16mhz(HOG) ST RAM-16bit 12 Megs appx. 80486 66DX2 33mhz conventional-32bit 24 Megs appx. protected mode-32bit (using 32bit reg) 48 Megs appx. The next one is a screen move, like moving a window on the screen from one place to another. Because memory is linear (one dimensional) extra steps needs to be taken in order for the window to move correctly in a X-Y coordinate system( two dimensional). These blits are replace mode blits, Transparent modes, would produce different results. Processor Bus Speed Type of Memory Bytes per Second 68000 8mhz 8mhz ST RAM-16bit 1 Meg appx. 68030 16mhz 16mhz ST RAM-16bit 2.6 Megs appx. BLiTTER 16mhz 16mhz(HOG) ST RAM-16bit 12 Megs appx. 80486 66DX2 33mhz conventional-32bit 13 Megs appx. protected mode-32bit (using 32bit reg) 26 Megs appx. The Falcon's CPU again is not impressive. But the BLiTTER more than makes up for access to RAM, especially for graphics bit-blits to screen, like mouse pointers, windows and other graphic displays. The 486 processor has a 33mhz Bus Speed and an internal clock of 66mhz that is what the '2' means in the 'DX2.' When working with internal math, the 486-66DX2 is twice as fast as the 486-33DX, but when accessing RAM there is no difference between processors. Bus Bandwidth This is the last comparison I will show. This is becoming a popular way to express the 'possible' speed limit of information traveling to and from the BUS. Here is a comparison between a couple of system's Bandwidth. I have included the DSP's Bus Bandwidth, since the DSP has its own SRAM and is just about a computer by itself. System Bus Width #of Data Buses Bus Speed Bus Bandwidth Atari ST 16-bit 1 8mhz 16megs @ sec. Falcon 030 16-bit 1 16mhz 32megs @ sec. Jaguar 64-bit 1 13.5mhz 108megs @ sec. 80486-66DX2 32-bit 1 33mhz 132megs @ sec. 56001DSP 24-bit 4 32mhz 384megs @ sec. This is a theoretical speed limit of the maximum possible number of bytes that can move though the bus at any given time. It is not that systems run at this limit, but what is possible. Some systems can come closer to this limit that others. The Falcon is one such system that can operate close to this limit. I have not tested other systems or learned there hardware limits to mention them. The Local-Bus Solution The Falcon splits that bus limit between CPU/BLiTTER/DSP/DMA access and video display system. The Falcon's Video is 32bits wide and runs at 16mhz. The split is made every other clock so 8,000,000 clocks goes to the Video and 8,000,000 clocks goes everywhere else. This gives the Falcon fast access to the video. This is similar to the 486-66DX2 local bus system, which also has 32 bit access, except it is at 33mhz and not 16mhz. Conclusion These benchmarks were not meant in any way to be critical of one system over another, but to show the difference in design. These benchmarks do not even begin to explain the differences in TRUE performance or speed. It is just an indication of these systems, and others. This was just written to place the Falcon on a more even playing field with other systems. The true test comes when you compare the machine for you needs or wants.