Comparing CPUs is complex much more complex than comparing cars for example. Its easy to understand the complexity here. The CPU can execute many hundreds of different instructions, and also the performance does depends on so many other factors like cache width, memory speed, possible instructions in parallel etc. Beating an 68030 is pretty easy as the 68030 is a relative bad CPU. Now why is the 68030 bad? The 68030 can only execute 1 instruction in parallel. Instruction on the 68030 do not take just 1 cycle, but minimum 2 and also EA calculation costs extra cycles. This means the 68030 has many instruction taking 4, 6, or even 10 or more clocks. Depending on the instruction mix a 50 Mhz 68030 can execute max 25 Millionn instructions per cycle - but as many instructions take more than 2 clock, its executes realistically in average about 10 Million instructions at 50 MHz. The 68030 has very small caches which are often nearly useless. The 68030 has no real branch prediction. The 68030 has no Subroutine call acceleration. The 68030 can not detect memory stream and can not prefetch memory effectively. Now the 68060 is a MUCH MUCH better CPU than the 68030. The 68060 is Super Scalar and can execute up to 2 instructions per clock. Most instructions take only 1 clock on the 68060. The caches of the 68060 are much better than the caches of the 68030. The 68060 has a very good branch prediction. The 68060 is clearly the best 68K ever produced by Motorola. But the 68060 has also some areas which can be improved. a) The 68060 Icache can only deliver 4 Byte per cycle. While the 68060 is super scalar and could process 2 instruction per clock - this Icache bottleneck does limit this very often as 4 byte are not enough to feed both pipes. b) The Icache does only deliver 2 bytes per clock for a Jump or Subroutine which has an unfortunate alignment. This means unlucky aligned Subroutines can be slower. c) The Icache can not very effectively prefetch this means performance of Programs bigger than the Icache is much lower. d) The DCache can not handle misalign operations for free. This means data misalignment in memory or on stack will slow the core down. e) The DCache can not detect memory stream and can not prefetch effectively. This means performance of highly memory intensive tasks like e.g. Image manipulation is slow. f) The 68060 can not accelerate subroutine returns g) The 68060 left some useful instruction like 64 Bit MUL and DIV away. These need to be emulated in software. APOLLO is very similar to the 68060 but is addresses and improves all areas would were not optimal on the 68060. 1) Apollo is Super Scalar and can execute several instructions per clock. Apollo Super Scalarity is stronger than the 68060 and it can execute more instruction combination Super scalar. 2) The execution time of all 68k instructions is very low. Even lower than on 68060. Most instructions need only 1 clock. 3) The Icache is very strong - it deliver 16byte per clock cycle. This is 4 times more than the 68060 Icache. 4) The Icache also delivers 16byte on any address. So aligment of subroutine label is not needed for optimal speed. The Icache can also prefetch very effectively. Apollo therefore can execute huge programs from main memory even faster than 68060 would execute small programs from ICache. 5) The Dcache is very strong in can read 8 Bytes and in parallel also write 8 Bytes per cycle and in parallel even prefetch 8 bytes per cycle. The Dcache detects memory streams and will automatically prefetch. This means performance of memory intensive games or programs is best in class. Here Apollo even beats GigaHerz clocked PowerPC. 6) DCache can for free support misaligned Reads and Writes. This means Apollo has optimal performance even with misaliged Data or Stack. 7) Apollo accelerates Subroutine returns 8) Apollo supports in hardware the useful instructions which were lost in the 68060. The 64bit mul for example only takes 2 clock cycles on Apollo - while an 68060 needs to emulate this in many many cycles. 9) Many Bitfields operations take only a single cycle for Appollo - the 68060 needed 10 or more often. This makes using Bitfields now really sensible. 10) Apollo can FUSE often used 68k instruction patterns of 2 instructions into 1. This improves the instructions per cycle. And this way Apollo has less bubbles in code execution than 6860 and can sometimes execute 4 instruction per clock. The 68060 was a pretty good CPU. We are happy that we could improve it in so many aspects.
|