Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

Is Vampire Faster Than Classic PPC Cards?page  1 2 3 4 

Mallagan Bellator

Posts 393
19 Aug 2017 20:00


Manuel Jesus wrote:

I get 32fps with a gold 2.7 / 3 test core running X11 core

What resolution is that?


Will 'Akiko' G.

Posts 9
29 Jan 2018 18:29


Roman S. wrote:

  Re:FPU performance
  The integer performance sucks on 603e because it has only one integer unit (or so I've heard) and doesn't get the full benefit of its superscalar design until you start alternating between integers and floating point instructions.

True to some degree. The 603e is actually a quite modern design for its age. The cpu has a fully pipelined IU and FPU even working in out of order manner (6 opcodes in the fly) with the ability to dispatch/execute 5 instruction in parallel and retire 3 instructions at once.

Daniel Sevo wrote:

  No, this is a misconception.
  The PowerPC 604 can _NOT_ do 6 instruction per clock.
 
  The maximum instruction number the PowerPC 604 can issue and retire is 4 instructions per cycle.

This is not entirely true. It is true, that the 604e can issue/retire 4 instruction at once, but it can dispatch/execute actually 7 instruction in parallel. It is quite similar to current cpus like Ryzen, which can dispatch/execute 10 instructions and issue/retire 6 instructions in parallel. The 604e was a quite powerful design with 3 IUs and 1 FPU.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
29 Jan 2018 18:43


Will 'Akiko' G. wrote:

   
Daniel Sevo wrote:

    No, this is a misconception.
    The PowerPC 604 can _NOT_ do 6 instruction per clock.
     
    The maximum instruction number the PowerPC 604 can issue and retire is 4 instructions per cycle.
   

   
  This is not entirely true. It is true, that the 604e can issue/retire 4 instruction at once, but it can dispatch/execute actually 7 instruction in parallel.
 

 
You can describe a CPU with many numbers.
Some have more importance for performance, some less.
So don't get yourself confused with pointless numbers.
 
Lets look at one example:
True fact is that APOLLO 68080 can keep in flight / execute up to 32 integer instructions in parallel and in addition 17 FPU instructions in parallel at the same time.
But what does this number tell you?

It tells you something about the complexity of APOLLO.
But it does not tell you much about the performance.

Meaningful value for real performance
is the amount of instructions that can be DECODED, ISSUED and RETIRED per clock.


Roman S.

Posts 149
29 Jan 2018 20:47


Well, I believe mainly in benchmarks - like AIBB (not SysInfo), timed LHA compression, timed MP3 decode, etc. Yes, they are biased by the compiler ability to support particular CPU - but this reflect my real-world usage too :)

68k and PPC have different instruction sets. And the "amount of instructions that can be DECODED, ISSUED and RETIRED per clock" is never constants - it depends on which concrete instructions you execute. If CPU is able to handle up to 64 NOPs in one cycle - well, congratulations, but how does it influence my computer usage?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
29 Jan 2018 21:06


Roman S. wrote:

  Well, I believe mainly in benchmarks - like AIBB (not SysInfo), timed LHA compression, timed MP3 decode, etc. Yes, they are biased by the compiler ability to support particular CPU
 

 
Its also very biased by the compiler flags.
If you look at AMINET for example - there are many versions of LAME for MP3.
 
There are HUGE differences between the version on AMINET
Some are compiled WITHOUT ANY optimizations and really run at 50% speed of versions with enabled optimizations.
 
So these benchmarks results can also be extremely misleading.


Mallagan Bellator

Posts 393
30 Jan 2018 00:53


Personally I believe in running a number of select games that are somewhat demanding in different ways, then record the actual performance of the games.
Say fps of Doom and Quake in certain resolutions, and other games like stunt car, and so on... Frontier... then run the same versions on the different systems, and see what comes out on top.

That, and as mentioned before, raytracers


Mr Niding

Posts 459
30 Jan 2018 05:30


With differences in benchmark result, even with the same program/different versions of it, adding the version numbers of said program during benchmarks will offset this issue.
It will atleast show performancedata over several scenarios.
 
When you watch reviews of hardware, they usually throw in 10-20 or more different tests to offset any issues where x game uses cpu heavily over the gpu, giving odd results vs a gpu heavy game.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
30 Jan 2018 06:52


Mr Niding wrote:

With differences in benchmark result, even with the same program/different versions of it, adding the version numbers of said program during benchmarks will offset this issue.

Lets look at a real world example.


C-code
a++


Created ASM instructions when compiled with -O2
addq.l #1,D0


Created ASM instructions when compiled with -O0
move.l val_a,D0
addq.l #1,D0
move.l D0,val_a

As you clearly see the number of instructions for the same C code can be 3 times, if optimizations are turned off.
The Binary in AMINET does normally NOT tell in the readme which compile options are used.

Lets for example say you want to compare the performance of a PowerPC CPU with an 68K CPU. And you take for this LAME binaries from AMINET, one executable for PPC and one for 68K.
And you do not know how they are compiled.

Tell me how can you compare the results?
Tell me how much value can these comparisons then have?


Roman S.

Posts 149
30 Jan 2018 07:58


Gunnar von Boehn wrote:

Tell me how can you compare the results?
Tell me how much value can these comparisons then have?

This test will tell you which executable should you use on your hardware :) (if they are both stable and function correctly, of course). Comparing CPUs can be hard... it is always more or less biased.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
30 Jan 2018 11:51


Roman S. wrote:

If CPU is able to handle up to 64 NOPs in one cycle - well, congratulations, but how does it influence my computer usage?

 
You are correct.
 
Also some CPU values are more "marketing" numbers then real world and often misleading. Its common that "peak" values are theoretical
and in real live not reached. The 604 PPC and G4 PPC are very good examples of this problem - in theory they reach much higher values than in real live - the reason are insignificant Icache performance.
For many workloops the icache and branch overhead reduced the real performance much below the theoretical maximum.
The 68060 does suffer from a similar problem. In theory it can execute 2 instruction per cycle but in real live the ICache can often not provide them.
 
 
A relative good overview give MINIBENH
CLICK HERE 

The test is multi-platform and allows comparing 68K / x86 / PPC and ARM for different workloads.
The profiles clearly show limits and strength of the certain CPUs.

Lets look at some example: PowerPC 5200B
In theory 2 Instruction per Cycle.
But we can see clearly that this is not always reached.

Now lets compare with POWERPC 750
Also 2 Instruction per Cycle but we you can clearly see that this is  reached much more often.




Will 'Akiko' G.

Posts 9
31 Jan 2018 10:04


Gunnar von Boehn wrote:

  Lets look at one example:
  True fact is that APOLLO 68080 can keep in flight / execute up to 32 integer instructions in parallel and in addition 17 FPU instructions in parallel at the same time.
  But what does this number tell you?

Uhm, 4 integer pipelines with 8 stages each? One simple FPU (8 stages, add, sub) and one complex FPU (9 stages, mul, div) pipeline? ;-) Is a FPGA that different from a common CPU? Don't get me wrong here, this is no critics. I just like the tech-talk and the details.

Gunnar von Boehn wrote:

  Also some CPU values are more "marketing" numbers then real world and often misleading. Its common that "peak" values are theoretical
  and in real live not reached. The 604 PPC and G4 PPC are very good examples of this problem - in theory they reach much higher values than in real live - the reason are insignificant Icache performance.

The MPC970 (aka G5) was a real bad apple here. The AltiVec unit is capable of processing about 20 GiB/s @ 2.5 GHz, but in the end the maximum achievable RAM bandwith was about 6.4 GiB/s.

Gunnar von Boehn wrote:

  For many workloops the icache and branch overhead reduced the real performance much below the theoretical maximum.
  The 68060 does suffer from a similar problem. In theory it can execute 2 instruction per cycle but in real live the ICache can often not provide them.

The biggest issue here is that most developers have no clue about cache organization, take 060 here. It has a 8+8 KiB i/dcache which is 4-way associative, this means your maximum data structure size should not be 8 KiB, it should be 2 KiB, because the cache is sliced in 4 parts. The next thing is false cache sharing or cacheline poisoning. If interestet, I can provide some simple C++11 examples showing how bad performance can be affected by this even on very modern CPUs. There is also quite an impact of misaligned data on some cpus.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
31 Jan 2018 10:37


Will 'Akiko' G. wrote:

Uhm, 4 integer pipelines with 8 stages each? One simple FPU (8 stages, add, sub) and one complex FPU (9 stages, mul, div) pipeline? ;-) Is a FPGA that different from a common CPU? Don't get me wrong here, this is no critics. I just like the tech-talk and the details.

FPGA can be anything. You decide how its structure is.
You can design in at a simple 1 stage CPU, or a modern pipelined Super-scaler high end CPU.

APOLLO has 2 EA UNITs and 2 ALUs.
And APOLLO can sometimes/often merge two 68K into one operation.

APOLLOs FPU Hard-Units are
* ADD/SUB/CMP
* MUL
* DIV
* SQRT
Each UNIT is fully pipelined and can do 1 FPU Operation per cycle.

Will 'Akiko' G. wrote:

 
Gunnar von Boehn wrote:

  For many workloops the icache and branch overhead reduced the real performance much below the theoretical maximum.
  The 68060 does suffer from a similar problem. In theory it can execute 2 instruction per cycle but in real live the ICache can often not provide them.
 

 
  The biggest issue here is that most developers have no clue about cache organization, take 060 here. It has a 8+8 KiB i/dcache which is 4-way associative, this means your maximum data structure size should not be 8 KiB, it should be 2 KiB, because the cache is sliced in 4 parts. The next thing is false cache sharing or cacheline poisoning. If interestet, I can provide some simple C++11 examples showing how bad performance can be affected by this even on very modern CPUs. There is also quite an impact of misaligned data on some cpus.

I think the biggest weakness of 68060 is that the Icache can provide only 4 byte of instructions per clock cycle.
68060 has 2 ALUs and could in theory execute 2 instruction per clock. As 68K instructions can be 2Byte, 4Byte, 6Byte, 8byte or longer - its obvious that 2 ALUs get starved most of the time by the 4 Byte Icache bandwidth.

Motorola was aware of this and did plan an 68060B with 8 Byte Icache bandwidth to fix this.




Will 'Akiko' G.

Posts 9
31 Jan 2018 12:22


Gunnar von Boehn wrote:

APOLLO has 2 EA UNITs and 2 ALUs.
And APOLLO can sometimes/often merge two 68K into one operation.

Is it one complex + one simple EA or are both the same? Is the core also doing the double-word prefech of the 68000? (Which can people drive nuts who like to do dense self-modifying code.)

Gunnar von Boehn wrote:

  APOLLOs FPU Hard-Units are
  * ADD/SUB/CMP
  * MUL
  * DIV
  * SQRT
  Each UNIT is fully pipelined and can do 1 FPU Operation per cycle.

Can all these units run in parallel or only some specific ones like ADD+MUL (to get the famous FMA done)? How you deal with the exception model of this FPU? I mean, I remember the reason Motorola did no pipelined FPU was floating point exception model of the m68k, they had trouble because it got quite complex. Is one Op true for single, double and extended precission?

Gunnar von Boehn wrote:

I think the biggest weakness of 68060 is that the Icache can provide only 4 byte of instructions per clock cycle.
68060 has 2 ALUs and could in theory execute 2 instruction per clock. As 68K instructions can be 2Byte, 4Byte, 6Byte, 8byte or longer - its obvious that 2 ALUs get starved most of the time by the 4 Byte Icache bandwidth.

Hmm, some of the 68060 parts run 3x times the external clock, do you know which ones? So I guess it is not true for the cache.


Thierry Atheist

Posts 644
31 Jan 2018 13:38


Will 'Akiko' G. wrote:
The biggest issue here is that most developers have no clue about cache organization, take 060 here. It has a 8+8 KiB i/dcache which is 4-way associative, this means your maximum data structure size should not be 8 KiB, it should be 2 KiB, because the cache is sliced in 4 parts. The next thing is false cache sharing or cacheline poisoning. If interestet, I can provide some simple C++11 examples showing how bad performance can be affected by this even on very modern CPUs. There is also quite an impact of misaligned data on some cpus.

Hi Will 'Akiko' G.,

Unless they have changed their mind about it, the Vampire 4 is supposed to have 32K instruction and 32K data caches.


Vojin Vidanovic

Posts 770
31 Jan 2018 15:57


Gunnar von Boehn wrote:
 
  According to these reports the Vampire outclasses all existing AMIGA upgrade cards including 603e and 604e PPC CPU cards.

Not eunough outclassing all 68k cards? :-)

By design 603/604e were early PPC experiments,
and while gained ground, havent been any real revolution.

In Amigaland, they performed a bit better driving MorphOS or AmigaOS 4.x alone then being a co-processor to m68k. As I remember, those who purchased 060+603 had better experience then 040+604 even with gfx card.

Its more important Vampire is a continuation of M68K arhitecture,
and should be compared to 060 as its last representative.

Other then that, I have no doubt certain AMMX/080/ApolloFPU optimized software can outclass an 100Mhz 603 (like mentioned video playback) in purely CPU mode.

Even nicest AmigaOS video player Emotion, performs not so good in software mode and use assistance of Radeon 9xxx or Radeon HD in HD video playback.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
31 Jan 2018 18:34


Thierry Atheist wrote:

  Unless they have changed their mind about it, the Vampire 4 is supposed to have 32K instruction and 32K data caches.

The VAMP-2 has 32 KB DCache already.
The VAMP-4 has 64 KB DCache.


Vojin Vidanovic

Posts 770
31 Jan 2018 19:15


Gunnar von Boehn wrote:

  The VAMP-2 has 32 KB DCache already.
  The VAMP-4 has 64 KB DCache.

Great improvement over up to 8kB DCache (4-way associative)
in 060 and bytes on 030.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
31 Jan 2018 20:30


Vojin Vidanovic wrote:

  Other then that, I have no doubt certain AMMX/080/ApolloFPU optimized software can outclass an 100Mhz 603 (like mentioned video playback) in purely CPU mode.
 

 
The comparison was not 100 Mhz PowerPC chips but to POWERPC Chip @ 240 MHz or 300Mhz.
 
In JPEG speed and VIDEO playback benchmarks the VAMP-2 runs circles around AMIGA Phase5 PowerPC cards.
 
 


Vojin Vidanovic

Posts 770
31 Jan 2018 20:51


Gunnar von Boehn wrote:
 
  The comparison was not 100 Mhz PowerPC chips but to POWERPC Chip @ 240 MHz or 300Mhz.
 
  In JPEG speed and VIDEO playback benchmarks the VAMP-2 runs circles around AMIGA Phase5 PowerPC cards.

Kudos for overall below 100Mhz CPU results, people tend to forget that.

I supose overall good design of CPU helps, but also that MMX-Altivec type of instructions aids a lot.

Sadly, on x86 chips never had enough time to experiment and on x1000 side OS4 Altivec optimized and Linux PPC Altivec optimized software is rare.


Thierry Atheist

Posts 644
31 Jan 2018 21:20


Gunnar von Boehn wrote:
Thierry Atheist wrote:
Unless they have changed their mind about it, the Vampire 4 is supposed to have 32K instruction and 32K data caches.

The VAMP-2 has 32 KB DCache already.
The VAMP-4 has 64 KB DCache.

Hi Gunnar von Boehn,

Sorry. Never seen that mentioned before.

Well then.

An assembler and C that is 68080 aware...
AOS3.x recoded to for *strictly* 68080 use....

With enhanced S-AGA capabilities taken into account........

I have NO WORDS to express myself as to how _______ _________ such a system would operate!

NO computer/operating system pairing is better than that!!!!

(I find multi-user, memory protection and various other "features" of so called "modern os's" (highly) undesirable.)

posts 68page  1 2 3 4