Overview Features Instructions Performance Forum Downloads Products Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
VISIT APOLLO IRC CHANNEL



All TopicsNewsPerformanceGamesDemosApolloVampireCoffinReleases
Information about the Apollo CPU and FPU.

Comparing 68060 FPU and 68080 FPUpage  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 3350
16 Jan 2018 17:19



The 68080 FPU does support the same instruction as the 68060 FPU.
But the 68080 FPU is new and modern in its design.

The major difference between 68060 and 68080 FPU are that 68080 FPU can start a new FPU instruction even if an previous one is still running.

Lets make a comparison:

68060 FPU
FDIV instruction needs 35 Clock cycles.
After launching an FDIV instruction the FPU will
not accept any new FPU instruction until the previous is finished.

68080 FPU
A new FDIV instruction or any other FPU instruction can be issued every clockcycle.

With code utilizing this the 68080 can execute 35 times more FDIV FLOPS than then 68060 at the same clockrate.

Another example:
FMUL.L D0,Fp0  - 5 cycle on 68060

The 68080 can launch this operation every cycle.
So with code using this the 68080 FPU will reach 5 times the FMUL FLOPS compared to the 68060.




Marko Oette

Posts 6
16 Jan 2018 18:52


Does it require changes to the ASM/C code to benefit from new FPU or does the old code work and benefit from the speedups?


Gunnar von Boehn
(Apollo Team Member)
Posts 3350
16 Jan 2018 19:01


Marko Oette wrote:

Does it require changes to the ASM/C code to benefit from new FPU or does the old code work and benefit from the speedups?

Lets make some example

This code makes a typical FPU operation
Its written sequential.

FMUL Fp0,Fp1
FADD Fp1,Fp7
FMUL Fp0,Fp2
FADD Fp2,Fp7
FMUL Fp0,Fp3
FADD Fp3,Fp7
FMUL Fp0,Fp4
FADD Fp4,Fp7

This code makes the same FPU operation
It uses the same instructions but it allow to run much faster.

FMUL Fp0,Fp1
FMUL Fp0,Fp2
FMUL Fp0,Fp3
FMUL Fp0,Fp4

FADD Fp1,Fp7
FADD Fp2,Fp7
FADD Fp3,Fp7
FADD Fp4,Fp7




Samuel Devulder

Posts 116
16 Jan 2018 21:50


At the moment compilers only generates "sequential" code style. Their operations scheduler will have to be modified to issue code that goes faster on 080. Till then, one has to write asm code by hand to benefit from the fpu-pipeline at the maximum level. Demo coders that still code in asm will love this! :)


Gunnar von Boehn
(Apollo Team Member)
Posts 3350
16 Jan 2018 21:58


Samuel Devulder wrote:

  At the moment compilers only generates "sequential" code style. Their operations scheduler will have to be modified to issue code that goes faster on 080.
 

 
All modern CPU behave like the 68080.
All modern x86, all modern PPC, all modern ARM.
And for them every current compiler knows how to write non sequential code. It only needs be enabled for us.
 


Sean Sk

Posts 113
16 Jan 2018 23:25


Does FEMU in conjunction with Gold 2.7 (hybrid FPU) for V2 cards function in this same manner?


Samuel Devulder

Posts 116
16 Jan 2018 23:34


Yes, we'll need to have uptodate compilers. Right now, neither GCC6 for 68k nor VBCC can do this yet. VBCC is possibly easier to tune since it has a dedicated scheduler in form of an external executable (vsc). However as it is a portable 100% generic scheduler, I wonder how it can be aware of the non-sequential preferences for 68080. It'll need some work I think.
 
Sources: 
    * vsc.h: EXTERNAL LINK 
   
    * vsc.c: EXTERNAL LINK
   
As for gcc6, let's hope Beppo will add a 68080 machine description fil & scheduling definitionse. It is a gcc-expert task because IIRC, the syntax isn't intuitive at all ( EXTERNAL LINK , EXTERNAL LINK ).
 


Marko Oette

Posts 6
17 Jan 2018 05:10


Thx for clarifying this.


Thierry Atheist

Posts 618
17 Jan 2018 09:53


does this mean that, 2, 3 or even 4 of these can be put into the Vampire 4 because it has so many extra LE's!!???


Nixus Minimax

Posts 236
17 Jan 2018 10:12


Gunnar von Boehn wrote:
Lets make a comparison:
 
  68060 FPU
  FDIV instruction needs 35 Clock cycles.
  After launching an FDIV instruction the FPU will
  not accept any new FPU instruction until the previous is finished.
 
  68080 FPU
  A new FDIV instruction or any other FPU instruction can be issued every clockcycle.
 
  With code utilizing this the 68080 can execute 35 times more FDIV FLOPS than then 68060 at the same clockrate.

Wouldn't 35 times more FDIV FLOPS require more FP registers?



Gunnar von Boehn
(Apollo Team Member)
Posts 3350
17 Jan 2018 10:22


Nixus Minimax wrote:

Gunnar von Boehn wrote:
Lets make a comparison:
 
  68060 FPU
  FDIV instruction needs 35 Clock cycles.
  After launching an FDIV instruction the FPU will
  not accept any new FPU instruction until the previous is finished.
 
  68080 FPU
  A new FDIV instruction or any other FPU instruction can be issued every clockcycle.
 
  With code utilizing this the 68080 can execute 35 times more FDIV FLOPS than then 68060 at the same clockrate.

 
  Wouldn't 35 times more FDIV FLOPS require more FP registers?
 

APOLLO has more FPU registers.
APOLLO has 32 FPU registers.

And also our FDIV on 68080 has a lower latency than 68060.
Its 10 cycle and not 35 like on 68060.
This means 10 regs are already enough to reach 35 times more speed than 68060.



Henryk Richter

Posts 69
17 Jan 2018 13:59


sean sk wrote:

Does FEMU in conjunction with Gold 2.7 (hybrid FPU) for V2 cards function in this same manner?

FEMU has been replaced by a better (read: faster, more convenient) solution.

In Gold 2.7, an MC68882 compatible set of FPU instructions will be available at boot time. It won't be necessary to load 68040.library or femu.


Sean Sk

Posts 113
17 Jan 2018 15:45


Henryk Richter wrote:

  FEMU has been replaced by a better (read: faster, more convenient) solution.
 
  In Gold 2.7, an MC68882 compatible set of FPU instructions will be available at boot time. It won't be necessary to load 68040.library or femu.
 

 
  Sounds great! Thank you for the information. Really looking forward to this!
 


Gunnar von Boehn
(Apollo Team Member)
Posts 3350
17 Jan 2018 16:14


Thierry Atheist wrote:

does this mean that, 2, 3 or even 4 of these can be put into the Vampire 4 because it has so many extra LE's!!???

The point is something else.

The 68060 FPU was OK.
The 68080 is much better - you see this now in all benchmarks.
68080 is much faster than 68060 - even with non optimal scheduled code.

But current software not even uses 50% of the 68080 FPU potential.
With slight improvement in code layout.
This means small improvement in the order of instructions,
there is a huge speed up potential.

And this is without needed bigger or more expensive FPGAs.


Mallagan Bellator

Posts 292
18 Jan 2018 02:13


Henryk Richter wrote:

sean sk wrote:

  Does FEMU in conjunction with Gold 2.7 (hybrid FPU) for V2 cards function in this same manner?
 

  FEMU has been replaced by a better (read: faster, more convenient) solution.
 
  In Gold 2.7, an MC68882 compatible set of FPU instructions will be available at boot time. It won't be necessary to load 68040.library or femu.

Does this mean the team actually managed to fit the fpu into the fpga of the V2 cards? If so, was something else removed, or the instructions somehow ”compressed”


Mallagan Bellator

Posts 292
18 Jan 2018 02:16


Gunnar von Boehn wrote:

Thierry Atheist wrote:

  does this mean that, 2, 3 or even 4 of these can be put into the Vampire 4 because it has so many extra LE's!!???
 

 
  The point is something else.
 
  The 68060 FPU was OK.
  The 68080 is much better - you see this now in all benchmarks.
  68080 is much faster than 68060 - even with non optimal scheduled code.
 
  But current software not even uses 50% of the 68080 FPU potential.
  With slight improvement in code layout.
  This means small improvement in the order of instructions,
  there is a huge speed up potential.
 
  And this is without needed bigger or more expensive FPGAs.

It would be totally sweet if someone optimized Quake to utilize this!


Andrew Copland

Posts 77
18 Jan 2018 10:31


I don't suppose that there's any limited reordering of dependent instructions that could be done here to speed up the existing FPU code?


Gunnar von Boehn
(Apollo Team Member)
Posts 3350
18 Jan 2018 11:37


Andrew Copland wrote:

I don't suppose that there's any limited reordering of dependent instructions that could be done here to speed up the existing FPU code?

 
Future applications should layout their code in a better modern way.
Future applications can also use BANK to use all 32 FPU registers, and to benefit from full 3 OPP FPU oprations in 1 instruction.

Maybe we should make a little coding challenge or coding competition?
For people to take part and demonstrate who can create the most powerful FPU code?


Peter Heginbotham

Posts 123
18 Jan 2018 12:55


Gunnar von Boehn wrote:

Andrew Copland wrote:

  I don't suppose that there's any limited reordering of dependent instructions that could be done here to speed up the existing FPU code?
 

   
  Future applications should layout their code in a better modern way.
  Future applications can also use BANK to use all 32 FPU registers, and to benefit from full 3 OPP FPU oprations in 1 instruction.
 
  Maybe we should make a little coding challenge or coding competition?
  For people to take part and demonstrate who can create the most powerful FPU code?

For me, the effort should be spent in updating the various compilers and programming tools to generate optimized code for the new FPU functionality. Maybe some community funded bounty?


Vojin Vidanovic

Posts 770
18 Jan 2018 14:53


Peter Heginbotham wrote:

  For me, the effort should be spent in updating the various compilers and programming tools to generate optimized code for the new FPU functionality. Maybe some community funded bounty?

Dont forget some CPU related "NOT hand by hand assembly finetuning" for 080 overall. Generally tools that would optimize existing 68k code for 080/MMX/2.7 mixed FPU/v4 full FPU would be most benefitial.

posts 24page  1 2