APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Performance and Benchmark Results!

300Mhz Target	page 1 2 3 4


Don Adan Posts 38 03 Sep 2018 19:11	Im not hardware expert, but 32 bit mulu.l d0,d1 can not be handled different than 64 bit mulu.l d0,d1:d2 ? It looks like 32 bit mulu.l needs 2 cycles for 68060 and 3 cycles for 68080.

Philippe Flype
(Apollo Team Member)
Posts 299
03 Sep 2018 20:36

I'm also not hardware expert, but it could be handled differentely, indeed. That said, it would take many precious cells in the FPGA.

Some details:


MULU.W Dn,Dm               38.2  <== 2 cycles
MULU.L Dn,Dm               25.3  <== 3 cycles
MULU.L Dn,Dr:Dq            25.3  <== 3 cycles 
DIV.W  Dn,Dm                2.2 
DIV.L  Dn,Dm                2.2 
DIV.L  Dn,Dr:Dq             2.2

Nixus Minimax

Posts 416
04 Sep 2018 08:08

Don Adan wrote:

It looks like 32 bit mulu.l needs 2 cycles for 68060 and 3 cycles for 68080.

OK, that explains a lot as the 060 seems to be very strong at 32 bit muls. I believe that two cycles for a mulu.l was very good at the time the 060 was released, I seem to remember that the PPCs of the era took 6 cycles for a mul. I guess Motorola chose to cram a lot of transistors into the 32 bit multiplier and drop hardware support for the 64 bit mul to compensate for it.

Don Adan

Posts 38
04 Sep 2018 12:17

Philippe Flype wrote:

I'm also not hardware expert, but it could be handled differentely, indeed. That said, it would take many precious cells in the FPGA.

Some details:


  MULU.W Dn,Dm               38.2  <== 2 cycles
  MULU.L Dn,Dm               25.3  <== 3 cycles
  MULU.L Dn,Dr:Dq            25.3  <== 3 cycles 
  DIV.W  Dn,Dm                2.2 
  DIV.L  Dn,Dm                2.2 
  DIV.L  Dn,Dr:Dq             2.2

Seems that div is the slowest 68060/68080 instruction about 39 cycles, I think. Interesting if exist fastest div implementation for other CPUs. Pentium I has similar div timing to 68060.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
04 Sep 2018 14:12

Don Adan wrote:

Seems that div is the slowest 68060/68080 instruction

If you run statistics over instruction usage then you will see that some instructions are used over hundreds times more than others.
How many cycles very rarely used instruction need has very little influence on the performance of the CPU.

Don Adan

Posts 38
04 Sep 2018 16:27

Gunnar von Boehn wrote:

Don Adan wrote:

Seems that div is the slowest 68060/68080 instruction

For the total performance of the CPU its very important
that the most used instructions are as fast as possible,
and that data access like Cache or memory access is fast.

Not all instructions are equally often used.
E.g. some instruction are used seldom or even very rarely only.

If you run statistics over instruction usage then you will see that some instructions are used over hundreds times more than others.
How many cycles very rarely used instruction need has very little influence on the performance of the CPU.

Good coder dont use div instructions because are not useful, but because are too slow for critical routines. Fast(est) div instructions will be used much often. Because div instructions are slow, then most coders uses tables. Anyway I see often in some old programs code like this too:
divu.w #2,D0

Compilers very often used div instructions. Then compiled code with fast div will be works much fastest.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
04 Sep 2018 17:12

Don Adan wrote:

Then compiled code with fast div will be works much fastest.

Div could be made faster, e.g even double speed.
But in real live the benefit will not be that big.

The reason is simple, its a matter of fact
that some operations are in programs very often needed.
Very often used are: ADD, SUB, CMP, AND, OR, MOVE
Compared to them rarely needed are operations like DIV or MOD

Lets say for example that your average program does 1000 ADDs and 10 DIV - so even if we make the DIV twice as fast,
it still have only a small impact on the program speed.

BTW a clever coder might find out how to do DIV on APOLLO for 3 cycle.


Philippe Flype (Apollo Team Member) Posts 299 04 Sep 2018 17:17	040 @ 25 MHz -- added. EXTERNAL LINK Also 040 MUL / DIV details : MULU.W Dn,Dm 2.2 MULU.L Dn,Dm 1.1 MULU.L Dn,Dr:Dq 1.2 DIV.W Dn,Dm 0.9 DIV.L Dn,Dm 0.5 DIV.L Dn,Dr:Dq 0.9


Philippe Flype (Apollo Team Member) Posts 299 04 Sep 2018 21:07	Better layout : EXTERNAL LINK

Don Adan

Posts 38
04 Sep 2018 22:32

Gunnar von Boehn wrote:

Don Adan wrote:

Then compiled code with fast div will be works much fastest.

Div could be made faster, e.g even double speed.
But in real live the benefit will not be that big.

The reason is simple, its a matter of fact
that some operations are in programs very often needed.
Very often used are: ADD, SUB, CMP, AND, OR, MOVE
Compared to them rarely needed are operations like DIV or MOD

Lets say for example that your average program does 1000 ADDs and 10 DIV - so even if we make the DIV twice as fast,
it still have only a small impact on the program speed.

BTW a clever coder might find out how to do DIV on APOLLO for 3 cycle.

1000 ADDs can be done in 500-1000 cycles, 10 DIVs will be done in about 400 cycles. Then 50% fastest DIVs give real speed up.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
04 Sep 2018 23:17

Don Adan wrote:

10 DIVs will be done in about 400 cycles.

I bet that you as good coder see already a way how to do 1 DIV for ~ 14 cycle on APOLLO.


Don Adan Posts 38 04 Sep 2018 23:18	Btw. For 3 cycles DIV you mean about code pipelining? If yes, this is almost impossible for 68080, because Apollo is too fast. 40 to 80 instructions must be used after DIV, before accessing to output registers from DIV.


Don Adan Posts 38 04 Sep 2018 23:20	Yes, 10 DIV for 200 cycles, this is possible with pipelining.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
04 Sep 2018 23:26

Don Adan wrote:

Yes, 10 DIV for 200 cycles, this is possible with pipelining.

10 DIV = ~150 cycle without pipeline
10 DIV = ~30 cycle with pipeline

I just see Flype already explained how to do it.

Don, what do you think about it?

Nixus Minimax

Posts 416
05 Sep 2018 09:08

Don Adan wrote:

Btw. For 3 cycles DIV you mean about code pipelining? If yes, this is almost impossible for 68080, because Apollo is too fast. 40 to 80 instructions must be used after DIV, before accessing to output registers from DIV.

I don't think this is impossible or even difficult. If you need just the occasional DIV, you can often unroll your loop by two and split the register set virtually in two parts. You do the DIV in one register set half for the preceding data set while doing the remaining work with the other half for the next data set.

Don Adan

Posts 38
05 Sep 2018 12:13

Gunnar von Boehn wrote:

Don Adan wrote:

Yes, 10 DIV for 200 cycles, this is possible with pipelining.

10 DIV = ~150 cycle without pipeline
10 DIV = ~30 cycle with pipeline

I just see Flype already explained how to do it.

Don, what do you think about it?

Flype uses fdiv which is much fastest than div for 68080. Im not hardware expert, but for me div can works with same speed like fdiv. Maybe is possible to convert internally inside Apollo core div to fdiv, and using fdiv implementation for div too? Having fast div and fdiv is better, because 2 divs can be done in same time, because fdiv and div can works in parallel if i remember right.

Don Adan

Posts 38
05 Sep 2018 12:21

Nixus Minimax wrote:

Don Adan wrote:

Apollo is too fast for pipelining div. For highly optimised routines all or almost all CPU registers are used. Unrolling loop dont give effects.CPU will be waiting when output registers from div must be used. Good coder try to create short code. For short code pipelining about 40 cycles is for me almost impossible for Apollo Core.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
05 Sep 2018 12:41

Don,

The idiv is not 40 cycle, but 35 cycle for a 64bit divide!
The foating point div is actually faster and finishes in 12 cycle.
Btw, Fdiv can also be used for many integer divs.
This means you can speed up your code this way somewhat.

Don Adan wrote:

Apollo is too fast for pipelining div

Lets make a real world example:

Lets look at the QUAKE render loop.
We all know Quake.

QUAKE renders a row in a inner loop.
Every 8 pixel the projection will need be corrected.
For this DIV is used.
This means you have the rendering of 8 pixel time for the DIV.
Quake could do the DIV in the FPU, let it run in parallel while drawing the current 8 pixel - and then get the DIV result nearly for free.

This is how a coder coder could do it.

Don Adan

Posts 38
05 Sep 2018 19:07

Gunnar von Boehn wrote:

Don Adan wrote:

Apollo is too fast for pipelining div

Lets make a real world example:

Lets look at the QUAKE render loop.
We all know Quake.

QUAKE renders a row in a inner loop.
Every 8 pixel the projection will need be corrected.
For this DIV is used.
This means you have the rendering of 8 pixel time for the DIV.
Quake could do the DIV in the FPU, let it run in parallel while drawing the current 8 pixel - and then get the DIV result nearly for free.

This is how a coder coder could do it.

Yes, but Quake code is compiled, not assembled. Compilers are not smart enough. And I still think that better if idiv=fdiv for speed, 12 cycles is much easiest to pipelining. And all existed Amiga 68k code will be runs fastest.For me fdiv performance is good or even very good, but idiv only average. Maybe idiv instruction can be microcoded as fdiv or something similar can be done? I dont know if idiv and fdiv using same resources in FPGA, but if not then perhaps some FPGA space can be free too.

Gunnar von Boehn
(Apollo Team Member)
Posts 6222
05 Sep 2018 19:32

Don Adan wrote:

Yes, but Quake code is compiled, not assembled.

Actually the Quake trick, that I explained you,
is exactly what is done in Quake on PC.
This FPU-DIV trick was one of the cool new tricks/inventions of Quake.

The FDiv on 68K was relative slow - so this trick did not work that good for Amiga.
But as you saw 68080 can do FDIV much faster than all previous 68k.
Even 68060 needed ~50 cycle for FDIV -
APOLLO changes this significantly and can reach peak throughput of one FDIV per clock.

APOLLO opens some new options for coders.
You can clearly see this in several areas in Minibench.

BTW do you have a Vampire too?

posts 68	page 1 2 3 4