Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Apollo Gets "Out of Order" Execution Support.page  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 6214
18 Aug 2018 04:50


John Heritage wrote:

Is the OOO engine enabled for the entirity of the 4-wide Integer core?  (I believe 68080 was said to be able to execute up to 4 instructions at once).  Any influence on maximum clock speed?  Power consumption?

To prevent misunderstandings.
OOO does not influence the max peak speed.

Apollo can issue max 4 instructions per cycle,
if running at 100MHz this would be peak 400 Mips.

But like with a car which might have a peak speed of 150 miles per hour, not all street allow you to run this.

The APOLLO pipeline actually is designed to perform very good already even without OOO. In contrast to PPC for example.
OOO can accelerate code which suffers from Latency stalls.

The biggest gain for OOO on Apollo are FPU instructions as many they have higher latency. e.g 6 cycle
And also on routines having high percentage of cache misses.

In total OOO might give an average boost of around 10%



Gunnar von Boehn
(Apollo Team Member)
Posts 6214
18 Aug 2018 04:54


Daniel Sevo wrote:

This is pretty cool stuff. Just when you think performance has peaked.. However.. it also made me wonder if there is a "final specification" for the Apollo Core?

Our goal is to create the "ultimate" 68K.
Which can play in the same ballpark as the modern chips of the big companies.

We cherry picked the best tricks of the big ones and added them to APOLLO. Clock by clock APOLLO already performs very good.
Clock be clock Apollo smokes already many PPC and all ARM chips.
Frankly Haswell is a great core - we will not beat it.

We have a roadmap of more improvement coming in the next time.


Captain Zalo

Posts 71
18 Aug 2018 09:05


Have you checked the cost of designing and manufacturing an 080 ASIC at 28nm fab in china at a low yield (say 5000 cpus)? It would be fun to see the Apollo Core operating at 10-15x the current speed.


Salteadorneo Salteador

Posts 20
18 Aug 2018 12:29


Congratulations.
Do you want to add an SPU-style unit?
And add more cores? This I guess it would be for an ASIC version.


John Heritage

Posts 111
18 Aug 2018 14:00


Thanks Gunnar - appreciate the OOO detail.  I know as cores get wider, maintaining ILP requires OOO, but it sounds like an in-order 4 -wide Apollo Core is still efficient.  Still, OOO is a lot of work for the extra performance, you have my full respects!
 
  If these were ASICs,  an OOO 68080 sounds like it should be called 68090 or something :)


Nixus Minimax

Posts 416
18 Aug 2018 17:05


John Heritage wrote:
If these were ASICs,  an OOO 68080 sounds like it should be called 68090 or something :)

The uneven 68k processors are just minor updates to the preceding even ones (000 and 010, 020 and 030). I think ooo would be significant enough to justify a whole generation step and thus we should proceed from the 68080 ("oh-eighty") to the 680A0 ("oh-A-ty")... :)



Daniel Sevo

Posts 299
18 Aug 2018 19:14


Captain Zalo wrote:

Have you checked the cost of designing and manufacturing an 080 ASIC at 28nm fab in china at a low yield (say 5000 cpus)? It would be fun to see the Apollo Core operating at 10-15x the current speed.

ASIC is probably in the half million Euro  / dollar and upwards ballpark.
Even given the success of several recent kickstarters, half or even as much as 1 million euro required would be tough to do that way, relying fully on the rtetro-community.. (You'd probably need some manufacturer on board that plans to use it for something else..)




Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
19 Aug 2018 11:26


Gunnar von Boehn wrote:

  To prevent misunderstandings.
  OOO does not influence the max peak speed.
 
  In total OOO might give an average boost of around 10%
 

 
  So all v2s and  v4s will get +10% FPU and -latency on many operation since GOLD3? That is a nice way to go, on top of GOLD3 features!

Salteadorneo salteador wrote:

Congratulations.
  Do you want to add an SPU-style unit?
  And add more cores? This I guess it would be for an ASIC version.

Our current FPGAs seems to fit one core only. To be multicored, we dont need just "space", core needs to be 100% done.

Beside no AmigaOS scheduler (or MacOS Classic, Atari ...) is far more problematic. AROS backport of x64 SMP and Gallium may be the high tide, when its on the to do list. And that would enable what you desire.

Until then, its far better to push the features and clock speed ahead.


M Rickan

Posts 177
21 Aug 2018 01:18


Daniel Sevo wrote:

ASIC is probably in the half million Euro  / dollar and upwards ballpark.
Even given the success of several recent kickstarters, half or even as much as 1 million euro required would be tough to do that way, relying fully on the retro-community.

Funding a full system would definitely be a more compelling offering in terms of crowdfunding. Look at the AtariVCS campaign as an example.

The other question is whether or not Gunnar and team have an interest in pursuing this option or licensing the IP for this purpose.


Gunnar von Boehn
(Apollo Team Member)
Posts 6214
21 Aug 2018 06:21


Regarding the questions about ASIC.
 
Apollo 68080 is clearly the most advanced 68K CPU.
Those advanced features are the reason Apollo is the fastest 68k CPU.

The Motorola 68060 is the 2nd fastest 68K CPU.
If you compare both Apollo 68080 and the M68060 you can clearly see the difference those improvements make.

 
Doing an ASIC needs a lot of preparation time and a big investment.
And then an ASIC is a one time shot.
 
Right now we use FPGAs for the Vampire cards.
The FPGA allow "upgrading" the core version.
So far the Vampire users got regularly a new CPU version which always was improved. We continuously develop the core and continously Apollo gets faster and more powerful.
 
We have a good list of development ideas for the future.
The new Out Of Order features gives a big speed boost for some applications - we saw benefit of up to +40% for some cases.
 

 
 


Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
21 Aug 2018 10:00


Gunnar von Boehn wrote:

  We have a good list of development ideas for the future.
  The new Out Of Order features gives a big speed boost for some applications - we saw benefit of up to +40% for some cases

Fully, understood. +10-40% FPU perf. and lowerlatency for V2 and V4, and we will see what V6+ brings :-)



Gregthe Canuck

Posts 274
21 Aug 2018 10:05



I really like seeing these 10% improvements. That is on top of faster memory controller, faster FPGA in V4, etc... They multiply. :)


Martin Soerensen

Posts 232
21 Aug 2018 14:08


If you had put all your effort into making an ASIC to be released today, then it probably wouldn't have a useful FPU and no SAGA/PAMELA since you would have had to start this work a long time ago when Apollo had much fewer features than it does today. I'd probably prefer a slower FPGA-based version which makes it possible to apply regular updates and add new features rather than having top-notch CPU clock frequency.
 
Many state of the art professional products also use FPGAs rather than ASICs simply because the design would be outdated before an ASIC could be ready and having the ability to fix bugs and add new features with a simple firmware flash is a very powerful feature.
 
Would it be feasible to put only the 080 core in an ASIC and then add SAGA/PAMELA/FPU/etc. in peripheral (FPGA) chips? I think that the instruction set is quite mature by now, but I am not sure how much SAGA etc. is interlocked to the CPU internals.


Vojin Vidanovic
(Needs Verification)
Posts 1916/ 1
28 Aug 2018 09:54


Martin Soerensen wrote:

  Would it be feasible to put only the 080 core in an ASIC and then add SAGA/PAMELA/FPU/etc. in peripheral (FPGA) chips? I think that the instruction set is quite mature by now, but I am not sure how much SAGA etc. is interlocked to the CPU internals.

That is likely to be half of real asic/motherboard costs, just because of ...

Technically it is possible I suppose, but reasonable ...

let it develop fully for a year or two before that.

back to topic, OO ex seems to be the feature of P6+ class cpus, but is also connected to this stupid spectre leak. Will we try to avoid that fuss and not to loose performance, I highly doubt someone is gonna use a hardware hack on Vampires :-)



Chastanier Cclecle

Posts 19
29 Aug 2018 20:36


Does the out-of-order feature mean the apolo core also support register renaming ? It could really improve superscallar performances (especially for handly-writen non-optimized asm code) if it is not already done, but cost some (a lot of) more fpga ressources , but with the V4 there will be plenty of space :)


Gunnar von Boehn
(Apollo Team Member)
Posts 6214
06 Sep 2018 07:51


Chastanier Cclecle wrote:

  Does the out-of-order feature mean the apolo core also support register renaming ? It could really improve superscallar performances (especially for handly-writen non-optimized asm code) if it is not already done, but cost some (a lot of) more fpga ressources , but with the V4 there will be plenty of space :)
 

 
 
In general one can say that "Out of Order" is much less important for 68K than e.g. for PPC.
The reason is easy to understand:
 
A 68k instruction can do a lot work - often an 68k can do with one instruction the amount of work for which a PPC needs 3 instructions.
 
Example:

  ADD.L D1,(A0)+

 
This instruction will load the data from memory to which A0 points to, will add to it the value of D1, and will save the result back into memory.
 
The PPC will need for this 3 instructions:
Example how the PPC would do it: (using 68K syntax for clarity)
 

  LOAD  (a0),R2
  ADD  D1,R2
  STORE R2,(A0)
  UPDATE A0+

 
The one instruction needs 1 CYCLE on 68080.
The PPC equivalent needs 3-4 instruction which are sequentially depending - the timing will typically look like this
 

LOAD  (a0),R2    -- 1 clock
                  -- 3 clock load usage bubble
ADD  D1,R2      -- 1 clock
STORE R2,(A0)    -- 1 clock
UPDATE A0+      -- 1 clock

 
We see that the dependency requires the instruction being done after each other. And the DCache access creates a load-usage bubble.
This make it take a total of 7 clocks.
The PPC needs OoO to fill the gaps.
 
Without OoO PPC will under-perform badly.
OoO allows to PPC to fill the bubble with some code.

And if you run such operation in a LOOP then the decoder will but this operations several times in the execution - like this:
 
 


LOAD  (a0),R2    -- 1 clock
                  -- 3 clock load usage bubble
ADD  D1,R2      -- 1 clock
STORE R2,(A0)    -- 1 clock
UPDATE A0+      -- 1 clock
 
LOAD  (a0),R2    -- 1 clock
                  -- 3 clock load usage bubble
ADD  D1,R2      -- 1 clock
STORE R2,(A0)    -- 1 clock
UPDATE A0+      -- 1 clock

 
The R2 which is really a TMP variable here is now used 2 times in the Execution pipe. In reality these 2 usages of the temp variable R2 are not depending. As both times the same name R2 is used, the CPU can NOT utilize OoO here to re-order and can not speed up the code. Only be renaming them to two different variables e.g. T2 / T3 the core can use OoO fully and can reorder the operations to avoid some of the bubbles.
 
As you see OoO and Register renaming is very important for the PPC to even get "acceptable" performance.

The 68k on the other hand does by design not have this problems.
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6214
06 Sep 2018 08:20


Gunnar von Boehn wrote:

In general one can say that "Out of Order" is much less important for 68K than e.g. for PPC.

 
For INTEGER operation the above is 100% true.
As 68080 does integer operations very fast and support FREE memory access as part of them.
 
For FLOATING point operations the situation is slightly different.
FLOATING point operations have by design a little latency.
For them OoO does help a lot to reach best performance.
 
If you take a look at the recent MiniBench Scores you can clearly see this.
Look how much faster FDIV can be on 68080 compared to 68060.

68060 @ 50 MHz
FDIV.S  (A0),FPn  ==  0.6

Vampire x11
FDIV.S  (A0),FPn  ==  69.1

= more than 100 times faster than 68060 @ 50 Mhz
= Speed of 68060 @ over 5 Gigaherz
 

On APOLLO you can reach the FDIV performance of a 68060 running with several Gigaherz Clock.
 
 


Mallagan Bellator

Posts 393
08 Sep 2018 10:08


This is some great stuff

posts 38page  1 2