APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

Microarchitecture Details

Krystian Baclawski

Posts 5
29 May 2016 12:12

Could you share with us a diagram of microarchitecture drawn in similar fashion to one that can be found in PowerPCs processor manuals? I hope you could also give us some details on it and answer following questions:

* Is Apollo Core pretty standard Out-Of-Order machine?
* How caches are organized (block size, set-associative, pseudo-LRU)?
* How do you organize fetch unit and instruction decoding?
* Do you use hybrid predictor or just local one with 2-bit saturating counters?
* How do you deal with instructions that perfom more that one memory access, like movem and move (ax)+, (ay)+ ?
* How do you recover from mispredicted branches and interrupts?

Kind regards
Cahir / Ghostown & Whelpz

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
29 May 2016 12:47

Krystian Baclawski wrote:

Please consult the Motorola manual of the 68060 CPU.
Apollo is an improvement over the 68060 but has a lot in common with the 68060.

Krystian Baclawski wrote:

* Is Apollo Core pretty standard Out-Of-Order machine?

Apollo is In-Order, with vertical pipeline.
Ea-Calculation / DCache / ALU are unrolled in the pipeline behind each other.
The pipeline design of Apollo is very similar to
Motorola 68060, Intel Pentium, VIA Centaur

Its Super-Scalar, with peak of 4 instructions per clock

Krystian Baclawski wrote:

* How caches are organized (block size, set-associative, pseudo-LRU)?

Replacement is random
Associativity is (1/2 way) selectable on core build.

Krystian Baclawski wrote:

* How do you organize fetch unit and instruction decoding?

not sure what detail you ask for

Krystian Baclawski wrote:

* Do you use hybrid predictor or just local one with 2-bit saturating counters?

2 Saturation counter
plus 512 Entry BTC

Krystian Baclawski wrote:

* How do you deal with instructions that perfom more that one memory access, like movem and move (ax)+, (ay)+ ?

They get cracked into several operations.

Krystian Baclawski wrote:

* How do you recover from mispredicted branches and interrupts?

Mispredictd instructions get flushed form the pipeline and the core continues at the correct instruction.

Krystian Baclawski

Posts 5
29 May 2016 13:17

Gunnar von Boehn wrote:

Its Super-Scalar, with peak of 4 instructions per clock

So you can issue up to 4 instructions each cycle, provided there're no RAW/WAR/WAW dependencies? What is the pipeline depth then?

With 4-issue wide CPU I implicitly assumed it has to be OoO. OTOH wouldn't hazard detection / bypassing be much simipler or better structured if you decided to go OoO way?

Gunnar von Boehn wrote:

Replacement is random, Associativity is (1/2 way) selectable on core build.

Isn't it too modest? Is there much difficulty in implementing 4-way real LRU (trick with 5 extra bits per set)? Would there be a significant perfomance improvement according to your benchmarks?

Gunnar von Boehn wrote:

2 Saturation counter, plus 512 Entry BTC

What is the value of miss ratio for your benchmarks? Have you considered adding two-level adaptive predictor (Yale-Patt)?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
29 May 2016 15:11

Krystian Baclawski wrote:

Gunnar von Boehn wrote:

Its Super-Scalar, with peak of 4 instructions per clock

So you can issue up to 4 instructions each cycle, provided there're no RAW/WAR/WAW dependencies?

4 instruction is peak,
2 per cycle is more common

Krystian Baclawski wrote:

What is the pipeline depth then?

6 stages

Krystian Baclawski wrote:

wouldn't hazard detection / bypassing be much simipler or better structured if you decided to go OoO way?

Typical CISC Instructions likes this one:

ADDI.L #$123456,(40,A7)

Can be tracked and executed as 1 operation.
for Cores implementing the Pipelines like APOLLO, 68060 or Pentium.
This is optimal.

For OoO cores the above single instruction would need to be split into 3 internal operations.
The amount of tracked operations in flight and the cost for this increases therefore.

As everything in live its a trade-off.

Krystian Baclawski

Posts 5
29 May 2016 15:44

Gunnar von Boehn wrote:

Typical CISC Instructions likes this one:

ADDI.L #$123456,(40,A7)

Can be tracked and executed as 1 operation for Cores implementing the Pipelines like APOLLO, 68060 or Pentium. This is optimal.

For OoO cores the above single instruction would need to be split into 3 internal operations. The amount of tracked operations in flight and the cost for this increases therefore.

Though I recognize the difficulty in implementing OoO, I perceive it as being better structured. Basically it doesn't look like a bunch of workarounds thrown together. I hope at some point you'll be brave enough to reach out for microarchitecture design that every modern high-perfomance CPU benefits from.

Samuel Crow

Posts 424
29 May 2016 21:02

Krystian Baclawski wrote:

Gunnar von Boehn wrote:

True. But with OoO you're very likely not to stall next couple of instructions in case of a cache miss. Instruction tracking, thanks to reorder buffer, doesn't seem to be much more unpleasant than instruction issue / forwarding / hazard detection in superscalar pipeline. Am I wrong?

Though I recognize the difficulty in implementing OoO, I perceive it as being better structured. Basically it doesn't look like a bunch of workarounds thrown together. I hope at some point you'll be brave enough to reach out for microarchitecture design that every modern high-perfomance CPU benefits from.

On an accelerator card for AmigaOS 3.1, we also need not worry about the overhead of an MMU so the instruction pipeline can be kept fairly short to minimize vulnerability to such stalls. Am I right?

John Heritage

Posts 112
28 Jun 2016 21:57

Gunnar - just curious, will the eventual FPU be similar -- 6 pipeline stages, and up to 4 instructions per clock peak?

Also if you care to share, what are cache sizes?

Just curious :)

Thanks!

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
03 Jul 2016 11:18

John Heritage wrote:

Gunnar - just curious, will the eventual FPU be similar -- 6 pipeline stages, and up to 4 instructions per clock peak?

Also if you care to share, what are cache sizes?

Apollo 68080 FPU in a nutshell

* Fully pipelined
* Operand Source can be: Register/#immediate/memory
* Free type conversion per Operation
* Opcode compatible to 68K FPU
* 32 internal FPU registers
8 Regs accessible over old instruction / 32 Regs using new encodings
* Support 2 and 3 Operand operations


John Heritage Posts 112 03 Jul 2016 21:56	This should be a massive improvement over the 68060 FPU then.. Fully pipelined and higher clock speeds alone will make a big difference.. Sounds Great!


Thierry Atheist Posts 644 04 Jul 2016 02:59	CPUs can have a data and instruction cache. Example, 4 Kilobytes, 8 Kilobytes, etc. Does this work, can it be done with FPUs?, or is that where a SIMD unit comes into play?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
04 Jul 2016 05:52

Thierry Atheist wrote:

CPUs can have a data and instruction cache. Example, 4 Kilobytes, 8 Kilobytes, etc.

Does this work, can it be done with FPUs?, or is that where a SIMD unit comes into play?

Apollo/68080 has of course also independent caches

Apollo has separate Instruction and Data caches which provide a number of improvements over the caches of other 68k CPUs.

posts 11