Overview Features Instructions Performance Forum Downloads Products OrderV4 Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
VISIT APOLLO IRC CHANNEL



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

AMMX Questionpage  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 4542
03 Nov 2016 23:16


The other day someone was wondering how useful AMMX is.

Maybe it will help to give some examples:

AMMX provides 64-Bit operations!

Now AMMX does not only do 1 operation but can do many in parallel.

Lets make an easy to understand example:
Lets say you want to create virtual Audio channels - like Octamed..

So lets say you want to merge 2 8-bit samples.
Typical 68k Code could look like this:
A0= Pointer to sample 1
A1= Pointer to sample 2
A2= Pointer to mixed result


  move.b (a0)+,D0    -- load 1st byte from Sample 1
  move.b (a1)+,D1    -- load 1st byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

As you see the normal 68k Code would use 6 Instructions to mix 1 Byte of audio result.
This means to mix 8 Bytes - we need 48 Instruction.


  move.b (a0)+,D0    -- load 1st byte from Sample 1
  move.b (a1)+,D1    -- load 1st byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 2nd byte from Sample 1
  move.b (a1)+,D1    -- load 2nd byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 3rd byte from Sample 1
  move.b (a1)+,D1    -- load 3rd byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 4th byte from Sample 1
  move.b (a1)+,D1    -- load 4th byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 5th byte from Sample 1
  move.b (a1)+,D1    -- load 5th byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 6th byte from Sample 1
  move.b (a1)+,D1    -- load 6th byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 7th byte from Sample 1
  move.b (a1)+,D1    -- load 7th byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

  move.b (a0)+,D0    -- load 8th byte from Sample 1
  move.b (a1)+,D1    -- load 8th byte from Sample 2
  add.w  D1,D0      -- add them together using WORD
  addq.w #1,D0      -- Add 1 for rounding!
  lsr.w  #1,D0      -- Shift right to have the Averaged result
  move.b D0,(A2)+    -- Save result

A lot instruction!

You wonder how AMMX can speed this up?


  LOAD  (A0)+,B0
  PAVGN  (A1)+,B0,B0
  STORE  B0,(A2)+

As you see the code is MUCH shorter.
And the Code is also much faster.

The normal 68k Code would need 48 instructions.
An 68030 would over 100 clockcycle for this code.
An 68040 would need around 48 cycles.
And an 68060 would depending how you write the code need 48 to best case 24 cycles.

If written in AMMX - 68080 need only 3 cycles for it.

You see the speed up is dramatically.
8 times faster than fastest 68060 code.




Philippe Flype
(Apollo Team Member)
Posts 277
04 Nov 2016 10:12


Hi Gunnar, thank you for the explanation. It helps, at least it helps me by giving some new hints.

One question if i can ask, about "8 times faster than fastest 68060 code.", one could say then why with RiVA we only reach factor x2.

Can you explains this, please ?


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
04 Nov 2016 19:34


Philippe Flype wrote:

One question if i can ask, about "8 times faster than fastest 68060 code.", one could say then why with RiVA we only reach factor x2.

The idea behind MMX or AMMX is not to rewrite all the programs completely.
This would be huge amount of work and would take too much time.
RIVA for example is a big program with thousands of lines of code.

Instead rewriting thousand of lines of code the most effective solutions is to pick the main workloop which consumes maybe 40% of the CPU time.

Then you focus on rewriting this functions which maybe is only 50 instructions long - to AMMX.
If you for example speed up this workloop with AMMX by a factor of 4 times.
Then this will increase the speed of the whole program by about
50%.

If you use AMMX like this.
Then the amount of work is only little but still you get a very nice benefit.


John William

Posts 460
04 Nov 2016 20:24


Gunnar von Boehn wrote:

Philippe Flype wrote:

  One question if i can ask, about "8 times faster than fastest 68060 code.", one could say then why with RiVA we only reach factor x2.
 

 
  The idea behind MMX or AMMX is not to rewrite all the programs completely.
  This would be huge amount of work and would take too much time.
  RIVA for example is a big program with thousands of lines of code.
 
  Instead rewriting thousand of lines of code the most effective solutions is to pick the main workloop which consumes maybe 40% of the CPU time.
 
  Then you focus on rewriting this functions which maybe is only 50 instructions long - to AMMX.
  If you for example speed up this workloop with AMMX by a factor of 4 times.
  Then this will increase the speed of the whole program by about
  50%.
 
  If you use AMMX like this.
  Then the amount of work is only little but still you get a very nice benefit.

But what if you rewrote the entire program using AMMX? Wouldn't then the benefit be 100%?



Thierry Atheist

Posts 618
04 Nov 2016 22:22


DSP compared to AMMX.

Does an AMMX instruction set do what a dedicated DSP circuit set can do, or are they somewhat to completely different?

Would it be of any value to add DSP capability to the Apollo Core, or would it just clutter things up with a dubious increase in productivity?


Thierry Atheist

Posts 618
04 Nov 2016 22:27


John William wrote:
But what if you rewrote the entire program using AMMX? Wouldn't then the benefit be 100%?

The immediate benefit of the small rewrite would show right away. As you worked on the other segments, slight increases would be shown over time.

I think that someone may be willing to, over the course of a year, work on the lesser segments to really make this new AMIGA shine. (Maybe even add some 64 bit sections.)


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
05 Nov 2016 07:28


Thierry Atheist wrote:

  DSP compared to AMMX.
 
  Does an AMMX instruction set do what a dedicated DSP circuit set can do, or are they somewhat to completely different?

 
A DSP is typically optimized for data manipulation and have some special features for this.
 
a) Often DSP have 2 memory buses and can therefore manipulate memory very efficient.
Apollo has 2 memory buses.
 
 
b) DSP often have more register or some internal scratchpad to allow variable calculation without memory access.
Apollo has many registers and big caches
 
c) DSP often have special instruction to allow branch free calculations of some arithmetics. Like for example ABSOLUTE.
Apollo has special instructions for this
 
d) DSP often are tuned to be able to do MULTIPLICATION very fast.
APOLLO can do 4 Multiplications per cycle.
 
For reference 68040 needs 16 cycles for 1 MUL.
This means Apollo 68060 is 64 more efficient in multiplication than 68040.
 
 
 
  As you can see 68080 does provide all the features which make a DSP strong. :-)


OneSTone O2o

Posts 159
06 Nov 2016 12:27


Your example for traditional 68k code is only the inside loop, right? Otherwise it misses to increase A0,A1,D0 (and check sample end). Is there in AMMX also an optimisation to increase several adresses at once by adding 1?


Thierry Atheist

Posts 618
06 Nov 2016 16:10


Hi Gunnar,

Thank you for that information....

This is truly a fantastic and absolutely worthwhile project.

Who cares about the other stuff out there, it will always be available. However, this is valuable in every other way, except for raw speed.


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
06 Nov 2016 17:04


oneSTone o2o wrote:

Your example for traditional 68k code is only the inside loop, right?

Yes this was only the code in the loop.

oneSTone o2o wrote:

Is there in AMMX also an optimisation to increase several adresses at once by adding 1?

(An)+ and -(AN) EA-modes are of course supported too.
They increase the pointer by 8


OneSTone O2o

Posts 159
06 Nov 2016 18:19


And three pointers at once +1 ?


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
06 Nov 2016 18:52


oneSTone o2o wrote:

And three pointers at once +1 ?

Sorry I'm not sure I understand the question.
Or the use case..


Markus (mfro)

Posts 91
14 Nov 2016 15:36


Gunnar von Boehn wrote:

  You wonder how AMMX can speed this up?
 

    LOAD  (A0)+,B0
    PAVGN  (A1)+,B0,B0
    STORE  B0,(A2)+
 

 

 
  Apparently, there is an additional set of "B" registers?
 
  How do you handle them on context switches?


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
15 Nov 2016 05:29


Markus (mfro) wrote:

  Apparently, there is an additional set of "B" registers?
 
  How do you handle them on context switches?

Hi Markus,

Apollo provides both:
Full 68040 instruction set, All 680x0 EA modes

and also upgrades as
- New Special Purpose Register (including rich set of Performance counters)
- 64 Bit wide Register ( Als Dn-Regs are 64 bit wide)
- Extra general Purpose Registers.

It logical that if the OS does not save the 64bit Registers on a Context switch, then your applications can not cleanly use them.
But saving them is no problem and you can also identify task running in 32bit mode and 64bit mode. So the OS can decide to only save the smaller register set for old applications.


Markus (mfro)

Posts 91
15 Nov 2016 09:38


Hi Gunnar,
Gunnar von Boehn wrote:
 
... It logical that if the OS does not save the 64bit Registers on a Context switch, then your applications can not cleanly use them...

So you need a modified OS if you allow applications to use AMMX (important to know for the ST as it has non-preemtive _and_ preemptive multitasking at the same time).

Gunnar von Boehn wrote:
... you can also identify task running in 32bit mode and 64bit mode...

out of curiosity: how would you do that?


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
15 Nov 2016 21:54


Markus (mfro) wrote:

  So you need a modified OS if you allow applications to use AMMX (important to know for the ST as it has non-preemtive _and_ preemptive multitasking at the same time).

Well "Yes" and "No".
If only one application uses the new regs - there is obviously not need of saving them.
This means if you play a video with video player for example.
This will work just fine - as long you only watch one video in parallel.
Of course the clean way is adding support for this to the OS.
Which means typically pushing these regs to the stack.

BTW AMMX can also work on Dn registers.
We used the extra regs in this example - But you can also use normal Dn.

Markus (mfro) wrote:

 
Gunnar von Boehn wrote:
... you can also identify task running in 32bit mode and 64bit mode...
 

  out of curiosity: how would you do that?

Bit 11 in SR turns on Apollo-mode.



Markus (mfro)

Posts 91
16 Nov 2016 07:24


Thank you, Gunnar.

Bit 11 in SR turns on Apollo-mode.

That means if you leave that bit untouched, you basically have a (fast) 68060 that would cough on your instruction set extensions and take the illegal instruction trap instead? Smart.

Might need some investigation on what existing legacy software does with this (Motorola reserved) bit, however...


Chain Q

Posts 19
05 Dec 2016 20:23


As you see the normal 68k Code would use 6 Instructions to mix 1 Byte of audio result. This means to mix 8 Bytes - we need 48 Instruction.

Just for the record, the fun, and the love of 68k, I can make your example 36 instructions (for 8 samples, w/o any Apollo opcodes). Can you? ;-)


Gunnar von Boehn
(Apollo Team Member)
Posts 4542
06 Dec 2016 05:16


Chain Q wrote:

 
As you see the normal 68k Code would use 6 Instructions to mix 1 Byte of audio result. This means to mix 8 Bytes - we need 48 Instruction.

 
  Just for the record, the fun, and the love of 68k, I can make your example 36 instructions (for 8 samples, w/o any Apollo opcodes). Can you? ;-)
 

 
:-)
Of course you can do a poor mans SIMD using 32bit ADD and masking with ANDs. We posted such code example also long ago.
 
The fake SIMD code that we wrote was a lot harder to read than the above naiv solution. Also if you consider that for example the 68060 has the limitation that it can max fetch 4 Bytes of instructions from the Icache, then the more complex code using MASKs and needing more registers to be saved - was in performance still much slower than the AMMX solution.

You can post your code as example if you want and post cycles on 68060 for education and comparison if you like.




Henryk Richter
(Apollo Team Member)
Posts 115
06 Dec 2016 06:26


Chain Q wrote:

 
As you see the normal 68k Code would use 6 Instructions to mix 1 Byte of audio result. This means to mix 8 Bytes - we need 48 Instruction.

    Just for the record, the fun, and the love of 68k, I can make your example 36 instructions (for 8 samples, w/o any Apollo opcodes). Can you? ;-)
 

  Sure, one can do it with 4.5 instructions per sample, but why would one want to?
 

  MOT_8x1_HALFVER        macro
          move.l  (a1),d1 ; P00 P01 P02 P03     
          move.l (a3),d2 ; P10 P11 P12 P13     
          move.l d1,d3          ; P00 P01 P02 P03
          or.l    d2,d3          ; P00|P10 P01|P11 P02|P12 P03|P13
                                  ;-> meaning: we need to add "1" whenever any of the operands has it's LSB set
          and.l  d6,d1          ; upper 7 bits P00 P01 P02 P03
          and.l  d6,d2          ; upper 7 bits P10 P11 P12 P13
          lsr.l  #1,d1          ; >>1         
          lsr.l  #1,d2          ; >>1         
          and.l  d0,d3          ; keep the 1   
          add.l  d1,d2          ; P00+P01 .. .. ..
          move.l 4(a1),d1        ; P04 P05 P06 P07
          add.l  d3,d2          ; (P00+P10+1)>>1 .. .. ..
          move.l  4(a3),d7        ; P14 P15 P16 P17
          move.l d2,(a2)+        ; store       
          move.l d1,d3          ; P04 P05 P06 P07
          or.l    d7,d3          ; P04|P14 P05|P15 P06|P16 P07|P17
                                  ;-> meaning: we need to add "1" whenever any of the operands has it's LSB set
          and.l  d6,d1          ; upper 7 bits P04 P05 P06 P07
          and.l  d6,d7          ; upper 7 bits P14 P15 P16 P17
          lsr.l  #1,d1          ;
          lsr.l  #1,d7          ;
          and.l  d0,d3          ; keep the 1   
          add.l  d1,d7          ; P04+P14 .. .. ..
          add.l  d3,d7          ; (P04+P14+1)>>1 .. .. ..
          move.l  d7,(a2)+        ; store
                          endm
 

  I prefer it with less instructions (total 24=3 per sample). Posted example needs 14 cycles assuming data in cache. Compared to 3 cycles for the same functionality in AMMX, I'm happy that I can use the latter (but not for audio mixing).

posts 25page  1 2