Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

Optimal 68K Memcopy Routinepage  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
25 Jun 2017 23:02


Today we looked into 68K AROS routines.
We were asked to look at the memcopy to help improve them.
 
I had a bet running that Flype will easily write a much shorter, and much faster routine. And in deed Flype quickly came up with this routine. I think we can call this routine an optimal memcopy.
 
A0 SRC
A1 DST
D0 LEN
 

LOOP:
      LOAD  (A0)+,D1      ; load 8 bytes
      STOREC D0,D1,(A1)+    ; store 8 bytes but never more than count in D0
      SUBQ.L  #8,D0
      BHI.B  LOOP
   
      RTS
 

 
 
As you see this routine is very short, easy to read.
And it performs well - allowing the Apollo core
to reach over 600 MB/sec memcopy on the standalone.
 
Well done Flype!
 
Its satisfying to see that APOLLO 68080 does not only outclass all other 68K systems but also easily beats GigaHerz PowerPC system by a factor of two!


Wawa T

Posts 695
25 Jun 2017 23:17


i dont know 68k mnemonics, so cannot judge if this is general 68k asm, but assume its apollo specific. you obviously know that aros recognizes apollo among other 68k variants and enables appropriate optimizations, so you can join the dev team, gain the write acces to the repo and commit this if you are certain.
 
  edit: errm, obviously if this is apollo specific gcc wont know what to do with it.. :(. probably it demands a patch to the compiler first.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
25 Jun 2017 23:26


wawa t wrote:

  edit: errm, obviously if this is apollo specific gcc wont know what to do with it.. :(. probably it demands a patch to the compiler first.

 
Patching GNU ASM is easy.
We added support for APOLLO instructions to it already.
 
But you do not even need this.
For the new instruction you just write in hex the value with dc.w
this is very easy.
Lemme do this for you real quick.
 

LOOP:
          dc.w    $FE18,$0101    ;      LOAD    (A0)+,D1
          dc.w    $FE19,$101f    ;      STOREC  D0,D1,(A1)+
          SUBQ.L  #8,D0
          BHI    LOOP
          RTS
 

See one can write this in seconds. :)
Now you can use it!


Ian Parsons

Posts 230
26 Jun 2017 01:11


Using machine code in source isn't very nice though. Hopefully soon all those assemblers that are still maintained will support the new 68080 mnemonics.

It will be a shame if some of the popular closed source assemblers/ disassemblers/monitors/debuggers can't be updated.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jun 2017 04:57


Ian Parsons wrote:

Using machine code in source isn't very nice though. Hopefully soon all those assemblers that are still maintained will support the new 68080 mnemonics.

 
We have defined "macros" for the assemblers like DEVPAC.
You can include them - then it can use the instructions with mnemonics.
I think the support for in the tools will come and is coming now. :-)
VASM has already added support for APOLLOs new instructions
and also new C compiler is adding support for APOLLO features.
 
 
I have to say I like this routine very much.
Its short and very readable and very fast.
 
This 4 lines is not only much shorter than the existing 200 Lines routine in AROS. Its also much faster.
 
The result is awesome for an 68K machine.
APOLLO is not only over 10 times faster than the fastest 68060.
APOLLO is also 3 times faster than an AMIGA-ONE PowerPC G4 System.
 


Henryk Richter
(Apollo Team Member)
Posts 128/ 1
26 Jun 2017 07:18


This memcopy approach requires the upcoming Gold3 version of the Apollo Core. VASM is already able to handle these new instructions: EXTERNAL LINK


Dusko Kovacic

Posts 21
26 Jun 2017 07:37


Gunnar von Boehn wrote:

  This 4 lines is not only much shorter than the existing 200 Lines routine in AROS. Its also much faster.
 
  The result is awesome for an 68K machine.
  APOLLO is not only over 10 times faster than the fastest 68060.
  APOLLO is also 3 times faster than an AMIGA-ONE PowerPC G4 System.
 

Hi Gunnar,

does it mean that with it AROS will perform fluid now on
vampire? or it needs to get proper gpu acceleration?

thnx




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jun 2017 08:10


Dusko Kovacic wrote:

Hi Gunnar,
 
does it mean that with it AROS will perform fluid now on
vampire? or it needs to get proper gpu acceleration?
 
thnx

While the new memcopy is better than the current one in AROS.
Apollo 68080 was also very fast on the old one - faster than any other AMIGA.

The bottleneck of AROS is not memcopy but the not yet existing GFX subroutines.
But this is worked on also - if this is done AROS should become much faster.



Wawa T

Posts 695
26 Jun 2017 09:15


Gunnar von Boehn wrote:

  Patching GNU ASM is easy.
  We added support for APOLLO instructions to it already
 

  the thing is, especially with aros, to keep it in a form that may get accepted upstream some day, rather to locally hack the support in. in this case i wonder if apollo core has a chance to be once acknowledged as a gcc sub platform. this is not the problem with vasm, as it is essentially one person project, with some amiga affine contributors.
 
  so, for the time being perhaps writing in the hexes in asm is the cleanest solution, just needs to be kept commented, just in case.
 


E Penguin

Posts 46
26 Jun 2017 09:18


Vampire + AROS is a very exciting combination. Much more that PPC + AmigaOS 4.x


Marcus Sackrow

Posts 37
26 Jun 2017 19:23



 

  LOOP:
      LOAD  (A0)+,D1      ; load 8 bytes
      STOREC D0,D1,(A1)+    ; store 8 bytes but never more than count in D0
      SUBQ.L  #8,D0
      BHI.B  LOOP
   
      RTS
 

   
  How this is possible? 8 bytes to D1? which can only store 4 bytes? what happen to the rest 4 bytes?
shouldn't there be a ".Q" to mark it copies 8 bytes length?

if you internally use 64 bit register.. what a
PUSH D1 do? push 4 or 8 bytes to stack?
if it pushes 4 bytes (for compatibility it have to do that)
then what happens if you:


LOAD  (A0)+,D1
PUSH  D1
LOAD  (A1),D1
POP    D1
STOREC #8,D1,(A1)+

why you cant write directly


LOOP:
      STOREC D0,(A0)+,(A1)+
      SUBQ.L  #8,D0
      BHI.B  LOOP
      RTS

it would fullfill the very nice Motorola basic principle of "orthogonal instruction set" every 68k had and every future 68k would have.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jun 2017 19:35


Marcus Sackrow wrote:

  if you internally use 64 bit register.. what a
  PUSH D1 do? push 4 or 8 bytes to stack?

Actually there is no PUSH mnemonic on 68K.
Do you mix this up with iNTEL?
 
 


Marcus Sackrow

Posts 37
26 Jun 2017 19:42


Gunnar von Boehn wrote:

 
Marcus Sackrow wrote:

    if you internally use 64 bit register.. what a
    PUSH D1 do? push 4 or 8 bytes to stack?
 

 
  This is 68k not iNTEL.
  There is no PUSH mnemonic on 68K.
 

 
  ah yes... sorry, long time no 68k assembler
 

  move.l D1, -(A7)
  ; or is there an:
  move.q D1, -(A7) ;??
 

 
  The question stays... is D1 64 bit wide?
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jun 2017 19:52


Marcus Sackrow wrote:

  ; or is there an:
  move.q D1, -(A7) ;??

 
Our mnemonic for
MOVE.Q
is LOAD/STORE depending on the direction
   
 
Marcus Sackrow wrote:

  The question stays... is D1 64 bit wide?

Yes APOLLO is a 64bit architecture
All 128 registers are 64bit


Marcus Sackrow

Posts 37
26 Jun 2017 19:57


Gunnar von Boehn wrote:

  Our mnemonic for
  MOVE.Q
  is LOAD/STORE depending on the direction

Is there a special reason for that?
"move.q"
would be a much more 68k alike than the RISC borrowed LOAD/STORE mechanism... and if I really like one thing on 68k assembler (and really hate on x86/arm) thats the orthogonal instruction set.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
26 Jun 2017 20:13


Marcus Sackrow wrote:

 
Gunnar von Boehn wrote:

    Our mnemonic for
    MOVE.Q
    is LOAD/STORE depending on the direction
 

 
Is there a special reason for that?

 
Yes of course, there are several good technical reasons.
 
 
Can you please study the encoding schema of the 68K?
If will see that the MOVE (ea),(ea) needs a huge amount of encoding space. It consumes 3/16 of the total encoding space.
This is a lot.
 
Performance wise MOVE (ea),(ea) has no advantage over 2 OPPs.
Both need the same time to execute.
So it only makes sense as it allows to save some instruction length sometimes.
 
 
The STOREC can be used in many combination.
Mem - Mem is in real live a rare case.
Defining it as 2 mem-EA encoding + RegisterCount - would have increased the instruction length of this instruction.
And it therefore would have increased the instruction length for ALL its usecases.
Also for REG,EA use cases which would otherwise be 1W shorter.
 
This means for the average program instruction size
it is more efficient to encode this in 2 instructions.
As both encoding would take the same amount of cycles - choosing the shorter Rg-mem encoding has only advantages.
 
If you as programmer prefer to "type" just one instruction
then simply define a "macro" for it.
Then you have all what you want. :-)
 


Nixus Minimax

Posts 416
26 Jun 2017 20:49


Marcus Sackrow wrote:
  if I really like one thing on 68k assembler (and really hate on x86/arm) thats the orthogonal instruction set.
 

 
  You hate ARM for being less orthogonal than 68k? IMHO ARM is far more orthogonal than 68k! Just look at the various oddities in 68k like EOR or instructions that do allow An as source register vs such that don't. Well, way off-topic here I guess...
 
  Regarding a double-mem MOVE, that may seem like a good thing as it saves a scratch register but you could put other stuff between the load-part and the store-part which can execute happily while waiting for the RAM. You can't do that easily when putting both moves into a single instruction.


Marcus Sackrow

Posts 37
26 Jun 2017 21:00


Thanks for the explanation.
so in principle 68k moves here away from it's CISC principle towards the more modern RISC approach, in principle what Motorola did by moving away from 68k.

the "MOVE (ea),(ea)" also STOREC Mem,Mem or (AX),(AY) has a big advantage that it saves a register ;) on programming... (so rather difficult to replace that with a MACRO) with the register count on a 68k thats maybe not so painful as on x86 but still a problem, worst case that means you need to save some registers first onto stack and away is the big advantage
Your copy routine shows that it's not THAT seldom. ;)

But I don't want to argue about, I see/understand the point.

How this 128 64 bit registers are managed inside AmigaOS? I guess not at all on task switch only the 16 32 bit register are saved and restored. So currently you can't use them in a normal current AmigaOS? Am I right?

Here an optimized vampire enabled AROS could give a big performance boost, when using the huge amount of registers ;)


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jun 2017 05:20


Marcus Sackrow wrote:

Thanks for the explanation.
so in principle 68k moves here away from it's CISC principle towards the more modern RISC approach

No, CISC is "defined" by having a machine which is able to use memory as source in an instructions.
This means which supports (EA).

How many operands your instruction encoding offers and how many of those allow memory-ea is implementation depending.

There are several variants possible.
Each variant having different advantages and disadvantages at the same time.

The smallest possible number of instructions you will get
if your instructions are 3 Operands and all 3 operands allow using memory-ea all the time.

A machine providing such encoding was the VAX.
The VAX could do
ADD (mem1),(mem2),(mem3)

The VAX needed few instructions, but its drawback was that each of its instruction was huge in encoding.

 
MOTOROLA did choose a different encoding way.
MOTOROLA did optimize the encoding of the 68K to have a much smaller  code size. Therefore MOTOROLA did "research" code options and did choose encoding which generally offered this form.

ADD  (mem),Reg or
ADD  Reg,(mem)

This design choice of MOTOROLA did reduce the instruction length.
For the most common use-case when you work with registers this highly improves code density.
Of course if you want to operate on (mem1),(mem2),(mem3) then with the MOTOROLA encoding you now need several instructions.

APOLLO 68080 is a real CISC machine

AMMX typical encoding looks like this.
OPP (mem),Reg,Reg

This means AMMX allows using memory as EA, and is 3 Operand encoding. Which allows more work to do with less instructions.



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Jun 2017 05:39


Marcus Sackrow wrote:

worst case that means you need to save some registers first onto stack and away is the big advantage
Your copy routine shows that it's not THAT seldom. ;)

 
No, the memcopy routine does not need to save a register on the stack. The 68K programming ABI defines D0/D1/A0/A1 as scratch registers for routines. Using these 4 registers in a subroutine is defined by MOTOROLA as the way of doing it.
Therefore this memcopy does NOT need to SAVE/RESTORE any registers on the stack.
 
 
Marcus, if you want to understand what AMMX is, then you should not focus on this micro memcopy example.

APOLLO can do a 64bit move (An)+,(Am)+ without needing a scratch register. But the instruction which we used on the memcopy was not a simple MOVE operation. We used a more complex instruction inclduing an extra counter.
 
AMMX is not limited to do simply memcopy.
AMMX is designed to process a lot of data.
Processing means doing many multiplications or many additions in short time. In other words to do real workloads.
 
Doing real workloads like decoding a JPEG or decoding an MP3 or decoding a MOVIE - will not be done with a single move instruction (mem) to (mem). Such algorithms always need many instruction and will use many work-registers to hold temporary results.
 
Therefore AMMX is designed to do these jobs most efficiently.
 

posts 29page  1 2