Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 11:38


Lets talk about how C compiler could be improved for 68080

There are several areas where a C compiler could create faster code.

Lets look at them in detail:

a) instruction scheduling : not use 060-hacks
Lets say the coder wants to do stuff like this:
AND.L #$000000003,D0
AND.L #$000000030,D1

These 2 instructions are 12 byte long.
68060 has a weakness with long instructions.
A compiler might try to work around this 060 weakness and create this code:
MOVEQ #$03,D7
AND.L D7,D0
MOVEQ #$30,D7
AND.L D7,D1

The compiler used an extra TMP register now, and added 2 more instructions. This "workaround" for 060 is not needed for 68080.
The 12byte long code with 2 instruction was optimal already.
The 12byte code would execute on the 68080 in 1 cycle total.

b) instruction scheduling : use super scalarity
The 68080 can execute 2 instructions per cycle, tell the compiler to schedule the code to take advantage of this.

c) FPU instruction scheduling

The 68080 FPU is fully pipelined.
FADD and FMUL have a latency.
But each cycle a new FPU instruction can be executed (pipelined)

With good scheduled FPU code the 68080 can reach several times the FPU performance of an 68060.

GCC can create good code for this.
But we need to provide GCC the scheduling cost information file.
This seems to me like a very low hanging fruit.




Stefan "Bebbo" Franke

Posts 139
24 Jun 2019 11:55


Gunnar von Boehn wrote:

  Lets talk about how C compiler could be improved for 68080
 
  There are several areas where a C compiler could create faster code.
 
  Lets look at them in detail:
 
  a) instruction scheduling : not use 060-hacks
  Lets say the coder wants to do stuff like this:
  AND.L #$000000003,D0
  AND.L #$000000030,D1
  ...
 

 
  this is just work and can be done
 
 
Gunnar von Boehn wrote:
 
  b) instruction scheduling : use super scalarity
  The 68080 can execute 2 instructions per cycle, tell the compiler to schedule the code to take advantage of this.
 

 
  I already added some scheduling to gcc 6.5.0.b as post processing.
 
  aren't it 4 instructions per cycle?
Gunnar von Boehn wrote:
 
  c) FPU instruction scheduling
 
  The 68080 FPU is fully pipelined.
  FADD and FMUL have a latency.
  But each cycle a new FPU instruction can be executed (pipelined)
 
  With good scheduled FPU code the 68080 can reach several times the FPU performance of an 68060.
 
  GCC can create good code for this.
  But we need to provide GCC the scheduling cost information file.
  This seems to me like a very low hanging fruit.
 

 
  providing the scheduling information is a start^^
 
  And to utilize the new 68080 instructions, a working m68k-amigaos-as with 68080 support is mandatory.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 12:38


Hallo Stefan,
 
Stefan "Bebbo" Franke wrote:

  I already added some scheduling to gcc 6.5.0.b as post processing.
  aren't it 4 instructions per cycle?

Up to 4 - yes but only under some circumstances.
 
68080 looks internally like this:
a) 3 operant machine with
b) TWO Ea units and TWO 3-operant ALUs, plus AMMX unit, plus FPU
 
 
Good scheduled code should be
  - super scalar for 2 ALUs.
  - will happily use longer instructions.
    (each ALU can eat 8byte instruction per cycle=16byte total)
  - make use of 3 operant features
    D2=D0+D1
    D5=D4+D3
 
    write as
    move.l D0,D2
    add.l  D1,D2
    move.l D3,D5
    add.l  D4,D5
 
Always bundle the MOVE.L with the next instruction. The core will be able to execute all 4 instructions together in 1 cycle.
 
 
FPU scheduling is relative easy.
The Core will keep track of dependancies.

FADD,FSUB,FCMP and FMUL have a latency of 6.
This means up to 6 FADD can be executed in parallel in the core.
FDIV has a latency of 10.
This means up to 10 FDIV can be executed in parallel in the core.
Please schedule instructions accordingly.
 
 
 
 
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 12:44


some 68080 tuning tips.
 
  A) 68080 can jump to any address without instruction fetch penalty.
  While aligning branch targets (labels) made sense on 68060.
  This makes no sense anymore on 68080.
 
  B) 68080 can do 64bit memory operations
  For example:
    MOVE.L (a0)+,(a1)+
    MOVE.L (a0)+,(a1)+
  Will be executed together as one 64bit MOVE.
 
  C) Misaligned memory access cost NOTHING extra for READs
 
  D) free Branch
  The code can automatically convert some branches to conditional code.
 
  small saturation example
  addq.b #1,D0
  bne.s nosat      -- overflow
  move.b #$FF,D0
  nosat
 
  The BNE and the MOVE.B will be internally converted to a Conditional MOVE. This will execute single cycle and never misspredict.
 
  This feature works only if there is a single instruction under the branch.
 

All the above tricks can be used and the code will stay compatible will older 68K CPUs too.
This means we can improve code performance without breaking compatibility.


Stefan "Bebbo" Franke

Posts 139
24 Jun 2019 13:08


Hallo Gunnar,

Gunnar von Boehn wrote:

Hallo Stefan,
 
 
Stefan "Bebbo" Franke wrote:

  I already added some scheduling to gcc 6.5.0.b as post processing.
  aren't it 4 instructions per cycle?
 

  Up to 4 - yes but only under some circumstances.
 
  68080 looks internally like this:
  a) 3 operant machine with
  b) TWO Ea units and TWO 3-operant ALUs, plus AMMX unit, plus FPU
 
   
  Good scheduled code should be
  - super scalar for 2 ALUs.
  - will happily use longer instructions.
    (each ALU can eat 8byte instruction per cycle=16byte total)
  - make use of 3 operant features
    D2=D0+D1
    D5=D4+D3
 
    write as
    move.l D0,D2
    add.l  D1,D2
    move.l D3,D5
    add.l  D4,D5
 
  Always bundle the MOVE.L with the next instruction. The core will be able to execute all 4 instructions together in 1 cycle.
 
 
  FPU scheduling is relative easy.
  The Core will keep track of dependancies.
 
  FADD,FSUB,FCMP and FMUL have a latency of 6.
  This means up to 6 FADD can be executed in parallel in the core.
  FDIV has a latency of 10.
  This means up to 10 FDIV can be executed in parallel in the core.
  Please schedule instructions accordingly.

a) so gcc could treat
    D2=D0+D1
    D5=D4+D3
as
    lea d0(d1),d2
    lea d4(d3),d5

which is resolved during output as move.l/add.l

isn't there an explicte new insn for this?

b) the use of moveq can be trained off

c) scheduling... I'll have a look.

Best approach: open a ticket for each topic at EXTERNAL LINK and pray that you won't get a "WON'T FIX" ^^


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 14:14


Stefan "Bebbo" Franke wrote:

  a) so gcc could treat
      D2=D0+D1
      D5=D4+D3
  as
      lea d0(d1),d2
      lea d4(d3),d5

 
ADD was just an example. The same works for other operations like SUB/AND/OR/EOR/ ...
But also for SHIFT
E.g
  B=A>>2
MOVE.l D0,D1
LSR #1,D1

Stefan "Bebbo" Franke wrote:

  which is resolved during output as move.l/add.l
  isn't there an explicte new insn for this?

68080 does support new instruction.
some simple ones like

ADDiW.L #$1234,D0
A 16bit immediate to LONG addition (4 byte encoding)

To special purpose ones like

MOVEX.L (ea),Dn
Move with free endian conversion

or
PERM #3467,D0:D1
A Byte Select instruction combining the content of 2 Registers new

New instruction of course break backward compatibility.
Therefore I first spoke about backward friendly options first.

One tuning note.
As you are certainly aware, Motorola did choose a vertical pipeline design for the 68060 CPU.
This means EA units are ABOVE ALU units in the pipeline design.
This design choice has 2 effects:
a) "free" EA calculation for instruction
b) ALU to EA bubble

Benefit:
MOVE.l 12(A0),D0    -- EA calculation and DCache access is free

Contra:
lsl.l  #1,D0
move.l (A0,D0),D2    -- Index usage
This code is will cause slowdown on 060.
As the Index register was used in the ALU to close before hand.
Motorola explains this very good in their manuals.
That compiler need to schedule such INDEX manipulation several instructions before using them.

Is GCC aware of this?



Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 15:03


Bebbo

here are some examples of FUSINGs:

MOVE.L (an)+,(am)+
MOVE.L (an)+,(am)+

MOVE.L -(an),-(am)
MOVE.L -(an),-(am)

MOVE.L (an)+,Dn
MOVE.L (an)+,Dm

MOVE.L Dn,(an)+
MOVE.L Dm,(an)+

MOVE.B (ea),Dn
extB.l Dn

MOVE.W (ea),Dn
ext.l  Dn

MOVE.L Dn,Dm
ADDi.L  #,Dm

MOVE.L Dn,Dm
SUBi.L  #,Dm

MOVE.L Dn,Dm
ANDi.W  #,Dm

MOVE.L Dn,Dm
ADD.L  Dx,Dm

MOVE.L Dn,Dm
SUB.L  Dx,Dm

MOVE.L Dn,Dm
AND.L  Dx,Dm

MOVE.L Dn,Dm
OR.L  Dx,Dm

MOVE.L Dn,Dm
SHIFT.L #,Dm

MOVEq #,Dn
MOVE.B  (ea),Dn    - MVZ

MOVEq #,Dn
MOVE.W  (ea),Dn    - MVZ

MOVEq #,Dn
OR.L  Dx,Dn   

MOVEq  #,Dn
AND.L  Dx,Dn   

SUBQ.L #1,Dn
BNE.s  LOOP

Above you see examples of FUSING of 2 instructions which execute together in single cycle in 1 ALU.

Both ALU can execute such bundles.


Stefan "Bebbo" Franke

Posts 139
24 Jun 2019 15:34


Gunnar von Boehn wrote:

  Contra:
  lsl.l  #1,D0
  move.l (A0,D0),D2    -- Index usage
  This code is will cause slowdown on 060.
  As the Index register was used in the ALU to close before hand.
  Motorola explains this very good in their manuals.
  That compiler need to schedule such INDEX manipulation several instructions before using them.
 
  Is GCC aware of this?
 

I'd say: yes.

and about "fusing": a complete list would be useful

Fused assembler instructions must be reflected by single insns in gcc to avoid splitting.




Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 15:59


Stefan "Bebbo" Franke wrote:

and about "fusing": a complete list would be useful
 
Fused assembler instructions must be reflected by single insns in gcc to avoid splitting.

OK, I will prepare this list for you.

Of course this can go both ways.
If you see a certain combination which is very common or very useful to fuse then please tell us and we can add a FUSING.

One more tuning comment-.

BITFIELD are useful but used to be often slow.
E.g. on 68060 a BFEXTU could need 9 cycle
On 68080 such BITFIELD often only is single cycle.
On 68060 a possible tuning was to replace a bitfields with a few normal instructions doing the same operation.
Please NOT do this cheat on 68080.
On 68080 a single cycle Bitfield is unbeatable.

One more question:
68080 has a number of new CPU debug registers.
You can read them out with MOVEC.

Using this registers you can for example
exactly read out of many CPU cycles a certain routine needed,
you can read out how good the branch-prediction was,
you can read out your Dcache Hit/Miss ratio.
You can read out how good Super-Scalar your routine is,
and you can read out if your routine suffered from ALU-2-EA stalls.

Could GCC benefit from this?
Could this be used in "profiling" options?


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 16:22


Bebbo, one more question
 
 
68080 can do per cycle
1 memory READ    OR
1 memory WRITE  OR
1 memory READ/WRITE to SAME ADDRESS

This makes CISC instruction very strong.
 
For best performance such memory access should be used for "real work" and counters / temporary variables should be kept in registers to keep the memory bus available for pushing data.
 
To make this easier 68080 has more registers.
Old 68000 did has 8 Address and 8 data registers.
On 68080 you can use 16 ADDRESS and 32 DATA registers.
 
To use those extra registers each instruction can have a PREFIX.
This technique is similar to the x64 extension which AMD did invent.
 
Would adding support for this to GCC be complex ?
 
 
On the other hand DOUBLE INDIRECT EA modes should be avoided.
For example this code construct is sometimes used by compilers "create" more Pointer registers.
MOVE.L (4[A0]),D0
 
Such operation needs 2 memory access and creates an ALU-2-EA bubble.
Performance wise this should be avoided.
Can GCC be told to avoid this or to account 4 cycles for this EA?

 


Stefan "Bebbo" Franke

Posts 139
24 Jun 2019 17:36


Gunnar von Boehn wrote:

...
Please NOT do this cheat on 68080.
On 68080 a single cycle Bitfield is unbeatable.
...

I think, the current insn costs are 100% correct - and I have to fix 68060 costs - but for 68080 all insns costs are set to 1 cycle atm.

Gunnar von Boehn wrote:

Bebbo, one more question
   
   
  68080 can do per cycle
  1 memory READ    OR
  1 memory WRITE  OR
  1 memory READ/WRITE to SAME ADDRESS
 
  This makes CISC instruction very strong.
   
  For best performance such memory access should be used for "real work" and counters / temporary variables should be kept in registers to keep the memory bus available for pushing data.

gcc is cost based - sometimes you can't be 100% accurate but it's fair. So gcc selects registers if useful, e.g. if a parameter is used more often a temp register is used.

Gunnar von Boehn wrote:

  To make this easier 68080 has more registers.
  Old 68000 did has 8 Address and 8 data registers.
  On 68080 you can use 16 ADDRESS and 32 DATA registers.

  To use those extra registers each instruction can have a PREFIX.
  This technique is similar to the x64 extension which AMD did invent.
   
  Would adding support for this to GCC be complex ?

if I find the time for this, I'll try to implement it.

Gunnar von Boehn wrote:

   
 
  On the other hand DOUBLE INDIRECT EA modes should be avoided.
  For example this code construct is sometimes used by compilers "create" more Pointer registers.
  MOVE.L (4[A0]),D0
 
  Such operation needs 2 memory access and creates an ALU-2-EA bubble.
  Performance wise this should be avoided.
  Can GCC be told to avoid this or to account 4 cycles for this EA? 

It was a pita to implement those double indirect instructions, but this can be done via costs or TUNE_68080 flag.

With -Os I would keep these insns.

And I wonder that

    MOVE.L (4[A0]),D0

is slower than

    MOVE.L 4[A0],A1
    MOVE.L (A1),D0

it should result into the same speed - at least...

BTW:

__regargs void foo(int x) {
  return x & 11;
}

yields now

    and.l #11,d0
    rts

if -m68080 is specified. This should be live quite soon.


Mike Kopack

Posts 268
24 Jun 2019 17:46


Really awesome to see you two discussing this and helping each other since realistically you are both the key people for this to happen. I’ll certainly be sending some $$ bebbo’s way and will be purchasing a V4SA as soon as I can get me hands on one.

Thanks for coming together on this guys! It’s only going to help the Vampire community!


Tango One

Posts 102
24 Jun 2019 18:28


One of my wet dreams is to programing in assambly.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 18:42


Stefan "Bebbo" Franke wrote:

 
Gunnar von Boehn wrote:

     
   
    On the other hand DOUBLE INDIRECT EA modes should be avoided.
    For example this code construct is sometimes used by compilers "create" more Pointer registers.
    MOVE.L (4[A0]),D0
   
    Such operation needs 2 memory access and creates an ALU-2-EA bubble.
    Performance wise this should be avoided.
    Can GCC be told to avoid this or to account 4 cycles for this EA? 
 

 
  It was a pita to implement those double indirect instructions, but this can be done via costs or TUNE_68080 flag.
 
  With -Os I would keep these insns.
 

  Well, the double-indirect encoding needs an extra extension word.
  So it makes longer instructions.
  Using the new A8-15 regs, or using 2 instructions will in average not be bigger.
 
 
 
 
Stefan "Bebbo" Franke wrote:

 
  And I wonder that
 
      MOVE.L (4[A0]),D0
 
  is slower than
 
      MOVE.L 4[A0],A1
      MOVE.L (A1),D0
 
  it should result into the same speed - at least...
 

 
  This is easy to explain:
 
 

  MOVE.L (4[A0]),D0
  gets executed as
  Cycle    Pipe  1              Pipe2
  1        MOVE.L 4(A0),Atemp  nop
  2        nop                  nop
  3        nop                  nop
  4        move.l (atemp),D0    nop
 

 
  4 Clock cycle total
 
  Now your other code:
 

  Cycle    Pipe  1              Pipe2
  1        MOVE.L 4(A0),A1      free
  2        free                free
  3        free                free
  4        move.l (A1),D0      free
 

It also takes 4 cycle. And it has the same instruction size!
It looks nearly the same but its not.

But you have 6 instructions slots free!
 
This means with good scheduling a compiler can put there
up to 6 instructions in parallel.

Hint: Avoid the ALU-2-EA bubble.

On 68060 the picture is very similar.

 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 19:09


Bebbo,

Maybe we can also talk about instruction pairing.
Pairing should be a very important topic for many super scalar CPUs.
68060, Pentium, PPC, Arm

Lets make some 68k examples

MOVEM.L (A0)+,D0:D1
Will load 2 register from memory and do vituelly the same work as:
MOVE.L (A0)+,D0
MOVE.L (A0)+,D1

Both have the same instruction size.

But there is a significant difference:
The multicycle MOVEM will block the 2nd pipe over its execution.
While doing the same work with 2 individual moves will allow the compiler to put other instructions in the 2nd pipe.




Stefan "Bebbo" Franke

Posts 139
24 Jun 2019 19:48


so double indirect insns are void at all - guess I disable these for all CPUs /sigh

Gunnar von Boehn wrote:

Bebbo,
 
  Maybe we can also talk about instruction pairing.
  Pairing should be a very important topic for many super scalar CPUs.
  68060, Pentium, PPC, Arm
 
  Lets make some 68k examples
 
  MOVEM.L (A0)+,D0:D1
  Will load 2 register from memory and do vituelly the same work as:
  MOVE.L (A0)+,D0
  MOVE.L (A0)+,D1
 
  Both have the same instruction size.
 
  But there is a significant difference:
  The multicycle MOVEM will block the 2nd pipe over its execution.
  While doing the same work with 2 individual moves will allow the compiler to put other instructions in the 2nd pipe.

right now movem is only used if 3 registers or more are pushed/popped

but since this happens in prologue/epilogue it's separated from the remaining code => only a late peephole optimizer might enhance this - maybe.



Gunnar von Boehn
(Apollo Team Member)
Posts 6197
24 Jun 2019 20:34


Bebbo,
   
    another GCC question..
   
    I recall that GCC used to create often unneeded TST instructions.
   
    Lets look at an example:
   
   

    void strncopy(char *dst,const char *src,short n)
    {
      char c;
      for (n;n>=0;n--) {
        c=*src++;
        *dst++=c;
        if( c==0x0 ) break;
      }
    }
   
   

   
    What code do we expect to create here?
   
   
    I would assume the compiles creates something like this:
   
   

      move.l 4(SP),D0
      BLT    END
      move.l 8(SP),A0
      move.l 12(SP),A1
    LOOP:
      move.b (A1)+,(A0)+
      dbeq  D0,Loop
    END:
      rts
   

 
  I expect a loop of 2 Instructions and a program total of 7.
 
Why is it the created code with recent GCC versions bigger?



Stefan "Bebbo" Franke

Posts 139
25 Jun 2019 06:39


Gunnar von Boehn wrote:

Bebbo,
     
      another GCC question..
     
      I recall that GCC used to create often unneeded TST instructions.
     
      Lets look at an example:
     
     

      void strncopy(char *dst,const char *src,short n)
      {
        char c;
        for (n;n>=0;n--) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
        }
      }
     
     

     
      What code do we expect to create here?
     
     
      I would assume the compiles creates something like this:
     
     

        move.l 4(SP),D0
        BLT    END
        move.l 8(SP),A0
        move.l 12(SP),A1
      LOOP:
        move.b (A1)+,(A0)+
        dbeq  D0,Loop
      END:
        rts
     

   
  I expect a loop of 2 Instructions and a program total of 7.
   
  Why is it the created code with recent GCC versions bigger?
 

the main reason seems to be the introduced abstract syntax - called gimple - plus the optimizers working with that syntax.

it 'optimizes' the initial code

{
  char c;

    char c;
  n;
  goto <D.1382>;
  <D.1381>:;
  c = *src++ ;
  *dst++  = c;
  if (c == 0)
    {
      goto <D.1380>;
    }
  n-- ;
  <D.1382>:;
  if (n >= 0) goto <D.1381>; else goto <D.1380>;
  <D.1380>:;
}
 
to (depends on the -O flags, this is -O1)

strncopy (char * dst, const char * src, short int n)
{
  unsigned int ivtmp.17;
  unsigned int ivtmp.15;
  char c;
  void * _29;
  void * _30;
  unsigned short _31;
  unsigned short _32;
  short int _33;

  <bb 2>:
  if (n_8(D) >= 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 3>:
  src_21 = src_7(D) + 1;
  c_22 = *src_7(D);
  dst_23 = dst_6(D) + 1;
  *dst_6(D) = c_22;
  if (c_22 == 0)
    goto <bb 7>;
  else
    goto <bb 4>;

  <bb 4>:
  ivtmp.15_3 = (unsigned int) src_21;
  ivtmp.17_18 = (unsigned int) dst_23;
  goto <bb 6>;

  <bb 5>:
  _29 = (void *) ivtmp.15_1;
  c_11 = MEM[base: _29, offset: 0B];
  ivtmp.15_2 = ivtmp.15_1 + 1;
  _30 = (void *) ivtmp.17_20;
  MEM[base: _30, offset: 0B] = c_11;
  ivtmp.17_19 = ivtmp.17_20 + 1;
  if (c_11 == 0)
    goto <bb 7>;
  else
    goto <bb 6>;

  <bb 6>:
  # n_25 = PHI <n_8(D)(4), _33(5)>
  # ivtmp.15_1 = PHI <ivtmp.15_3(4), ivtmp.15_2(5)>
  # ivtmp.17_20 = PHI <ivtmp.17_18(4), ivtmp.17_19(5)>
  _31 = (unsigned short) n_25;
  _32 = _31 + 65535;
  _33 = (short int) _32;
  if (_33 != -1(OVF))
    goto <bb 5>;
  else
    goto <bb 7>;

  <bb 7>:
  return;
}

which is anything but optimized...

and gcc6.5.0b is already a tad better than the original gcc^^ and there is room for further improvements.

The compiler explorer at EXTERNAL LINK provides some 68k compilers to compare asm code.




Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 07:49


Hi Bebbo,

here is some Fusing Overview

CLICK HERE


Thellier Alain

Posts 141
25 Jun 2019 07:53


char c;
        for (n;n>=0;n--) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
        }

writing a loop with C this way generate a better ASM

        char c;
        while(n) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
          n--;
        }

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19