APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Information about the Apollo CPU and FPU.

GCC Improvement for 68080	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Stefan "Bebbo" Franke

Posts 142
11 Jul 2019 15:08

Samuel Devulder wrote:

That's nice :) I suppose __retfp0 has no impact when the returned value is an int.

It only affects float/double.

Samuel Devulder wrote:

Is there a cmd-line switch or pragma to add implicit __retfp0 to every functions (except maybe for functions in math.h) ?

-mretfp0 is on the way. Functions having __stdargs aren't touched.

... needs tons of updated headers, as -mregparm. Is someone bored?

Grom 68k

Posts 61
11 Jul 2019 15:39

Stefan "Bebbo" Franke wrote:

... needs tons of updated headers, as -mregparm. Is someone bored?

If you think modifications can be scripted, I can try to make a python script.

Do you have an example ?

Stefan "Bebbo" Franke

Posts 142
11 Jul 2019 15:56

Grom 68k wrote:

Stefan "Bebbo" Franke wrote:

... needs tons of updated headers, as -mregparm. Is someone bored?

If you think modifications can be scripted, I can try to make a python script.

Do you have an example ?

for all headers in
sys-include and sub folders
libnix/include and sub folders
do

add __stdargs to functions and function types

it should look like the ones in

ndk-include/clib/alib_protos.h

(guess there are more files which could use that patch)

clib2/include seems complete
libnix/include is partial done
sys-include has only few function with __stdargs yet


Grom 68k Posts 61 11 Jul 2019 18:14	Where is source ?

Niclas A
(Apollo Team Member)
Posts 219
11 Jul 2019 18:22

Grom 68k wrote:

Where is source ?

EXTERNAL LINK

Stefan "Bebbo" Franke

Posts 142
11 Jul 2019 19:02

Niclas A wrote:

Grom 68k wrote:

Where is source ?

EXTERNAL LINK

not exactly - he's looking for the headers which are in several repos.

Easiest: grab the built version, and mail me the fixed headers and I'll put them live

Niclas A
(Apollo Team Member)
Posts 219
11 Jul 2019 19:37

Stefan "Bebbo" Franke wrote:

Niclas A wrote:

Grom 68k wrote:

Where is source ?

EXTERNAL LINK

not exactly - he's looking for the headers which are in several repos.

Easiest: grab the built version, and mail me the fixed headers and I'll put them live

Oh sorry. My bad.

Samuel Devulder

Posts 248
11 Jul 2019 21:51

Stefan "Bebbo" Franke wrote:

Samuel Devulder wrote:

I should try recompiling quake your latest gcc6.5b when it'll be available.

It's available now.

Is it in this link: EXTERNAL LINK ?

Is there just a plain ZIP file containing just the content of the setup. (My anti-virus doesn't like the setup, and I usualy prefer plain zip to easily move the installation folder anytime when needed. And most important: on my W10 machine the setup.exe produces this error: EXTERNAL LINK ).

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 06:36

Stefan "Bebbo" Franke wrote:

gcc is aware of 68040/60:
/* fmovecr must be emulated on the 68040 and 68060, so it
shouldn't be used at all on those chips. */
and for the 68080 all FP constants can be used directly now.

Hello Bebbo,

I see this code now:


           fdmul.x fp3,fp1
           fdmul.d #0x3fe0000000000000,fp3
           fdadd.x fp3,fp1
           fdadd.x fp2,fp1
           fdmove.x fp0,fp2
          fdadd.d #0x3ff0000000000000,fp2
           fdmul.x fp0,fp2
           fdmul.x fp0,fp0
           fdsub.d #0x3ff0000000000000,fp0
           fddiv.x fp0,fp2

Would it be possible for GCC to use the
.SINGLE instead .DOUBLE for numbers which fit in single?

This would save program space and also be faster for the Moto68060.

Stefan "Bebbo" Franke

Posts 142
12 Jul 2019 07:14

Gunnar von Boehn wrote:

Would it be possible for GCC to use the
.SINGLE instead .DOUBLE for numbers which fit in single?

This would save program space and also be faster for the Moto68060.

that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 08:13

Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

Would it be possible for GCC to use the
.SINGLE instead .DOUBLE for numbers which fit in single?

This would save program space and also be faster for the Moto68060.

that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.

great! many thanks!

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 08:29

Gunnar von Boehn wrote:

compiled with -O2 -m68080


    .L2:
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            subq.l #1,d0
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            tst.l d0
            jne .L2
            unlk a5
            rts

Yes you are absolutely correct.
GCC 6.5b does include the unneeded TST instruction

Bebbo, GCC sometimes makes such code but not always.
Bebbo, do you see a way to solve this?
If this is easy to fix it would be great as saving the extra TST will often be good.

Stefan "Bebbo" Franke

Posts 142
12 Jul 2019 08:34

Gunnar von Boehn wrote:

compiled with -O2 -m68080


     .L2:
             fdmove.d (a0)+,fp1
             fdmul.x fp0,fp1
             fmove.d fp1,(a1)+
             subq.l #1,d0
             fdmove.d (a0)+,fp1
             fdmul.x fp0,fp1
             fmove.d fp1,(a1)+
             fdmove.d (a0)+,fp1
             fdmul.x fp0,fp1
             fmove.d fp1,(a1)+
             fdmove.d (a0)+,fp1
             fdmul.x fp0,fp1
             fmove.d fp1,(a1)+
             tst.l d0
             jne .L2
             unlk a5
             rts

Yes you are absolutely correct.
GCC 6.5b does include the unneeded TST instruction

Bebbo, GCC sometimes makes such code but not always.
Bebbo, do you see a way to solve this?
If this is easy to fix it would be great as saving the extra TST will often be good.

what's the source and the used options?

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 08:41


#include <string.h>
void Scale3(double scalar, double* b, double* c)
{
      size_t j;
      double t1;
      double t2;
      double t3;
      double t4;
       
      for (j=1000; j; j--){
          t1 = scalar* *c++;
          *b++ = t1;
          t2 = scalar* *c++;
          *b++ = t2;
          t3 = scalar* *c++;
          *b++ = t3;
          t4 = scalar* *c++;
          *b++ = t4;
      }
}

This above code.
With these options "-Os -m68080"

Stefan "Bebbo" Franke

Posts 142
12 Jul 2019 08:54

Gunnar von Boehn wrote:


   #include <string.h>
   void Scale3(double scalar, double* b, double* c)
   {
        size_t j;
        double t1;
        double t2;
        double t3;
        double t4;
         
        for (j=1000; j; j--){
            t1 = scalar* *c++;
            *b++ = t1;
            t2 = scalar* *c++;
            *b++ = t2;
            t3 = scalar* *c++;
            *b++ = t3;
            t4 = scalar* *c++;
            *b++ = t4;
        }
   }

This above code.
With these options "-Os -m68080"

dängs :-)

the m68080 automaton is not yet correct, but the scheduler inserts the only executable insn.

Samuel Devulder

Posts 248
12 Jul 2019 10:28

Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:


void Scale0(double scalar, double* b, double* c)
{
     size_t j;
     for (j=4; j; j--){
         *b++ = scalar* *c++;
     }
}

I get:

_Scale0:
         fdmove.d (4,sp),fp0
         move.l (16,sp),a1
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
         move.l (12,sp),a0
; 4 wait-cycles
         fmove.d fp1,(a0)+
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
; 5 wait-cycles
         fmove.d fp1,(a0)+
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
; 5 wait-cycles
         fmove.d fp1,(a0)+
         fdmul.d (a1),fp0
; 5 wait-cycles: total 19 wait-cycles
         fmove.d fp0,(a0)
         rts

In this asm there are lots of pipeline stalls because the result of fdmul is used in the very next operation. Why wasn't the flow reorganized to use multiplication on 4 independent fpu-regs (cf. loop unrolling), and write the result in the end of the function like this:

_Scale0:
         fdmove.d (4,sp),fp0
         move.l (12,sp),a0
         move.l (16,sp),a1
         fmovem fp3/fp2,-(sp)
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
         fdmove.d (a1)+,fp2
         fdmul.x fp0,fp2
         fdmove.d (a1)+,fp3
         fdmul.x fp0,fp3
         fdmul.x (a1)+,fp0
; 0 wait-cycle
         fmove.d fp1,(a0)+
; 1 wait-cycle
         fmove.d fp2,(a0)+
; 1 wait-cycle
         fmove.d fp3,(a0)+
         fmovem (sp)+,fp3/fp2
; 0 wait-cycle: total 2 wait-cycles 
         fmove.d fp0,(a0)+
         rts

Notice:
1) if the loop was 1 cycle bigger, resulting in an extra fp5 being used, then there wouldn't be any wait-cycle :)
2) even with this 4-fold loop, the 1 wait-cycle can be even further optimized by spreading the fmovem into several fmove.x placed in these points like this:

_Scale0:
fdmove.d (4,sp),fp0
         move.l (12,sp),a0
         move.l (16,sp),a1
         fmovem fp3/fp2,-(sp)
         fdmove.d (a1)+,fp2
         fdmul.x fp0,fp2
         fdmove.d (a1)+,fp3
         fdmul.x fp0,fp3
         fdmove.d (a1)+,fp1
         fdmul.x fp0,fp1
         fdmul.x (a1)+,fp0
; 0 wait-cycle
         fmove.d fp2,(a0)+
   fmove.x fp2,(sp)+
; 0 wait-cycle
         fmove.d fp3,(a0)+
   fmove.x fp3,(sp)+
; 0 wait-cycle
         fmove.d fp1,(a0)+
; 0 wait-cycle: total 0 wait-cycles 
         fmove.d fp0,(a0)+
         rts


Stefan "Bebbo" Franke Posts 142 12 Jul 2019 10:32	Rhe insn in front of a compare is now fused to the compare if it's related to. live at ~12:35 - if all tests pass

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 13:07

Samuel Devulder wrote:

Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:


  void Scale0(double scalar, double* b, double* c)
  {
      size_t j;
      for (j=4; j; j--){
          *b++ = scalar* *c++;
      }
  }

Unrolling will make more sense for more interations.
How about we look at a loop of "640" ?


#include <string.h>void Scale8(double scalar, double* b, double* c)
{
      size_t j;
      for (j=640; j; j--){
          *b++ = scalar * *c++;
      }
}
_Scale8:
         move.l a2,-(sp)
         move.l #640,d0
         fdmove.d (8,sp),fp0
         move.l (20,sp),a0
         move.l (16,sp),a1
.L2:
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         move.l a1,a2
         fmove.d fp1,(a2)+
         subq.l #8,d0
         lea (64,a1),a1
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(a2)+
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(a2)
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(-40,a1)
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(-32,a1)
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(-24,a1)
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(-16,a1)
         fdmove.d (a0)+,fp1
         fdmul.x fp0,fp1
         fmove.d fp1,(-8,a1)
         tst.l d0
         jne .L2
         move.l (sp)+,a2
         rts

-O3 -m68080 -fomit-frame-pointer -funroll-loops

Bebbo can you explain why this code is generated.

I would have assumed EA mode (An)++
Instead GCC uses (16,An) EA mode and makes it very complicated with LEA and using A2.

Why does GCC make such code?

Stefan "Bebbo" Franke

Posts 142
12 Jul 2019 13:13

Samuel Devulder wrote:

Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:


   void Scale0(double scalar, double* b, double* c)
   {
       size_t j;
       for (j=4; j; j--){
           *b++ = scalar* *c++;
       }
   }

Everything is fine, because the pointers may alias.
Use


  void Scale0(double scalar, double* restrict b, double* restrict c)


Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 13:39	Bebbo did you see my question above?

posts 367	page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19