Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 15:08


Samuel Devulder wrote:

That's nice :) I suppose __retfp0 has no impact when the returned value is an int.

It only affects float/double.
Samuel Devulder wrote:

  Is there a cmd-line switch or pragma to add implicit __retfp0 to every functions (except maybe for functions in math.h) ?

-mretfp0 is on the way. Functions having __stdargs aren't touched.

... needs tons of updated headers, as -mregparm. Is someone bored?



Grom 68k

Posts 61
11 Jul 2019 15:39


Stefan "Bebbo" Franke wrote:

  ... needs tons of updated headers, as -mregparm. Is someone bored?

If you think modifications can be scripted, I can try to make a python script.

Do you have an example ?


Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 15:56


Grom 68k wrote:

 
Stefan "Bebbo" Franke wrote:

    ... needs tons of updated headers, as -mregparm. Is someone bored?
 

 
  If you think modifications can be scripted, I can try to make a python script.
 
  Do you have an example ?
 

 
  for all headers in
    sys-include and sub folders
    libnix/include and sub folders
    do
 
    add __stdargs to functions and function types
 
  it should look like the ones in
 
    ndk-include/clib/alib_protos.h
 
  (guess there are more files which could use that patch)

clib2/include seems complete
libnix/include is partial done
sys-include has only few function with __stdargs yet


Grom 68k

Posts 61
11 Jul 2019 18:14


Where is source ?


Niclas A
(Apollo Team Member)
Posts 219
11 Jul 2019 18:22


Grom 68k wrote:

Where is source ?

EXTERNAL LINK 


Stefan "Bebbo" Franke

Posts 139
11 Jul 2019 19:02


Niclas A wrote:

Grom 68k wrote:

  Where is source ?
 

 
  EXTERNAL LINK 

not exactly - he's looking for the headers which are in several repos.

Easiest: grab the built version, and mail me the fixed headers and I'll put them live




Niclas A
(Apollo Team Member)
Posts 219
11 Jul 2019 19:37


Stefan "Bebbo" Franke wrote:

 
Niclas A wrote:

 
Grom 68k wrote:

    Where is source ?
   

   
    EXTERNAL LINK   
 

   
  not exactly - he's looking for the headers which are in several repos.
 
  Easiest: grab the built version, and mail me the fixed headers and I'll put them live
 
 

  Oh sorry. My bad.
 


Samuel Devulder

Posts 248
11 Jul 2019 21:51


Stefan "Bebbo" Franke wrote:

   
Samuel Devulder wrote:

    I should try recompiling quake your latest gcc6.5b when it'll be available.
     

     
      It's available now.
   

    Is it in this link: EXTERNAL LINK ?
   
    Is there just a plain ZIP file containing just the content of the setup. (My anti-virus doesn't like the setup, and I usualy prefer plain zip to easily move the installation folder anytime when needed. And most important: on my W10 machine the setup.exe produces this error: EXTERNAL LINK ).
   


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 06:36


Stefan "Bebbo" Franke wrote:

   
    gcc is aware of 68040/60:
      /* fmovecr must be emulated on the 68040 and 68060, so it 
    shouldn't be used at all on those chips.  */
    and for the 68080 all FP constants can be used directly now.
 

 
 
Hello Bebbo,
 
I see this code now:
 

          fdmul.x fp3,fp1
          fdmul.d #0x3fe0000000000000,fp3
          fdadd.x fp3,fp1
          fdadd.x fp2,fp1
          fdmove.x fp0,fp2
        fdadd.d #0x3ff0000000000000,fp2
          fdmul.x fp0,fp2
          fdmul.x fp0,fp0
          fdsub.d #0x3ff0000000000000,fp0
          fddiv.x fp0,fp2

 
 
Would it be possible for GCC to use the
.SINGLE instead .DOUBLE for numbers which fit in single?
 
This would save program space and also be faster for the Moto68060.
 
 


Stefan "Bebbo" Franke

Posts 139
12 Jul 2019 07:14


Gunnar von Boehn wrote:

 
  Would it be possible for GCC to use the
  .SINGLE instead .DOUBLE for numbers which fit in single?
   
  This would save program space and also be faster for the Moto68060.

that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 08:13


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

   
  Would it be possible for GCC to use the
  .SINGLE instead .DOUBLE for numbers which fit in single?
   
  This would save program space and also be faster for the Moto68060.
 

 
  that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.

great! many thanks!


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 08:29


Gunnar von Boehn wrote:

    compiled with -O2 -m68080
   

    .L2:
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            subq.l #1,d0
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            tst.l d0
            jne .L2

            unlk a5
            rts
   

   
  Yes you are absolutely correct.
  GCC 6.5b does include the unneeded TST instruction
 
 

 
 
Bebbo, GCC sometimes makes such code but not always.
Bebbo, do you see a way to solve this?
If this is easy to fix it would be great as saving the extra TST will often be good.


Stefan "Bebbo" Franke

Posts 139
12 Jul 2019 08:34


Gunnar von Boehn wrote:

Gunnar von Boehn wrote:

    compiled with -O2 -m68080
   

    .L2:
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            subq.l #1,d0
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            fdmove.d (a0)+,fp1
            fdmul.x fp0,fp1
            fmove.d fp1,(a1)+
            tst.l d0
            jne .L2

            unlk a5
            rts
   

   
    Yes you are absolutely correct.
    GCC 6.5b does include the unneeded TST instruction
   
 

 
 
  Bebbo, GCC sometimes makes such code but not always.
  Bebbo, do you see a way to solve this?
  If this is easy to fix it would be great as saving the extra TST will often be good.

what's the source and the used options?



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 08:41



#include <string.h>
void Scale3(double scalar, double* b, double* c)
{
      size_t j;
      double t1;
      double t2;
      double t3;
      double t4;
     
      for (j=1000; j; j--){
          t1 = scalar* *c++;
          *b++ = t1;
          t2 = scalar* *c++;
          *b++ = t2;
          t3 = scalar* *c++;
          *b++ = t3;
          t4 = scalar* *c++;
          *b++ = t4;
      }
}

This above code.
With these options "-Os -m68080"


Stefan "Bebbo" Franke

Posts 139
12 Jul 2019 08:54


Gunnar von Boehn wrote:

 

  #include <string.h>
  void Scale3(double scalar, double* b, double* c)
  {
        size_t j;
        double t1;
        double t2;
        double t3;
        double t4;
       
        for (j=1000; j; j--){
            t1 = scalar* *c++;
            *b++ = t1;
            t2 = scalar* *c++;
            *b++ = t2;
            t3 = scalar* *c++;
            *b++ = t3;
            t4 = scalar* *c++;
            *b++ = t4;
        }
  }
 

 
  This above code.
  With these options "-Os -m68080"
 

 
 
  dängs :-)
 
the m68080 automaton is not yet correct, but the scheduler inserts the only executable insn.


Samuel Devulder

Posts 248
12 Jul 2019 10:28


Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:

void Scale0(double scalar, double* b, double* c)
{
    size_t j;
    for (j=4; j; j--){
        *b++ = scalar* *c++;
    }
}

I get:
_Scale0:
        fdmove.d (4,sp),fp0
        move.l (16,sp),a1
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
        move.l (12,sp),a0
; 4 wait-cycles
        fmove.d fp1,(a0)+
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
; 5 wait-cycles
        fmove.d fp1,(a0)+
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
; 5 wait-cycles
        fmove.d fp1,(a0)+
        fdmul.d (a1),fp0
; 5 wait-cycles: total 19 wait-cycles
        fmove.d fp0,(a0)
        rts

In this asm there are lots of pipeline stalls because the result of fdmul is used in the very next operation. Why wasn't the flow reorganized to use multiplication on 4 independent fpu-regs (cf. loop unrolling), and write the result in the end of the function like this:
_Scale0:
        fdmove.d (4,sp),fp0
        move.l (12,sp),a0
        move.l (16,sp),a1
        fmovem fp3/fp2,-(sp)
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
        fdmove.d (a1)+,fp2
        fdmul.x fp0,fp2
        fdmove.d (a1)+,fp3
        fdmul.x fp0,fp3
        fdmul.x (a1)+,fp0
; 0 wait-cycle
        fmove.d fp1,(a0)+
; 1 wait-cycle
        fmove.d fp2,(a0)+
; 1 wait-cycle
        fmove.d fp3,(a0)+
        fmovem (sp)+,fp3/fp2
; 0 wait-cycle: total 2 wait-cycles
        fmove.d fp0,(a0)+
        rts
     

Notice:
1) if the loop was 1 cycle bigger, resulting in an extra fp5 being used, then there wouldn't be any wait-cycle :)
2) even with this 4-fold loop, the 1 wait-cycle can be even further optimized by spreading the fmovem into several fmove.x placed in these points like this:
_Scale0:
fdmove.d (4,sp),fp0
        move.l (12,sp),a0
        move.l (16,sp),a1
        fmovem fp3/fp2,-(sp)
        fdmove.d (a1)+,fp2
        fdmul.x fp0,fp2
        fdmove.d (a1)+,fp3
        fdmul.x fp0,fp3
        fdmove.d (a1)+,fp1
        fdmul.x fp0,fp1
        fdmul.x (a1)+,fp0
; 0 wait-cycle
        fmove.d fp2,(a0)+
  fmove.x fp2,(sp)+
; 0 wait-cycle
        fmove.d fp3,(a0)+
  fmove.x fp3,(sp)+
; 0 wait-cycle
        fmove.d fp1,(a0)+
; 0 wait-cycle: total 0 wait-cycles
        fmove.d fp0,(a0)+
        rts     

 


Stefan "Bebbo" Franke

Posts 139
12 Jul 2019 10:32


Rhe insn in front of a compare is now fused to the compare if it's related to.

live at ~12:35 - if all tests pass


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 13:07


Samuel Devulder wrote:

Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:
 

  void Scale0(double scalar, double* b, double* c)
  {
      size_t j;
      for (j=4; j; j--){
          *b++ = scalar* *c++;
      }
  }
 

Unrolling will make more sense for more interations.
How about we look at a loop of "640" ?


#include <string.h>

void Scale8(double scalar, double* b, double* c)
{
      size_t j;
      for (j=640; j; j--){
          *b++ = scalar * *c++;
      }
}

_Scale8:
        move.l a2,-(sp)
        move.l #640,d0
        fdmove.d (8,sp),fp0
        move.l (20,sp),a0
        move.l (16,sp),a1
.L2:
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        move.l a1,a2
        fmove.d fp1,(a2)+
        subq.l #8,d0
        lea (64,a1),a1
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(a2)+
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(a2)
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(-40,a1)
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(-32,a1)
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(-24,a1)
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(-16,a1)
        fdmove.d (a0)+,fp1
        fdmul.x fp0,fp1
        fmove.d fp1,(-8,a1)
        tst.l d0
        jne .L2
        move.l (sp)+,a2
        rts

-O3 -m68080 -fomit-frame-pointer -funroll-loops

Bebbo can you explain why this code is generated.

I would have assumed EA mode (An)++
Instead GCC uses (16,An) EA mode and makes it very complicated with LEA and using A2.

Why does GCC make such code?


Stefan "Bebbo" Franke

Posts 139
12 Jul 2019 13:13


Samuel Devulder wrote:

  Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:
 

  void Scale0(double scalar, double* b, double* c)
  {
      size_t j;
      for (j=4; j; j--){
          *b++ = scalar* *c++;
      }
  }
 

 

 
  Everything is fine, because the pointers may alias.
  Use
 

  void Scale0(double scalar, double* restrict b, double* restrict c)
 

 
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Jul 2019 13:39


Bebbo did you see my question above?

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19