Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Samuel Devulder

Posts 248
30 Jul 2019 17:08


$ man pow
EXTERNAL LINK 
EXTERNAL LINK  :)

   
This gives: EXTERNAL LINK 
   
Actually we can see it calls libm.a/pow. But I think there is a missing __stdargs to the declaration of the pow function in math.h since functions in libm.a should use the default ABI (eg. stack-based) even if -mregparm is present.

Also notice that "jsr XXX;rts" should be optimized as "jra XXX". This doesn't happen here, but appears when removing the "-mregpam" option. (Still that famous post-stack setup optim pass that seem missing.)


Stefan "Bebbo" Franke

Posts 139
30 Jul 2019 17:17


Samuel Devulder wrote:

 
$ man pow
  EXTERNAL LINK   
  EXTERNAL LINK    :)

   
  This gives: EXTERNAL LINK   
   
  Actually we can see it calls libm.a/pow. But I think there is a missing __stdargs to the declaration of the pow function in math.h since functions in libm.a should use the default ABI (eg. stack-based) even if -mregparm is present.
 

 
 
  Grom68k already provided many headers - I 'only' have to put em in...

EDIT: math.h has __stdargs now


Samuel Devulder

Posts 248
30 Jul 2019 21:27


Cool B) Works fine EXTERNAL LINK


Grom 68k

Posts 61
30 Jul 2019 22:42


Samuel Devulder wrote:

      Cool B) Works fine EXTERNAL LINK       

     
      Is __retfp0 active now on pow ? It seems not.
     
      EXTERNAL LINK   
  Edit: -mregparm is active on pow this morning.
 
  -mregparm could be useful too in complex.h :).
 
  EXTERNAL LINK 


Stefan "Bebbo" Franke

Posts 139
31 Jul 2019 07:08


Grom 68k wrote:

Samuel Devulder wrote:

        Cool B) Works fine EXTERNAL LINK       

       
        Is __retfp0 active now on pow ? It seems not.
       
        EXTERNAL LINK   
    Edit: -mregparm is active on pow this morning.
 
  -mregparm could be useful too in complex.h :).
 
  EXTERNAL LINK   

stdlib functions can't use register parameters or fp0 to return something. /shrug?

complex returns more than one fp-register -> a pointer is used
this is not covered by __retfp0 - as the name states: fp0 is ONE register.



Grom 68k

Posts 61
31 Jul 2019 15:28


Stefan "Bebbo" Franke wrote:

 
Grom 68k wrote:

 
Samuel Devulder wrote:

          Cool B) Works fine EXTERNAL LINK         

         
          Is __retfp0 active now on pow ? It seems not.
         
          EXTERNAL LINK     
      Edit: -mregparm is active on pow this morning.
   
    -mregparm could be useful too in complex.h :).
   
    EXTERNAL LINK     
 

 
  stdlib functions can't use register parameters or fp0 to return something. /shrug?
 
  complex returns more than one fp-register -> a pointer is used
  this is not covered by __retfp0 - as the name states: fp0 is ONE register.
 
 

 
  That's why I try to use math-68881.h to remove builtin.
 
  Else, i don't understand something, is __stdarg really working ? pow use now fp0 and fp1 as input.
 
  EXTERNAL LINK 
 
  I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.
 


Stefan "Bebbo" Franke

Posts 139
31 Jul 2019 15:40


Grom 68k wrote:

  Else, i don't understand something, pow use now fp0 and fp1 as input.
 
  EXTERNAL LINK 
 
  I only speak about -mregparm for the complex.h since it works on pow. Complex data is converted as 2 fp registers.

it depends which headers are live at cex... the old ones without __stdargs or the new ones with __stdargs :-)


Samuel Devulder

Posts 248
31 Jul 2019 17:44


Grom 68k wrote:

pow use now fp0 and fp1 as input.

hmm no it doesn't

            fmove.d fp1,-(sp)
            fmove.d fp0,-(sp)
            jsr _pow
(at the moment on EXTERNAL LINK ) Maybe the header has changed since.


Grom 68k

Posts 61
31 Jul 2019 22:16


Yes, cex improve quickly.
     
      I try again div for int. It's better.
     
      EXTERNAL LINK     
      What is the best between move d0,d0 and test d0 ? Is Test smaller ?
   
    Same question between jpl and bpl ?
 
  Edit: It's not better with /4. It's only better with /2 ! Why ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Aug 2019 06:50


Gunnar von Boehn wrote:

  How about using such formula for the cost?
 
 
  a) 4 per clock cycle
  b) +1 per instruction word
  c) +2 for using memory
 

 
  I think this will create much more balanced code.
  What do you think?

Bebbo, what do you think about this balanced cost proposal?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Aug 2019 06:55


More ideas to make better code:
 
 

  int f(int* ptr)
  {
      int tmp;
      tmp= *ptr &2;
     
      return tmp;
  }
 

 
GCC right now creates this, which is not optimal
 

          move.l (a0),d0        #2      4 4    +d0                                                                       
          and.l #2,d0            #3      0 4    *d0                                               
          rts                    #5      -1 0
 

 

  GCC should produce this:
 
  MOVEQ #2,D0
  AND.L (A0),D0
 

 
We should include the instruction length in the COST model!

I would propose we include the Memory access and the length in the cost so that the cost should be this:
 


          move.l (a0),d0        #2      4 7    +d0                                                                       
          and.l #2,d0            #3      0 7    *d0                                               
          rts                    #5      -1 0
 

The cost should be this:
 


          moveq #2,d0            #2      4 5    +d0                                                                       
          and.l (a0),d0          #3      0 7    *d0                                               
          rts                    #5      -1 0
 



Stefan "Bebbo" Franke

Posts 139
01 Aug 2019 17:06


Gunnar von Boehn wrote:

More ideas to make better code:
 
 

  int f(int* ptr)
  {
      int tmp;
      tmp= *ptr &2;
     
      return tmp;
  }
 

 
  GCC right now creates this, which is not optimal
 

          move.l (a0),d0        #2      4 4    +d0                                                                       
          and.l #2,d0            #3      0 4    *d0                                               
          rts                    #5      -1 0
 

 
 
  GCC should produce this:
 
  MOVEQ #2,D0
  AND.L (A0),D0
 

 
  We should include the instruction length in the COST model!
 
 
  I would propose we include the Memory access and the length in the cost so that the cost should be this:
 

          move.l (a0),d0        #2      4 7    +d0                                                                       
          and.l #2,d0            #3      0 7    *d0                                               
          rts                    #5      -1 0
 

 
 
  The cost should be this:
 

          moveq #2,d0            #2      4 5    +d0                                                                       
          and.l (a0),d0          #3      0 7    *d0                                               
          rts                    #5      -1 0
 

I expect a gain of less than 0.1% ...




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
01 Aug 2019 20:30


Stefan "Bebbo" Franke wrote:

  I expect a gain of less than 0.1% ...

This is very easy to count. :-)

                   Clocks  Byte  
  move.l (a0),d0        1      2                                                                       
  and.l  #2,d0          1      6                                   

versus

  moveq  #2,D0          1      2
  and.l  (a0),D0        0      2  (fused!)

                 

Clocks 2 => 1
Bytes  8 => 4

Twice as fast and halve the size.
I would call this a great improvement.

This tuning is possible for very many operations
not only for AND but also ADD/SUB/OR/EOR/...




Stefan "Bebbo" Franke

Posts 139
01 Aug 2019 20:33


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  I expect a gain of less than 0.1% ...
 

 
  This is very easy to count. :-)
 
 
                   Clocks  Byte  
    move.l (a0),d0        1      2                                                                       
    and.l  #2,d0          1      6                                   
 
  versus
 
  moveq  #2,D0          1      2
  and.l  (a0),D0        0      2  (fused!)
 
                 
 
  Clocks 2 => 1
  Bytes  8 => 4
 
  Twice as fast and halve the size.
  I would call this a great improvement.
 
  This tuning is possible for very many operations
  not only for AND but also ADD/SUB/OR/EOR/...
 
 

not for SUB ...



Grom 68k

Posts 61
01 Aug 2019 22:14


Gunnar von Boehn wrote:

  For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic.
 
  As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly.
  Instead this
 

  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,D1
  ADDq.l #1,D2
  ADDq.l #1,D3
 

 
  Do this
 

  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,(a0)+
  ADDq.l #1,D1
  ADDq.l #1,(a0)+
  ADDq.l #1,D2
  ADDq.l #1,(a0)+
  ADDq.l #1,D3
 

 
  Such scheduling is also important for AMMX and FPU code.

Hi,

To limit memory usage, I think a memory pipeline can be added.


(define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)")

;; simple insns with 1 cycle
(define_insn_reservation "simple" 1 (eq_attr "type" "alu_l")
"i_pipelines, i_ports, i_memory")

Super scalar requirements reduce memory usage.



Stefan "Bebbo" Franke

Posts 139
01 Aug 2019 22:40


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

 
Stefan "Bebbo" Franke wrote:

    I expect a gain of less than 0.1% ...
 

 
  This is very easy to count. :-)
 
 
                   Clocks  Byte  
    move.l (a0),d0        1      2                                                                       
    and.l  #2,d0          1      6                                   
 
  versus
 
    moveq  #2,D0          1      2
    and.l  (a0),D0        0      2  (fused!)
 
                 
 
  Clocks 2 => 1
  Bytes  8 => 4
 
  Twice as fast and halve the size.
  I would call this a great improvement.
 
  This tuning is possible for very many operations
  not only for AND but also ADD/SUB/OR/EOR/...
 
 
 

 
  not for SUB ...
 

and not for EOR ...



Stefan "Bebbo" Franke

Posts 139
01 Aug 2019 22:45


Grom 68k wrote:

Gunnar von Boehn wrote:

  For the modern CPUs 060/080 we have 2 pipes, so "grouping" of instructions becomes a important topic.
 
  As the CPU can do 1 memory operation per cycle, plus another register operation instructions should be scheduled accordingly.
  Instead this
 

  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,D1
  ADDq.l #1,D2
  ADDq.l #1,D3
 

 
  Do this
 

  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,(a0)+
  ADDq.l #1,D1
  ADDq.l #1,(a0)+
  ADDq.l #1,D2
  ADDq.l #1,(a0)+
  ADDq.l #1,D3
 

   
  Such scheduling is also important for AMMX and FPU code.
 

 
  Hi,
 
  To limit memory usage, I think a memory pipeline can be added.
 
 

  (define_reservation "i_pipelines" "(i0_pipeline | i1_pipeline)")
 
 
  ;; simple insns with 1 cycle
  (define_insn_reservation "simple" 1 (eq_attr "type" "alu_l")
  "i_pipelines, i_ports, i_memory")
 

 
  Super scalar requirements reduce memory usage.
 

to allow 2 (or more) insn per cycle is more effort, since this


  fmul.x fp0,fp1
  ADDq.l #1,(a0)+
  ADDq.l #1,D0
  ADDq.l #1,(a0)+
  ADDq.l #1,D1
  ADDq.l #1,(a0)+
  fmul.x fp1,fp2

would result in a stall, since the five insns take 3 cycles only...




Grom 68k

Posts 61
02 Aug 2019 01:23


:(
 
  I was thinking pipelines more easier than fusing.
 
  https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html#Processor-pipeline-description
 
 
  I found a method for fusing. Is the same problem with insns/cycles count ?
 
  EXTERNAL LINK


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
02 Aug 2019 06:32


Stefan "Bebbo" Franke wrote:

  and not for EOR ...
 

Why should it not?


MOVEQ #$F,D0
EOR.L (A0),D0

This works just fine.

The 68K offers a rich selection of instruction.
For different immediate ranges the 68K provides us tuned instructions.
 
Example:


                    Bytes
SUBQ.L #$1,A0      2
SUBA.W #$111,A0    4
SUBA.L #$222222,A0  6

Using the tuned instruction will make programs smaller, and increase Icache hit rate. So both size is saved and speed increased.

BTW 68080 offers this too:


                    Bytes
ADDQ.L  #$1,D0      2
ADDIW.L #$111,D0    4
ADDI.L  #$222222,D0  6



Grom 68k

Posts 61
02 Aug 2019 06:59


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  and not for EOR ...
 
 

  Why should it not?

It's simply not in the PDF list for now.

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19