Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
02 Aug 2019 08:02


Grom 68k wrote:

Gunnar von Boehn wrote:

 
Stefan "Bebbo" Franke wrote:

    and not for EOR ...
   
 

  Why should it not?
 

 
  It's simply not in the PDF list for now.

You talk about the FUSING of the 68080?
Yes, the 68080 has the advantage of the Fusing.
Which makes this code even faster.

But even without the fusing the MOVEQ version has a clear advantage.

The 68K offer instruction to make programs dense and small.
Using this as best as possible can give us a clear advantage over others

For the 68020 and 68030 writing "dense" code is very important for performance as their Icache are super tiny.
On the 68020 and 68030 every BYTE of instruction cost 1 clock cycle.
So 8 versus 4 bytes - saves 4 Byte and makes the code also 4 cycle faster.

One might think that the bigger Icache of the 68060 makes this tuning redundant?
On fact also for the 68060 such tuning is crutial.
The reason is that the 68060 is limited by the Icache to Decoder fetch bandwidth. This means also on the 68060 this saving of 4 Bytes results in clearly faster program code.

Our new 68080 has both much bigger cache and fetch bandwidth.
Nevertheless I think we should aim for compiling code in such a way that its tuned well and runs better on all 68K cores.

I think that adding the BYTE length to the COST model of GCC will help a lot to make better code for all 68K models.


Stefan "Bebbo" Franke

Posts 139
02 Aug 2019 08:04


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  and not for EOR ...
 
 

  Why should it not?
 
 
  MOVEQ #$F,D0
  EOR.L (A0),D0
 

  This works just fine.

operands mismatch -- statement `eor.l (a0),d0' ignored

Gunnar von Boehn wrote:

  The 68K offers a rich selection of instruction.
  For different immediate ranges the 68K provides us tuned instructions.
 
  Example:
 

                      Bytes
  SUBQ.L #$1,A0      2
  SUBA.W #$111,A0    4
  SUBA.L #$222222,A0  6
 

 
  Using the tuned instruction will make programs smaller, and increase Icache hit rate. So both size is saved and speed increased.
 
  BTW 68080 offers this too:
 

                      Bytes
  ADDQ.L  #$1,D0      2
  ADDIW.L #$111,D0    4
  ADDI.L  #$222222,D0  6
 

and for sub:

  a - b != b - a




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
02 Aug 2019 08:25


Hi Stefan

Your CEX website not work atm. Can you fix this?

What code is created today for?
int A,B;

A = B EOR 0xf;

Does GCC use MOVEQ here?

Obviously SUB has a direction same as DIV. :-)
MUL could use the MOVEQ always.
MOVEQ can also be use to ZERO extent an operant.

Unsigned Char a;
Unsigned long b;
b = a;
=>
moveq  #0,D1
move.b D0,D1

Regarding SUB:
B= 100-A;
What code does GCC create here?



Stefan "Bebbo" Franke

Posts 139
02 Aug 2019 09:40


Gunnar von Boehn wrote:

Hi Stefan
 
  Your CEX website not work atm. Can you fix this?

 
You need to subscribe to a better service level^^

Gunnar von Boehn wrote:
 
  What code is created today for?
  int A,B;
 
  A = B EOR 0xf;
 
  Does GCC use MOVEQ here?
 
  Obviously SUB has a direction same as DIV. :-)
  MUL could use the MOVEQ always.
  MOVEQ can also be use to ZERO extent an operant.
 
  Unsigned Char a;
  Unsigned long b;
  b = a;
  =>
  moveq  #0,D1
  move.b D0,D1
 
 
  Regarding SUB:
  B= 100-A;
  What code does GCC create here?
 

EXTERNAL LINK 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
02 Aug 2019 09:55


Stefan "Bebbo" Franke wrote:

  EXTERNAL LINK 

Excellent!
And I see you found a nice solution also for doing SUB with MOVEQ.
The sign bit in MOVEQ is pretty handy. :-D

Did you do the MOVEQ with peephole optimization or with general COST understanding?


Samuel Devulder

Posts 248
02 Aug 2019 22:30


Samuel Devulder wrote:

On my v2 the exe works at least ~30secs without issues (it is the time it take for "timedemo demo1"), but yeah, there are still a couple of issues with gcc650b. It is a work in progress. (..)

For information, I confirm the crash of quake.gcc-6.5.0b.080 after a couple of minutes (when diving into the water in demo2 when you let the game idle from startup). I also notice that the "-usepub" option doesn't work.

I've done a long bit of analyzing a disasm version and found an issue present in my exe. It affects >= 68040 code, but I'm unsure it makes quake crash when diving. However, it does bad things to memory, which possibly results to a later crash of the exe. The issue has been reported to Bebbo ( EXTERNAL LINK ).



Grom 68k

Posts 61
03 Aug 2019 07:26


Don Adan wrote:

Samuel Devulder wrote:

 
Grom 68k wrote:
  With -mtune=68080, it's work well. :)

  Doesn't seem so: EXTERNAL LINK     
  You probably meant: -mtune=68030 (680-thirty) EXTERNAL LINK   
 
 
lsl works well now. Is moveq can be removed ?

  Moveq is mandatory since lsl #n is limited to n<=8 AFAIK.
 

 
  Not exactly. Two lsl.l can be used too. But i dont know what is better for 68080. Advantage, no trash/register is necessary.

Hi,

Last optimisations make 2 lsl instead of moveq + lsl :)

Regards


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
03 Aug 2019 07:38



  Hi Bebbo,
  Hi Sam,
 
 
  A:
    move.l (4,sp),d0
    move.l ([_tab],d0.l*4),a0
 

 
  is not equivalent to
 
 
  B:
    move.l (4,sp),a0
    add.l a0,a0                  <== here
    move.l ([_tab,a0],a0.l),a0  <= and there!
 

 
  _tab is a Pointer to a Pointer.
  The code wants to access the Nth element of the final pointer.
 
  Code B: will dereference the Pointer wrongly.
  The correct code will dereference the pointer
 
  Lets put numbers to make this clear.
  Lets say parameter X = $10
  Lets say address of _tab = $100
  Lets content of _tab= $400
 
  The correct code will READ_MEMORY at ADDR $100,
  retrieve the value of $400 then ADD 4*$10 = $440
  READ Memory at addr $440
 
  The wrong code will READ MEMORY at $100+2*$10 = $120
  It will read some trash value from there and ADD to it $20
  And then do memory access to a random location.

Or in other words

*ptr + 4

Is not the same as

(*(ptr+2)) + 2

 
The double-indirect memory modes of the 68K are complex.
I had the impression before that GCC makes mistakes with them.
Is there a compile flag to globally disabled them?


Samuel Devulder

Posts 248
03 Aug 2019 11:18


Yeah, absolutely. I think it is the index-multiplier optimization which is broken (D0.l*4). GCC tries to mimic this with A0 in a optimized way(? but this is doesn't gain any cycle) but is wrong in thinking that ([XXX,a0],a0) is the same as ([XXX],A0*2):
   

        add.l a0,a0                  ; A0*2, okay
        move.l ([_tab,a0],a0.l),a0 
                        \__/ GCC thinks this makes A0*2 too !
   

So not only the index*4 optimisation is wrong, but in the first place why doesn't it simply use the "*4" multiplier on A0.L. I think this is possible <EA>: ([XXX],A0.l*4).
 
Since this code appears when compiling for >=68040 but not on 68030, I think this is probably related to a special treatment of index-multiplication because of the cost of it on 68040 or 68060.


Stefan "Bebbo" Franke

Posts 139
03 Aug 2019 15:55


Samuel Devulder wrote:

  Yeah, absolutely. I think it is the index-multiplier optimization which is broken (D0.l*4). GCC tries to mimic this with A0 in a optimized way(? but this is doesn't gain any cycle) but is wrong in thinking that ([XXX,a0],a0) is the same as ([XXX],A0*2):
     

          add.l a0,a0                  ; A0*2, okay
          move.l ([_tab,a0],a0.l),a0 
                          \__/ GCC thinks this makes A0*2 too !
     

  So not only the index*4 optimisation is wrong, but in the first place why doesn't it simply use the "*4" multiplier on A0.L. I think this is possible <EA>: ([XXX],A0.l*4).
   
  Since this code appears when compiling for >=68040 but not on 68030, I think this is probably related to a special treatment of index-multiplication because of the cost of it on 68040 or 68060.
 

 
  I stopped questioning why gcc does this or that...
  ... it's always the costs or my messing around
 
  ... and I fixed this one to print the correct asm:
 

    move.l ([a0,a0.l],_tab),a0
 

 
EDIT: LOL - still wrong^^


Samuel Devulder

Posts 248
03 Aug 2019 16:24


yes still wrong [X,Y] isn't (X+Y) but *(X+Y). Like Gunnar said: double indirection is evil. The good code would be "move.l ([_tab],A0.l*4),A0" or if A-regs doesn't support multiplier, then use the 68030 version with D0 as index-reg.


Stefan "Bebbo" Franke

Posts 139
03 Aug 2019 20:19


Question:

Is there a case where double indirect is useful?


Samuel Devulder

Posts 248
03 Aug 2019 21:19


When you lack an Address register may be ? Otherwise, an explicit double indirection like this

        move.l (sp,4),d0          ; getting index
        move.l _tab,a0            ; 1st indirection
        move.l (a0,d0.l*4),a0    ; 2nd indirection
works fine. It also might allow other instructions being pushed between indirections steps or factorize the 1st indirection and be faster in the end.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
04 Aug 2019 05:07


Samuel Devulder wrote:

  if A-regs doesn't support multiplier,
 

 
Yes, both D-Reg and A-reg support Index-multiplier.
And the Index-multiplier for them is free in this instruction! 
The multiplier cost no extra cycle and it cost no extra byte.
That GCC did not use the free Index-Multiplier but did add extra instructions instead is really "GAGA".




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
04 Aug 2019 05:35


Stefan "Bebbo" Franke wrote:

Question:
 
Is there a case where double indirect is useful?

 
"Useful" might depend on viewpoint.
 
"Needed" is much easier to answer: Its clearly not needed.
 
You can always get the same result using 2 instructions.
Double indirect is slow. Its not faster than doing 2 instruction.
Double indirect increases instruction length.
Its often the same size as using 2 instructions.
Double indirect uses an "internal" TMP-Address register - so you not
need an architectural register for this.
 
Double indirect blocks Super-Scalar execution, so its in the bigger picture not good for Super-Scalar CPUs like 68060 and 68080.
Performance wise 68060 and 68080 work better without it.
 
I think as long GCC makes error on Double-Indirect its a big risk.

If we look at performance or tuning options then we could talk about other options:
Maybe -fomit-frame-pointer should default option.
The LINK instructions needed for the framepointer logic increase code size, cost extra cycles and does extra memory traffic.

I think GCC is not only used by "experts" but also by hobby people which take some source from the internet and compile it.
Those hobby users should ideally get with the default options a good executeable.




Stefan "Bebbo" Franke

Posts 139
04 Aug 2019 07:36


Gunnar von Boehn wrote:

Samuel Devulder wrote:

  if A-regs doesn't support multiplier,
 

 
  Yes, both D-Reg and A-reg support Index-multiplier.
  And the Index-multiplier for them is free in this instruction! 
  The multiplier cost no extra cycle and it cost no extra byte.
  That GCC did not use the free Index-Multiplier but did add extra instructions instead is really "GAGA".

It only depends which optimizer passes are enabled at a distinct level.

depending on the CPU costs the *4 is convertet into add,add or lsl #2 (68040 shifts slower). Then you need an optimizer pass which converts and combine it into an adress index with mutliplier.

Next there are complaints that -O0 does not optimze too...



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
04 Aug 2019 08:08


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

 
Samuel Devulder wrote:

    if A-regs doesn't support multiplier,
   

   
  Yes, both D-Reg and A-reg support Index-multiplier.
  And the Index-multiplier for them is free in this instruction! 
  The multiplier cost no extra cycle and it cost no extra byte.
  That GCC did not use the free Index-Multiplier but did add extra instructions instead is really "GAGA".
 

 
  It only depends which optimizer passes are enabled at a distinct level.

Using INDEX EA mode allows a free MUL.
If the compiler not uses this but does the MUL "per hand" with extra instructions then this is shows missing understanding of the 68K ISA.

Stefan "Bebbo" Franke wrote:

  Next there are complaints that -O0 does not optimze too...

Actually my proposal was based on experience and very serious.

Take a look at AMINET, there you will find many programs compiled by "hobby" porters.
Many programs are simple recompiles like LAME or others.
If you look at their "compile quality" then you will see that very many of these "AMIGA PORTS" are plain compiles without -O without -fomit-frame-pointer even without "strip".

Yes, this means that many of the Amiga porters lack understanding of  GCC and that the created programs uploaded to AMINET could be twice as fast if people would better understand how to use compile parameters.

Now I wonder what is easier: make default parameter "more optimal" or educate all amiga hobby porters?




Stefan "Bebbo" Franke

Posts 139
04 Aug 2019 08:26


Gunnar von Boehn wrote:

 
Stefan "Bebbo" Franke wrote:

  Question:
   
  Is there a case where double indirect is useful?
 

   
  "Useful" might depend on viewpoint.
   
  "Needed" is much easier to answer: Its clearly not needed.
   
  You can always get the same result using 2 instructions.
  Double indirect is slow. Its not faster than doing 2 instruction.
  Double indirect increases instruction length.
  Its often the same size as using 2 instructions.
  Double indirect uses an "internal" TMP-Address register - so you not
  need an architectural register for this.
  [/cost]

  so it saves a register and is not slower
 
 
Gunnar von Boehn wrote:
 
  Double indirect blocks Super-Scalar execution, so its in the bigger picture not good for Super-Scalar CPUs like 68060 and 68080.
  Performance wise 68060 and 68080 work better without it.
 

 
  If I read "Super" I expect something better...
  ... luckily this is handled via costs. Simply make double indirect more expensive
 
 
Gunnar von Boehn wrote:
 
  I think as long GCC makes error on Double-Indirect its a big risk.
 

 
  Using my branch is a risk. So you better go back using the stock version or you do your own branch.
 
 
 
Gunnar von Boehn wrote:

  If we look at performance or tuning options then we could talk about other options:
  Maybe -fomit-frame-pointer should default option.
  The LINK instructions needed for the framepointer logic increase code size, cost extra cycles and does extra memory traffic.
 
  I think GCC is not only used by "experts" but also by hobby people which take some source from the internet and compile it.
  Those hobby users should ideally get with the default options a good executeable.
 

 
  different people yield different expectations...
 
  ... maybe hobby people want to debug their code... then the frame pointer is mandatory.
 
  ... and a super processer should be able to implement link/unlink more efficiently. On modern *86 CPUs the frame pointer version is the faster one. And sometimes this is also true for 68k.
 
  And you can manage all of this in your own fork.
 


Markus B

Posts 209
04 Aug 2019 08:48


Oh, how did this discussion get out of hand?


Mr Niding

Posts 459
04 Aug 2019 09:28


Ya, that was odd. I was reading the thread, and refreshed, and then "removed user".

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19