Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Information about the Apollo CPU and FPU.

GCC Improvement for 68080page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 08:33


Salut Thellier!
 
  In which regard is the created code better?
 
  Maybe I should explain something:
 
  CPU have several options to create a LOOP.
 
  Option a) Using a counter variable.
  The counter counts down, and the CPU loops until it reaches the end.
 
 
  Simple Example code:
 
 

    while(n) {
      *dst++=0x00;
      n--;
    }
 

 
 
  Options how this can be done:
 
 

  LOOP:
    CLR.B (A0)+
    DBRA  D0,LOOP
 

 
  or
 

  LOOP:
    CLR.B (A0)+
    SUBQ.L #1,D0
    BNE    LOOP
 

 
  another option is to LOOP until a certain End Address is reached
 

    move.l A0,A1
    add.l D0,A1
  LOOP:
    CLR.B (A0)+
    CMPA.L A1,A0
    BNE    LOOP
 

 
 
  Those loops look identical but they are NOT!
 
  Disadvantage number 1:
  Register Dependences
 
  Loop A) and B) use independent operations.
  The CLR and the DBRA use not the same register.
  There are no dependencies.
  This means a Super Scalar CPU can execute them in parallel.
  This is GOOD.
 
 
  The 3rd Loop code (C) creates a dependency between the CLR and the CMPA. This means they can NOT be executed in parallel.
  This means LOOP code (c) is by design twice as slow!
 
 
  Disadvantage number 2 - branch prediction.
  A CPU has to predict at Branch/Loop instruction
  and has to decide predict whether to LOOP or not.
  The end of the counting down LOOP instructions is "foreseeable"
  A CPU can "SNOOP" the condition of the counter one in advance
  and can do such LOOP without misprediction!
  This can NOT be done for the LOOP C version.
  This means Loop C will get a misprediction penalty of severly clocks.
 
  Many good CPUs like IBM POWER or AC68080 are able to predict the end of a counting down loop correctly and do such Loops without any mis-prediction at all.
 
 
  To sum this up. Loop construct (c) is for long loop twice as slow,
  and for short loops even more than twice slower.
 
 
  GCC 2.9 was creating Loop (A) code.
  New GCC versions seem to prefer Loop (C) code.
 
  The new GCC version seem to be blind to the negative register dependency they created in Loop(c)
  and seems to be unaware that CPUs can predict the end of a count down loop - which they kill with using version (c)
 
 
  Maybe Bebbo has an idea how to fix this?


Stefan "Bebbo" Franke

Posts 139
25 Jun 2019 09:34


Gunnar von Boehn wrote:
 
  Maybe Bebbo has an idea how to fix this?

the current m68k-amigaos-gcc recognizes this as builtin and calls memset.


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 09:50


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:
 
    Maybe Bebbo has an idea how to fix this?
 

 
  the current m68k-amigaos-gcc recognizes this as builtin and calls memset.

:-)
Haha nice joke

Of course this just an easy to read example which people can follow.

The problem we talk about is what LOOP constructs GCC tries to use.
Not just for memset() but in general for all code.

GCC 2.9 used often DBRA, which was a good choice.

New GCC versions seem to often use "CMPA BNE".
CMPA has the two disadvantages we explained earlier.

I dont know why GCC changed this.
Do you have an explain for this?

Or more important can we teach GCC to better use do the countdown code like "SUBQ.L #1,D0 BNE" or DBRA?




Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 10:06


For better understanding, lets look at real code:
 
 

  char strCHECK(char *src,short n)
      {
        char c =0x0;
        while(n) {
            c+=*src++;
            n--;
        }
      return c;
      }
 

 
 
 
  GCC 6.5b using -Os
 

  _strCHECK:
          clr.b d1
          add.w a0,d0
  .L2:
          cmp.w a0,d0
          jne .L3
          move.b d1,d0
          rts
  .L3:
          add.b (a0)+,d1
          jra .L2
 

 
 
  GCC 6.5b -O2
 

  _strCHECK:
          move.w d0,d1
          jeq .L1
          clr.b d0
  .L3:
          add.b (a0)+,d0
          subq.w #1,d1
          jne .L3
  .L1:
          rts
 

 
  The code created with Os is a lot slower and its bigger. :-(
 
 
  If I change the counter from SHORT to INT then GCC 6.5b
  will also for the -O2 compile choose the CMPA LOOP.
 
  I see result like this:
 
 
 

  -Os
  _strCHECK:
          clr.l d1
          move.l d0,a1
          move.b d1,d0
  .L2:
          cmp.l d1,a1
          jne .L3
          rts
  .L3:
          add.b (a0,d1.l),d0
          addq.l #1,d1
          jra .L2
 
   
 
  or
 
 

  -O2
  _strCHECK:
          tst.l d0
          jeq .L1
          move.l a0,d1
          add.l d0,d1
          clr.b d0
  .L3:
          add.b (a0)+,d0
          cmp.l a0,d1
          jne .L3
  .L1:
          rts
 

 
 

Bebbo if you look at this you will see the 2 issues
a) the END ADDRESS Calculation makes the code first of all longer!
As this is needed:
          move.l a0,d1
          add.l d0,d1

And because of the Register dependencies the LOOP will be slower.

That GCC used (A0,D1) EA-mode was also not clever.
This mode is longer and slower than (A0)+ on most 68K.
Its pretty slow on 060.
And it add in addition a ALU-2-EA bubble on 060+

Bebbo do you understand why GCC makes these wrong decisions now?



Stefan "Bebbo" Franke

Posts 139
25 Jun 2019 10:31


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

 
Gunnar von Boehn wrote:
 
    Maybe Bebbo has an idea how to fix this?
 

 
  the current m68k-amigaos-gcc recognizes this as builtin and calls memset.
 

 
  :-)
  Haha nice joke
 
 
  Of course this just an easy to read example which people can follow.
 
  The problem we talk about is what LOOP constructs GCC tries to use.
  Not just for memset() but in general for all code.
 
  GCC 2.9 used often DBRA, which was a good choice.
 
  New GCC versions seem to often use "CMPA BNE".
  CMPA has the two disadvantages we explained earlier.
 
  I dont know why GCC changed this.
  Do you have an explain for this?
 
  Or more important can we teach GCC to better use do the countdown code like "SUBQ.L #1,D0 BNE" or DBRA?

ok - the main cause seems to be -fiv-opts, which is on by default:

  "Optimizes induction variables in loops."

using: m68k-amigaos-gcc -O3  a.c -S -fno-builtin -fno-ivopts

yields:

_foo:
        link.w a5,#0
        move.l (8,a5),a0
        move.l (12,a5),d0
        jeq .L6
.L3:
        clr.b (a0)+
        subq.l #1,d0
        jne .L3
.L6:
        unlk a5
        rts

with my local version.

So the cost for using an address registers needs to be higher.



Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 10:34


Bebbo,
 
Great that you found it!
This loop is so much better now.

If you can generally fix this, I can see that some routines will literally run TWICE as fast now.


Stefan "Bebbo" Franke

Posts 139
25 Jun 2019 11:28


Gunnar von Boehn wrote:

      Lets look at an example:
     
     

      void strncopy(char *dst,const char *src,short n)
      {
        char c;
        for (n;n>=0;n--) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
        }
      }
     
     

BTW: that strncpy is buggy and will copy one byte to much


Thellier Alain

Posts 141
25 Jun 2019 12:31


>In which regard is the created code better?
I am not an expert but see bellow : it seems there are unnecessary instructions with a for()

Compiled with egcs 2.90.27
gcc -Wa,-adhln -g -Wall -m68020-40 -m68881 -noixemul -O3 -c test.c > test-mixed
to generate mixed C & ASM

  285:test.c    //    /*==================================================================*/
  286:test.c    //        void strncopy1(char *dst,const char *src,short n)
  287:test.c    //        {
  1345                .stabd 68,0,287
  1346 0470 4855        pea a5@
  1347 0472 2A4F        movel sp,a5
  1348 0474 226D 0008  movel a5@(8),a1
  1349 0478 206D 000C  movel a5@(12),a0
  1350 047c 202D 0010  movel a5@(16),d0
  1351 0480 3200        movew d0,d1
  288:test.c    //          char c;
  1352                .stabd 68,0,288
  1353                LBB6:
  289:test.c    //          for (n;n>=0;n--) {
  1354                .stabd 68,0,289
  1355 0482 6D0A        jlt L124
  1356                .even
  1357                L126:
  290:test.c    //            c=*src++;
  1358                .stabd 68,0,290
  1359 0484 1018        moveb a0@+,d0
  291:test.c    //            *dst++=c;
  1360                .stabd 68,0,291
  1361 0486 12C0        moveb d0,a1@+
  292:test.c    //            if( c==0x0 ) break;
  1362                .stabd 68,0,292
  1363 0488 6704        jeq L124
  1364                .stabd 68,0,289
  1365 048a 5341        subqw #1,d1
  1366 048c 6AF6        jpl L126
  1367                L124:
  293:test.c    //          }
  294:test.c    //        }
  1368                .stabd 68,0,294
  1369                LBE6:
  1370 048e 4E5D        unlk a5
  1371 0490 4E75        rts
  1382                Lscope3:
  1384                .even
  1385                .globl _strncopy2
  1386                _strncopy2:
  295:test.c    //    /*==================================================================*/
  296:test.c    //        void strncopy2(char *dst,const char *src,short n)
  297:test.c    //        {
  1387                .stabd 68,0,297
  1388 0492 4855        pea a5@
  1389 0494 2A4F        movel sp,a5
  1390 0496 226D 0008  movel a5@(8),a1
  1391 049a 206D 000C  movel a5@(12),a0
  1392 049e 202D 0010  movel a5@(16),d0
  1393 04a2 3200        movew d0,d1
  298:test.c    //          char c;
  1394                .stabd 68,0,298
  1395                LBB7:
  299:test.c    //          while(n) {
  1396                .stabd 68,0,299
  1397 04a4 670E        jeq L131
  1398                .even
  1399                L132:
  300:test.c    //            c=*src++;
  1400                .stabd 68,0,300
  1401 04a6 1018        moveb a0@+,d0
  301:test.c    //            *dst++=c;
  1402                .stabd 68,0,301
  1403 04a8 12C0        moveb d0,a1@+
  302:test.c    //      n--; 
  1404                .stabd 68,0,302
  1405 04aa 5341        subqw #1,d1
  303:test.c    //            if( c==0x0 ) break;
  1406                .stabd 68,0,303
  1407 04ac 4A00        tstb d0
  1408 04ae 6704        jeq L131
  304:test.c    //          }
  1409                .stabd 68,0,304
  1410 04b0 4A41        tstw d1
  1411 04b2 66F2        jne L132
  1412                L131:
  305:test.c    //        } 
  1413                .stabd 68,0,305
  1414                LBE7:
  1415 04b4 4E5D        unlk a5
  1416 04b6 4E75        rts
  1427                Lscope4:


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
25 Jun 2019 13:20


thellier alain wrote:

>In which regard is the created code better?
  I am not an expert but see bellow : it seems there are unnecessary instructions with a for()

 


// FOR() CODE
FOR:
  moveb a0@+,d0
  moveb d0,a1@+
  jeq END
  subqw #1,d1
  jpl FOR
END:
= 5 instructions


// WHILE(n)
WHILE:
    moveb a0@+,d0
    moveb d0,a1@+
    subqw #1,d1
    tstb d0
    jeq END
    tstw d1
    jne WHILE
END:
= 7 instructions

Actually in your example the WHILE LOOP is longer. :-(
7 versus 5 instructions
While our "goal" was to have 2 instructions in the LOOP

LOOP
  MOVE.B (A0)+,(A1)+
  dbeq  LOOP

Maybe more important than GCC 2.9 will be a modern GCC like GGC 6.0
If we can improve 68K code generation on GCC 6, then this would help AMIGA and ATARI a lot.



Thellier Alain

Posts 141
26 Jun 2019 14:45


>Actually in your example the WHILE LOOP is longer. :-(

Houpps you are absolutely right I was too tired the day I posted that :-/



Michael R

Posts 281
26 Jun 2019 18:12


thellier alain wrote:

        char c;
        for (n;n>=0;n--) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
        }
 
  writing a loop with C this way generate a better ASM
 
        char c;
        while(n) {
          c=*src++;
          *dst++=c;
          if( c==0x0 ) break;
          n--;
        }

Thellier Alain,

It's great that gcc is being optimized for Apollo 68080.

Also, I have a question about some sample code you wrote called DT2HAM. May I revise parts of it and re-release it?


Thellier Alain

Posts 141
27 Jun 2019 09:09


Hello Michael

Feel free to modify it
EXTERNAL LINK  but just release the updated sources too
If you broke compatibility with non 68080 machines then release it as a new package

DatatypeToHam was designed for picture quality so it eats lots of memory (not a problem as I used it in WinUAE)
There are several options: if you have a good colors vision (like me) you can see the differences else let default

Especially this part eats memory
//allocate a float buffer to store 'errors'
  RGBfs= (float *)malloc(HamHigh*HamLarge*3*sizeof(float));

It should better use some kinda "sliding buffer" but not easy to implement ... anyway not a problem on a Vampire

Alain




Michael R

Posts 281
27 Jun 2019 11:07


thellier alain wrote:

Hello Michael
 
  Feel free to modify it
  EXTERNAL LINK  but just release the updated sources too
  If you broke compatibility with non 68080 machines then release it as a new package
 
  DatatypeToHam was designed for picture quality so it eats lots of memory (not a problem as I used it in WinUAE)
  There are several options: if you have a good colors vision (like me) you can see the differences else let default
 
  Especially this part eats memory
  //allocate a float buffer to store 'errors'
  RGBfs= (float *)malloc(HamHigh*HamLarge*3*sizeof(float));
 
  It should better use some kinda "sliding buffer" but not easy to implement ... anyway not a problem on a Vampire
 
 
  Alain
 
 
 

Ok. Thank you. I haven't compiled it yet for the 68080. I should do that and test it on my A500 equipped with a Vampire V2. I use AmiDevCpp but I'll try making it with GCC.

I made some modifications to your original code in my test program. I compiled it for AROS and it worked. I needed some HAM conversion code for one of my graphics programs for Icaros. I wrote all the ILBM save functions including SaveHAMPic but I needed HAM conversion code. I would have needed to write everything till I found your program.

I had some difficulty with the French comments. The main issues were the ILBM header information and it doesn't seem to add padding bytes to the end of scanlines. I'll release the source as well and give credit to you as the original author. I'll compile it for Amiga as well to test on 68080.

I saw that you put your name on the Floyd Steinberg function. Is the contact information for you correct. My graphics program is ShowPicture. You can find me as "miker" on Aros-Exec.

I'm working on another grahics program converting RayStorm 3D Modeler m68k to work on Icaros Desktop. It would be nice to save rendered images to HAM6 or HAM8 for that as well. Thank you.

I'll be using GCC 4.8.3 but if newer versions of GCC are being optimized for 68080 I'll try that too.




Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 06:12


Ok, I made some changes which are now live:
 
 
  EXTERNAL LINK 


Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 08:37


Gunnar von Boehn wrote:

    Patching GNU ASM is easy.
    We added support for APOLLO instructions to it already.

And I'm still looking for that GNU ASM patch...


Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 08:41


thellier alain wrote:

  >Actually in your example the WHILE LOOP is longer. :-(
   
    Houpps you are absolutely right I was too tired the day I posted that :-/
   
 

 
  the while loop is stoll longer with my recent changes, now it's
 
    for:  3.000045 instructions
  while:  4      instructions
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
29 Jun 2019 08:47


Stefan "Bebbo" Franke wrote:

Ok, I made some changes which are now live:

Great work Bebbo!

I think this improves significantly loop code generation for 68K family.

Bebbo, regarding the GCC FPU code generation.
Will teaching GCC to use little bit more parallelism on FPU instruction take you long?
 
Can you make a short C code example showing the effect/difference?

"Vielen Dank" in advance


Gunnar von Boehn
(Apollo Team Member)
Posts 6197
29 Jun 2019 09:44


Stefan "Bebbo" Franke wrote:

Gunnar von Boehn wrote:

    Patching GNU ASM is easy.
    We added support for APOLLO instructions to it already.
 

 
  And I'm still looking for that GNU ASM patch...

Bebbo, the batch is not complete.
Its only some of the normal instructions like: "TOUCH" or "ADDIW.L" or "HINT".
More work would be needed to enable e.g the more register.
 




Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 09:53


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

  Ok, I made some changes which are now live:
 
 

 
  Great work Bebbo!
 
  I think this improves significantly loop code generation for 68K family.
 
 
  Bebbo, regarding the GCC FPU code generation.
  Will teaching GCC to use little bit more parallelism on FPU instruction take you long?
 
  Can you make a short C code example showing the effect/difference?
 
  "Vielen Dank" in advance

First I need more information about all of this.

Which cpu - of the ones implemented in gcc - is the closest match and a good starting point?

Some cold fire? one of i386? mips? sparc? ...?


Stefan "Bebbo" Franke

Posts 139
29 Jun 2019 09:56


Gunnar von Boehn wrote:

Stefan "Bebbo" Franke wrote:

 
Gunnar von Boehn wrote:

      Patching GNU ASM is easy.
      We added support for APOLLO instructions to it already.
 

 
  And I'm still looking for that GNU ASM patch...
 

 
  Bebbo, the batch is not complete.
  Its only some of the normal instructions like: "TOUCH" or "ADDIW.L" or "HINT".
  More work would be needed to enable e.g the more register.

A start is a start - is a start - is a ...

Why not fork my binutils-gdb and provide a pull request? That way I can tell gcc to use those insns too and maybe extend objdump to disassemble these insns correctly. Or mail me a patch or the patched files or ...

:-)

posts 367page  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19