Information about the Apollo CPU and FPU. |
|
---|
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 15:08
| Samuel Devulder wrote:
| That's nice :) I suppose __retfp0 has no impact when the returned value is an int.
|
It only affects float/double.
Samuel Devulder wrote:
| Is there a cmd-line switch or pragma to add implicit __retfp0 to every functions (except maybe for functions in math.h) ?
|
-mretfp0 is on the way. Functions having __stdargs aren't touched.... needs tons of updated headers, as -mregparm. Is someone bored?
| |
| | Grom 68k
Posts 61 11 Jul 2019 15:39
| Stefan "Bebbo" Franke wrote:
| ... needs tons of updated headers, as -mregparm. Is someone bored?
|
If you think modifications can be scripted, I can try to make a python script. Do you have an example ?
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 15:56
| Grom 68k wrote:
| Stefan "Bebbo" Franke wrote:
| ... needs tons of updated headers, as -mregparm. Is someone bored? |
If you think modifications can be scripted, I can try to make a python script. Do you have an example ? |
for all headers in sys-include and sub folders libnix/include and sub folders do add __stdargs to functions and function types it should look like the ones in ndk-include/clib/alib_protos.h (guess there are more files which could use that patch)clib2/include seems complete libnix/include is partial done sys-include has only few function with __stdargs yet
| |
| | Grom 68k
Posts 61 11 Jul 2019 18:14
| Where is source ?
| |
| | Niclas A (Apollo Team Member) Posts 219 11 Jul 2019 18:22
| Grom 68k wrote:
| Where is source ?
|
EXTERNAL LINK
| |
| | Stefan "Bebbo" Franke
Posts 142 11 Jul 2019 19:02
| not exactly - he's looking for the headers which are in several repos. Easiest: grab the built version, and mail me the fixed headers and I'll put them live
| |
| | Niclas A (Apollo Team Member) Posts 219 11 Jul 2019 19:37
| Stefan "Bebbo" Franke wrote:
| not exactly - he's looking for the headers which are in several repos. Easiest: grab the built version, and mail me the fixed headers and I'll put them live |
Oh sorry. My bad.
| |
| | Samuel Devulder
Posts 248 11 Jul 2019 21:51
| Stefan "Bebbo" Franke wrote:
| Samuel Devulder wrote:
| I should try recompiling quake your latest gcc6.5b when it'll be available. |
It's available now. |
Is it in this link: EXTERNAL LINK ? Is there just a plain ZIP file containing just the content of the setup. (My anti-virus doesn't like the setup, and I usualy prefer plain zip to easily move the installation folder anytime when needed. And most important: on my W10 machine the setup.exe produces this error: EXTERNAL LINK ).
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 06:36
| Stefan "Bebbo" Franke wrote:
| gcc is aware of 68040/60: /* fmovecr must be emulated on the 68040 and 68060, so it shouldn't be used at all on those chips. */ and for the 68080 all FP constants can be used directly now. |
Hello Bebbo, I see this code now: fdmul.x fp3,fp1 fdmul.d #0x3fe0000000000000,fp3 fdadd.x fp3,fp1 fdadd.x fp2,fp1 fdmove.x fp0,fp2 fdadd.d #0x3ff0000000000000,fp2 fdmul.x fp0,fp2 fdmul.x fp0,fp0 fdsub.d #0x3ff0000000000000,fp0 fddiv.x fp0,fp2
Would it be possible for GCC to use the .SINGLE instead .DOUBLE for numbers which fit in single? This would save program space and also be faster for the Moto68060.
| |
| | Stefan "Bebbo" Franke
Posts 142 12 Jul 2019 07:14
| Gunnar von Boehn wrote:
| Would it be possible for GCC to use the .SINGLE instead .DOUBLE for numbers which fit in single? This would save program space and also be faster for the Moto68060.
|
that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 08:13
| Stefan "Bebbo" Franke wrote:
|
Gunnar von Boehn wrote:
| Would it be possible for GCC to use the .SINGLE instead .DOUBLE for numbers which fit in single? This would save program space and also be faster for the Moto68060. |
that's imho the assembler's job, to replace single insns with cheaper ones, but maybe I can do that in gcc too.
|
great! many thanks!
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 08:29
| Gunnar von Boehn wrote:
| compiled with -O2 -m68080 .L2: fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ subq.l #1,d0 fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ tst.l d0 jne .L2 unlk a5 rts
Yes you are absolutely correct. GCC 6.5b does include the unneeded TST instruction |
Bebbo, GCC sometimes makes such code but not always. Bebbo, do you see a way to solve this? If this is easy to fix it would be great as saving the extra TST will often be good.
| |
| | Stefan "Bebbo" Franke
Posts 142 12 Jul 2019 08:34
| Gunnar von Boehn wrote:
|
Gunnar von Boehn wrote:
| compiled with -O2 -m68080 .L2: fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ subq.l #1,d0 fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a1)+ tst.l d0 jne .L2 unlk a5 rts
Yes you are absolutely correct. GCC 6.5b does include the unneeded TST instruction |
Bebbo, GCC sometimes makes such code but not always. Bebbo, do you see a way to solve this? If this is easy to fix it would be great as saving the extra TST will often be good.
|
what's the source and the used options?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 08:41
| #include <string.h> void Scale3(double scalar, double* b, double* c) { size_t j; double t1; double t2; double t3; double t4; for (j=1000; j; j--){ t1 = scalar* *c++; *b++ = t1; t2 = scalar* *c++; *b++ = t2; t3 = scalar* *c++; *b++ = t3; t4 = scalar* *c++; *b++ = t4; } }
This above code. With these options "-Os -m68080"
| |
| | Stefan "Bebbo" Franke
Posts 142 12 Jul 2019 08:54
| Gunnar von Boehn wrote:
| #include <string.h> void Scale3(double scalar, double* b, double* c) { size_t j; double t1; double t2; double t3; double t4; for (j=1000; j; j--){ t1 = scalar* *c++; *b++ = t1; t2 = scalar* *c++; *b++ = t2; t3 = scalar* *c++; *b++ = t3; t4 = scalar* *c++; *b++ = t4; } }
This above code. With these options "-Os -m68080" |
dängs :-) the m68080 automaton is not yet correct, but the scheduler inserts the only executable insn.
| |
| | Samuel Devulder
Posts 248 12 Jul 2019 10:28
| Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code:
void Scale0(double scalar, double* b, double* c) { size_t j; for (j=4; j; j--){ *b++ = scalar* *c++; } }
I get:_Scale0: fdmove.d (4,sp),fp0 move.l (16,sp),a1 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 move.l (12,sp),a0 ; 4 wait-cycles fmove.d fp1,(a0)+ fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 ; 5 wait-cycles fmove.d fp1,(a0)+ fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 ; 5 wait-cycles fmove.d fp1,(a0)+ fdmul.d (a1),fp0 ; 5 wait-cycles: total 19 wait-cycles fmove.d fp0,(a0) rts In this asm there are lots of pipeline stalls because the result of fdmul is used in the very next operation. Why wasn't the flow reorganized to use multiplication on 4 independent fpu-regs (cf. loop unrolling), and write the result in the end of the function like this:_Scale0: fdmove.d (4,sp),fp0 move.l (12,sp),a0 move.l (16,sp),a1 fmovem fp3/fp2,-(sp) fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmul.x (a1)+,fp0 ; 0 wait-cycle fmove.d fp1,(a0)+ ; 1 wait-cycle fmove.d fp2,(a0)+ ; 1 wait-cycle fmove.d fp3,(a0)+ fmovem (sp)+,fp3/fp2 ; 0 wait-cycle: total 2 wait-cycles fmove.d fp0,(a0)+ rts Notice: 1) if the loop was 1 cycle bigger, resulting in an extra fp5 being used, then there wouldn't be any wait-cycle :) 2) even with this 4-fold loop, the 1 wait-cycle can be even further optimized by spreading the fmovem into several fmove.x placed in these points like this:
_Scale0: fdmove.d (4,sp),fp0 move.l (12,sp),a0 move.l (16,sp),a1 fmovem fp3/fp2,-(sp) fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fdmul.x (a1)+,fp0 ; 0 wait-cycle fmove.d fp2,(a0)+ fmove.x fp2,(sp)+ ; 0 wait-cycle fmove.d fp3,(a0)+ fmove.x fp3,(sp)+ ; 0 wait-cycle fmove.d fp1,(a0)+ ; 0 wait-cycle: total 0 wait-cycles fmove.d fp0,(a0)+ rts
| |
| | Stefan "Bebbo" Franke
Posts 142 12 Jul 2019 10:32
| Rhe insn in front of a compare is now fused to the compare if it's related to. live at ~12:35 - if all tests pass
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 13:07
| Samuel Devulder wrote:
| Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code: void Scale0(double scalar, double* b, double* c) { size_t j; for (j=4; j; j--){ *b++ = scalar* *c++; } }
|
Unrolling will make more sense for more interations. How about we look at a loop of "640" ? #include <string.h>void Scale8(double scalar, double* b, double* c) { size_t j; for (j=640; j; j--){ *b++ = scalar * *c++; } } _Scale8: move.l a2,-(sp) move.l #640,d0 fdmove.d (8,sp),fp0 move.l (20,sp),a0 move.l (16,sp),a1 .L2: fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 move.l a1,a2 fmove.d fp1,(a2)+ subq.l #8,d0 lea (64,a1),a1 fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a2)+ fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(a2) fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(-40,a1) fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(-32,a1) fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(-24,a1) fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(-16,a1) fdmove.d (a0)+,fp1 fdmul.x fp0,fp1 fmove.d fp1,(-8,a1) tst.l d0 jne .L2 move.l (sp)+,a2 rts
-O3 -m68080 -fomit-frame-pointer -funroll-loops Bebbo can you explain why this code is generated. I would have assumed EA mode (An)++ Instead GCC uses (16,An) EA mode and makes it very complicated with LEA and using A2. Why does GCC make such code?
| |
| | Stefan "Bebbo" Franke
Posts 142 12 Jul 2019 13:13
| Samuel Devulder wrote:
| Playing around with "-O3 -m68080 -fomit-frame-pointer -funroll-loops" and that code: void Scale0(double scalar, double* b, double* c) { size_t j; for (j=4; j; j--){ *b++ = scalar* *c++; } }
|
Everything is fine, because the pointers may alias. Use void Scale0(double scalar, double* restrict b, double* restrict c)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 12 Jul 2019 13:39
| Bebbo did you see my question above?
| |
|
|
|