Information about the Apollo CPU and FPU. |
|
---|
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 11:38
| Lets talk about how C compiler could be improved for 68080 There are several areas where a C compiler could create faster code. Lets look at them in detail: a) instruction scheduling : not use 060-hacks Lets say the coder wants to do stuff like this: AND.L #$000000003,D0 AND.L #$000000030,D1 These 2 instructions are 12 byte long. 68060 has a weakness with long instructions. A compiler might try to work around this 060 weakness and create this code: MOVEQ #$03,D7 AND.L D7,D0 MOVEQ #$30,D7 AND.L D7,D1 The compiler used an extra TMP register now, and added 2 more instructions. This "workaround" for 060 is not needed for 68080. The 12byte long code with 2 instruction was optimal already. The 12byte code would execute on the 68080 in 1 cycle total. b) instruction scheduling : use super scalarity The 68080 can execute 2 instructions per cycle, tell the compiler to schedule the code to take advantage of this. c) FPU instruction scheduling The 68080 FPU is fully pipelined. FADD and FMUL have a latency. But each cycle a new FPU instruction can be executed (pipelined) With good scheduled FPU code the 68080 can reach several times the FPU performance of an 68060. GCC can create good code for this. But we need to provide GCC the scheduling cost information file. This seems to me like a very low hanging fruit.
| |
| | Stefan "Bebbo" Franke
Posts 142 24 Jun 2019 11:55
| Gunnar von Boehn wrote:
| Lets talk about how C compiler could be improved for 68080 There are several areas where a C compiler could create faster code. Lets look at them in detail: a) instruction scheduling : not use 060-hacks Lets say the coder wants to do stuff like this: AND.L #$000000003,D0 AND.L #$000000030,D1 ... |
this is just work and can be done Gunnar von Boehn wrote:
| b) instruction scheduling : use super scalarity The 68080 can execute 2 instructions per cycle, tell the compiler to schedule the code to take advantage of this. |
I already added some scheduling to gcc 6.5.0.b as post processing. aren't it 4 instructions per cycle?
Gunnar von Boehn wrote:
| c) FPU instruction scheduling The 68080 FPU is fully pipelined. FADD and FMUL have a latency. But each cycle a new FPU instruction can be executed (pipelined) With good scheduled FPU code the 68080 can reach several times the FPU performance of an 68060. GCC can create good code for this. But we need to provide GCC the scheduling cost information file. This seems to me like a very low hanging fruit. |
providing the scheduling information is a start^^ And to utilize the new 68080 instructions, a working m68k-amigaos-as with 68080 support is mandatory.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 12:38
| Hallo Stefan,
Stefan "Bebbo" Franke wrote:
| I already added some scheduling to gcc 6.5.0.b as post processing. aren't it 4 instructions per cycle?
|
Up to 4 - yes but only under some circumstances. 68080 looks internally like this: a) 3 operant machine with b) TWO Ea units and TWO 3-operant ALUs, plus AMMX unit, plus FPU Good scheduled code should be - super scalar for 2 ALUs. - will happily use longer instructions. (each ALU can eat 8byte instruction per cycle=16byte total) - make use of 3 operant features D2=D0+D1 D5=D4+D3 write as move.l D0,D2 add.l D1,D2 move.l D3,D5 add.l D4,D5 Always bundle the MOVE.L with the next instruction. The core will be able to execute all 4 instructions together in 1 cycle. FPU scheduling is relative easy. The Core will keep track of dependancies.FADD,FSUB,FCMP and FMUL have a latency of 6. This means up to 6 FADD can be executed in parallel in the core. FDIV has a latency of 10. This means up to 10 FDIV can be executed in parallel in the core. Please schedule instructions accordingly.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 12:44
| some 68080 tuning tips. A) 68080 can jump to any address without instruction fetch penalty. While aligning branch targets (labels) made sense on 68060. This makes no sense anymore on 68080. B) 68080 can do 64bit memory operations For example: MOVE.L (a0)+,(a1)+ MOVE.L (a0)+,(a1)+ Will be executed together as one 64bit MOVE. C) Misaligned memory access cost NOTHING extra for READs D) free Branch The code can automatically convert some branches to conditional code. small saturation example addq.b #1,D0 bne.s nosat -- overflow move.b #$FF,D0 nosat The BNE and the MOVE.B will be internally converted to a Conditional MOVE. This will execute single cycle and never misspredict. This feature works only if there is a single instruction under the branch. All the above tricks can be used and the code will stay compatible will older 68K CPUs too. This means we can improve code performance without breaking compatibility.
| |
| | Stefan "Bebbo" Franke
Posts 142 24 Jun 2019 13:08
| Hallo Gunnar,Gunnar von Boehn wrote:
| Hallo Stefan, Stefan "Bebbo" Franke wrote:
| I already added some scheduling to gcc 6.5.0.b as post processing. aren't it 4 instructions per cycle? |
Up to 4 - yes but only under some circumstances. 68080 looks internally like this: a) 3 operant machine with b) TWO Ea units and TWO 3-operant ALUs, plus AMMX unit, plus FPU Good scheduled code should be - super scalar for 2 ALUs. - will happily use longer instructions. (each ALU can eat 8byte instruction per cycle=16byte total) - make use of 3 operant features D2=D0+D1 D5=D4+D3 write as move.l D0,D2 add.l D1,D2 move.l D3,D5 add.l D4,D5 Always bundle the MOVE.L with the next instruction. The core will be able to execute all 4 instructions together in 1 cycle. FPU scheduling is relative easy. The Core will keep track of dependancies. FADD,FSUB,FCMP and FMUL have a latency of 6. This means up to 6 FADD can be executed in parallel in the core. FDIV has a latency of 10. This means up to 10 FDIV can be executed in parallel in the core. Please schedule instructions accordingly.
|
a) so gcc could treat D2=D0+D1 D5=D4+D3 as lea d0(d1),d2 lea d4(d3),d5 which is resolved during output as move.l/add.l isn't there an explicte new insn for this? b) the use of moveq can be trained off c) scheduling... I'll have a look. Best approach: open a ticket for each topic at EXTERNAL LINK and pray that you won't get a "WON'T FIX" ^^
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 14:14
| Stefan "Bebbo" Franke wrote:
| a) so gcc could treat D2=D0+D1 D5=D4+D3 as lea d0(d1),d2 lea d4(d3),d5
|
ADD was just an example. The same works for other operations like SUB/AND/OR/EOR/ ... But also for SHIFT E.g B=A>>2 MOVE.l D0,D1 LSR #1,D1Stefan "Bebbo" Franke wrote:
| which is resolved during output as move.l/add.l isn't there an explicte new insn for this?
| 68080 does support new instruction. some simple ones likeADDiW.L #$1234,D0 A 16bit immediate to LONG addition (4 byte encoding) To special purpose ones like MOVEX.L (ea),Dn Move with free endian conversion or PERM #3467,D0:D1 A Byte Select instruction combining the content of 2 Registers new New instruction of course break backward compatibility. Therefore I first spoke about backward friendly options first. One tuning note. As you are certainly aware, Motorola did choose a vertical pipeline design for the 68060 CPU. This means EA units are ABOVE ALU units in the pipeline design. This design choice has 2 effects: a) "free" EA calculation for instruction b) ALU to EA bubble Benefit: MOVE.l 12(A0),D0 -- EA calculation and DCache access is free Contra: lsl.l #1,D0 move.l (A0,D0),D2 -- Index usage This code is will cause slowdown on 060. As the Index register was used in the ALU to close before hand. Motorola explains this very good in their manuals. That compiler need to schedule such INDEX manipulation several instructions before using them. Is GCC aware of this?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 15:03
| Bebbo here are some examples of FUSINGs: MOVE.L (an)+,(am)+ MOVE.L (an)+,(am)+ MOVE.L -(an),-(am) MOVE.L -(an),-(am) MOVE.L (an)+,Dn MOVE.L (an)+,Dm MOVE.L Dn,(an)+ MOVE.L Dm,(an)+ MOVE.B (ea),Dn extB.l Dn MOVE.W (ea),Dn ext.l Dn MOVE.L Dn,Dm ADDi.L #,Dm MOVE.L Dn,Dm SUBi.L #,Dm MOVE.L Dn,Dm ANDi.W #,Dm MOVE.L Dn,Dm ADD.L Dx,Dm MOVE.L Dn,Dm SUB.L Dx,Dm MOVE.L Dn,Dm AND.L Dx,Dm MOVE.L Dn,Dm OR.L Dx,Dm MOVE.L Dn,Dm SHIFT.L #,Dm MOVEq #,Dn MOVE.B (ea),Dn - MVZ MOVEq #,Dn MOVE.W (ea),Dn - MVZ MOVEq #,Dn OR.L Dx,Dn MOVEq #,Dn AND.L Dx,Dn SUBQ.L #1,Dn BNE.s LOOP Above you see examples of FUSING of 2 instructions which execute together in single cycle in 1 ALU. Both ALU can execute such bundles.
| |
| | Stefan "Bebbo" Franke
Posts 142 24 Jun 2019 15:34
| Gunnar von Boehn wrote:
| Contra: lsl.l #1,D0 move.l (A0,D0),D2 -- Index usage This code is will cause slowdown on 060. As the Index register was used in the ALU to close before hand. Motorola explains this very good in their manuals. That compiler need to schedule such INDEX manipulation several instructions before using them. Is GCC aware of this?
|
I'd say: yes. and about "fusing": a complete list would be useful Fused assembler instructions must be reflected by single insns in gcc to avoid splitting.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 15:59
| Stefan "Bebbo" Franke wrote:
| and about "fusing": a complete list would be useful Fused assembler instructions must be reflected by single insns in gcc to avoid splitting.
|
OK, I will prepare this list for you. Of course this can go both ways. If you see a certain combination which is very common or very useful to fuse then please tell us and we can add a FUSING. One more tuning comment-. BITFIELD are useful but used to be often slow. E.g. on 68060 a BFEXTU could need 9 cycle On 68080 such BITFIELD often only is single cycle. On 68060 a possible tuning was to replace a bitfields with a few normal instructions doing the same operation. Please NOT do this cheat on 68080. On 68080 a single cycle Bitfield is unbeatable. One more question: 68080 has a number of new CPU debug registers. You can read them out with MOVEC. Using this registers you can for example exactly read out of many CPU cycles a certain routine needed, you can read out how good the branch-prediction was, you can read out your Dcache Hit/Miss ratio. You can read out how good Super-Scalar your routine is, and you can read out if your routine suffered from ALU-2-EA stalls. Could GCC benefit from this? Could this be used in "profiling" options?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 16:22
| Bebbo, one more question 68080 can do per cycle 1 memory READ OR 1 memory WRITE OR 1 memory READ/WRITE to SAME ADDRESSThis makes CISC instruction very strong. For best performance such memory access should be used for "real work" and counters / temporary variables should be kept in registers to keep the memory bus available for pushing data. To make this easier 68080 has more registers. Old 68000 did has 8 Address and 8 data registers. On 68080 you can use 16 ADDRESS and 32 DATA registers. To use those extra registers each instruction can have a PREFIX. This technique is similar to the x64 extension which AMD did invent. Would adding support for this to GCC be complex ? On the other hand DOUBLE INDIRECT EA modes should be avoided. For example this code construct is sometimes used by compilers "create" more Pointer registers. MOVE.L (4[A0]),D0 Such operation needs 2 memory access and creates an ALU-2-EA bubble. Performance wise this should be avoided. Can GCC be told to avoid this or to account 4 cycles for this EA?
| |
| | Stefan "Bebbo" Franke
Posts 142 24 Jun 2019 17:36
| Gunnar von Boehn wrote:
| ... Please NOT do this cheat on 68080. On 68080 a single cycle Bitfield is unbeatable. ...
|
I think, the current insn costs are 100% correct - and I have to fix 68060 costs - but for 68080 all insns costs are set to 1 cycle atm. Gunnar von Boehn wrote:
| Bebbo, one more question 68080 can do per cycle 1 memory READ OR 1 memory WRITE OR 1 memory READ/WRITE to SAME ADDRESS This makes CISC instruction very strong. For best performance such memory access should be used for "real work" and counters / temporary variables should be kept in registers to keep the memory bus available for pushing data.
|
gcc is cost based - sometimes you can't be 100% accurate but it's fair. So gcc selects registers if useful, e.g. if a parameter is used more often a temp register is used. Gunnar von Boehn wrote:
| To make this easier 68080 has more registers. Old 68000 did has 8 Address and 8 data registers. On 68080 you can use 16 ADDRESS and 32 DATA registers. To use those extra registers each instruction can have a PREFIX. This technique is similar to the x64 extension which AMD did invent. Would adding support for this to GCC be complex ?
|
if I find the time for this, I'll try to implement it. Gunnar von Boehn wrote:
| On the other hand DOUBLE INDIRECT EA modes should be avoided. For example this code construct is sometimes used by compilers "create" more Pointer registers. MOVE.L (4[A0]),D0 Such operation needs 2 memory access and creates an ALU-2-EA bubble. Performance wise this should be avoided. Can GCC be told to avoid this or to account 4 cycles for this EA?
|
It was a pita to implement those double indirect instructions, but this can be done via costs or TUNE_68080 flag. With -Os I would keep these insns. And I wonder that MOVE.L (4[A0]),D0 is slower than MOVE.L 4[A0],A1 MOVE.L (A1),D0 it should result into the same speed - at least... BTW: __regargs void foo(int x) { return x & 11; } yields now and.l #11,d0 rts if -m68080 is specified. This should be live quite soon.
| |
| | Mike Kopack
Posts 268 24 Jun 2019 17:46
| Really awesome to see you two discussing this and helping each other since realistically you are both the key people for this to happen. I’ll certainly be sending some $$ bebbo’s way and will be purchasing a V4SA as soon as I can get me hands on one. Thanks for coming together on this guys! It’s only going to help the Vampire community!
| |
| | Tango One
Posts 102 24 Jun 2019 18:28
| One of my wet dreams is to programing in assambly.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 18:42
| Stefan "Bebbo" Franke wrote:
| Gunnar von Boehn wrote:
| On the other hand DOUBLE INDIRECT EA modes should be avoided. For example this code construct is sometimes used by compilers "create" more Pointer registers. MOVE.L (4[A0]),D0 Such operation needs 2 memory access and creates an ALU-2-EA bubble. Performance wise this should be avoided. Can GCC be told to avoid this or to account 4 cycles for this EA? |
It was a pita to implement those double indirect instructions, but this can be done via costs or TUNE_68080 flag. With -Os I would keep these insns. |
Well, the double-indirect encoding needs an extra extension word. So it makes longer instructions. Using the new A8-15 regs, or using 2 instructions will in average not be bigger. Stefan "Bebbo" Franke wrote:
| And I wonder that MOVE.L (4[A0]),D0 is slower than MOVE.L 4[A0],A1 MOVE.L (A1),D0 it should result into the same speed - at least... | This is easy to explain: MOVE.L (4[A0]),D0 gets executed as Cycle Pipe 1 Pipe2 1 MOVE.L 4(A0),Atemp nop 2 nop nop 3 nop nop 4 move.l (atemp),D0 nop
4 Clock cycle total Now your other code: Cycle Pipe 1 Pipe2 1 MOVE.L 4(A0),A1 free 2 free free 3 free free 4 move.l (A1),D0 free
It also takes 4 cycle. And it has the same instruction size! It looks nearly the same but its not.But you have 6 instructions slots free! This means with good scheduling a compiler can put there up to 6 instructions in parallel. Hint: Avoid the ALU-2-EA bubble. On 68060 the picture is very similar.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 19:09
| Bebbo, Maybe we can also talk about instruction pairing. Pairing should be a very important topic for many super scalar CPUs. 68060, Pentium, PPC, Arm Lets make some 68k examples MOVEM.L (A0)+,D0:D1 Will load 2 register from memory and do vituelly the same work as: MOVE.L (A0)+,D0 MOVE.L (A0)+,D1 Both have the same instruction size. But there is a significant difference: The multicycle MOVEM will block the 2nd pipe over its execution. While doing the same work with 2 individual moves will allow the compiler to put other instructions in the 2nd pipe.
| |
| | Stefan "Bebbo" Franke
Posts 142 24 Jun 2019 19:48
| so double indirect insns are void at all - guess I disable these for all CPUs /sighGunnar von Boehn wrote:
| Bebbo, Maybe we can also talk about instruction pairing. Pairing should be a very important topic for many super scalar CPUs. 68060, Pentium, PPC, Arm Lets make some 68k examples MOVEM.L (A0)+,D0:D1 Will load 2 register from memory and do vituelly the same work as: MOVE.L (A0)+,D0 MOVE.L (A0)+,D1 Both have the same instruction size. But there is a significant difference: The multicycle MOVEM will block the 2nd pipe over its execution. While doing the same work with 2 individual moves will allow the compiler to put other instructions in the 2nd pipe.
|
right now movem is only used if 3 registers or more are pushed/popped but since this happens in prologue/epilogue it's separated from the remaining code => only a late peephole optimizer might enhance this - maybe.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 24 Jun 2019 20:34
| Bebbo, another GCC question.. I recall that GCC used to create often unneeded TST instructions. Lets look at an example: void strncopy(char *dst,const char *src,short n) { char c; for (n;n>=0;n--) { c=*src++; *dst++=c; if( c==0x0 ) break; } }
What code do we expect to create here? I would assume the compiles creates something like this: move.l 4(SP),D0 BLT END move.l 8(SP),A0 move.l 12(SP),A1 LOOP: move.b (A1)+,(A0)+ dbeq D0,Loop END: rts
I expect a loop of 2 Instructions and a program total of 7. Why is it the created code with recent GCC versions bigger?
| |
| | Stefan "Bebbo" Franke
Posts 142 25 Jun 2019 06:39
| Gunnar von Boehn wrote:
| Bebbo, another GCC question.. I recall that GCC used to create often unneeded TST instructions. Lets look at an example: void strncopy(char *dst,const char *src,short n) { char c; for (n;n>=0;n--) { c=*src++; *dst++=c; if( c==0x0 ) break; } }
What code do we expect to create here? I would assume the compiles creates something like this: move.l 4(SP),D0 BLT END move.l 8(SP),A0 move.l 12(SP),A1 LOOP: move.b (A1)+,(A0)+ dbeq D0,Loop END: rts
I expect a loop of 2 Instructions and a program total of 7. Why is it the created code with recent GCC versions bigger?
|
the main reason seems to be the introduced abstract syntax - called gimple - plus the optimizers working with that syntax. it 'optimizes' the initial code { char c; char c; n; goto <D.1382>; <D.1381>:; c = *src++ ; *dst++ = c; if (c == 0) { goto <D.1380>; } n-- ; <D.1382>:; if (n >= 0) goto <D.1381>; else goto <D.1380>; <D.1380>:; } to (depends on the -O flags, this is -O1) strncopy (char * dst, const char * src, short int n) { unsigned int ivtmp.17; unsigned int ivtmp.15; char c; void * _29; void * _30; unsigned short _31; unsigned short _32; short int _33; <bb 2>: if (n_8(D) >= 0) goto <bb 3>; else goto <bb 7>; <bb 3>: src_21 = src_7(D) + 1; c_22 = *src_7(D); dst_23 = dst_6(D) + 1; *dst_6(D) = c_22; if (c_22 == 0) goto <bb 7>; else goto <bb 4>; <bb 4>: ivtmp.15_3 = (unsigned int) src_21; ivtmp.17_18 = (unsigned int) dst_23; goto <bb 6>; <bb 5>: _29 = (void *) ivtmp.15_1; c_11 = MEM[base: _29, offset: 0B]; ivtmp.15_2 = ivtmp.15_1 + 1; _30 = (void *) ivtmp.17_20; MEM[base: _30, offset: 0B] = c_11; ivtmp.17_19 = ivtmp.17_20 + 1; if (c_11 == 0) goto <bb 7>; else goto <bb 6>; <bb 6>: # n_25 = PHI <n_8(D)(4), _33(5)> # ivtmp.15_1 = PHI <ivtmp.15_3(4), ivtmp.15_2(5)> # ivtmp.17_20 = PHI <ivtmp.17_18(4), ivtmp.17_19(5)> _31 = (unsigned short) n_25; _32 = _31 + 65535; _33 = (short int) _32; if (_33 != -1(OVF)) goto <bb 5>; else goto <bb 7>; <bb 7>: return; } which is anything but optimized... and gcc6.5.0b is already a tad better than the original gcc^^ and there is room for further improvements. The compiler explorer at EXTERNAL LINK provides some 68k compilers to compare asm code.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6254 25 Jun 2019 07:49
| Hi Bebbo, here is some Fusing Overview CLICK HERE
| |
| | Thellier Alain
Posts 143 25 Jun 2019 07:53
| char c; for (n;n>=0;n--) { c=*src++; *dst++=c; if( c==0x0 ) break; }writing a loop with C this way generate a better ASM char c; while(n) { c=*src++; *dst++=c; if( c==0x0 ) break; n--; }
| |
|
|
|