Information about the Apollo CPU and FPU. |
|
---|
| | Stefan "Bebbo" Franke
Posts 142 17 Jul 2019 13:54
| Grom 68k wrote:
|
Stefan "Bebbo" Franke wrote:
| Grom 68k wrote:
| ... Is gcc can make difference between subq.l #4,d0 and subq.l #1,d0 ? If yes, subq.l #4,d0 could be move upward. ... |
since the sub is fused to the jne, there is no gain. |
Unlike other, #1 is specified for subq/bne fusing. Gunnar von Boehn wrote:
| MOVEq #,Dn AND.L Dx,Dn SUBQ.L #1,Dn BNE.s LOOP |
|
_Scale1: | unrolled 4 times fdmove.x fp1,fp0 fmovem fp2/fp3/fp4,-(sp) moveq #20,d0 .L2: fdmove.d (a1)+,fp4 fdmul.x fp0,fp4 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ subq.l #4,d0 fmove.d fp2,(a0)+ fmove.d fp1,(a0)+ tst.l d0 jne .L2 fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1 fmovem (sp)+,fp4/fp3/fp2 rts
| |
| | Grom 68k
Posts 61 17 Jul 2019 14:09
| Stefan "Bebbo" Franke wrote:
| Grom 68k wrote:
| Stefan "Bebbo" Franke wrote:
| Grom 68k wrote:
| ... Is gcc can make difference between subq.l #4,d0 and subq.l #1,d0 ? If yes, subq.l #4,d0 could be move upward. ... |
since the sub is fused to the jne, there is no gain. |
Unlike other, #1 is specified for subq/bne fusing. Gunnar von Boehn wrote:
| MOVEq #,Dn AND.L Dx,Dn SUBQ.L #1,Dn BNE.s LOOP |
|
_Scale1: | unrolled 4 times fdmove.x fp1,fp0 fmovem fp2/fp3/fp4,-(sp) moveq #20,d0 .L2: fdmove.d (a1)+,fp4 fdmul.x fp0,fp4 fdmove.d (a1)+,fp3 fdmul.x fp0,fp3 fdmove.d (a1)+,fp2 fdmul.x fp0,fp2 fdmove.d (a1)+,fp1 fdmul.x fp0,fp1 fmove.d fp4,(a0)+ fmove.d fp3,(a0)+ subq.l #4,d0 fmove.d fp2,(a0)+ fmove.d fp1,(a0)+ tst.l d0 jne .L2 fmove.d fp0,-(sp) move.l (sp)+,d0 move.l (sp)+,d1 fmovem (sp)+,fp4/fp3/fp2 rts
|
tst.l is useless, fmove doesn't modify zero flag You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0) How gcc count the number of time (4,5,6)? If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times.
| |
| | Stefan "Bebbo" Franke
Posts 142 17 Jul 2019 14:29
| Grom 68k wrote:
| tst.l is useless, fmove doesn't modify zero flag You can remove unused Scalar0 (= remove first fdmove.x fp1,fp0)
|
I know - not done yet.
Grom 68k wrote:
| How gcc count the number of time (4,5,6)? If you can simplify the step to 1, you can fused subq, and it will be better to unroll 5 or 6 times.
|
I limited it by some formula, otherwise gcc tends to use stack variables for unrolling.
| |
| | Grom 68k
Posts 61 18 Jul 2019 08:08
| I simplify my example to transform vector3 with matrix4x4 EXTERNAL LINK . EDIT: This one is better to unroll EXTERNAL LINK | Bx | = | Ux Vx Wx Tx || Ax | | By | | Uy Vy Wy Ty || Ay | | Bz | | Uz Vz Wz Tz || Az | | 1. | | 0. 0. 0. 1. || 1. | EDIT2: The same with const and -O3 EXTERNAL LINK Can gcc remove useless a5 ? move.l (a1)+,(-8,a5) move.l (a1)+,(-4,a5) move.l (a1)+,(-16,a5) move.l (a1)+,(-12,a5) move.l (a1)+,(-24,a5) move.l (a1)+,(-20,a5) move.l (a1)+,(-32,a5) move.l (a1)+,(-28,a5) move.l (a1)+,(-40,a5) move.l (a1)+,(-36,a5) ...
And this is the rainflow EXTERNAL LINK . It's used to calculate fatigue dammage of metallic part.
| |
| | Thellier Alain
Posts 143 18 Jul 2019 12:27
| Hello I know you are only tuning the compiler but is it possible you regroup those optimized sources into an ASM source ? I mean having 68080 optimized versions of CrossProduct DotProduct MultyplyMatrices4x4 MultyplyMatrices3x3 TransformVec3Matrices4x4 TransformVec3Matrices3x3 Distance3 etc... can be usefull for lots of future 3D programs :-) Thanks
| |
| | Samuel Devulder
Posts 248 18 Jul 2019 12:59
| What's what I did for the Monkey demo in CoffinOS. But beware of library versions. They tend to work against the full potential of optimizing compilers. For instance, here is an except of a profiling session I did some times ago with a library version of DotProduct: Test date: Sun Oct 21 20:47:51 2018 Execution profile for: sam/quake.gcc-3.2.2.030 Time units: Percentual Sort order: Overall time Profiling mode: Separate Used commandline: -safe -usemode 0 All symbols shown _DotProduct 7979523 0.000 13.965 0.000 434.592 R_ClipEdge 3525197 0.000 7.550 0.000 0.000 @R_RenderFace 891605 0.000 4.741 0.000 0.000 @R_EmitEdge 1525151 0.000 3.433 0.000 0.000 @D_DrawSpansXP4 252339 0.000 3.401 0.000 0.000 As you can see the #1 most costly function is DotProduct. This is not because it isn't optimized (the library version is as optimized as possible), but because it is used almost everywhere in the code. When used via a library --that is not inlined-- the compiler cannot really optimize it along with other fpu computations. Such a function it is too few fpu-ops for major speed boost. Serializing/deserializing the vectors into/from memory to call the library is a full waste of time for instance. Actually such a library function should always be inlined and optimized globally along with other instructions of the C function. The same goes with primitive-likes operations (CrossProduct, etc.) that are often used to make decisions in the code (ie. their result is combined with other computation and used in a if() statement).
| |
| | Stefan "Bebbo" Franke
Posts 142 18 Jul 2019 19:51
| Grom 68k wrote:
| EDIT2: The same with const and -O3 EXTERNAL LINK Can gcc remove useless a5 ? move.l (a1)+,(-8,a5) move.l (a1)+,(-4,a5) move.l (a1)+,(-16,a5) move.l (a1)+,(-12,a5) move.l (a1)+,(-24,a5) move.l (a1)+,(-20,a5) move.l (a1)+,(-32,a5) move.l (a1)+,(-28,a5) move.l (a1)+,(-40,a5) move.l (a1)+,(-36,a5) ...
|
there might be an option - but I'll have a look.
| |
| | Samuel Devulder
Posts 248 18 Jul 2019 22:06
| Grom 68k wrote:
| Can gcc remove useless a5 ?
|
If you want to remove the frame-pointer, just add -fomit-frame-pointer: EXTERNAL LINK Now, if you question is "why on hell do gcc makes a local copy of const double transformMatrix[4][4]", then I have no clue. It is very odd. I see no obvious reason for this local copy (even playing with the restrict keyword doesn't help EXTERNAL LINK ). [EDIT] I think I kind of "understand" what's going on. Replace the number of loops (900) by a smaller value (say 3). Then you'll see gcc preload transformMatrix into fpu regs. If you increase the number of loops a little bit, you'll see gcc use more and more fpu regs, up to a point (say 5) where there aren't enough fpu-reg for preload and then it seem that gcc uses the local stack as extra "free" regs. This is very very odd. Of course using memory as source is as fast as using fpu-reg, but then why use a local stack-based copy? Copying adds many cycles. It is killing the speed.
| |
| | Thellier Alain
Posts 143 19 Jul 2019 09:04
| @Samuel I never talked about a Library. I was meaning just some functions (in a .h) with parameters in register that you can use inlined in a C source.
| |
| | Samuel Devulder
Posts 248 19 Jul 2019 12:49
| Library or inline asm are the same, optimization-wise. Both appear as "atomic" function call, and the compiler couldn't optimize as much as fully inlined C-code. So better use "static inline" with plain C code in include.h to let a better chance for the compiler to schedule the instructions over whole of the fonction.
| |
| | Stefan "Bebbo" Franke
Posts 142 21 Jul 2019 07:49
| Samuel Devulder wrote:
| There are strange things occuring with this matrix*vector routine. EXTERNAL LINK There are lots of "fadd #0,freg" as pointed out by Grom68k, ...
|
I undid my changes to remove these zeros. You have to use -ffast-math: Why? X + 0 and X - 0 both give X when X is NaN, infinite, or nonzero and finite. The problematic cases are when X is zero, and its mode has signed zeros. In the case of rounding towards -infinity, X - 0 is not the same as X because 0 - 0 is -0. In other rounding modes, X + 0 is not the same as X because -0 + 0 is 0. Thus you can't omit the fadd #0,fpx unles the user forces it via -ffast-math.
| |
| | Samuel Devulder
Posts 248 21 Jul 2019 10:28
| So it is signed zeros that are causing troubles. Damn non-mathematical concept ;) Anyway, using your latest version and "-ffast-math" gives a great result concerning wait-cycle. I now only count 2 of them remaining in the very end of the calculation (that's really not very much) EXTERNAL LINK _multiplyMatrix: subq.l #8,sp fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp) fdmove.d (a0)+,fp6 fdmove.x fp6,fp1 fdmul.d (a1)+,fp6 fdmove.d (a0)+,fp7 fdmove.x fp7,fp0 fdmove.x fp0,fp3 fdmove.x fp1,fp2 fdmul.d (a1)+,fp7 fdmove.d (24,a1),fp5 fdmove.d (16,a1),fp4 fdmul.x fp0,fp5 fdmul.d (88,a1),fp0 fdmul.x fp1,fp4 fdmul.d (56,a1),fp3 fdmul.d (48,a1),fp2 fdmul.d (80,a1),fp1 fdadd.x fp6,fp7 fdmove.d (a1)+,fp6 fmove.d fp0,(72,sp) fdadd.x fp4,fp5 fdmove.d (a0)+,fp0 fdmove.d (24,a1),fp4 fdadd.x fp3,fp2 fdmul.x fp0,fp6 fdmul.x fp0,fp4 fdmove.x fp0,fp3 fdmul.d (88,a1),fp0 fdmul.d (56,a1),fp3 fdadd.d (72,sp),fp1 fdadd.x fp7,fp6 fdmove.d (a1)+,fp7 fdadd.x fp5,fp4 fmove.d fp0,(72,sp) fdmove.d (a0),fp0 fdmul.x fp0,fp7 fdmove.d (24,a1),fp5 fdadd.x fp3,fp2 fdmul.x fp0,fp5 fdmove.x fp0,fp3 fdmul.d (56,a1),fp3 fdadd.d (72,sp),fp1 fdmul.d (88,a1),fp0 fdadd.x fp6,fp7 fdadd.x fp5,fp4 move.l d0,a0 fdadd.x fp3,fp2 fdadd.x fp0,fp1 ; 1 wait-cycle (fp7) fmove.d fp7,(a0)+ fmove.d fp4,(a0)+ ; 1 wait-cycle (fp2) fmove.d fp2,(a0)+ fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2 fmove.d fp1,(a0) addq.l #8,sp rts /me happy with the result :)
| |
| | Grom 68k
Posts 61 21 Jul 2019 11:08
| Samuel Devulder wrote:
| So it is signed zeros that are causing troubles. Damn non-mathematical concept ;) Anyway, using your latest version and "-ffast-math" gives a great result concerning wait-cycle. I now only count 2 of them remaining in the very end of the calculation (that's really not very much) EXTERNAL LINK _multiplyMatrix: subq.l #8,sp fmovem fp2/fp3/fp4/fp5/fp6/fp7,-(sp) fdmove.d (a0)+,fp6 fdmove.x fp6,fp1 fdmul.d (a1)+,fp6 fdmove.d (a0)+,fp7 fdmove.x fp7,fp0 fdmove.x fp0,fp3 fdmove.x fp1,fp2 fdmul.d (a1)+,fp7 fdmove.d (24,a1),fp5 fdmove.d (16,a1),fp4 fdmul.x fp0,fp5 fdmul.d (88,a1),fp0 fdmul.x fp1,fp4 fdmul.d (56,a1),fp3 fdmul.d (48,a1),fp2 fdmul.d (80,a1),fp1 fdadd.x fp6,fp7 fdmove.d (a1)+,fp6 fmove.d fp0,(72,sp) fdadd.x fp4,fp5 fdmove.d (a0)+,fp0 fdmove.d (24,a1),fp4 fdadd.x fp3,fp2 fdmul.x fp0,fp6 fdmul.x fp0,fp4 fdmove.x fp0,fp3 fdmul.d (88,a1),fp0 fdmul.d (56,a1),fp3 fdadd.d (72,sp),fp1 fdadd.x fp7,fp6 fdmove.d (a1)+,fp7 fdadd.x fp5,fp4 fmove.d fp0,(72,sp) fdmove.d (a0),fp0 fdmul.x fp0,fp7 fdmove.d (24,a1),fp5 fdadd.x fp3,fp2 fdmul.x fp0,fp5 fdmove.x fp0,fp3 fdmul.d (56,a1),fp3 fdadd.d (72,sp),fp1 fdmul.d (88,a1),fp0 fdadd.x fp6,fp7 fdadd.x fp5,fp4 move.l d0,a0 fdadd.x fp3,fp2 fdadd.x fp0,fp1 ; 1 wait-cycle (fp7) fmove.d fp7,(a0)+ fmove.d fp4,(a0)+ ; 1 wait-cycle (fp2) fmove.d fp2,(a0)+ fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2 fmove.d fp1,(a0) addq.l #8,sp rts /me happy with the result :)
|
Hi, Is it possible to reserve fp0 and fp1 for the last 2 fmove ? Example:
fmove.d fp4,(a0)+ fmovem (sp)+,fp7/fp6/fp5/fp4/fp3/fp2 fmove.d fp1,(a0)+ fmove.d fp0,(a0)
| |
| | Grom 68k
Posts 61 21 Jul 2019 21:24
| Philippe Flype wrote:
| Since the 080 have a precise cycle counter, i can output the real results of each of them. Those are REGS to REGS operations, in exception of FMOVE R/W, FMOVEM R/W. +------------+--------------+ | FPU instr | Single | OoO | +------------+--------+-----+ | FABS | 1 | 1 | | FADD | 6 | 1 | | FCMP | 6 | 1 | | FDABS | 1 | 1 | | FDADD | 6 | 1 | | FDDIV | 9 | 2 | | FDIV | 9 | 2 | | FDMOVE | 1 | 1 | | FDMUL | 6 | 1 | | FDNEG | 1 | 1 | | FDSQRT | 21 | 12 | | FDSUB | 6 | 1 | | FINTRZ | 2 | 1 | | FMOVERm | 1 | 1 | | FMOVEWm | 1 | 1 | | FMOVERi | 1 | 1 | | FMOVEWi | 1 | 1 | | FMOVECR | 1 | 1 | | FMOVECTRL | 4 | 4 | | FMOVEMR | 8 | 8 | | FMOVEMW | 25 | 25 | | FMUL | 6 | 1 | | FNEG | 1 | 1 | | FSABS | 1 | 1 | | FSADD | 6 | 1 | | FSDIV | 9 | 2 | | FSGLDIV | 9 | 2 | | FSGLMUL | 6 | 1 | | FSMOVE | 1 | 1 | | FSMUL | 6 | 1 | | FSNEG | 1 | 1 | | FSQRT | 21 | 12 | | FSSQRT | 21 | 12 | | FSSUB | 6 | 1 | | FSUB | 6 | 1 | | FTST | 1 | 1 | | FSEQ | 1 | 1 | | FSCC | 1 | 1 | | FNOP | 1 | 1 | +------------+--------+-----+ | FPSP instr | Single | OoO | +------------+--------+-----+ | FACOS | 121 | 121 | | FASIN | 121 | 121 | | FATAN | 198 | 198 | | FATANH | 153 | 153 | | FCOS | 209 | 209 | | FCOSH | 264 | 264 | | FETOX | 220 | 220 | | FETOXM1 | 231 | 231 | | FGETEXP | 88 | 88 | | FGETMAN | 88 | 88 | | FINT | 99 | 99 | | FLOG10 | 231 | 231 | | FLOG2 | 242 | 242 | | FLOGN | 220 | 220 | | FLOGN1P | 220 | 220 | | FMOD | 121 | 121 | | FREM | 121 | 121 | | FSCALE | 99 | 99 | | FSIN | 238 | 238 | | FSINCOS | 264 | 264 | | FSINH | 286 | 286 | | FTAN | 198 | 198 | | FTANH | 275 | 275 | | FTENTOX | 231 | 231 | | FTWOTOX | 231 | 231 | +------------+--------+-----+
Source code provided : EXTERNAL LINK |
In gcc commits, I found fdiv with a latency of 10 instead of 9. I understand too that fdiv is not fully pipelined. Where is the cycle not usable ? ;; all insns with latency 10 (define_insn_reservation "m68080_fpu_10" 10 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fdiv")) "f0_pipeline, f1_pipeline, f2_pipeline, f3_pipeline, f4_pipeline, f5_pipeline, f6_pipeline, f7_pipeline, f8_pipeline, f9_pipeline")
| |
| | Grom 68k
Posts 61 23 Jul 2019 12:36
| Philippe Flype wrote:
| Since the 080 have a precise cycle counter, i can output the real results of each of them. Those are REGS to REGS operations, in exception of FMOVE R/W, FMOVEM R/W. +------------+--------------+ | FPU instr | Single | OoO | +------------+--------+-----+ | FDIV | 9 | 2 | | FMUL | 6 | 1 | | FSQRT | 21 | 12 | +------------+--------+-----+
Source code provided : EXTERNAL LINK |
Hi, To help schudeling, are fpu pipelines could be defined as is ? ;; all insns with latency 6 (define_insn_reservation "m68080_fpu_6" 6 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fmul,falu,fcmp,ftst")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end") ;; all insns with latency 10 (define_insn_reservation "m68080_fpu_10" 10 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fdiv")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end") ;; all insns with latency 21 (define_insn_reservation "m68080_fpu_21" 21 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fsqrt")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end")
Regards
| |
| | Stefan "Bebbo" Franke
Posts 142 23 Jul 2019 16:08
| Grom 68k wrote:
|
Philippe Flype wrote:
| Since the 080 have a precise cycle counter, i can output the real results of each of them. Those are REGS to REGS operations, in exception of FMOVE R/W, FMOVEM R/W. +------------+--------------+ | FPU instr | Single | OoO | +------------+--------+-----+ | FDIV | 9 | 2 | | FMUL | 6 | 1 | | FSQRT | 21 | 12 | +------------+--------+-----+
Source code provided : EXTERNAL LINK |
Hi, To help schudeling, are fpu pipelines could be defined as is ? ;; all insns with latency 6 (define_insn_reservation "m68080_fpu_6" 6 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fmul,falu,fcmp,ftst")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_end") ;; all insns with latency 10 (define_insn_reservation "m68080_fpu_10" 10 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fdiv")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr4, f0_pipeline_instr5, f0_pipeline_instr6, f0_pipeline_end") ;; all insns with latency 21 (define_insn_reservation "m68080_fpu_21" 21 (and (eq_attr "cpu" "m68080") (eq_attr "type" "fsqrt")) "f0_pipeline_start1, f0_pipeline_start2, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_instr3, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr1, f0_pipeline_instr2, f0_pipeline_end")
Regards
|
the latency for fdiv can be changed to 9 - np. Your version will result in some stalls, since one state can only be used from one insn: e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.
| |
| | Grom 68k
Posts 61 24 Jul 2019 15:18
| Stefan "Bebbo" Franke wrote:
| the latency for fdiv can be changed to 9 - np. Your version will result in some stalls, since one state can only be used from one insn: e.g. if fsqrt switches from f0_pipeline_instr3 to f0_pipeline_instr1 and a fmul wants to switch from f0_pipeline_start2 to f0_pipeline_instr1, one has to wait.
|
Hi, As fdiv and fsqrt seem to be not fully pipelined, it probably use sometime the same pipe that fadd or fmul. It will be easy to fix but we need 1 hour of the fpu core developper to describe fpu pipelines. If it is usefull, do you know how to add FPSP(complex fpu as sin, cos...) insn? Is FPSP lock the entire fpu ? How to write this lock on the other insns(fadd, fmul...)? If you can write the first one, I can make the others after my holidays. After, I try integer instructions EXTERNAL LINK Mul should be modified for the -m68080 Gunnar von Boehn wrote:
| Stefan "Bebbo" Franke wrote:
| - what is the latency of each insn? |
always 1 More expensive are MUL=2 DIV=32 MOVEM=1 per Reg MOVE16=4 CMPM=2 JMP/JSR with calculated EA =4 E.g. "JSR -40(A6)" JMP /JSR absolute or PC-relativ =1
|
Is it mandatory to sub and after add 1 when converts short to int ? EXTERNAL LINK I am impressed that gcc use dbeq but it leave jne. EXTERNAL LINK Regards
| |
| | Samuel Devulder
Posts 248 24 Jul 2019 16:14
| Grom 68k wrote:
| As fdiv and fsqrt seem to be not fully pipelined,
|
Is this true? As far as I can test, simple fpu ops can run concurrently with fdiv. Concerning complex fpu functions like fsin/fcos/ftan/fexp etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all. Concerning integer multiplication, it is even worse if you mul by 16 instead of 11. It produces 4 additions in a row accounting for 4 cycles whereas a single LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL?
| |
| | Grom 68k
Posts 61 24 Jul 2019 16:52
| Samuel Devulder wrote:
| Grom 68k wrote:
| As fdiv and fsqrt seem to be not fully pipelined, |
Where does that come from ? As far as I can test, simple fpu ops can run concurrently with and fdiv. Concerning complex fpu functions like fsin/cos/tan etc, you can ignore the pipeline. They are kind of emulated and takes plenty of operations (see fpsp lib) plus the interruption mechanism which is quite fast nonetheless, but which probably flushes the pipeline. Better consider fsin/fcos and friends as not pipelined at all. Concerning integer operation, if you mul by 16 instead of 11, I'm surprised by the produced ASM. It is a series of 4 additions in a row accounting for 4 cycles whereas an LSL #4 is only one cycle! Am I wrong at estimating cycles for LSL? |
At least, the fdiv definition must be modified. It is defined as fully pipelined in gcc and it's not the case as show the FPU Cycle Counter. Philippe Flype wrote:
| Since the 080 have a precise cycle counter, i can output the real results of each of them. Those are REGS to REGS operations, in exception of FMOVE R/W, FMOVEM R/W. +------------+--------------+ | FPU instr | Single | OoO | +------------+--------+-----+ | FDIV | 9 | 2 | | FMUL | 6 | 1 | | FSQRT | 21 | 12 | +------------+--------+-----+
Source code provided : EXTERNAL LINK |
For the FPSP, this was my question, how can block pipeline for other fpu instruction? For integer, I think as you but there is probably a reason like flags differences or other. EDIT: (zero, negative, overflow...) I try with unsigned int, it's the same. EDIT2: Worse, <<4 is replaced by 4 add :( EXTERNAL LINK EDIT3: There is no reason EXTERNAL LINK Else, do you try the 1/sqrt(x) function from Quake with gcc ? Will 3 Ops instructions make it faster ? Is a creation of a FPSP instruction can help ?
| |
| | Samuel Devulder
Posts 248 24 Jul 2019 17:24
| I have an asm implementation for 1/sqrt(x) wich I use in quake. It has a lot of wait-states. It would be better to let the compiler inline and schedule the corresponding C code amongst other fpu calculations in the caller. * float Q_rsqrt( float number ) * { * long i; * float x2, y; * const float threehalfs = 1.5F; * * x2 = number * 0.5F; * y = number; * i = * ( long * ) &y; // evil floating point bit level hacking * i = 0x5f3759df - ( i >> 1 ); // what the fuck? * y = * ( float * ) &i; * y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration *// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed * * return y; *} * fp0/d0 = 1/fsqrt(fp0) (fp1,d1 preserved) xdef _Q_rsqrt xdef @Q_rsqrt cnop 0,4 _Q_rsqrt ifnd REGPARM fmove.s 4(sp),fp0 endc @Q_rsqrt fmove.s fp0,d0 fmul.s #-0.5,fp0 lsr.l #1,d0 neg.l d0 add.l #$5f3759df,d0 fmul.s d0,fp0 fmul.s d0,fp0 fadd.s #1.5,fp0 fmul.s d0,fp0 ifd __GNUC__ fmove.s fp0,d0 endc rts
| |
|
|
|