Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
The team will post updates and news about our project here

68k FPU Coding Challenge - Win a Prize!page  1 2 

Gunnar von Boehn
(Apollo Team Member)
Posts 6207
09 Sep 2018 22:16


We would like to invite you to participate on a little coding challenge.
 
All 3D games need some math routine behind.
You all know the famous game ELITE.
In the intro of the game the star-ships do this nice rotation.
 
If your computer has an FPU than a common way to calculate such rotation is to use a rotation matrix on an array of vertexes.
 
Here is the challenge:
Lets assume we have an array of vertex of type SINGLE with (X,Y,Z) per element. This means 12 byte per Vertex.
Lets assume you have an array containing 1000 Vertex =(12KB)

Lets assume we have a rotation matrix of type SINGLE
 
We use type SINGLE in this challenge because this is very often used for 3D games.

Please write an FPU routine which processes this array
and applies the rotation calculation on it
and stores the rotated result in a new array.
 

Your routine gets called with these parameters:

  A0= PTR to array of source vertex
  A1= PTR where you shall store the rotated values
  A2= PTR to the rotation matrix to use
  D0= 1000  Number of elements to process.
 
 
APOLLO 68080 x12 will max reach 85 MegaFlops with optimal code.
 
The coder who writes a routine coming the closest to this score will win a price of a Vampire-Team-T-shirt.
For the measurement we assume that the src array is hot in Dcache.
 
We look forward to see your routines.
Good Luck!

Please ask if you have questions.


Thellier Alain

Posts 141
10 Sep 2018 09:15


Nice idea Gunnar
 
  To give a starting base here it is the code that give gcc 2.90.27
  with options -m68040 -m68881 -O3
  (edit: use wanted registers)
 
//  /*==================================================================*/
//  #include <string.h>
//  #include <stdarg.h>
//  #include <math.h>
//  #include <proto/exec.h>
//  /*==================================================================================*/
//  typedef struct _Vertex3D
//  {
//  float x,y,z;
//  } Vertex3D;
//  /*=================================================================*/
//  void CopyTransformV(Vertex3D* Vs,Vertex3D* V2s,float *Ms,LONG Vnbs)
//  {
  pea a5@
  movel sp,a5
  fmovem #0x1c,sp@-
  movel a0,sp@-
//  register Vertex3D* V =Vs;
LBB2:
  movel a5@(8),a0
//  register Vertex3D* V2 =V2s;
  movel a5@(12),a1
//  register float*  M =Ms;  /* put stack values to given registers */
  movel a5@(16),a2
//  register LONG  Vnb =Vnbs;
  movel a5@(20),d0
//  register float  x;
//  register float  y;
//  register float  z;
// 
//  while(Vnb)
  jeq L75
.even
L76:
//  {
//  x=V[0].x;y=V[0].y;z=V[0].z;
  fsmoves a0@,fp2
  fsmoves a0@(4),fp4
  fsmoves a0@(8),fp3
//  V2[0].x= M[0]*x + M[4]*y+ M[8] *z;
  fsmoves a2@,fp0
  fsmulx fp2,fp0
  fsmoves a2@(16),fp1
  fsmulx fp4,fp1
  fsaddx fp0,fp1
  fsmoves a2@(32),fp0
  fsmulx fp3,fp0
  fsaddx fp0,fp1
  fmoves fp1,a1@
//  V2[0].y= M[1]*x + M[5]*y+ M[9] *z;
  fsmoves a2@(4),fp0
  fsmulx fp2,fp0
  fsmoves a2@(20),fp1
  fsmulx fp4,fp1
  fsaddx fp0,fp1
  fsmoves a2@(36),fp0
  fsmulx fp3,fp0
  fsaddx fp0,fp1
  fmoves fp1,a1@(4)
//  V2[0].z= M[2]*x + M[6]*y+ M[10]*z;
  fsmuls a2@(8),fp2
  fsmuls a2@(24),fp4
  fsaddx fp2,fp4
  fsmuls a2@(40),fp3
  fsaddx fp3,fp4
  fmoves fp4,a1@(8)
// 
//  V++;
  addw #12,a0
//  V2++;
  addw #12,a1
//  Vnb--;
  subql #1,d0
//  }
  jne L76
L75:
//  }
LBE2:
  movel sp@+,a0
  fmovem sp@+,#0x38
  unlk a5
  rts


Thellier Alain

Posts 141
10 Sep 2018 10:06


First ideas:

* Process 2 vertex per loop

* Remove the addw #12,a0 and  addw #12,a1 and use an index register
in d1 for accessing vertices

* No more use subql #1,d0 but cmp d1,d2 with d2 containing 12*1000 (12*Vnb)

* Store some matrices values in fp5 fp6 fp7 (68040)

* If enough fp registers then store whole matrix in fp0 fp16 (68080)

* feed x y z with a movem

BTW a non straightforward solution would be to analyze the matrix for simples  x rotation or y rotation or z rotation.
Then have 3 dedicated functions for those 3 cases

For example in quake3 of course the scene use full xyz transformation matrix but there are some "3d icons" that only y rotate. Depending of the game simple rotations maybe more important in proportion




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Sep 2018 10:42



thellier alain wrote:

First ideas:

 
Hi Alain,
thanks for the ideas.
 
Let me give some more details about 68080.
1) the FPU can issue one FPU instruction per cycle.
The CPU can in addition execute another integer instruction each cycle. This means be interleaving FPU/CPU instruction you can do some CPU instructions in parallel for "free".
This means interleaved PTR increments can be done for free.

Another free option is the EA mode (an)+

2) dbra Dn,loop  takes only 0.5 cycle in 2nd pipe.
This means a DBRA in the last instruction can be free.

The same is true for "SUBQ.l #1,Dn, bne" combination.
This combo is identified as LOOP and can be executed for free in 2nd pipe.

As you correctly pointed out the 68080 has more Register.
3) APOLLO has 8 normal FPU register + 24 Extended FPU register.
This means FPU code can use up to 32 FPU regs.
Plus 8 Data regs as Inputs.

4) APOLLO allows to use 3 Opp per cycle.
This means using an extra instruction extension word
something like this can be done in 1 cycle.
FADD (a0),Fp0,Fp1


Thellier Alain

Posts 141
10 Sep 2018 14:08


If we have 32 fpu registers then we can have the full matrice in registers once for all AND movem 5 vertex (15 fpu registers) at one time from/to memory
I will remain one fpu register for the calculation

>free option is the EA mode (an)+
I allways thought (an)+ (an)+ (an)+ was worse than doing an(0) an(1) an(2) etc as the + change the adresse register so dont use the pipeline
I suppose 68080 change the rules :-)




Samuel Devulder

Posts 248
10 Sep 2018 14:43


The M being constant through-out the loop, I'd put them in some regs. Here is what I'd get with pseudo-asm
 

    ; a0 = input
    ; a1 = output
    ; a2 = M
    ; d7 = counter
   
    ; using extra regs e0-e8, if not allowed use n(a2)
        fmove.s (a2)+,e0
        fmove.s (a2)+,e1
        fmove.s (a2)+,e2
   
        fmove.s (a2)+,e3
        fmove.s (a2)+,e4
        fmove.s (a2)+,e5
   
        fmove.s (a2)+,e6
        fmove.s (a2)+,e7
        fmove.s (a2)+,e8
   
        move.l  (a0)+,d0-d2 ; 3 cycles
   
    loop:
    ; 3 arg op, or use macro for plain 68882
        fmul3  d0,e0,fp0  ; 1
        fmul3  d1,e1,fp1  ; 1
        fmul3  d2,e2,fp2  ; 1
   
        fmul3  d0,e3,fp3  ; 1
        fmul3  d1,e4,fp4  ; 1
        fmul3  d2,e5,fp5  ; 1
   
        fadd    fp0,fp1    ; 1
   
        fmul3  d2,e6,fp6  ; 1
        fmul3  d2,e7,fp7  ; 1
        fmul3  d2,e8,fp0  ; 1
   
        fadd    fp3,fp4    ; 1
        fadd    fp1,fp2    ; 1
   
        ; 2 cycles bubble
        fadd    fp6,fp7    ; 1
   
        ; 3 cycles bubble -->  move.l  (a0)+,d0-d2 ; 3 cycles
        fadd    fp4,fp5    ; 1
   
        ; 1 cycles bubble
        fadd    fp7,fp0    ; 1
   
        fmove.s fp2,(a1)+  ; 1
   
        ; 3 cycles bubble
        fmove.s fp5,(a1)+  ; 1
   
        ; 3 cycles bubble
        fmove.s fp0,(a1)+  ; 1
   
        dbra    d7,loop    ; 0
   
        ; total = 21 + 9 bubbles = 30 cycles/element

Bubbles in the bottom of the loop can be eliminated by putting there the beginning (fmul3 parts) of the next iteration, but is a pain to unroll loop in asm. I need time to write this correctly. But you got the idea.


Renee Cousins
(Apollo Team Member)
Posts 142
10 Sep 2018 18:11


Your move.l (movem.l?) needs to be inside the loop. If you use more registers and unroll the loop somewhat you should be able to intermix the fmuls of the next vertex with the faad/fmove of the prior.


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Sep 2018 18:25


Nice ideas so far.
Lets continue this, and lets see if you can score better than an 68060 @ 400 MHz!


Thellier Alain

Posts 141
10 Sep 2018 19:07


@samuel
  You are using d0 d1 d2 for floats operations : is it legal ?
  Why do you store the new x y z values only at end if not using movem ?

 
  @gunnar
  Of course we can interleave int/float operations for better performance but here we do floats only... especially if accesses to adresses are not indexed (else we will have several int index++ )
 
 
 
 
 
 
 


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
10 Sep 2018 19:31


thellier alain wrote:

@samuel
  You are using d0 d1 d2 for floats operations : is it legal ?

Yes Dn is legal for the 1st operant.

68K FPU has normally 2 Operants

FADD.S 1st,2nd

1st can be any of those
  memory
  data register
  Fpu register 
  Extended Register  (68080 only)



Samuel Devulder

Posts 248
10 Sep 2018 20:00


Renee Cousins wrote:

  Your move.l (movem.l?) needs to be inside the loop.
 

They are! ( in the pseudo-code after the right arrow "--->" in the comment about the "3 cycles bubble").
 
I'll get rid of as many bubbles as possible by rotating the loop logic a little and clean the source code. Let's see if I can get rid of all the bubbles. (To optimize, I typically write asm by hand, then add comments about where are the fpu bubbles and try to reorg the code baby-steps by baby-steps until I'm happy with the solution.)
 
@Alain: yes using Dn reg as float source (.s extension) is perfectly legal. They are handled as "vey fast" memory-access in my thinking. As you can see I read them with movem.l to actually read them as "is", eg. as floating point values.
 
I do not write the result with a fmovem because I'm not sure if fmovem can store floats (I tend to think only .x (eg 80bits) values are stored.) Anyway, I think movem and signle moves costs the same for 32-bits values. And also having the writes spread in different places in the code allow filling the bubbles and gain cycles in the end.



Samuel Devulder

Posts 248
10 Sep 2018 20:59


How about this? I got rid of the 3 cycles bubbles in the end of the loop in the previous code. The code also contains an "equivalent" of 68080 3-way ops for 68882 (slower of course)

        ; a0 = input
        ; a1 = output
        ; a2 = M
        ; d7 = counter
           
          ifne ac68080
         
        ; code for apollo accelerator 68080
            fmove.s (a2)+,e0
            fmove.s (a2)+,e1
            fmove.s (a2)+,e2
       
            fmove.s (a2)+,e3
            fmove.s (a2)+,e4
            fmove.s (a2)+,e5
       
            fmove.s (a2)+,e6
            fmove.s (a2)+,e7
            fmove.s (a2)+,e8
          else
       
        ; mc68882 version
        e0 equ 0
        e1 equ 4
        e2 equ 8
        e3 equ 12
        e4 equ 16
        e5 equ 20
        e6 equ 24
        e7 equ 32
        e8 equ 36
       
        ; emulation of 3 arg op for mc68882
        fmul3 macro
              fmove.s \1,\3
              fmul.s  \2(a2),\3
          endm
         
        ; end of 68882 specific code
          endc
       
            move.l  (a0)+,d0-d2 ; 3 cycles
       
            fmul3  d0,e0,fp0  ; 1
            fmul3  d1,e1,fp1  ; 1
            fmul3  d2,e2,fp2  ; 1
       
            fmul3  d0,e3,fp3  ; 1
            fmul3  d1,e4,fp4  ; 1
           
        loop:
            fmul3  d2,e5,fp5  ; 1
            fmul3  d2,e6,fp6  ; 1
       
        ; 3 cycles bubble when looping back (fp0)
            fadd    fp0,fp1    ; 1 ----+ <----------+
                                        |            |
            fmul3  d2,e7,fp7  ; 1    |            |
            fmul3  d2,e8,fp0  ; 1    |            |
                                        |            |
            fadd    fp3,fp4    ; 1    |            |
                                        |            |
        ; 2 cycles bubble (fp1)        |            |
            fadd    fp1,fp2    ; 1 <---+            |
            fadd    fp6,fp7    ; 1                  |
                                                    |
            movem.l (a0)+,d0-d2 ; 3 (ex 3 bubbles)  |
            fadd    fp4,fp5    ; 1                  |
                                                    |
            fmove.s fp2,(a1)+  ; 1                  |
            fadd    fp7,fp0    ; 1                  |
                                                    |
            fmul3  d1,e1,fp1  ; 1 (ex 3 bubbles)  |
            fmul3  d2,e2,fp2  ; 1 (ex 3 bubbles)  |
            fmul3  d0,e3,fp3  ; 1 (ex 3 bubbles)  |
            fmove.s fp5,(a1)+  ; 1                  |
                                                    |
            fmul3  d1,e4,fp4  ; 1 (ex 1 bubble)    |
            fmove.s fp0,(a1)+  ; 1                  |
            fmul3  d0,e0,fp0  ; 1 -----------------+
       
            dbra    d7,loop    ; 0
           
        total : 21 + 5 bubbles = 26 cycles / loop
       
Yet, there is still 5 cycles in bubbles, mainly due to fp0 upon loopback :( I cannot see a way to do something useful in the ned opf the loop to fill that 3 cycle loop. :(


Thellier Alain

Posts 141
11 Sep 2018 09:27


@all
I just noted that I posted the wrong C/Asm function in my first post: I suppose what Gunnar wants is 4x4 matrix not a 3x3
Anyway the problem is still more or less the same




Samuel Devulder

Posts 248
11 Sep 2018 15:13


Gunnar spoke of 12 bytes vectors (X,Y,Z) so for me the matrix is also 3x3.

(and there are some typos in my latest code.. some "d2" shoulde be replaced with "d0" and "d1".. well anyway it is just a sketch of code, not a real one.)


Renee Cousins
(Apollo Team Member)
Posts 142
11 Sep 2018 16:44


thellier alain wrote:
  I just noted that I posted the wrong C/Asm function in my first post: I suppose what Gunnar wants is 4x4 matrix not a 3x3. Anyway the problem is still more or less the same

The advantage of 4x4 is that the two common (object and camera) transform matrices can be combined into one. So instead of two 3x3 transforms you have one 4x4.



Thellier Alain

Posts 141
12 Sep 2018 08:23


@Samuel
You still have unused registers so you should process 2 (4?) vertices per loop : interlacing vertex1 and vertex2 computations should enhance the bubble problem

Perhaps using movem for reading vertices was not such a good idea as it impose to wait 3 reads before doing anything

Also d0 d1 d2 store x y z but are never used as destination for a computation : I mean when x is last used as input it can serve as output for a computation = economizing regs will allow to process more vertices per loop




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Sep 2018 08:28


Alain raises some very good points here.
 
a) Yes, processing 2 rows in parallel in the Loop will give more independent work to do - which allows you to fill bubbles.
This is a good idea.
 
b) Using three simple MOVE instructions can be faster than one MOVEM,  as the simple MOVE instructions could be done in parallel "for free" in a free 2nd pipe slot.
 



Thellier Alain

Posts 141
12 Sep 2018 13:21


IMHO processing 3 vertices per loop would be an even better idea as those vertices certainly represents triangles so certainly the vertices count will be a multiple of 3
 
>Using three simple MOVE instructions can be faster than one MOVEM
Yes but I dont know if it will be the same for 3*3 move if processing 3 vertices per loop



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Sep 2018 13:31


thellier alain wrote:

  >Using three simple MOVE instructions can be faster than one MOVEM
  Yes but I dont know if it will be the same for 3*3 move if processing 3 vertices per loop
 

The instruction in the CPU can looks like this


1pipe  2pipe
FMUL    free
FMUL    free
FMUL    free
FADD    free

In the example the 2nd pipe does nothing.
"Finger in nose" as the french say.
For example you can for free do a MOVE instruction each cycle in the 2nd pipe.

MOVEM will use the 1st pipe for several clock.
This would cost extra.



Thellier Alain

Posts 141
12 Sep 2018 15:31


Ok I see so interleaving the MOVEs is the better solution as we dont have much others int instructions to fill the 2nd pipe

I am reposting the problem as C/asm if using a 4x4 matrice used as  a 3x4 matrice (I mean rotation + translation)

//  /*==================================================================*/
//  #include <string.h>
//  #include <stdarg.h>
//  #include <math.h>
//  #include <proto/exec.h>
//  /*==================================================================================*/
//  typedef struct _Vertex3D
//  {
//  float x,y,z;
//  } Vertex3D;
//  /*=================================================================*/
//  void CopyTransformV(Vertex3D* Vs,Vertex3D* V2s,float *Ms,LONG Vnbs)
//  {
  pea a5@
  movel sp,a5
  fmovem #0x1c,sp@-
  movel a0,sp@-
//  register Vertex3D* V =Vs; /* put stack values to wanted registers */
LBB2:
  movel a5@(8),a0
//  register Vertex3D* V2 =V2s;
  movel a5@(12),a1
//  register float*  M =Ms;
  movel a5@(16),a2
//  register LONG  Vnb =Vnbs;
  movel a5@(20),d0
//  register float  x;
//  register float  y;
//  register float  z;
// 
//  while(Vnb)
  jeq L75
.even
L76:
//  {
//  x=V[0].x;y=V[0].y;z=V[0].z;
  fsmoves a0@,fp2
  fsmoves a0@(4),fp3
  fsmoves a0@(8),fp4
// 
//  V2[0].x= M[0]*x + M[4]*y+ M[8] *z+ M[12];
  fsmoves a2@,fp0
  fsmulx fp2,fp0
  fsmoves a2@(16),fp1
  fsmulx fp3,fp1
  fsaddx fp0,fp1
  fsmoves a2@(32),fp0
  fsmulx fp4,fp0
  fsaddx fp1,fp0
  fsadds a2@(48),fp0
  fmoves fp0,a1@
//  V2[0].y= M[1]*x + M[5]*y+ M[9] *z+ M[13];
  fsmoves a2@(4),fp0
  fsmulx fp2,fp0
  fsmoves a2@(20),fp1
  fsmulx fp3,fp1
  fsaddx fp0,fp1
  fsmoves a2@(36),fp0
  fsmulx fp4,fp0
  fsaddx fp1,fp0
  fsadds a2@(52),fp0
  fmoves fp0,a1@(4)
//  V2[0].z= M[2]*x + M[6]*y+ M[10]*z+ M[14];
  fsmuls a2@(8),fp2
  fsmuls a2@(24),fp3
  fsaddx fp2,fp3
  fsmuls a2@(40),fp4
  fsaddx fp3,fp4
  fsadds a2@(56),fp4
  fmoves fp4,a1@(8)
// 
//  V++;
  addw #12,a0
//  V2++;
  addw #12,a1
//  Vnb--;
  subql #1,d0
//  }
  jne L76
L75:
//  }
LBE2:
  movel sp@+,a0
  fmovem sp@+,#0x38
  unlk a5
  rts

posts 26page  1 2