Overview Features Coding ApolloOS Performance Forum Downloads Products Order Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

AMMX DEMO: Real Time Texture Manipulation

Manuel Jesus

Posts 155
17 Feb 2018 18:45


One of the keys to powerfull 2D graphic is fast manipulation of 24 bit textures in resolutions higher than Amiga LOWRES. Shown here is a 640X480 24 bit texture manipulated at 90 fps. This is done using optimized AMMX code and thus achieves results much faster than an 060.
  The AMMX insructions also have the benefit of taking little space on the FPGA.
 
  EXTERNAL LINK


Thellier Alain

Posts 141
20 Feb 2018 09:50


This is interesting
So a future project can be to use those instructions in a "Vampire optimized" Wazp3D or PatchCompositeTags...
So allowing a basic Compositing & 3D support on Vampire in the future...

>640X480 24 bit texture manipulated at 90 fps
as drawing full 3D pixels will use 4-5 times more instructions per pixel we can hope a playable Quake at 640X480 around 20 fps one day... or games like MACE




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
20 Feb 2018 10:06


thellier alain wrote:

This is interesting

So a future project can be to use those instructions in a "Vampire optimized" Wazp3D or PatchCompositeTags...

Yes indeed.

If you like to tackle this then I can offer you,
to write the low levels ASM functions.


Thellier Alain

Posts 141
20 Feb 2018 14:20


>I can offer you to write the low levels ASM functions.
Nice Thanks

But I dont have a Vampire yet I will buy a standalone as soon as available

Anyway I would be intellectualy interested about you optimize a 3D rendering function

I have already wrote about the Cow3D vampire version with low fpu usage (I mean not for drawing)
EXTERNAL LINK  EXTERNAL LINK 
In this program the most crucial part is in FillPoly function
and especially in the fill an horizontal segment loop
that start with
  NLOOP(dx)

and end with
  LineDone:

This is my horrible low level C that looks like ASM ;-)

/*==================================================================*/
#define ASRGBA64(ptr) (*((double*)ptr))
#define ASRGBA32(ptr) (*(( ULONG*)ptr))
#define ASRGB16(ptr)  (*(( UWORD*)ptr))

#define MUL8(src,dst,c,sa,da) dst[c]=( ((ULONG) ( ((UWORD)src[c])*sa + ((UWORD)dst[c])*da ))>>8)
#define MULCOL(c,sa,da) MUL8(T0.B.RGBA,D0.B.RGBA,c,sa,da)
#define MULRGB(sa,da)  if(sa==255) {D0.L.RGBA32=T0.L.RGBA32;} else { MULCOL(0,sa,da); MULCOL(1,sa,da); MULCOL(2,sa,da); }

#define MOD8(src,dst,c) dst[c]=( ((ULONG) ( ((UWORD)src[c])*((UWORD)dst[c])  ))>>8)
#define MODCOL(c)  MOD8(C0.B.RGBA,T0.B.RGBA,c)
#define MODRGBA  { MODCOL(0); MODCOL(1); MODCOL(2); MODCOL(3); }

#define SRC_A (T0.B.RGBA[3])
/*==================================================================*/
void FillPoly(struct Context3D *Ctx,struct FixPoint3D *Edge1,struct FixPoint3D *Edge2)
{
register LONG x,dx;
register LONG z,dz;
 
register LONG u,du;
register LONG v,dv;
register LONG w,dw;
 
register LONG r,dr;
register LONG g,dg;
register LONG b,db;
register LONG a,da;
 
register ULONG m,n;
register LONG y;

register ULONG *Tex32; 
register ULONG *Dst32; 
register ULONG *Dst32X; 
register LONG *Zbuf32; 
register LONG *Zbuf32X; 
register ULONG sline;
register ULONG dline;
register union Rgba3D T0;  /* texel */
register union Rgba3D D0;  /* destination pixel from screen */
register union Rgba3D C0;  /* current gouraud */

//..............|..................................
// .............|..#@..............................
// ..Out........|.#..@..........Clipped Triangle...
// ..of screen..|#....@.........to fill............
// .............x#.....@........between edges......
// ............x.#......@..........................
// ...........x..#.......@.........................
// ..........x...#........@........................
// .........x....#.Edge1...@.Edge2.................
// ..........x...#..........@...........In Screen..
// ............ x#...........@.....................
// .............|.#...........@....................
// .............|....#.........@...................
// .........Clip|......#........@..................
// -------------|---------x------x-------Clip------
// .............|............x....x................
// .............|...............x..x...Out.........
// .............|..................xx..of screen...
//..............|..................................

 
  FUNC
  y=Ctx->PolyY;
  Edge1=&Edge1[y];
  Edge2=&Edge2[y];

/*  Init Texture & Screen pointers  */
  Tex32=(ULONG*)Ctx->T->pixels;
  sline=Ctx->T->w;
 
  Dst32=(ULONG*)Ctx->pixels;
  dline=Ctx->w;

  Zbuf32=(LONG*)Ctx->zbuffer;

/* for poly height fill each horizontal segments */
  MLOOP(Ctx->PolyHigh)
  {
  x =(Edge1->x);    /* first-pixel x */
  dx=(Edge2->x - Edge1->x)+1;  /* horizontal segment size in pixels */

  if(dx < 1)
  goto LineDone;

/* get first-pixel value for each "channels" */
  z=(Edge1->z);
 
  u=(Edge1->u); 
  v=(Edge1->v); 
  w=(Edge1->w);
 
  r=(Edge1->r);
  g=(Edge1->g);
  b=(Edge1->b);
  a=(Edge1->a);

/* compute linear distance (delta) for each "channels" among x */
  dz=(((Edge2->z>>16) - (Edge1->z>>16))<<16)/dx;
 
  du=(((Edge2->u>>16) - (Edge1->u>>16))<<16)/dx;
  dv=(((Edge2->v>>16) - (Edge1->v>>16))<<16)/dx;
  dw=(((Edge2->w>>16) - (Edge1->w>>16))<<16)/dx;

  dr=(((Edge2->r>>16) - (Edge1->r>>16))<<16)/dx;
  dg=(((Edge2->g>>16) - (Edge1->g>>16))<<16)/dx;
  db=(((Edge2->b>>16) - (Edge1->b>>16))<<16)/dx;
  da=(((Edge2->a>>16) - (Edge1->a>>16))<<16)/dx;
 
  Dst32X =&Dst32 [y*dline + x];  /* Pointer on the start of the segment on screen */
  Zbuf32X=&Zbuf32[y*dline + x];  /* Pointer on the start of the segment on zbuffer*/

/* fill an horizontal segment */
  NLOOP(dx)
  {
    if(z < Zbuf32X[n])  /* do a W3D_Z_LESS test */
    {
    Zbuf32X[n]=z;
    T0.L.RGBA32=Tex32[ (v>>16)*sline + (u>>16)];      /* get texel */

    if(C.UseGouraud)
    {
    C0.L.RGBA32=((r>>16)<<24)+((g>>16)<<16)+((b>>16)<<8)+((a>>16));  /* get current gouraud color as RGBA32*/
    MODRGBA                /* modulate effect for Texture/Color */
    }
   
    if(SRC_A)  /* if texel alpha (else is invisible for alpha==0) */
    {
    D0.L.RGBA32=Dst32X[n];
    MULRGB(SRC_A,(255-SRC_A));  /* do alpha transparency for Texture/Screen */
    Dst32X[n]=D0.L.RGBA32;
    }
    }

/* next pixel: new values for each "channels" */
  z=z+dz;
   
  u=u+du;
  v=v+dv;
  w=w+dw;
   
  r=r+dr;
  g=g+dg;
  b=b+db;
  a=a+da;
  }
 
LineDone: 
  Edge1++;  /* next line of poly */
  Edge2++;
  y++;
  }

}

 



Gunnar von Boehn
(Apollo Team Member)
Posts 6207
20 Feb 2018 17:44


Hi Alain,
 
A few ideas to your routine:
 
register LONG x,dx;
You declare 30 variables as type register as they are important.
Of course previous 68K CPUs did not provide so mane registers.
Apollo will help you here - as only 68080 has enough registers for your usage.
 
 
if(z < Zbuf32X[n])  /* do a W3D_Z_LESS test */
I see that your Z Buffer is 32bit int.
Did you try a 16bit buffer too?
16bit could save space and increase speed.
 

T0.L.RGBA32=Tex32[ (v>>16)*sline + (u>>16)];      /* get texel
Here the LEA3D instruction will be perfect match
as it does this calculation in 0.5 cycle
[ (v>>16)*sline + (u>>16)]
 
 
if(C.UseGouraud)
Those checks inside the workloop will cost some time.
The code will run faster when you write 2 routines.
One with gouraud and one without and decide before the loop which routine to use.
 
 
Tex32
Your code uses 32bit texture.
This look very good but is wasteful in memory and slow.
APOLLO will help you here a lot.
APOLLO supports in HW compressed 24bit textures.
Those textures will look as good but use much less memory and are much faster.

du=(((Edge2->u>>16) - (Edge1->u>>16))<<16)/dx;
Your routine uses linear interpolation in texture space?
What do think about adding perspective correction on the texture?



Samuel Devulder

Posts 248
20 Feb 2018 18:27


Gunnar von Boehn wrote:

        du=(((Edge2->u>>16) - (Edge1->u>>16))<<16)/dx;
        Your routine uses linear interpolation in texture space?
        What do think about adding perspective correction on the texture?

AFAIK, these kind of "delta" optimizations along a dx axis are frequent. More info on that page: EXTERNAL LINK (look at "Optimizing the edge function").
 
I'm currently studying this with my new pet (a colorful monkey) :)
      EXTERNAL LINK


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
20 Feb 2018 19:37


Samuel Devulder wrote:

AFAIK, these kind of "delta" optimizations along a dx axis are frequent.

Yes and its OK for a few pixels.
Doing a delta run for 8 pixels or so is fine, after this one should correct the perspective again.



Thellier Alain

Posts 141
27 Feb 2018 14:16


A few ideas to your routine:

>You declare 30 variables as type register ... only 68080 has enough registers for your usage.
Yes

 

>I see that your Z Buffer is 32bit int.
>Did you try a 16bit buffer too?
Yes I tried it too in Wazp3D but I encountered some problems mainly due to the fact that some progs use Z values that are not in 0.0 to 1.0 range
so as I wanted Wazp3D to be the more compatible i choosed finally float
But I agree Zbuffer can be enhanced

>Here the LEA3D instruction will be perfect match
:-)
 
>if(C.UseGouraud)
>The code will run faster when you write 2 routines.
I agree but this is problematic as there are lost of states that can enabled: gouraud, zbuffer, blend modes,etc...
So having a function for each combination is impossible

>Your code uses 32bit texture.
if 68080 can extract fastly R8G8B8 from (say) R5G6B5 for solid textures
But remenber there is also textures with alpha values from 0 to 255

>Your routine uses linear interpolation in texture space?
True for Cow3D that dont have perspective
>What do think about adding perspective correction on the texture?
Wazp3D compute a quadratic approximation of texture to emulate perspective
  u +=du;
  v +=dv;
  du+=ddu;
  dv+=ddv;
It works well for not too long segments
So depending of the polygon width
less than 10 pixels => use linear
less than 40 pixels => use quadratic
else cut vertically the polygon in two polygons that use quadratic

For the 68080 the simpler and more efficient is certainly to have true perspective mapping that use a div per pixel

 
 
 


Samuel Devulder

Posts 248
27 Feb 2018 15:03


In the quake engine, the horizontal lines of a polygon is called a span. In theory a division should be done for every pixel of that span, but John Carmack cheated by doing that division on every pixel out of 16 and interpolating in-between the divisions. This results is a little bit inexact, but the errors are hardly visible and the code runs much faster.


Thellier Alain

Posts 141
27 Feb 2018 17:39


Yes but for the 68080 that do div very fastly adding a per 16 pixels linear deltas management will certainly be slower,no ?


Gunnar von Boehn
(Apollo Team Member)
Posts 6207
27 Feb 2018 19:00


thellier alain wrote:

Yes but for the 68080 that do div very fastly adding a per 16 pixels linear deltas management will certainly be slower,no ?

First of all, I'm sure your quadratic code will run fine.

Second you are of course fully correct,
that the in theory many possible drawing option could be used.
I would  assume that of these many options some are used more often. Maybe we could aim to write offer for the most used option a handtuned asm routine.

What do you think?

In regard how to write the code, we could discuss using INT or FLOAT or a mix. APOLLO 68080 can do 2-4 INT instructions per cycle.
And FPU instructions can be done parallel - with a max speed of 1 FPU instructions per clock in Gold2.7

I think using compressed textures will give a big advantage.
DTX Compressed Truecolor textures need 4 bit storing space per texel. This means 8 times less space compared to raw 32bit texture.
Therefore Cache hit rate will improve a lot.

What do you think is the most often used raster variant?



Samuel Crow

Posts 424
28 Feb 2018 12:07


thellier alain wrote:

  >if(C.UseGouraud)
  >The code will run faster when you write 2 routines.
  I agree but this is problematic as there are lost of states that can enabled: gouraud, zbuffer, blend modes,etc...
  So having a function for each combination is impossible

That's actually how the Mac did it and with shader emulation besides!  In the early days of the LLVM compiler infrastructure one of their showcase JIT applications was the Mac software shader codes:

The first stage was to compile the shader codes into a VLIW intermediate representation instruction set.  It was done with maximum optimization because the shader program had to be able to run once per pixel if needed.

Then the second stage was to JIT from the intermediate representation to native code using a different subroutine for each combination of effects.  The optimization was turned off completely for this stage so that it would run fast.

If the use of having a separate subroutine for each combination of effects seems unmaintainable, use macros.


Thellier Alain

Posts 141
28 Feb 2018 14:47


>Maybe we could aim to write offer for the most used option a handtuned asm routine.
Certainly the previous routine plus perspective will be enough for most 3D renderings

>the intermediate representation to native code
Yes but this is not "having a separate subroutine" per states combination but "generating the subroutine for any given states combination"
I mean "already done" vs "compiled at run time"

I agree it is the most promising way in terms of speed
buuut it will need to solve other problems like "do ve need to generate a new subroutine ? or the states only represents one that already exists"

Anyway for the moment it is higly speculative : having the previous routine for 68080 will be a first step




Thellier Alain

Posts 141
12 Mar 2018 09:18


Hello

I have begin to study the AMMX doc (about how to do a 3d draw polygon routine)

If I understand well it may looks like this for each pixel in segment loop

a0 is Frag
a1 is Delta
a2 is Tex32
a3 is Screen32
d0 is x

load a0@(24),E0    // fixed R fixed G (16.16)
load a0@(32),E1    // fixed B fixed A

movem.l a0@(12),E2  // fixed u fixed v
lea3d E2,E2
movel a2@(E2),E2  // tex

vperm #$0000159d,E0,E1,E1;  // keep r g b a in E1=color
pmula E1,E2,E2    // color=color*tex

movel a6@(0),d3    // tex
vperm #$00003333,E2,E0,E1;  // keep  color aaaa = alpha
clr.l E0
pmula d3,E1,E0    // color=color*tex

movel a0@(40),a3  // get Screen32 ptr for this Y (stored in frag once for all)
load a3@(d0),E2    // get pixel at Screen32[x]

psubb E1,#$FFFFFFFFFFFFFFFF,E1 // alpha=255-alpha
pmula E2,E1,E0

movel E0,a3@(d0)  // get Screen32

certainly some typos error but the idea is here

After that we need to go to next pixel
  Frag->L.z += Delta.L.z;
   
  Frag->L.u += Delta.L.u;
  Frag->L.v += Delta.L.v;
  Frag->L.w += Delta.L.w;
   
  Frag->L.r += Delta.L.r;
  Frag->L.g += Delta.L.g;
  Frag->L.b += Delta.L.b;
  Frag->L.a += Delta.L.a;

Simple but still need 8 instructions (if deltas are in registers)
so is there a way to make a "vector" add on 32 bits values
I mean there is PADD for 8 or 16 bits values but not 32 ...
OK it is less interesting but still divide the instructions per 2 in this particular case

Alain




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
12 Mar 2018 10:44


Hi Alain,
 
Some ideas:
 
Doing eight 32bit additions is not a problem, the Core can two 32bit additions parallel per cycle.
 
All variables should be loaded from memory into register BEFORE we start the work loop.
 
Inside the work loop we will only do fast register accumulations.
The only memory operations in the work loop should be the texture fetch, (the Z-Buffer check if used), and the store to the screen buffer.
 
All variables like U/V/Z/ their delta should be held in register.
You have enough regs for this.
You have 16 Pointers plus 32 Data-Regs available.
 
The ALPHA mul can be done little simpler.
Maybe you put a Pseudo place holder there and I will fill it in for you?

Maybe you can write your draft code this way?
 
Merci
 


Samuel Devulder

Posts 248
12 Mar 2018 10:57


The asm is hard to follow (at least for me). I guess that in your notation Frag stands for the fractional part, and (u,v) stands for the texture coordinate, right? I guess that z is the distance used in z-buffer. But what is w then ? the alpha channel ?

Quick remark: notice that in binary 255-x is just ~x (complement x bit by bit), so there is no need for parallel op to complement all the alpha values. But we can be even more efficient (read below).

As far as I can guess (maybe I'm totally wrong) the asm code is trying to merge a texture value (32bits), with the current pixel value (32bits) via the texture-alpha channel. The underlaying math is kind of alpha*texture + (1-alpha)*current_pixel[*] done for every RGB component. There is the very simple PIXMRG ammx2 instruction that does this in a single operation!
___
[*] actually 1-alpha is 255-alpha.




Thellier Alain

Posts 141
12 Mar 2018 13:28


@Samuel
no no Frag is the union struct that contain all the current values for x y z u v w r g b a as fixed size 16.16
(Frag for fragment)

Delta is the same struct but with deltas

w is 1/z used for perspective tex mapping = not used in cow3D but will be needed for a Wazp3D implementation

>PIXMRG
Yes i forgot this one :-)

@Gunnar
Anyway if we can pack all in (say) 20 instructions per pixel the 8 "add delta value" will stay the biggest part ... so a PADD32 will help
Even more if we process 2 pixels at a time (with 64 bits vector registers) so perhaps in 35 instructions (or less) but with still 2*8 add :-(

For the moment it is only "brainstorming" ideas ...

Alain




Gunnar von Boehn
(Apollo Team Member)
Posts 6207
13 Mar 2018 21:56


thellier alain wrote:

@Gunnar
Anyway if we can pack all in (say) 20 instructions per pixel

   
There are of course different cases.
And the number of instructions will vary by the case we want to cover.
   
Simple case:
  A0 Texture
  A1 Screen
  A2 U.u
  A3 V.v
  A4 Tmp
  D2 DU.du
  D3 DV.dv
  D7 width of rasterline

Loop:
  lea3D (A2,A3),A4
  adda.l D2,A2
  -
  move.b (A0,A4.l),D0
  adda.l D3,A3
  -
  move.b (d0),(A1)+
  dbra  D7,loop
 

   
As you see the simple case is 6 instructions and needs 3 clocks

How is the timing of this?
LEA3D needs be done in the cycle before the TEXTURE Fetch
And the write to screen buffer in the cycle after the Texture read,
this means the LOOP is by design 3 clocks.
The U.u  V.v delta updates we can do for free in these cycles in the 2nd pipe.

The example was simple doing linear interpoliation
As you explained we get a better result with interpolating the deltas


Loop:
  lea3D (A2,A3),A4
  adda.l D2,A2
  -
  move.b (A0,A4.l),D0
  adda.l D3,A3
  -
  add.l D4,D2
  add.l D5,D3
  -
  move.b (d0),(A1)+
  dbra  D7,loop
 

   
Now the code is 4 cycle.
Running at 80 MHz this means up to 20 MegaTexel per second.


Thellier Alain

Posts 141
14 Mar 2018 09:37


Of course 4 cycles is nice
But your code do only a little part of what is needed to fill a 3D polygon

Also IMHO lea3D can be implemented better ;-)
I mean it seems to do
  lea3D(u,v),offset
should be better
  lea3D(uv,tex),ptr
as both uv can be loaded/stored/added (if padd32 will exists)  in a single 64 bits vector register

Please consider the padd32 option

>There are of course different cases.
Yes but you are still reasoning for 2D games that use a simple case: If we want Quake 640*480 at 20fps then most of the drawings will use the WORST case = textured+perspective+modulate+blending+zbuffer so doing all that in 20 instructions(or I hope less) will be difficult



posts 19