Performance and Benchmark Results!
|
68k Coding Challenge | page 1 2 3
|
---|
|
---|
| | Samuel Devulder
Posts 248 11 Feb 2018 11:59
| Yes, that's the "dark blue trick" that Gunnar explained earlier in the thread. Eg. replace black palette(0) by the most dark blue (hence, not 0). The visual will be the same (dark blue looks like black), but the speed gain is worth the trick! About AMMX, there is the STOREM (store mask) instruction that can contitionnaly write up to 8 bytes in memory. The condition is specified by a mask taking the form $abcdefgh where a,b,c,... are 0 (no writing) and 1 (writing). Such a mask can be directly computed without tests or branch from the formula I gave earlier:
((((x & 0x7777777)+0x77777777)|x)>>4)&0x11111111 (x containing the 8 consecutive colors, that is the myword variable IIRC.) move.l (a0),d0 ; read 8 color indexes move.l d0,d1 and.l #$77777777,d1 ; build mask add.l #$77777777,d1 or.l d0,d1 lsr.l #4,d1 and.l #$11111111,d1 ; d1 = mask telling which color is 0 or not Now with this mask we need to get pixel quickly in an AMMX reg from d0. Something that does --possibly in parallel-- extract 4 bits index from d0, get the 16 bits palette value from this index and put it in the proper word of the AMMX reg.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6263 12 Feb 2018 09:14
| This is nearly cheating. But here is the AMMX version of the Sprite copy. Loop: move.w (a2)+,D0 ; Loading the 16bit of 4bit data move.w (a2)+,D1 ; Apollo can do 2 joined reads per cycle move.w (a2)+,D2 move.w (a2)+,D3 TRANSLOi D0,E0:E1 ; Translate 4 pixel to 16bit values, TRANSLOi D1,E2:E3 ; does translate colors and create write MASK TRANSLOi D2,E4:E5 TRANSLOi D3,E6:E7 vstorem2 E1:E0,(A0) ; Store with MASK (byte enables) vstorem2 E3:E2,$8(A0) ; Writes 4 pixel per cycle vstorem2 E5:E4,$10(A0) add.w #modulo,A0 vstorem2 E7:E6,$18-module(A0) dbra D7,Loop
As you see the code is branch free. The code does not need the read back from the Screen back. The code is bubble free. The code does not need to do memory reads for color conversion. The Loop runs now magnitudes faster than the C version. The above loop does render 16 pixels of sprite in 10 cycles. This is less than 0.6 cycle per pixel. With unrolling we can get it lower to 0.5 cycle. Using ASM AMMX over the C loop resulted in ~ 20 times speedup.
| |
| | Samuel Devulder
Posts 248 12 Feb 2018 11:03
| There are a lot of magic in that piece of asm code :) I doubt anybody other than you could have produced such a marvelous code :) Btw the AMMX doc should be updated, there is no trace of TRANSLOi and VSTOREM2 in CLICK HERE There is however documentation for STOREM and TRANSHI. So one can presume these are related. The magic part is the use of the transpose matrix operation on the E0-E7 registers. I presume E0-E7 contains the palette data, right ? Apart from eight 4 bytes regs mades 32 bytes in total (same size as the whole palette), this is very obscure. Could you give more details about how and why this works for us poor mortals, please ? :)
| |
| | Thellier Alain
Posts 143 12 Feb 2018 13:30
| >this is very obscure It seems that this code "pick" words in the matrices based on a list stored in a register But the code is missing the part when the palette is stored in the registers (and which registers are used)BTW Could you explain also move.w (a2)+,D0 move.w (a2)+,D1 move.w (a2)+,D2 move.w (a2)+,D3 As I was thinking than modifying the adresse a2 between instructions would have impact on paralellisation/pipelining, no? I allways thought that move.w (a2),D0 move.w 2(a2),D1 move.w 4(a2),D2 move.w 6(a2),D3 do some stuffs addq #8,a2 would be faster on modern processors no ? Also what is the "normal" usage for TRANSHI TRANSLO ? I mean matrices that contain 16 bits words cant be used for 3D, no? Thanks
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6263 12 Feb 2018 13:51
| thellier alain wrote:
| BTW Could you explain also move.w (a2)+,D0 move.w (a2)+,D1 move.w (a2)+,D2 move.w (a2)+,D3 As I was thinking than modifying the address A2 between instructions would have impact on parallelization/pipelining, no?
|
APOLLO can "combine" 2 sequential memory access into 2 instruction. This means e.g. 2 WORD READs or WRITES will be combined to one LONG access - therefore the ++ will increment by 4. The same way 2 LONG access could be combined doing 1 QUAD access and would in/de-crement by 8. No other CPU can do this. thellier alain wrote:
| I allways thought that move.w (a2),D0 move.w 2(a2),D1 move.w 4(a2),D2 move.w 6(a2),D3 do some stuffs addq #8,a2 would be faster on modern processors no ?
|
APOLLO has some special feature which no other CPU can do this. :-)thellier alain wrote:
| Also what is the "normal" usage for TRANSHI TRANSLO ? I mean matrices that contain 16 bits words cant be used for 3D, no?
|
working with 16bit integers and matrixes of those of 4x4 or 8x8 is a very common case for JPEG / MPEG / XVID /H264 decodings. The 64bit bit registers of APOLLO allow working on 4 values in parallel at the same time. Working "horizontal" and some values in parallel would be the normal operation. The TRANSPOSE instruction is a specialty of AMMX and allows working "vertically" over 4 registers at the same time. This is very useful for VIDEO and GFX operations.
| |
| | Nixus Minimax
Posts 416 12 Feb 2018 14:29
| Gunnar von Boehn wrote:
| TRANSLOi D0,E0:E1 ; Translate 4 pixel to 16bit values, ; does translate colors and create write MASK
|
OK, I can understand that the TRANSLOi instruction can create the mask in E1 based on the value of the palette index in D0 and use that for a scattered write operation, but how is it possible to translate the values 1 through 15 of each pixel to hicolor? Where is the palette stored?
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6263 12 Feb 2018 14:42
| thellier alain wrote:
| I mean matrices that contain 16 bits words cant be used for 3D, no?
|
I would say the common usage for TRANSPOSE are GFX and VIDEO decoding. That we use it here for Palette conversion is more a nice side effect. For 3D we have other sharp tools in our box. One pretty useful one for 3D coders might be LEA3D Lets assume that you use for U-V coordinates fixed point 16bit.16bit Which is very common in many game engines. Then LEA3D would provide you the following operation. Addr := (16bit-Vhigh * modulo) + 16bit-Uhigh; So it allows to calc the texture coordinate in a single cycle. And the result of it can "bubble free" be used in the next memory load to read a texture value. This instructions does allow usage of texture of arbitrate width. You are not limited to 2 exponent width. And this instructions allows usage in the next cycle (bubble free). Which you could never get on 68060. Therefore SW texture-mapping performance is MUCH higher than 68060.
| |
|
|
|