Overview Features Instructions Performance Forum Downloads Products OrderV4 Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.



All TopicsNewsPerformanceGamesDemosApolloVampireAROSWorkbenchATARIReleases
Performance and Benchmark Results!

Can Vampire Do 640x240x32bit ?page  1 2 3 4 

Vladimir Repcak

Posts 348
09 May 2020 13:01


Niclas A wrote:

Vladimir Repcak wrote:

 
Niclas A wrote:

  What would normally happen is you allow res equal or bigger and get large black areas if bigger. Remember creating lots of special resolutions back in the day when I got my voodoo 3 for my A1200. This to get full-screen in emulators. Otherwise you would get a stamp size picture in top left corner and lots of black 🙂
 

  Is that how it was ? That sounds pretty weird.
 
  So, in our case of 640x360, the resolution wouldn't fill whole screen ? Kinda like some modern LCDs do (they center the resolution and put black bars around) ?
 
  On an actual CRT monitor back in the day? Wow.
 

 
  If you did not have 640x360 when the screen requester showed you would maybe select 640x480 and get 120 lines of black at the bottom. If you choose 800x600 it would have 160 columns of black on the right and 240 lines of black at the bottom.
 
  Offcource if you choose a 640x360 res (that you created) it would be full screen and look great with no blacks on the side.
 
  Thats atleast how a lot of emus back in the day handled it.
 

Interesting. Thanks for explanation.

I was just playing with the requester code, as I need to abstract away few resolution constants and under emulator, and two new resolutions popped up (on top of 320x240 and 640x480):

320x256
640x400

Would those also pop on Vampire without any prior config ?

640x400 sounds like it just might push 30 fps (with only minor scene complexity adjustment), yet be significantly sharper.


Niclas A
(Apollo Team Member)
Posts 216
09 May 2020 19:50


Vladimir Repcak wrote:

  Interesting. Thanks for explanation.
 
  I was just playing with the requester code, as I need to abstract away few resolution constants and under emulator, and two new resolutions popped up (on top of 320x240 and 640x480):
 
  320x256
  640x400
 
  Would those also pop on Vampire without any prior config ?
 
  640x400 sounds like it just might push 30 fps (with only minor scene complexity adjustment), yet be significantly sharper.

Did you do a clean Workbench + Picasso96 install in WinUAE?
If so then the same would be available on the vampire as default. (cant remember what was available default with Picasso96 anymore)

But as Gunnar has said a couple of times. Its up to the user what resolutions are available. You config your own in Picasso96 prefs. You can delete all default ones and add your own if you like. But you could just throw up an error message saying you need XXXxYYY res to play the game.




Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 07:35


Vladimir Repcak wrote:

    And, this is 24-bit pixel fill loop (with 4 ops per pixel):
   

    d0: Length of scanline - 1  (dbra count)
    d7: R of scanline color
    d6: G of scanline color
    d5: B of scanline color
   
      .dsLoop:
        move.b d7, (a0)+
        move.b d6, (a0)+
        move.b d5, (a0)+
      dbra d0,.dsLoop
   

   

   
Do you not interpolate the color?
I thought you need 24bit color for smooth interpolations?

 
If my understanding is right and your game interpolates the colors
then we need to benchmark the code including the interpolation.
 
Regarding the 24bit, MOVE.W + MOVE.B is less instructions than 3xMOVE.B
 
But I still wonder if the 2 layer approach would not be great for your game?
I have no clear idea how your game on the gaming layer will in the end look like. And how many colors you need for it. Can you share some more photos to help imagine it?

For sake of argument let us quickly look in benefit of 2 layer approach.

I could imagine a back layer with 256 color
or with YUV (DVD quality) YUV cost 16bit but has DVD quality (24bit)

If the game-layer would use 2556 color, you have 24bit quality palette for flat shading.
256 CLUT mode would allow you to CLR SCREEN very fast.
and for drawing a LINE you could use AMMX.

Lets look at example code:

Simple 32bit row draw


LOOP
  MOVE.L D0,(A0)+
  dbra  D7,LOOP

Drawing 8 pixel per iteration


LOOP
  STORECOUNT D0,D7,(A0)+
  suq.l  #8,D7
  bhi    LOOP

The trick of STORECOUNT is that it will COPY 0 to 8 bytes in 1 instruction.
If you flatshade and D7 is 8 times the same value then this will be a very fast flatshade


Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 07:46


Vladimir Repcak wrote:

Might as well straight lock it to 20.
 
I just implemented a variable FPS lock - I can choose 60,30 or 20 fps.

A common programming technique is NOT to have a lock at all.
But to have the game run more "freely".
As this will generally increase your framerate and make the gamer smoother.

What do you think about this?

The common way of coding this is running the movement math in VBL IRQ e.g. with 50 Hz and in parallel let the game render using the coordinates totally unsync to screen using tripple buffering.




Vladimir Repcak

Posts 348
10 May 2020 09:15


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Might as well straight lock it to 20.
 
  I just implemented a variable FPS lock - I can choose 60,30 or 20 fps.
 

 
  A common programming technique is NOT to have a lock at all.
  But to have the game run more "freely".
  As this will generally increase your framerate and make the gamer smoother.
 
  What do you think about this?
 
  The common way of coding this is running the movement math in VBL IRQ e.g. with 50 Hz and in parallel let the game render using the coordinates totally unsync to screen using tripple buffering.
 
 

It's significantly more disturbing during gameplay to have an occasional framedrop from 60 to 30 fps, than simply having rock-solid 30 fps (same thing with 30->20 vs 20 lock) that never drops a frame.

Especially at higher speeds. The drop from 60 to 30 feels like you just hit a concrete wall for a bit.

As much as consoles screwed up gaming, they did, however, introduce a new and better concept to computer (regardless of platform) gaming - locked framerate, which wasn't anywhere near as popular before...

Even people who aren't so sensitive to framedrops (there's many of those) will still enjoy locked framerate, because it's not disturbing in any way.


Vladimir Repcak

Posts 348
10 May 2020 09:20


Gunnar von Boehn wrote:

  Do you not interpolate the color?
  I thought you need 24bit color for smooth interpolations?
I have some interpolation per scanline, not within a scanline (it's a per-scanline cost, not per-pixel cost).

I need 24-bit for background, not foreground.

Gunnar von Boehn wrote:
then we need to benchmark the code including the interpolation
It already is. It's included in the per-scanline traversal cost.


Vladimir Repcak

Posts 348
10 May 2020 09:26


Gunnar von Boehn wrote:

  Regarding the 24bit, MOVE.W + MOVE.B is less instructions than 3xMOVE.B

Yeah, but move.w would have to be word-aligned, otherwise we get a trap, no ?
 
And not only first pixel of scanline, but also last and all in between.

Plus, since there's 3 bytes, every other pixel would have to be readjusted - e.g. first pixel: {move.w + move.b}, second pixel: {move.b + move.w}, etc.

Now, if 68080 could ditch the memory alignment rules of 68000 processors, and we could write 16-bit word at an odd address without any performance penalty, then this could bring some speed-up.

I thought I read somewhere that this (non-aligned access) is a potential feature in future ?



Vladimir Repcak

Posts 348
10 May 2020 09:48


Gunnar von Boehn wrote:

 
  But I still wonder if the 2 layer approach would not be great for your game?
 
 
  For sake of argument let us quickly look in benefit of 2 layer approach.
 
  I could imagine a back layer with 256 color
  or with YUV (DVD quality) YUV cost 16bit but has DVD quality (24bit)

Yeah, I used the two-layer approach  on Jaguar - had a background layer that didn't consume any CPU cycles (drawn via ObjectProcessor).
And a foreground 3D layer that was 256 colors (out of 65,536).
Which is why I wanted exact same approach on Vampire as I was deeply familiar with all dependencies and features of this approach.

But two months ago, when I inquired about that approach I was told it's not possible on Vampire (would have to go back to that thread to read more details as to why, don't recall now, besides it doesn't matter, as I don't have time to go reimplementing such core features now).

So, you don't have to convince me about advantages of such approach. But since I already implemented the background copying and filling of framebuffer the alternative way, I can't afford risking another 1,2,3 (or perhaps 4) weeks in implementing, testing and troubleshooting the layered approach now.

The time for that approach has already passed.



Vladimir Repcak

Posts 348
10 May 2020 09:51


Gunnar von Boehn wrote:

  Drawing 8 pixel per iteration
 

  LOOP
    STORECOUNT D0,D7,(A0)+
    suq.l  #8,D7
    bhi    LOOP
 

 
  The trick of STORECOUNT is that it will COPY 0 to 8 bytes in 1 instruction.
  If you flatshade and D7 is 8 times the same value then this will be a very fast flatshade

Thanks. This is good. As soon as I get to troubleshooting networking and will have a way to deploy builds to my V4, I will try to give this 8-px grouping a try.

Right now, remote testing is working for me quite well, though.


Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 10:46


Vladimir Repcak wrote:

I need 24-bit for background, not foreground.

What do you think about YUV for background?


Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 10:50


Vladimir Repcak wrote:

 
Gunnar von Boehn wrote:

    Regarding the 24bit, MOVE.W + MOVE.B is less instructions than 3xMOVE.B

  Yeah, but move.w would have to be word-aligned, otherwise we get a trap, no ?
 

  68000 will trap on misaligned READ/WRITE
  68080 supports misaligned READ and WRITE in Hardware and will never trap.
 
 
 
Vladimir Repcak wrote:

  I thought I read somewhere that this (non-aligned access) is a potential feature in future ?
 

  Misaligned READ is free.
  Misaligned WRITE can be free too, depends on layout.
 


Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 10:52


With AMMX you can always do a
 
STORECOUNT Data,3,(a0)+
 
This means you can with 1 instruction you can store 3 byte
The same way you could use "STORECOUNT 6" to save 2 pixel in 1 instruction.

And also you could do "MOVE.L MOVE.W"  to store 2 pixel.


Vladimir Repcak

Posts 348
10 May 2020 17:38


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  I need 24-bit for background, not foreground.
 

  What do you think about YUV for background?

Hard to say without having done experiments with it. I'm sure there will be lots of scenarios where the difference to 24bit would be negligible. Without making direct comparison in my test scene, it's impossible.

I presume if we used this, there would be no need to copy every single pixel of the background, right ? Meaning, the HW would simply fetch, pixel by pixel each frame ?

That would give us 60% of frame time that is now killed by copying 24bit bitmap. And since flatshaded foreground could do well at 8 bits, we'd be back at 30 fps even at 640x480.

Is it something that already works right now or you expect it to be tested and fully enabled in a core in few months ?


Vladimir Repcak

Posts 348
10 May 2020 17:50


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

 
Gunnar von Boehn wrote:

    Regarding the 24bit, MOVE.W + MOVE.B is less instructions than 3xMOVE.B

    Yeah, but move.w would have to be word-aligned, otherwise we get a trap, no ?
   

  68000 will trap on misaligned READ/WRITE
  68080 supports misaligned READ and WRITE in Hardware and will never trap.
 
   
 
Vladimir Repcak wrote:

  I thought I read somewhere that this (non-aligned access) is a potential feature in future ?
 

  Misaligned READ is free.
  Misaligned WRITE can be free too, depends on layout.
 

So this already works ? Now, that's cool. I wouldn't have to write 3 pages of code to handle dozen alignment scenarios. Just move.w and move.b and dbra.

So, I just ran the numbers in excel and it appears to be identical to 32-bit:

CopyBitmap dropped from 59.6% to 44.7%
PixelWrite rose from 32.6% to 48.8%

So, those two stage are then 92.1% (32bit) vs 93.5% (24 bit)

Of course, the 24-bit number is just interpolation. I'm guessing that with 1 additional op per pixel, the total cost of that stage would rise by roughly 50% (3 ops / pixel vs 2 ops / pixel).

But, it would have to be written and benchmarked, to be really sure (bubbles and all).

An empty 3d scene would surely be faster with this, it just wouldn't be so much fun :)

Certainly, without having to do any benchmark, because there would be more work per pixel (3 ops instead of 2), there would be a threshold of number of scanline pixels, at which point any gains from 24bit would be negated by more per-pixel work.

That much can be inferred easily without any benchmark :)


Vladimir Repcak

Posts 348
10 May 2020 17:58


BTW, I'm not sure how many people realize we're dealing with 32-bit 3D rendering on an 85 MHz CPU without external gfx accelerator. How cool is that :)

Something like that needed 3dfx on PC. And I don't even recall if the first voodoo chipsets even did 32 bit.

They did 16-bit, that much I remember, but I think you needed voodoo2/3 for 32-bit, right ? Probably could be looked up...

So, this "problem" of 32-bit rendering performance is pretty nice to have ;)



Gunnar von Boehn
(Apollo Team Member)
Posts 4839
10 May 2020 17:58


Vladimir Repcak wrote:

Certainly, without having to do any benchmark, because there would be more work per pixel (3 ops instead of 2),

   
When you use "STORECOUNT" then its only 1 instruction to write 3 byte. But Storecount could also write 6byte== 2 pixel.
If you write 2 pixel per loop, then of course its even faster.
You could write 2 pix per iteration with 1 clock per 2 pix?
This would then only be 0.5 cycle per pixel.
   
 

  LOOP2
      STORECOUNT Color,6,(a0)+
      dbra  d7mLOOP2
   
      if ODD write 1 more pixel
 

But the same optimization you can do on the 32bit pixel too.
You could STORE a 64bit value in 1 cycle.
This would halve the number of Loop iterations you need.


Vladimir Repcak

Posts 348
11 May 2020 05:22


Gunnar von Boehn wrote:

Vladimir Repcak wrote:

  Certainly, without having to do any benchmark, because there would be more work per pixel (3 ops instead of 2),

   
  When you use "STORECOUNT" then its only 1 instruction to write 3 byte. But Storecount could also write 6byte== 2 pixel.
  If you write 2 pixel per loop, then of course its even faster.
  You could write 2 pix per iteration with 1 clock per 2 pix?
  This would then only be 0.5 cycle per pixel.
   
   

    LOOP2
      STORECOUNT Color,6,(a0)+
      dbra  d7mLOOP2
   
      if ODD write 1 more pixel
   

 
 
  But the same optimization you can do on the 32bit pixel too.
  You could STORE a 64bit value in 1 cycle.
  This would halve the number of Loop iterations you need.

Halving the iteration count would definitely help.

I'm currently consuming between 33-50% of frame time in 640x480 on Pixel Fill.

If that number could be halved, then it's definitely worth the implementation cost.

It wouldn't push 640x480 into 30 fps territory, but it would help it stay at 20 fps without dipping into 15 fps.

One last benchmark I need to run is the full range of CPU spikes. I am hoping 50% of frame time (as a buffer zone) is enough, but I need data over thousand frames for this.
I did notice, in the emulator, the differences are around 50% of frame time. And while the exact cycle costs are not comparable, the percentage of frame time, so far, has been mostly consistent with real HW.


Gunnar von Boehn
(Apollo Team Member)
Posts 4839
11 May 2020 06:44


Vladimir Repcak wrote:

I did notice, in the emulator, the differences are around 50% of frame time. And while the exact cycle costs are not comparable, the percentage of frame time, so far, has been mostly consistent with real HW.
 

In my experience benchmarks on emulation are not very meaningful for various reasons.

You have to mind that some instructions are by design very slow in emulation, while some are fast. There might be difference of thousand percent  with no relation of this difference in real hardware.
In other words : two instruction might both take 1 cycle in real hardware - but in emulation one of the instructions can be 10 times slower than the other.
Also you need to mind that a JIT compiler will first translate blocks of code which is several 100 times slower than a later loop iteration. This is a huge difference to how real hardware behaves.
Emulation will not show you the performance gains of alignment or of FPU pipelining.

In short if you rely on emulation benchmarking you can not expect to get a good result.
 


Vladimir Repcak

Posts 348
11 May 2020 06:56


It could be a coincidence but the percentages map pretty well from real HW into emulator.

I don't intend to look more into it, but it's impossible not to make that observation while directly working with those numbers :)

It's probably just the actual speed of my particular CPU, though.



Markus B

Posts 195
11 May 2020 12:25


No I'm really looking forward for some tech demo ...

posts 80page  1 2 3 4