APOLLO CPU Knowledge Forum

Overview

Features

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
Please visit our Apollo-Discord Server for support.

All Topics

News

Performance

Games

Demos

Apollo

Vampire

AROS

Workbench

ATARI

Releases

Performance and Benchmark Results!

Why Does UAE Cheat In Benchmarks?	page 1 2 3


Samuel Devulder Posts 248 12 Jul 2019 08:11	Anyhow, MIPS really stands for Meaningless Indicator of Processor Speed. This thread is all about this. It is time to use a better indicator.

Gunnar von Boehn
(Apollo Team Member)
Posts 6254
12 Jul 2019 08:27

Samuel Devulder wrote:

It is time to use a better indicator.

The wish to measure CPU / Computer speed is very understandable.
A CPU can do many different instructions.
Some of those instructions are faster like "ADD"
Some instructions are slower like "DIV"
Some instructions can have different speed depending on the circumstances.

This makes "measuring" the speed of a CPU not so easy.
Every Program will have a different mix of instructions,
and every different mix of instructions will take a different time to process.

SYSINFO tries to measure the CPU speed and tries find a compromise here.

SYSINFO uses a mixture of instructions.
This mix covers subroutine calls, mathematical operations, logical operations and so on.
The idea of using such a mix is very reasonable.
SYSINFO does not test "IF/THEN/ELSE" code = conditional branches.
If SYSINFO would include such instructions also in its MIX then I would call SYSINFO actually as very comprehensive.

Renee Cousins
(Apollo Team Member)
Posts 142
14 Jul 2019 04:03

Samuel Devulder wrote:

Anyhow, MIPS really stands for Meaningless Indicator of Processor Speed. This thread is all about this. It is time to use a better indicator.

LOL.

When I started working at my latest job, we were using the Freescale S08 processors. For fun I compiled the Dhrystone benchmark on it and was disappointed with it's score -- the S08 seriously struggled with C code.

But the reality was that this was a FAST processor as long as you didn't need number crunching power. It could bit bang eight I2C channels and keep up with full speed USB without breaking a sweat. It's small register set made interrupts fast and it put the IO registers in zero-page, so you could toggle pins crazy fast.

So indeed, MIPS is just one way of measuring a CPU's abilities and it's a shallow one at that.


Mr Niding Posts 459 14 Jul 2019 07:27	Todays benchmark channels on youtube goes thru a large numbers of games and programs to highlight fps and time to render at different resolutions and colordepth compress and decompress etc. So thats what we should concern ourselves with.

Seiya Be

Posts 12
22 Jul 2019 12:03

Gunnar von Boehn wrote:

And interesting question is, why is this cheat feature added to UAE?

SySInfo 4.0 Vampire V4 (first url) and SysInfo 4.0 WinUAE-Jit (second url)
https://imgbb.com/][img EXTERNAL LINK https://imgbb.com/][img EXTERNAL LINK

Bernd Meyer

Posts 6
08 Sep 2019 08:20

Gunnar von Boehn wrote:

So rewriting these 8 instructions to 1 instruction will not work for normal programs. It will NOT make your computer run any program faster.

This trick only trigger so well in SYSINFO and SYSSPEED.

What you call a "trick" or "cheat" is, however, a perfectly valid real world optimization. The JIT compiles blocks of 68k code into equivalent x86 code, where "equivalent" means having the same overall effect, regardless of execution time. Basically, the blocks are from one conditional or indirect jump to the next.

Now, the JIT in WinUAE does do some optimising, mostly related to the x86 register allocation (or rather, the attempt to avoid allocating registers as much as possible, given how few there are).

Optimisation A keeps track of the state of each 68k register as the compiler steps through the 68k instructions. Each 68k register at any time can either be stored in RAM, or stored in a particular x86 register, or have a known value and not be stored anywhere. Each register also may have a known offset from the stored value. And each register may be stored in its location as big or little endian (not sure whether that one actually is in UAE, or only in Amithlon). Of course, at some point, all of those lazily-deferred adjustments need to be made, but delaying it until the end of a block, or until the actual value is needed can really save time.

Optimisation B does some very simple dependency analysis, and thus allows the compiler to not generate code to calculate things which are of no consequence. This is particularly useful for flags, but of course, once the mechanism is in place, it makes sense to use it for the registers (and partial registers) as well.

Optimisation C lets the JIT compiler speculatively compile through end-of-block instructions if the predicted-at-compile-time decision was taken. This means that the delayed stuff from (A) can be further delayed.

Let me put some real-world code here, from Protracker (see https://16-bits.org/pt_src/tracker/PT4.0.s):


  dseloop7:
  (1)        addq.l  #1,a1
  (2)        cmp.b   #0,(a1)
  (3)        bne.s   dseloop7
  
  (4)        move.b  #'.',(a1)+
  (5)        move.b  #'T',(a1)+
  (6)        move.b  #'R',(a1)+
  (7)        move.b  #'K',(a1)+
  (8)        CLR.B   (A1)+
  (9)        MOVE.L  (SP)+,A1

Lines 1 to 3 are one block. When the compiler gets to the instruction (1), it doesn't emit any code, it simply increases the offset of A1 by 1 (optimisation A). Then for instruction (2), it will allocate an x86 register to A1 and load its value from the in-memory 68k state. It can then generate the x86 comparison instruction corresponding to (2) incorporating the known offset into the memory access.
Then it reaches instruction (3), and generates a conditional jump. Given that the expected behaviour is for the 68k BRA to be taken, the x86 code will branch away (to some yet-to-be-generated fixup code) on equality. It will then continue to generate more code starting at (1), due to optimisation (B). So the code for (1) to (3) ends up as (extremely simplified)


  code_for_1:
      mov  eax,(address_of_A1_in_memory_state)
      cmp (eax,1),0
      beq  fixup1
      cmp (eax,2),0
      beq  fixup2
      (...)
      cmp (eax,15),0
      beq  fixup15
      lea eax,(eax,15)
      mov (address_of_A1_in_memory_state),eax
      RET(1)
  fixup1:
      lea eax,(eax,1)
      mov (address_of_A1_in_memory_state),eax
      RET(4)
  fixup2:
      lea eax,(eax,2)
      mov (address_of_A1_in_memory_state),eax
      RET(4)
  (...)

where "RET(x)" stands for the code which takes the known 68k PC (instruction (x)) and finds what x86 code to call for it.

Similarly, the code generated for (4) to (8) will not increment the x86 register holding A1, but simply increment the offset in the compiler state. And then, when translating instruction (9), that state gets overwritten without ever having been realised. And while each of instructions (4) to (9) set the 68k flags, the ones generated by (4) to (8) are known to be immediately overwritten without ever being looked at, so the compiler can avoid generating extra x86 code for them (optimisation (C)).

So, again, what you call "cheat" is simply the result of some real-world optimisations which happen to be applicable to some remarkably bad benchmarking code.

posts 46	page 1 2 3