Overview Features Instructions Performance Forum Downloads Products Reseller Contact

Welcome to the Apollo Forum

This forum is for people interested in the APOLLO CPU.
Please read the forum usage manual.
VISIT APOLLO IRC CHANNEL



All TopicsNewsPerformanceGamesDemosApolloVampireCoffinReleasesLogin
Performance and Benchmark Results!

Pc Relative Vs Absolute Addressing Performance

Patrik Axelsson

Posts 3
12 Jun 2019 21:15


In recent efforts to reduce executable size for a couple of very space sensitive projects with requirements like "fit inside a FFS block" or "fit inside two OFS blocks", I found that I could save some size by making vbcc not emit code requiring relocations for dos.library functions.

Most of the dos.library functions are defined to pass addresses in data registers (probably because of its BCPL heritage) and if fed some address to constant data, like a string, vbcc will then by default emit code like "move.l #label,dX" which will require a relocation entry in the resulting executable.

For N such function calls, this costs 10+N*2 Bytes for kick2.0+ drel32, 16+N*4 Bytes for kick1.x compatible rel32, so a respectable number of bytes can be saved if relocations can be avoided.

I found that you could make vbcc instead emit "lea label(pc),aX move.l aX,dX" by redefining the inline function calls to use an address register and do a move to correct data register in the inline assembly before the jsr to the lvo. The code size for these two instructions is equal to the single instruction with the absolute address and as it is pc relative, it avoids generating a relocation, which is great.

If this method would generate code which needs a couple more cycles, it would not matter for dos.library functions which generally are I/O bound. However, if considered as a general optimization to vbcc, it would be interesting to know how it performs.

Created a simplistic test for this:
http://megaburken.net/~patrik/addrtest.lha

It consists of two executables where one loops a great number of times over 32 "move.l #label,dX", while the other loops the same number of times over 32 "lea label(pc),aX move.l aX,dX".

The code is small enough to fit inside the cache of the 020/030, which has the smallest cache of the 680x0 family and 32 repetitions will make the loop handling around them insignificant.

In the below results, addrtest is the pc-relative test and addrtest_reloc is the absolute test which has relocations in the executable. I used EXTERNAL LINK for measuring the duration.

Results from A500+ with 7.14MHz 68000, 8MB fastmem:
2.Ram Disk:> time addrtest
512.973114s
2.Ram Disk:> time addrtest_reloc
512.991696s

Results from A1200 with 14.18758MHz 68020, no fastmem (fastmem irrelvant, fits inside cache):
2.Ram Disk:> time addrtest
121.947783s
2.Ram Disk:> time addrtest_reloc
121.949324s

Results from A3000 with 25MHz 68030, 16MB fastmem:
10.Temp:addrtest> time addrtest
68.637883s
10.Temp:addrtest> time addrtest_reloc
68.644236s

Results from A4000 with 32.768MHz 68040 on WarpEngine, 32MB fastmem:
9.Ram Disk:> time addrtest
42.938625s
9.Ram Disk:> time addrtest_reloc
9.505656s

Results from A4000 with 50MHz 68060 on CSPPC, 128MB fastmem:
9.Ram Disk:> time addrtest
8.308892s
9.Ram Disk:> time addrtest_reloc
8.310202s

Results from A500 with Vampire (AGA Core, beta 3, not 100% sure, not my machine):
6.Ram Disk:addrtest> uhc:c/time addrtest
6.457900s
6.Ram Disk:addrtest> uhc:c/time addrtest_reloc
3.282050s

The Vampire and the 040 are the exceptions and gets better performance with absolute addressing solution, all other CPUs performs equal.

It would be very interesting if anyone has any insight to why the performance differs between these two solutions on the Vampire.


Gunnar von Boehn
(Apollo Team Member)
Posts 4169
13 Jun 2019 07:40


Hi Patrik,
 
  lets look together at the code
 
  Your routines look like this
 

  move.l  #hellostring,d2    -- 6 byte
  move.l  #hellostring,d2    -- 6 byte
  move.l  #hellostring,d2    -- 6 byte
  move.l  #hellostring,d2    -- 6 byte
  move.l  #hellostring,d2    -- 6 byte
 

 
 

  lea    hellostring(pc),a2  -- 4 byte
  move.l  a2,d2              -- 2 byte
  lea    hellostring(pc),a2  -- 4 byte
  move.l  a2,d2              -- 2 byte
  lea    hellostring(pc),a2  -- 4 byte
  move.l  a2,d2              -- 2 byte
  lea    hellostring(pc),a2  -- 4 byte
  move.l  a2,d2              -- 2 byte
  lea    hellostring(pc),a2  -- 4 byte
  move.l  a2,d2              -- 2 byte
 

 
 
We see that both routines have the same instruction length.
But one routine has twice as many instructions.

Less instructions typically means = faster.

But the CPUs 68000/68020/68030/68060 have a hardware limitation in loading instructions.
This means the execution speed on these CPus depends a lot on the code length.
Therefore even if one of your routine has less instructions its not running faster on those CPUs as their main bottleneck is loading of the instruction.

The 68040 and 68080 have more power in loading instruction - for them the instruction size is not a limiting factor.




Nixus Minimax

Posts 341
13 Jun 2019 11:38


Patrik Axelsson wrote:

  Results from A500 with Vampire (AGA Core, beta 3, not 100% sure, not my machine):
  6.Ram Disk:addrtest> uhc:c/time addrtest
  6.457900s
  6.Ram Disk:addrtest> uhc:c/time addrtest_reloc
  3.282050s
 
  The Vampire and the 040 are the exceptions and gets better performance with absolute addressing solution, all other CPUs performs equal.
 
  It would be very interesting if anyone has any insight to why the performance differs between these two solutions on the Vampire.
 

 
  Easy to explain :)
 
  The 040 is better than the 060 at using long constants which is why the relocated binary executes faster on the 040 than the lea/move variant which requires the monoscalar 040 to execute two instructions sequentially.
 
  The 060 is slow on the long constant and faster on the short constant in the PC-relative instruction but needs to execute the move sequentially which nullifies the speed advantage of the PC-relative instruction. This is because the move.l depends on the lea to finish because the move needs the result of the lea.
 
  Now the Vampire could do both equally fast but by placing the move.l directly behind the lea you create suboptimal code. More precisely, the 080 has an EA-pipeline that operates independently from the ALU. The EA-pipeline makes sure that address calculations are finished once the effective addresses are needed during operand fetch in the ALU. However, this EA-pipeline cannot be placed so early that it will already be finished in the next instruction slot because, well, then it would be just part of a very long pipeline which has a lot of disadvantages. Thus, if the very next instruction in the ALU requires the result of the EA-pipeline running in parallel to the ALU, the ALU needs to wait for the EA-calculation to finish which creates a bubble. If, OTOH, the EA-calculation is placed a couple of instructions earlier in the instruction stream, the EA-calculation will completely disappear in the execution time and both variants will be equally fast.
 
You could e.g. execute a loop of
 
  lea label(pc),d0
  move.l d1,d3
  lea label(pc),d1
  move.l d0,d2
 
without any speed penalty over the relocated variant
 
  move.l #label,d0
  move.l #label,d1
 


Simo Koivukoski
(Apollo Team Member)
Posts 468
15 Jun 2019 16:30


Results from Standalone with 113MHz 68080
New Shell process 4
4.Ram Disk:> time addrtest
4.756417s
4.Ram Disk:> time addrtest_reloc
2.381140s
4.Ram Disk:> vcontrol hz
AC68080 @ 113 MHz (x16)
4.Ram Disk:>



Patrik Axelsson

Posts 3
15 Jun 2019 16:55


Gunnar and Nixus, thank you for your answers, very interesting.

Nixus Minimax wrote:
 
  You could e.g. execute a loop of
 
    lea label(pc),d0
    move.l d1,d3
    lea label(pc),d1
    move.l d0,d2
 
  without any speed penalty over the relocated variant
 
    move.l #label,d0
    move.l #label,d1

With this example in mind, I implemented a test which populates registers like vbcc would do for the dos.library/Open("readme", MODE_OLDFILE) call.

For the standard variant causing relocations, called argpopulate_reloc, the repeated part looks like this:
        move.l  d3,a6
        move.l  #filename,d1
        move.l  #$000003ed,d2

For the version without relocations, called argpopulate, the repeated part looks like this:
        move.l  d3,a6
        lea    filename(pc),a1
        move.l  #$000003ed,d2
        move.l  a1,d1

I also did a version called argpopulate_leafirst, where lea is done before a6 population to give the EA pipeline as much time as possible:
        lea    filename(pc),a1
        move.l  d3,a6
        move.l  #$000003ed,d2
        move.l  a1,d1

These parts are repeated 16 times and then looped around many times.

Results for the same A500 Vampire 2 with AGA Core, beta 3:
10.Ram Disk:addrtest> uhc:c/time argpopulate
6.459489s
10.Ram Disk:addrtest> uhc:c/time argpopulate_leafirst
6.459585s
10.Ram Disk:addrtest> uhc:c/time argpopulate_reloc
4.872216s

So a little less difference, but the amount of lea is much less so the difference should reasonably be smaller. It appears the EA-pipelining is not happening, for reasons I don't understand.

Anyhow, this test is just dry-populating registers and is not anywhere near a real world example, so I created another test where a small function is called after populating the registers. This function is just contains 2 instructions and a rts.

Just showing the repeated part of the version without relocations now, to make this post less long:
        move.l  d3,a6
        lea    filename(pc),a1
        move.l  #$000003ed,d2
        move.l  a1,d1
        jsr    somefunction(pc)

The function looks like this:
somefunction:
        move.l  d1,d0
        add.l  d2,d0
        rts

The same 16 repetitions and the same variants on the same Vampire 2:
10.Ram Disk:addrtest> uhc:c/time funccall
11.234572s
10.Ram Disk:addrtest> uhc:c/time funccall_leafirst
11.326574s
10.Ram Disk:addrtest> uhc:c/time funccall_reloc
9.653664s

Yet again less difference - there is a lot more done apart from the lea now. Still not equal, so

As I said earlier, this is not my machine. The owner figured he should change to the GOLD 2.11 core and test and these are the results for all tests:
6.Ram Disk:addrtest3> uhc:c/time addrtest
6.955444s
6.Ram Disk:addrtest3> uhc:c/time addrtest_reloc
3.476436s

Slightly slower than before.

6.Ram Disk:addrtest3> uhc:c/time argpopulate
3.588643s
6.Ram Disk:addrtest3> uhc:c/time argpopulate_leafirst
3.585960s
6.Ram Disk:addrtest3> uhc:c/time argpopulate_reloc
2.722472s

The relocation variant is only ~16% faster now, compared to ~33% before.

6.Ram Disk:addrtest3> uhc:c/time funccall
7.073051s
6.Ram Disk:addrtest3> uhc:c/time funccall_leafirst
7.070668s
6.Ram Disk:addrtest3> uhc:c/time funccall_reloc
7.071081s

This is really interesting - no performance difference for this test on the GOLD 2.11 core, so apparently the EA-pipelining works for this combination of core and code.

Regarding the 030, 040 and 060:
The argpopulate#? and funccall#? perform without difference between non-reloc and reloc version on 030 and 060.

040 is like before - reloc version is always fastest, but as much more normal work compared to the lea+move is done, the difference is dramatically reduced. For example 040 diff in addrtest#? was 350%, but in funccall#? it is only 26%.

Test executables and sources: http://megaburken.net/~patrik/addrtest3.lha


Nixus Minimax

Posts 341
17 Jun 2019 07:59


Patrik Axelsson wrote:
argpopulate_reloc:
          move.l  d3,a6
          move.l  #filename,d1
          move.l  #$000003ed,d2
 
argpopulate:
          move.l  d3,a6
          lea    filename(pc),a1
          move.l  #$000003ed,d2
          move.l  a1,d1
 
argpopulate_leafirst:
          lea    filename(pc),a1
          move.l  d3,a6
          move.l  #$000003ed,d2
          move.l  a1,d1

Results for the same A500 Vampire 2 with AGA Core, beta 3:
  10.Ram Disk:addrtest> uhc:c/time argpopulate
  6.459489s
  10.Ram Disk:addrtest> uhc:c/time argpopulate_leafirst
  6.459585s
  10.Ram Disk:addrtest> uhc:c/time argpopulate_reloc
  4.872216s

I think the 080 is so powerful in other regards that the instructions between the LEA and the MOVE.L are too few in your example:

1:        lea    filename(pc),a1 ; EA-pipe
2:        move.l  d3,a6          ; 1st ALU
3:        move.l  #$000003ed,d2  ; 2nd ALU (executed in the same cycle as 2)
4:        move.l  a1,d1          ; 1st ALU (only two cycles after 1)

I believe the EA-pipeline is three cycles ahead of the ALUs, so with superscalar code in between you can execute up to six instructions between the LEA and the MOVE.L.



Gunnar von Boehn
(Apollo Team Member)
Posts 4169
17 Jun 2019 08:11


two quick comments:
 
1) I would advice that if you do performance measurements
  and reviews that always measure speed on CORE V2.11 not on GOLD 3 ALPHA. The alpha core was not compiled for speed.
 
2) While measuring this parameter passing for DOS calls is interesting, its has very little or no real world impact.
Program will not get any faster or slower by this.
 



Patrik Axelsson

Posts 3
18 Jun 2019 22:52


@Nixus:
That makes sense, thank you very much for explaining further.

@Gunnar:
I am trying to convince others that doing this to save executable size makes sense. When just looping over the instructions differing between the methods, the only CPUs showing a difference were the 040 and Vampire, which was the reason for asking here.

Now after understanding the reason for the Vampire difference better and creating a slightly more realistic test with a function call plus running the correct core, the only CPU with a difference between methods is the 040.

And yes, indeed, if a real function would be called, it would most likely be impossible to measure the difference even on the 040 as it only differs ~26% with this extremely short function.


Nixus Minimax

Posts 341
19 Jun 2019 08:14


Patrik Axelsson wrote:
Now after understanding the reason for the Vampire difference better and creating a slightly more realistic test with a function call plus running the correct core, the only CPU with a difference between methods is the 040.

You are asking the right questions but, as Gunnar said, just don't use the Gold3 alpha core for testing because the CPU in that is really old. The Gold3 alpha core was made a long time ago and I think even misses a lot of CPU features. If you want to get reliable results, use the most recent core from the Gold 2.x series which right now is 2.11




Gunnar von Boehn
(Apollo Team Member)
Posts 4169
19 Jun 2019 10:19


If you look at performance then the "complete" answer could be:

1) A program needs time to load from disk

2) The OS will process and relocation the loaded program before execution. This means every entry in relocation will take a little time.

3) The parameter set up for a DOS call needs little time.
But we talk here about a difference of 1-2 cycle

4) Executing a DOS command will take time.
In case if a File-IO this can be millions of cycle.



posts 10