Bernd Meyer
Posts 6 08 Sep 2019 08:20
| Gunnar von Boehn wrote:
| So rewriting these 8 instructions to 1 instruction will not work for normal programs. It will NOT make your computer run any program faster. This trick only trigger so well in SYSINFO and SYSSPEED. |
What you call a "trick" or "cheat" is, however, a perfectly valid real world optimization. The JIT compiles blocks of 68k code into equivalent x86 code, where "equivalent" means having the same overall effect, regardless of execution time. Basically, the blocks are from one conditional or indirect jump to the next. Now, the JIT in WinUAE does do some optimising, mostly related to the x86 register allocation (or rather, the attempt to avoid allocating registers as much as possible, given how few there are). Optimisation A keeps track of the state of each 68k register as the compiler steps through the 68k instructions. Each 68k register at any time can either be stored in RAM, or stored in a particular x86 register, or have a known value and not be stored anywhere. Each register also may have a known offset from the stored value. And each register may be stored in its location as big or little endian (not sure whether that one actually is in UAE, or only in Amithlon). Of course, at some point, all of those lazily-deferred adjustments need to be made, but delaying it until the end of a block, or until the actual value is needed can really save time. Optimisation B does some very simple dependency analysis, and thus allows the compiler to not generate code to calculate things which are of no consequence. This is particularly useful for flags, but of course, once the mechanism is in place, it makes sense to use it for the registers (and partial registers) as well. Optimisation C lets the JIT compiler speculatively compile through end-of-block instructions if the predicted-at-compile-time decision was taken. This means that the delayed stuff from (A) can be further delayed. Let me put some real-world code here, from Protracker (see https://16-bits.org/pt_src/tracker/PT4.0.s): dseloop7: (1) addq.l #1,a1 (2) cmp.b #0,(a1) (3) bne.s dseloop7 (4) move.b #'.',(a1)+ (5) move.b #'T',(a1)+ (6) move.b #'R',(a1)+ (7) move.b #'K',(a1)+ (8) CLR.B (A1)+ (9) MOVE.L (SP)+,A1
Lines 1 to 3 are one block. When the compiler gets to the instruction (1), it doesn't emit any code, it simply increases the offset of A1 by 1 (optimisation A). Then for instruction (2), it will allocate an x86 register to A1 and load its value from the in-memory 68k state. It can then generate the x86 comparison instruction corresponding to (2) incorporating the known offset into the memory access. Then it reaches instruction (3), and generates a conditional jump. Given that the expected behaviour is for the 68k BRA to be taken, the x86 code will branch away (to some yet-to-be-generated fixup code) on equality. It will then continue to generate more code starting at (1), due to optimisation (B). So the code for (1) to (3) ends up as (extremely simplified) code_for_1: mov eax,(address_of_A1_in_memory_state) cmp (eax,1),0 beq fixup1 cmp (eax,2),0 beq fixup2 (...) cmp (eax,15),0 beq fixup15 lea eax,(eax,15) mov (address_of_A1_in_memory_state),eax RET(1) fixup1: lea eax,(eax,1) mov (address_of_A1_in_memory_state),eax RET(4) fixup2: lea eax,(eax,2) mov (address_of_A1_in_memory_state),eax RET(4) (...)
where "RET(x)" stands for the code which takes the known 68k PC (instruction (x)) and finds what x86 code to call for it. Similarly, the code generated for (4) to (8) will not increment the x86 register holding A1, but simply increment the offset in the compiler state. And then, when translating instruction (9), that state gets overwritten without ever having been realised. And while each of instructions (4) to (9) set the 68k flags, the ones generated by (4) to (8) are known to be immediately overwritten without ever being looked at, so the compiler can avoid generating extra x86 code for them (optimisation (C)). So, again, what you call "cheat" is simply the result of some real-world optimisations which happen to be applicable to some remarkably bad benchmarking code.
|