Information about the Apollo CPU and FPU. |
Patents to Improve CPU Core? | page 1 2
|
---|
|
---|
| | Samuel Crow
Posts 424 13 Aug 2022 05:37
| I just happened to look up the opcode fusion patent on Google's patent search. It expired last month on the third of July. EXTERNAL LINK @Gunnar Do you have a deal with Intel for patent permissions? If so, will the macro-op to micro-op patent open up the possibility of a 68090? EXTERNAL LINK I realize that US patents may not have the same teeth in Europe but European patents do. I'd like to discuss possibilities briefly here regarding a few ideas to improve the 68080 core into something newer.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 13 Aug 2022 09:41
| Samuel Crow wrote:
| I'd like to discuss possibilities briefly here regarding a few ideas to improve the 68080 core into something newer.
|
Hi Sam, Sure, what idea do you have?
| |
| | Samuel Crow
Posts 424 13 Aug 2022 15:12
| Hi Gunnar! I've got an idea about how to implement an arbitrary memory-to-memory vector op collection using opcode fusion and partial loop-unrolling to hopefully bypass the ARM9 memory-to-memory vector patent. It also could make an additional 5-stage pipeline practical. (That pipeline has two decoder stages for more powerful opcode fusions.) Of course, large 3-way superscalar cores need bigger FPGAs so it may not be practical at current FPGA levels. If it isn't practical at all, we can stop here. If it may become a generational improvement we can continue. The gist is being able to implement the an iteration of the following as an addition to AMMX: LABEL: OP.W (A1)+,(A2)+ DBF D1,LABEL |
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 13 Aug 2022 15:54
| Samuel Crow wrote:
| LABEL: OP.W (A1)+,(A2)+ DBF D1,LABEL |
|
Lets us start step by step. Can we start with the "problem" you want to solve? Please with very concrete examples, and please tell me where you see the current bottleneck.
| |
| | Samuel Crow
Posts 424 13 Aug 2022 19:20
| Ok. The blitter emulation running on the second thread uses AMMX to accelerate the horizontal word runs so it can execute 4 16-bit operations per clock. The counter executes every other clock cycle so the inner loop is 2 clock cycles per iteration. Next, there's the memory accesses to load and store the 64-bit sections for an additional 2 instructions per loop. That's pretty good for a 2-way superscalar CPU. My proposal is to break up the 4 16-bit ops into any number of 16-bit ops as indicated by the OP.W psuedocode. Once that's done, unrolling the loop 2-, 4- or even 6- times per iteration, checking and updating the counter at the level zero micro-op level in accordance with every expansion of the macro-op to make an arbitrary-width vector op. By using the 32-bit vector micro-op that every AMMX op does 2 of per clock, the flexibility to use more pipelines are exposed. Of course there are side effects like interrupts needing special consideration but nothing that I see as a blocker if there's space for it (the big "if"). It's "just" a 2-opcode fusion followed by a loop-unroller.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 13 Aug 2022 20:58
| Samuel Crow wrote:
| Ok. The blitter emulation running on the second thread uses AMMX to accelerate the horizontal word runs so it can execute 4 16-bit operations per clock. |
Sorry, not sure I can follow you. There is no Blitter emulation on the 2nd threat on the 68080. The V4 systems have a 100% DMA based Amiga-hardware Blitter. The Blitter works 100% like the Amiga Blitter. It uses the AMiga DMA channels and can also be programmed by the Copper. The Blitter can do Bobs, Lines, Fill, exactly like the original, but its significantly faster than the original Blitter and it can also do 3D Texture Mapping in 16Bit color or Truecolor and can do Bilinearfiltering and ZBuffer management and also can do Light/Shading. But lets talk more about your idea. Samuel Crow wrote:
| if there's space for it (the big "if"). It's "just" a 2-opcode fusion followed by a loop-unroller. |
I think it will help if you can make a more clear example of code to understand what you mean. Can you help me with a simple example of workload and make some 68K code for it? Thanks Or I can make an example code for us to talk about. Lets show a simple code to copy a soft-sprite on screen lea Spritedata,A0 lea Screen,A1 move.w #height-1,D1 Yloop: move.w #width/8-1,D0 Xloop: load (A0)+,D2 storem.b D2,(A1)+ dbra D0,Xloop add.l #Modulo,A1 dbra D1,Yloop rts
Lets say we run on a 8bit chunky screenmode as often used by RTG games. This softsprite code will copy a sprite to screen. The innerloop does load 8 pixel and conditionally stores each of them on screen. The inner loop needs 2 cycle to excute. This means 8 pixels are copied in 2 clocks.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 14 Aug 2022 07:22
| Lets look at the non AMMX code for this job. The below loop will do the same as the AMMX code. You can clearly see how much bigger the needed is, and of course the AMMX code is many times faster. For jobs like this, the AMMX is like 20 times faster than non AMMX. lea Spritedata,A0 lea Screen,A1 move.w #height-1,D1 Yloop: move.w #width/8-1,D0 Xloop: move.b (A0)+,D2 beq .no0 move.b D2,(A1) .no0 move.b (A0)+,D2 beq .no1 move.b D2,1(A1) .no1 move.b (A0)+,D2 beq .no2 move.b D2,2(A1) .no2 move.b (A0)+,D2 beq .no3 move.b D2,3(A1) .no3 move.b (A0)+,D2 beq .no4 move.b D2,4(A1) .no4 move.b (A0)+,D2 beq .no5 move.b D2,5(A1) .no5 move.b (A0)+,D2 beq .no6 move.b D2,6(A1) .no6 move.b (A0)+,D2 beq .no7 move.b D2,7(A1) .no7 addq.l #8,A1 dbra D0,Xloop add.l #Modulo,A1 dbra D1,Yloop rts
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 14 Aug 2022 11:17
| If you compare the 2 code snippets. The inner workloop is written in bold. Lets compare them
instructions 3 versus 26 memory access 2 versus 16 branches 0 versus 8 clocks 2 versus (17-64) depending on branches
The numbers show very clearly what huge improvement the AMMX code gives. Even non coders should clearly see the immense benefit just by looking at the numbers.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 14 Aug 2022 12:09
| AMMX is extremely powerful for games. You can use it for cookie-cut copies and also for ALPHA blend copies. Amiga games never did ALPHA-Blend before simply because its very expensive making the games to slow. This is completely changed now with AMMX. You can use ALPHA blending in every game now and still be smooth and fast.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 14 Aug 2022 12:20
| Lets look at the AMMX code for 32bit pixel and ALPHA Blending lea sprite,A0 lea screen,A1 move.w #height-1,D1 Yloop: move.w #width-1,D0 Xloop: LOAD (A1),D2 MULalpha (A0)+,D2 STORE D2,(A1)+ dbra D0,Xloop adda.l #Modulo,A1 dbra D1,Yloop rts
Is you can see making Alpha blended truecolor Sprites on Screen is super easy and super fast thanks to AMMX.
| |
| | Samuel Crow
Posts 424 14 Aug 2022 20:09
| Thanks for the AMMX examples. I was trying to figure out a way to make a variable-length memory-to-memory vector unit like what ARM9 uses but not using their patent. The basic idea was to merge the AMMX vector instructions with the innermost loop level DBRA so it could put less strain on the code cache. It would be based on the macro-expansion patent (the second link on my first post). CISC CPU's usually have fairly complex decoding but by implementing the 3 or 4 instructions normally required into microcode and then fuse the DBRA so that it can be implemented as a more general purpose vector unit. ARM9 uses the 128-bit Neon vector unit internally to implement their variable length vector instructions so that future, larger vector units can be used without having to invent new fixed-width instructions. Their vectors can be up to 2048 bits (256 bytes) but not big enough for full-screen vector ops. The DBRA has an unsigned 16-bit counter allowing full-width resolutions instead. It's an effort to make the 68080 future-proof to a future CPU that will have either more parallel pipelines or wider ones that AMMX can be maintained.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 14 Aug 2022 21:35
| Samuel Crow wrote:
| Thanks for the AMMX examples. I was trying to figure out a way to make a variable-length memory-to-memory vector unit |
The VAX did had this already. :-) I understand you that this talk is without having a real usecase today. Right? I mean there is no real world problem that you can not solve with AMMX today here, but you think about an future option.
| |
| | Samuel Crow
Posts 424 14 Aug 2022 22:28
| OK! Here's my real-world example: My i7 is a 3rd generation i7. It doesn't support AVX512 vector ops but does support SSEz where z is whatever version of SSE it supports. When I tried to run the Bun Javascript bundler, it didn't work because the Zig programming language it's written in doesn't support SSEz but currently requires AVX512. Once it was figured out what the problem was on older Intel chips, I had to download an Intel patcher for running new code on older CPUs. What a pain. I have to prefix all of the executable names with the patcher app name on the command-line. ARM9 users have a solution to that that alleviates the problem for all future generations of 64-bit ARM chips: It supports all the way up to 2048-bit vector ops but internally macro-expands them into Neon 128-bit instructions internally. That's better but still not good enough for making a BOB that's bigger than 64 truecolor pixels across. It's 256 bytes maximum. What I'm proposing is like the ARM9 but able to represent the number of longwords in the length of the vector. The reason this is necessary is that AMMX is stuck at 64-bits per vector op while processing 32-bit truecolor pixels. That's 2 pixels at a time. Nice but not future proof. What happens with the 68090 when AMMX2 comes out with 128-bit vector ops? Rewrite every single AMMX2 code to be backward compatible to the 68080? Add a patcher app so that the new instructions are patched to the old AMMX instructions like Intel did with AVX512? The improvement is this: The number of AMMX instructions will be internally macro-expanded to the width of the BOB, thus eliminating the looping overhead of the DBRA opcode. It's folded away most of the time but still there. Also, it allows future AMMX2 instructions to be macro expanded but use less clocks. lea sprite,A0 lea screen,A1 move.w #height-1,D1 Yloop: move.w #width-1,D0 Xloop: LOAD (A1),D2 MULalpha (A0)+,D2 STORE D2,(A1)+ dbra D0,Xloop adda.l #Modulo,A1 dbra D1,Yloop rts
becomes lea sprite,A0 lea screen,A1 move.w #height-1,D1 Yloop: MULALPHA.width (A0)+,(A1)+ adda.l #Modulo,A1 dbra D1,Yloop rts
and performs as fast on 68080 as AMMX. On 68090 if it comes out, it goes twice as fast while still being backwards compatible to the 68080. Future-proof is how I like it!
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 15 Aug 2022 07:20
| Samuel Crow wrote:
| The improvement is this: The number of AMMX instructions will be internally macro-expanded to the width of the BOB, thus eliminating the looping overhead of the DBRA opcode.
|
A very good feature of the 68080 is that DBRA has no looping overhead. Xloop: load (A0)+,D2 storem.b D2,(A1)+ dbra D0,Xloop
This loop takes 2 cycle. The DBRA is free.
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 15 Aug 2022 08:01
| Samuel Crow wrote:
| On 68090 if it comes out, it goes twice as fast while still being backwards compatible to the 68080. Future-proof is how I like it!
|
The size is not the most important and also bigger is not always better. To prevent misunderstandings, the 68080 is not 64bit because we could not do 128bit, we could have done 128bit but we decided that 64bit is more advantageous. As you will know in reality wider vectors also come with many disadvantages: higher routing resources needs, higher muxing costs, both resulting in lower clock, and more alignment requirements. We did evaluate both options 64bit AMMX and 128bit AMMX for the 68080 CPU and after reviewing the evaluation we decided that 64bit gives us a better balanced CPU. The current AMMX design does offer a quantum leap in performance over traditional 68K instructions. This performance leap does not just come because AMMX is wider but because the instructions are so much more powerful. The key advantage of the AMMX-MULalpha is not that it twice as wide as normal 68K code and therefore can do twice the works. The real advantage is that with normal/old 68K code you need like 20 instructions for doing a single pixel. While AMMX can do it in 1 instruction. AMMX is so strong because it offer very powerful instructions for game coding. AMMX offer here much more powerful instructions than ALTIVEC, or NEON or SSE offers. Its not the width that counts but the technique. ;-) Altivec for example is twice as wide as AMMX, but AMMX is much faster than ALTIVEC for many game applications - because the instructions are smarter.
| |
| | Samuel Crow
Posts 424 15 Aug 2022 16:19
| Of course there is no need to go "future-proof" if there will be no future versions. How far do you plan to take this design? (Sorry about the loaded question but I feel it still must be asked.)
| |
| | Gunnar von Boehn (Apollo Team Member) Posts 6294 15 Aug 2022 20:24
| Samuel Crow wrote:
| Of course there is no need to go "future-proof" if there will be no future versions. How far do you plan to take this design? (Sorry about the loaded question but I feel it still must be asked.)
|
The goal is to go at some point for an ASIC. As you know the ASIC technology will allow a much higher clock. Which means we will be in the range of 10-30 times faster. But already today the system are so powerful, I can not imagine what games you can do with 20 times the power. Have you seen any of the new games? The games like ApolloInvader or ApolloMenace ? The game use truecolor, alpha blending, high resolutions, huge numbers of multicolored sprites on screen, sometimes over hundred sprites, and huge fully animated end-bosses, multilayers scrolling with with several fully animated playfields... The games are visually way above what you have seen on Amiga before. AMMX gives the Apollo Systems today already such a power to make games far above what you have seen on Amiga.
| |
| | Kamelito Loveless
Posts 263 17 Aug 2022 15:23
| To me the goal should be to have at least 10k unique active users willing to buy games and apps tailored to the V4
| |
| | Nick Fellows
Posts 188 18 Aug 2022 08:17
| Interesting, what is the rationale behind that number ?
| |
| | Kamelito Loveless
Posts 263 19 Aug 2022 20:48
| nick fellows wrote:
| Interesting, what is the rationale behind that number ?
|
This is the number given by a former well known Amiga game developer. He was speaking about the ZX Spectrum Next but this apply to V4 devices too. He was willing to develop games for this platform if this number was met. It is for him the minimal number to make money out of his work.
| |
|
|
|