Rather than calling a SWI, an unused co-processor number is used. Cloe claims the Undefined Instruction hardware vector, and checks to see if the instruction is one of this co-processor's instructions, much in the same way that the floating-point emulator works.
In the example given in the Cloe v2 document, this would mean that all the SWIs would be translated into some co-processor instruction.
MOVxx pc,<rn>
MOVxxS pc,<rn>
BL <+/-64> (possibly?) MOVS pc,lr, this would become B
cloe_movs_pc_lr
This optimisation means that:
ADDNES r7,r3,pc could be encoded as MCRNE
8,0,4,7,3,0. The values in the MCR instruction are taken from:
ALUxx{s} Ra,Rb,pc the following instruction
table is used:
31..28
27..24
23..20
19..16
15..12
11..8
7..4
3..0
MRC/MCR
Condx
1 1 1 0
0 0 0 s
Ra
ALU
1 0 0 0
0 0 0 1
Rb
A further optimisation, which would allow for ALUxx{s} Ra,Rb,Rc,<shift;>#<constant> relies on splitting the instruction further:
<shift> is either LSL (00), LSR (01), ASR (10), ROR (11). For the table below, the first bit is encoded as 'T', and the second as 'U'.
<constant> is between 0 and 31, and is bit-wise encoded as J, K, L, M and N.
The new table looks like:
31..28
27..24
23..20
19..16
15..12
11..8
7..4
3..0
MRC/MCR
Condx
1 1 1 0
N T U s
Ra
ALU
1 0 0 M
L K J 1
Rb
As you can see, it now uses two co-processors - number 8 and number 9. However, as there are 14 coprocessors that are unused, this should not present a problem.
For example, the instruction BICGE r4,r3,pc,ASR#13 would
become 2_1010 1110 0100 0100 1110 1001 1011 0011, or
&AE44E9B3 (when disassembled, this becomes MCRGE
CP9,4,R14,C4,C3,5).
For some LDMFD r13!,{...}, the LDC/STC instructions could be used, and the
mapping between the two are:
31..28
27..24
23..20
19..16
15..12
11..8
7..4
3..0
LDM/STM
Condx
1 0 0 P
U S W L
Rn
Register bit-field
LDC/STC
Condx
1 0 0 P
U N W L
Rn
CRd
1 0 0 0
Offset
Since APCS normally doesn't store r0-r3 on the stack, bits 0..7 would encode registers 4-11, and bits 12..15 would encode registers 12-15, as they would do in the LDM/STM instruction.
Another advantage to these optimisations is that the original instruction also does not have to be stored in an array.
CMP Rn,#<constant> ADDCC pc,pc,Rn,LSL#2 B <some address> B <some otheraddress> ; ...This kind of code is used when you wish to create a jump table - Rn on entry is the routine number it wishes to call. This is used in SWI decode tables, typically with R11 as Rn. Cloe would add each of the addreses in the table, up to <constant>/2 instructions. Note that this may cause problems if the programmer has some data in the middle of the table, and they are not expecting that particular value to be called...
Using the same tests as before:
| Number of instructions (total): | 20391 | (100%) |
| Number of direct execution: | 18047 | 88.5% |
| Number of no-decode: | 716 | 3.5% |
| Number of minimal decode: | 489 | 2.4% |
| Number of full decode: | 1139 | 5.6% |
This means that the execution speed would be 89% of the speed of the processor. As this is a 'global' convert (ie. it does not follow branches etc), then the actual figure would be higher - possibly around 92%. This is because a global convert also converts all the strings in the program, and all the data.
In order to get proper results, the convertor would have to be improved, and actually follow code - also the emulator would also have to be written, which would show the actual degredation in performance.