Note that where this document mentions Cloe, it refers to this new version. If it refers to Cloe 1, then it is the earlier idea.
&8000), a Relocatable module's initialisation
routine, or a vector entry point.
The conversion takes two stages - the conversion, and the execution.
MOV r0,r2
ADDEQ r7,r8,r0
ADD r0,pc,#32 ; ADR instruction!
LDMNVFD r13!,{r4-r7,pc}^ ; NV - could translate into MOV r0,r0
STMFD r13!,{r4-r7,lr}
B &9038 BEQ &1145c
BL &20148 BLNE &13004
MOVS pc,lr
LDMFD r13!,{r4-r7,r11,pc}^
LDR pc,[r4,#0]
MOV r2,pc ADD r0,r1,pc TEQP pc,#1<<28
SWI XOS_Exit
For example, take the following code:
code
BL init
BL main
BL close
SWI XOS_Exit
main
MOV r0,#'A'
MOV r9,#7
loop
STMFD r13!,{lr}
TST r9,#1
BLEQ print_lowercase
BLNE print_uppercase
SUBS r9,r9,#1
ADD r0,r0,#1
BGT loop
LDMFD r13!,{pc}^
init
SWI &20100+22
SWI &20100+0 ; Go to mode 0
MOVS pc,lr
close
SWI &20004 ; Wait for a key
MOVS pc,lr
print_lowercase
ORR r0,r1,#32
SWI XOS_Write0
MOVS pc,lr
print_uppercase
BIC r0,r0,#32
SWI XOS_Write0
MOVS pc,lrFor this code, Cloe will convert it into: code
SWI Cloe_BL ; Class 3 - emulates 'init', and then continues
SWI Cloe_BL ; Class 3 - emulates 'main', and then continues
SWI Cloe_BL ; Class 3 - emulates 'close', and then continues
SWI XOS_Exit ; Class 6 - ignored
main
MOV r1,#'A' ; Ignored
MOV r9,#7 ; Ignored
loop
STMFD r13!,{lr} ; Ignored
TST r9,#1 ; Ignored
SWIEQ Cloe_BL ; Class 2 - emulates 'print_lowercase' and continues
SWINE Cloe_BL ; Class 2 - emulates 'print_uppercase' and continues
SUBS r9,r9,#1 ; Ignored
ADD r1,r1,#1 ; Ignored
BGT loop ; Class 2 - already emulated 'loop', so continues
SWI Cloe_PullStack ; Class 4 - destination unknown
init
SWI &20100+22 ; Ignored
SWI &20100+0 ; Ignored
SWI Cloe_ALU ; Class 4 - destination unknown
close
SWI &20004 ; Ignored
SWI Cloe_ALU ; Class 4 - destination unknown
print_lowercase
ORR r0,r1,#32 ; Ignored
SWI XOS_Write0 ; Ignored
SWI Cloe_ALU ; Class 4 - destination unknown
print_uppercase
BIC r0,r0,#32 ; Ignored
SWI XOS_Write0 ; Ignored
SWI Cloe_ALU ; Class 4 - destination unknown
When it reaches one of the class 4 instructions, it emulates it, and then works out where the program counter would continue. Using this address, it starts the conversion process again. After emulating, it starts executing the code at that address.
This allows the following to occur:
; ...
ADR r0,other_routine
MOV lr,pc
MOV pc,r0
LDMFD r13!,{pc}^
; ...
other_routine
SWI &20120
MOVS pc,lr
; ...When it reaches the MOV pc,r0 (which is a class 4
instruction), it is able to establish that r0 points to
other_routine (which it has yet to emulate), so Cloe starts its
conversion at other_routine, and continues until it reaches the
end. In this example, this is at the MOVS pc,lr instruction. When
it has done this, it starts executing at other_routine.
Of course, if Cloe has already converted other_routine, then it
does not need to do it again, and so it will just execute at
other_routine.
In order to find out if it has already emulated the instruction, it will require 1 bit per ARM instruction, or 1 bit per 32 bytes of memory for the program. This evaluates to 1K per 32K program. It is rare to find a program that is larger than 600K - even that would require only 19K. This memory is required throughout the life of the program, but only needs to be paged in when the code is being emulated.
Cloe would also require some stack in order to perform the conditional conversion. This stack is only required during the conversion phase. The stack requirement would depend on the complexity (and number of procedures) the program uses.
In addition, Cloe also needs to store the original instructions and their addresses to some form of memory. This memory would also be required during the life of the program. In terms of processor overhead, the two main overheads are when Cloe is performing the conversion section, and when it is actually emulating instructions.
Initial tests show that the conversion code can be very fast - up to around 16MBytes/second, or 4MIPS. This means that a 200K program would be converted in 0.06 seconds, under ideal conditions.
The emulation would be considerably slower, possibly at around 1MIPS.
Furthermore, a typical 'C' program of 83K requires 2300 emulated instructions, or around 10% of instructions need to be emulated.
Note that these tests were only performed on a bulk system (ie. no code-following was employed), and the number of emulated instructions should be much lower. Also, a 200MHz StrongARM processor was used.
Using these figures, a 500MHz processor would convert a 200K program in 0.02 seconds, and would run at an average speed of 452MIPS, or 90.4% of real processor speed!
I shall have some more up to date timings when I get the chance to write the system, and emulator, properly.