View Single Post
Old 06-04-2012, 05:17 AM   #26
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
I think you can load a 32 bit constant with 3 non-branching instructions, actually.

1) load 2nd following instruction
2) xor top 8 bits of value with constant
3) instruction with top 8 bits set so that the bottom 24 bits are ignored

I need to check the details of this, but it should work i'd think in 3 cycles and 12 bytes:

Update: something like (if setting 0 as a constant, we'd use different (hopefully shorter) instructions):

ldr r5, [pc, #4] // load contents of "whatever instruction" I probably got the syntax wrong...
eors r5, high_8_bits_of_value shl #24 // this will always have the zero flag clear, provided we're not setting 0
.word low_24_bits_of_value // note that this encodes as if zeroflag-set: some data-operation, one of and, eor, sub, rsb, add, adc, sbc or rsc
this is a nice hack - using the the s suffix and encoding 24bit, and it can certainly work with this scheme. i think there has to be another instruction when setting the smaller portion (8bit) of the desired 32bit though, because we cannot use the barrel shifter and setting an immediate in one instruction, for example:

Code:
    /* load .word into r5 */
  ldr r5, [pc, #4]
    /* setting the lower bits into an extra register is required :[ */
  mov r6, #0xee
    /* left shift by 8 will discard the beq opcode (0x0a) and we can then concat
    with the lower bits, while clearing the z bit in cprs (s suffix) */
  orrs r5, r6, r5, lsl #8
    /* will not be called: beq <some_24bit_word> =
       = .word 0x0a000000 | 0x00xxxxxx */
  .word 0x0abbccdd
    /* r5 now holds 0xbbccddee
but as a comparison i think it might be faster to form the constant with 4 naive instructions:
Code:
  mov r0, #0x000000ee
  orr r0, r0, #0x0000dd00
  orr r0, r0, #0x00cc0000
  orr r0, r0, #0xbb000000
while the naive version may suffer from lack of potential pipeline optimization, the previous version has two potential stalls: one at ldr and one at beq.

i've been reading more on how ldr works on cpu/mpu level and it does depend on a lot of factors. it will normally take 2 cycles, while it can still take one cycle if it can be pipelined, in a case, where no involved register operation follows afterwards. this is somehow difficult to achieve if constant loading will be a macro.

using ldr rx, [pc, n] can be a 2-3 cycle operation and if a stall occurs it will be caused by the "fetch unit" (fetch-decode stage). if cache performance is of consideration here, it would be interesting to compare what are the benefits (if any) of using a global pool (address of, stored in a register, e.g. GLUE_CALL_CODE) in comparison to the pc relative offsetting in regard of caching, mapping, tlb, fetch timing, etc.

but in general, we can still provide a local pool per section that will be outside of the return branch (which will theoretically still take 2-3 cycles):
Code:
.section_name
ldr r1, [pc, #some_offset]
add r1, r0, #1
...
# return
mov pc, lr
# pool here
.word 0xff00ff00
.word 0xff0fff00
.word 0xfff0ff00
on the other hand simpler constants can obviously be formed by using the barrel shifter with 1-3 instructions (which guaranties 1-3 cycles and opens up to OOP and pipelining):

Code:
# 0x7fffffff
mvn r0, #0
mov r0, r0, lsr #1

# 0x3ff00000
mov r0, #0x00f00000
orr r0, r0, #0x3f000000
if performance becomes really of greater concern later on, constant definition could become section specific in attempt to speed execution. for example - loading a constant partially, performing some other operation and then finish loading the constant, which may obfuscate the code a bit.

--
liteon is offline   Reply With Quote