COCKOS
CONFEDERATED FORUMS
Cockos : REAPER : NINJAM : Forums
Forum Home : Register : FAQ : Members List : Search :
Old 03-21-2012, 01:14 AM   #1
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default ARM port of EEL2

following this thread:
http://forum.cockos.com/showthread.php?t=98337

i did some work on giving this a quick go...

here are the initial results:
https://github.com/neolit123/wdl/commits/eel2-arm

the glue code took most of my time, as i had to first understand what (the hell?) is going on in there. there are some major problems when function calls are made, for example calling something libc from the "virtual machine". my current solution, which is basically - passing an address table around in assembly, may urge the need for some facepalm-like gestures in certain developers.

mind that this a soft-float port to ARM, which will run slow, but on pretty much everything. a VFP version can be possibly branched out in the same build, while FPA and FPE do not make much sense to be implemented in my opinion, since the support is minimal (afaik).

only some basic operators and functions are implemented at this point, but the semantics are in place.

test:

Code:
ret = 3.1415926535897932384626433832795;
ret = (sqr(ret - 3.0) / 2 + 1.5)*ret;
ret = (sqr(ret - 3.0) / 2 + 1.5)*ret;
ret = (sqr(ret - 3.0) / 2 + 1.5)*ret;

// goes something like
ret = 3.14159265358979323
ret = 4.7438810584205937
ret = 14.3291800878725635
ret = 941.0712054248509730
--
liteon is offline   Reply With Quote
Old 04-07-2012, 04:03 PM   #2
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default soft-float performance

this is a bit of side note:

i was curious on the performance situation when using software floating point in comparison to hardware, so i had to run some tests in this aspect. the only adequate way to get at least somehow accurate measures is my case, not having a real ARM device, while running either in a simulator or a VM, was to see what happens when x86 handles optimized software floating point and draw some conclusions from that.

instead of looking for the GNU build of their soft-float library i wrote a quick version of floating point addition that takes into consideration everything that the FPU might do, such as check for NAN, infinity and round to nearest as the default rounding mode. i've used some compensation trickery for the actual measurement code to neglect any possible small deviations, caused by compiler optimizations, pipelining or OOE (if that is even possible). this is greatly simplified on a single core x86 with the TSC if you can get the OS into a passive mode.

the results are:
no test code - ~0 cycles
x87 FADD - ~24 cycles
SOFT-FADD - ~140 cycles
SOFT-FADD with -O3 - ~40 cycles

GCC -O3 does a great job optimizing the function into something that might be considered "difficult to follow" x86 assembly (not that x86 normally is), but the performance is excellent. while these numbers will be completely different on an ARM CPUs (and overall the code will be much slower), i think that i cannot confirm that hardware floating point arithmetic is thousand of times faster than software, information for which i took from various small articles and more explicit hardware documentation. i would speculate a 10-30 times faster execution for VFP's FADD over a unoptimized software version on ARM.

if someone is interested i can post the test code.

p.s.
i was able to fry something on my MB/AGP port, so currently my graphic card only runs in VGA mode, but i guess i will continue slowly the ARM port after i have a better platform to work on (unfortunately this affects my job-work as well). to my surprise watching a low-res "modern" video on a native player and low-res flash (e.g. youtube) works ok even without hardware acceleration and high AGP transfer rates.

--
liteon is offline   Reply With Quote
Old 04-11-2012, 01:49 PM   #3
IXix
Human being with feelings
 
Join Date: Jan 2007
Location: mcr:uk
Posts: 2,509
Default

Go on!
IXix is offline   Reply With Quote
Old 05-08-2012, 03:03 PM   #4
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

here is an initial merge into the refactored eel2. compiles, but has to be adapted/tested later on:
http://github.com/neolit123/wdl/comm...3f57d229e22ddb

the github diff does not have ignore-* i believe.

--
liteon is offline   Reply With Quote
Old 05-23-2012, 12:08 PM   #5
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Very cool! I'm about to push some new EEL changes online, including a bytecode interpreted mode (that is portable)... Now I'm tempted to go find a Raspberry Pi to help port the native ARM version (with FPU I hope?). Sorry if all of our EEL changes cause merge hell :/

Last edited by Justin; 05-23-2012 at 12:15 PM.
Justin is offline   Reply With Quote
Old 05-24-2012, 12:51 AM   #6
Tale
Human being with feelings
 
Tale's Avatar
 
Join Date: Jul 2008
Location: Holland
Posts: 1,691
Default

We have a Raspberry Pi at work, but I haven't had a chance to play with it yet.
Tale is online now   Reply With Quote
Old 05-24-2012, 05:13 AM   #7
Banned
Human being with feelings
 
Banned's Avatar
 
Join Date: Mar 2008
Location: Unwired (probably in the proximity of Amsterdam)
Posts: 3,743
Default

Fwiw, perhaps you could do some ARM development on a jailbroken iPhone/iPad as well; the jailbreak toolchain includes GCC, GDB etc.
__________________
˙lɐd 'ʎɐʍ ƃuoɹʍ ǝɥʇ ǝɔıʌǝp ʇɐɥʇ ƃuıploɥ ǝɹ,noʎ
Banned is offline   Reply With Quote
Old 05-24-2012, 07:09 AM   #8
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Quote:
Originally Posted by Banned View Post
Fwiw, perhaps you could do some ARM development on a jailbroken iPhone/iPad as well; the jailbreak toolchain includes GCC, GDB etc.
ah nice, and I imagine you can mark pages as executable when jailbroken, too eh?
Justin is offline   Reply With Quote
Old 05-24-2012, 09:31 PM   #9
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
Very cool! I'm about to push some new EEL changes online, including a bytecode interpreted mode (that is portable)... Now I'm tempted to go find a Raspberry Pi to help port the native ARM version (with FPU I hope?). Sorry if all of our EEL changes cause merge hell :/
no problem,
there isn't much of a trouble merging, really...

the CPU in Raspberry PI is a bit outdated - ARM1176JZF-S, but has a VFP unit and is good enough for development. i wanted to get soft-float support in, because unlike the x87, which will probably be there for quite some time, ARM might soon decide to deprecate the VFP unit at some point to save dye space (and thus force use of the newer NEON SIMD only or come up with something else). there are a lot of ARM CPU's that have different floating logic and are simply not compatible (VFP,NEON,FPA,FPE).

if we neglect that, the VFP control word has a field that puts the co-processor into scalar mode which is suitable for EEL2, i think.
https://www.scss.tcd.ie/~waldroj/3d1/arm_arm.pdf
page 885.

the register exchange (CPU-COP) and overall the instruction sets are pretty straightforward.

i wouldn't consider working on a mobile device, unless its possible to attach a real monitor, mouse and a keyboard to it. also, i don't think serious programmers can be convinced that Android or iOS are better than something like Debian for development.

for the sake of running on a mobile device i did run a previous build of EEL2 on an Android phone, but then the build broke at some point. :\

--
liteon is offline   Reply With Quote
Old 05-25-2012, 06:19 AM   #10
Banned
Human being with feelings
 
Banned's Avatar
 
Join Date: Mar 2008
Location: Unwired (probably in the proximity of Amsterdam)
Posts: 3,743
Default

Quote:
Originally Posted by Justin View Post
ah nice, and I imagine you can mark pages as executable when jailbroken, too eh?
Check out ldid (Cydia), a tool that Jay Freeman (aka saurik) wrote; since Apple started their code signing requirements this is very useful to bypass it to allow an iPhone to execute binaries.

Rpetrich made an OS X port as well, so you can also add this step to a desktop building workflow before moving stuff onto a device. This way you can script all the required commands (e.g. make, chmod +x, ldid -S, scp) commands into a 1-click building/testing cycle.
__________________
˙lɐd 'ʎɐʍ ƃuoɹʍ ǝɥʇ ǝɔıʌǝp ʇɐɥʇ ƃuıploɥ ǝɹ,noʎ
Banned is offline   Reply With Quote
Old 05-30-2012, 05:17 PM   #11
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

[double post]

Last edited by liteon; 05-30-2012 at 09:54 PM.
liteon is offline   Reply With Quote
Old 05-30-2012, 05:21 PM   #12
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

https://github.com/neolit123/wdl/commits/

Code:
init
alloc
reg
compile
pass(1), ret=0.000000
pass(2), ret=0.000000
pass(3), ret=0.000000
pass(4), ret=0.000000
pass(5), ret=0.000000
pass(6), ret=0.000000
pass(7), ret=0.000000
pass(8), ret=0.000000
pass(9), ret=0.000000
pass(10), ret=0.000000
the glue code needs some more work, but at least it compiles/runs now.

there are some slight differences to x86, ppc, since in all places i directly modify the pc/link instead of branching ("b"). this should be technically slower, but gives a 32bit jump. the reason was that bx was giving me some strange results (thumb mode) and on the other hand gas translated "bl" to something similar, if i recall.

--
liteon is offline   Reply With Quote
Old 05-30-2012, 08:41 PM   #13
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Quote:
Originally Posted by liteon View Post
the glue code needs some more work, but at least it compiles/runs now.

there are some slight differences to x86, ppc, since in all places i directly modify the pc/link instead of branching ("b"). this should be technically slower, but gives a 32bit jump. the reason was that bx was giving me some strange results (thumb mode) and on the other hand gas translated "bl" to something similar, if i recall.

--

Very cool! I'm learning a lot reading this...

Unfortunately I think we'll need to do some more tweaks to the code calling the glue, to support storing the offset elsewhere (in a data block, perhaps), because this code:

Quote:
static const unsigned int GLUE_JMP_IF_P1_Z[]=
{
0x051ff004, // ldreq pc, [pc, #-4]
0x0, // offset goes here
};
...will try to execute the offset as an instruction (assuming the jump is not made), which would almost always be bad...
Justin is offline   Reply With Quote
Old 05-30-2012, 10:25 PM   #14
dub3000
Human being with feelings
 
dub3000's Avatar
 
Join Date: Mar 2008
Location: Sydney, Australia
Posts: 3,775
Default

cool stuff!
dub3000 is online now   Reply With Quote
Old 05-30-2012, 10:26 PM   #15
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

[double post again...]

Last edited by liteon; 05-30-2012 at 10:41 PM.
liteon is offline   Reply With Quote
Old 05-30-2012, 10:32 PM   #16
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
Very cool! I'm learning a lot reading this...

Unfortunately I think we'll need to do some more tweaks to the code calling the glue, to support storing the offset elsewhere (in a data block, perhaps), because this code:

...will try to execute the offset as an instruction (assuming the jump is not made), which would almost always be bad...
you are correct,
forgot about that - oops.

here is what can be done:

Code:
/* gas test program */

  .global  main
  .type  main, %function
main:
  stmfd  sp!, {lr}

  mov r0, #1
  /* check/set the zero flag */
  cmp r0, #0
  /* we call our conditional instruction
  offset #0 would mean the instruction at the pc (or in this case .word) */
  ldrne  pc, [pc, #0]
  /* but if we reach this point we simply update the pc
  which basically goes to ldmfd... or skips the .word */
  add pc, pc, #0
  .word 0xcafecafe

  ldmfd  sp!, {pc}
Code:
static const unsigned int GLUE_JMP_IF_P1_Z[]=
{
  0x059ff000,   // ldreq  pc, [pc, #0]
  0xe28ff000,   // add pc, pc, #0
  0x0           // offset goes here
};
static const unsigned int GLUE_JMP_IF_P1_NZ[]=
{
  0x159ff000,  // ldrne  pc, [pc, #0]
  0xe28ff000,  // add pc, pc, #0
  0x0          // offset goes here
};
https://github.com/neolit123/wdl/com...76905008557823

i'm hoping that it will be possible to write at GLUE_JMP_IF_P1_NZ[2], for example.
[edit] and also that the instruction after the "offset goes here" word will be callable ?

--

Last edited by liteon; 05-30-2012 at 10:57 PM.
liteon is offline   Reply With Quote
Old 05-31-2012, 02:22 AM   #17
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Actually (duh!), those jump instructions can be the 26 bit relative versions -- the addresses passed are relative anyway (but they are in bytes rather than dwords, which may need some tweaking). GLUE_MAX_JMPSIZE should be defined to the ~16 million max... I will update the calling code to use a GLUE_JMP_SET_OFFSET(instruction_end_buffer,offset) rather than having it directly replace the address using the GLUE_JMP_TYPE / GLUE_JMP_OFFSET / GLUE_JMP_OFFSET_MASK values (since the latter requires the address to fit in its own int or short).
Justin is offline   Reply With Quote
Old 05-31-2012, 04:34 AM   #18
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

i didn't check previously how nseel-compiler.c writes the immediates into the bytes. this will certainly reduce the jump overhead (the 32mb limit though...).

the encoding looks similar to ppc (sign extension etc).
here is the "b" encoding info:

http://simplemachines.it/doc/arm_inst.pdf
(page 17, actually its 24bits..)

so for example, if we want to branch to the current offset where the b instruction is (e.g. at 1000), this would mean.

by specs:
encoded = (target_offset - pc) >> 2
e.g. (1000 - 1008) >> 2 = -8 >> 2 = -2 = 0xfffffffe(32bit) = 0xfffffe(24bit)

note: pc is current offset + 8 only on ARM mode, was + 4 on THUMB mode i think.

0xea - this is the non-conditional b opcode (11101010)
0xfffffe - immediate
0xeafffffe - result

Code:
static const unsigned char GLUE_JMP_NC[] =
{
  0xea, 0x0, 0x0, 0x0
};

static const unsigned char GLUE_JMP_IF_P1_Z[] =
{
  0x0a, 0x0, 0x0, 0x0
};

static const unsigned char GLUE_JMP_IF_P1_NZ[] =
{
  0x1a, 0x0, 0x0, 0x0
};

// (edit: 24 bit and needs cmp for the cond.)
--

Last edited by liteon; 06-01-2012 at 02:19 PM.
liteon is offline   Reply With Quote
Old 05-31-2012, 03:16 PM   #19
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

OK, pushed an update, now the glue*.h implement:

GLUE_JMP_SET_OFFSET(endptrofinstruction, offset_in_bytes_from_end_of_instruction)


PPC implements it as:

#define GLUE_JMP_SET_OFFSET(endOfInstruction,offset) (((short *)(endOfInstruction))[-1] = ((offset) + 4) & 0xFFFC)

(since PPC jumps are relative to the start of the jump instruction)

whereas x86/x86-64 implement it as:

#define GLUE_JMP_SET_OFFSET(endOfInstruction,offset) (((int *)(endOfInstruction))[-1] = (offset))

(on x86 the jump is relative to the next instruction)

If it makes it easier to read, you could implement this as 'static void GLUE_JMP_SET_OFFSET(void *, int)' too...
Justin is offline   Reply With Quote
Old 05-31-2012, 04:04 PM   #20
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

I'd imagine that Thumb mode shouldn't even be considered, since RAM use isn't a concern. Also I'd be curious whether loading constants via PC-relative addressing and the associated branch is worthwhile; probably it would make more sense to either a) encode as 4 instructions (ugh), or B) make each codehandle have a table of pointers to load from (provided the count is small enough to be addressable). The latter is something I've considered doing for PPC, too, but it doesn't quite seem worth it as PPC can do constant 32 bit loads in 2 instructions...
Justin is offline   Reply With Quote
Old 06-01-2012, 04:53 PM   #21
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
OK, pushed an update, now the glue*.h implement:

GLUE_JMP_SET_OFFSET(endptrofinstruction, offset_in_bytes_from_end_of_instruction)


PPC implements it as:

#define GLUE_JMP_SET_OFFSET(endOfInstruction,offset) (((short *)(endOfInstruction))[-1] = ((offset) + 4) & 0xFFFC)

(since PPC jumps are relative to the start of the jump instruction)

whereas x86/x86-64 implement it as:

#define GLUE_JMP_SET_OFFSET(endOfInstruction,offset) (((int *)(endOfInstruction))[-1] = (offset))

(on x86 the jump is relative to the next instruction)

If it makes it easier to read, you could implement this as 'static void GLUE_JMP_SET_OFFSET(void *, int)' too...
thanks for the changes and clarification,

just merged and updated...
but i might have messed the jump offset macro, since it confuses me a little.
best would be to actually get to the point of testing it, i guess.

--
liteon is offline   Reply With Quote
Old 06-01-2012, 04:56 PM   #22
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
I'd imagine that Thumb mode shouldn't even be considered, since RAM use isn't a concern. Also I'd be curious whether loading constants via PC-relative addressing and the associated branch is worthwhile; probably it would make more sense to either a) encode as 4 instructions (ugh), or B) make each codehandle have a table of pointers to load from (provided the count is small enough to be addressable). The latter is something I've considered doing for PPC, too, but it doesn't quite seem worth it as PPC can do constant 32 bit loads in 2 instructions...
yep, no thumb mode. the current scheme will also not work with it very well, since the port depends on 4byte offsets (and is using r8). the mode switching in itself is a bit confusing, complemented by the cpu model naming scheme that arm uses.

as far as i know the pc method of loading is the safest and the only way to load a full 32bit value.
there is also mvn (move + not), which can do for example:
ldr r0, =0xffffff00
could be:
mvn r0, #255
but will not work for 0xfffffe00.

gcc seems to use it quite a lot event for smaller values. this is a dump of the end of the <main> branch:
Code:
   188f8:	0002a87c 	andeq	sl, r2, ip, ror r8
   188fc:	0002a884 	andeq	sl, r2, r4, lsl #17
   18900:	0002a89c 	muleq	r2, ip, r8
   18904:	0002a8a4 	andeq	sl, r2, r4, lsr #17
   18908:	0002a8b8 	streqh	sl, [r2], -r8
   1890c:	0002a8bc 	streqh	sl, [r2], -ip
   ...
the second method you propose is something i've considered as well. there is already an address table dumped into a pool in GLUE_CALL_CODE (but it really should be in c, i think, and passed as a __asm parameter like you do with "consttab"). the table itself is passed to the nseel_asm_... methods to provide some function pointers, because i wasn't able to get the correct addresses of such in any other way. 256 values would be hardly reachable at this point, i think.

this would take loading a full double 2 instructions (or ~4 cycles (edit)) instead of 4.

--

Last edited by liteon; 06-03-2012 at 06:29 AM.
liteon is offline   Reply With Quote
Old 06-02-2012, 10:19 AM   #23
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

I think you can load a 32 bit constant with 3 non-branching instructions, actually.

1) load 2nd following instruction
2) xor top 8 bits of value with constant
3) instruction with top 8 bits set so that the bottom 24 bits are ignored

I need to check the details of this, but it should work i'd think in 3 cycles and 12 bytes:

Update: something like (if setting 0 as a constant, we'd use different (hopefully shorter) instructions):

ldr r5, [pc, #4] // load contents of "whatever instruction" I probably got the syntax wrong...
eors r5, high_8_bits_of_value shl #24 // this will always have the zero flag clear, provided we're not setting 0
.word low_24_bits_of_value // note that this encodes as if zeroflag-set: some data-operation, one of and, eor, sub, rsb, add, adc, sbc or rsc

Last edited by Justin; 06-03-2012 at 02:51 PM.
Justin is offline   Reply With Quote
Old 06-02-2012, 10:55 AM   #24
rutmang
Human being with feelings
 
rutmang's Avatar
 
Join Date: May 2007
Location: Dearborn, MI
Posts: 55
Default

Good lord! I just now found this thread. Don't know all the ingredients you all are talking about, but it smells like things are cooking!

As for some cheap hardware (may still be on par with the Raspsberry Pi) for development, we got one of these for $50 altogether. A little more than the Pi, but already housed with a touchscreen, etc.

Only 256MB ram, but seems pretty easy to hack away with. And you can plug in a USB keyboard and mouse.

CPU is InfoTM iMAPx210 1GHz. I think the CPU manual is here if it may help at all.

I can't code, but would like to help if this info is of any use.
__________________
"'Dangerous Business' is as good as 'Bridge over Troubled Water" any day of the week."

=== Check out my stuff ===
http://mangmade.blogspot.com/
======= Thanks! ========
rutmang is offline   Reply With Quote
Old 06-03-2012, 02:35 PM   #25
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Just a note to say I more fully sorted out my ideal set of 3 instructions, which include no branches and such, for loading the 32 bit value...
Justin is offline   Reply With Quote
Old 06-04-2012, 05:17 AM   #26
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

Quote:
Originally Posted by Justin View Post
I think you can load a 32 bit constant with 3 non-branching instructions, actually.

1) load 2nd following instruction
2) xor top 8 bits of value with constant
3) instruction with top 8 bits set so that the bottom 24 bits are ignored

I need to check the details of this, but it should work i'd think in 3 cycles and 12 bytes:

Update: something like (if setting 0 as a constant, we'd use different (hopefully shorter) instructions):

ldr r5, [pc, #4] // load contents of "whatever instruction" I probably got the syntax wrong...
eors r5, high_8_bits_of_value shl #24 // this will always have the zero flag clear, provided we're not setting 0
.word low_24_bits_of_value // note that this encodes as if zeroflag-set: some data-operation, one of and, eor, sub, rsb, add, adc, sbc or rsc
this is a nice hack - using the the s suffix and encoding 24bit, and it can certainly work with this scheme. i think there has to be another instruction when setting the smaller portion (8bit) of the desired 32bit though, because we cannot use the barrel shifter and setting an immediate in one instruction, for example:

Code:
    /* load .word into r5 */
  ldr r5, [pc, #4]
    /* setting the lower bits into an extra register is required :[ */
  mov r6, #0xee
    /* left shift by 8 will discard the beq opcode (0x0a) and we can then concat
    with the lower bits, while clearing the z bit in cprs (s suffix) */
  orrs r5, r6, r5, lsl #8
    /* will not be called: beq <some_24bit_word> =
       = .word 0x0a000000 | 0x00xxxxxx */
  .word 0x0abbccdd
    /* r5 now holds 0xbbccddee
but as a comparison i think it might be faster to form the constant with 4 naive instructions:
Code:
  mov r0, #0x000000ee
  orr r0, r0, #0x0000dd00
  orr r0, r0, #0x00cc0000
  orr r0, r0, #0xbb000000
while the naive version may suffer from lack of potential pipeline optimization, the previous version has two potential stalls: one at ldr and one at beq.

i've been reading more on how ldr works on cpu/mpu level and it does depend on a lot of factors. it will normally take 2 cycles, while it can still take one cycle if it can be pipelined, in a case, where no involved register operation follows afterwards. this is somehow difficult to achieve if constant loading will be a macro.

using ldr rx, [pc, n] can be a 2-3 cycle operation and if a stall occurs it will be caused by the "fetch unit" (fetch-decode stage). if cache performance is of consideration here, it would be interesting to compare what are the benefits (if any) of using a global pool (address of, stored in a register, e.g. GLUE_CALL_CODE) in comparison to the pc relative offsetting in regard of caching, mapping, tlb, fetch timing, etc.

but in general, we can still provide a local pool per section that will be outside of the return branch (which will theoretically still take 2-3 cycles):
Code:
.section_name
ldr r1, [pc, #some_offset]
add r1, r0, #1
...
# return
mov pc, lr
# pool here
.word 0xff00ff00
.word 0xff0fff00
.word 0xfff0ff00
on the other hand simpler constants can obviously be formed by using the barrel shifter with 1-3 instructions (which guaranties 1-3 cycles and opens up to OOP and pipelining):

Code:
# 0x7fffffff
mvn r0, #0
mov r0, r0, lsr #1

# 0x3ff00000
mov r0, #0x00f00000
orr r0, r0, #0x3f000000
if performance becomes really of greater concern later on, constant definition could become section specific in attempt to speed execution. for example - loading a constant partially, performing some other operation and then finish loading the constant, which may obfuscate the code a bit.

--
liteon is offline   Reply With Quote
Old 04-17-2013, 03:33 PM   #27
liteon
Human being with feelings
 
liteon's Avatar
 
Join Date: Apr 2008
Posts: 510
Default

[haven't posted in a while mode]

it has been almost an year and i'm really sorry about that, but due to personal reasons and job work i've abandoned this completely and went to doing less engaging open source in the spare time...i can assure you though, that if you really want to get your hands dirty you should definitely try something of this magnitude and learn a lot.

one potential issue that drew me back a little was the recent (at the time) refactoring, which isn't much of problem, but more of a challenge.

the second issue was that even if targeting ARM is not that bad of an idea, usually the mobile vendors (which mostly decide on ARM due to efficiency of the platform) will apparently impose sandbox limitations that may disable this type of engine completely. so for example, even if the engine runs on Android it may not work at all on iOS, unless redesigned (of sorts and if possible). so the thing here is that you may not get a user base for the software you are writing.

unless we are very prescient and ARM suddenly decides to target the desktop, this might end up being only as a very nerdy mind-flex for developers, which certainly isn't a bad thing of course :].

i can eventually get back to it...

--

Last edited by liteon; 04-17-2013 at 03:41 PM.
liteon is offline   Reply With Quote
Old 04-21-2013, 05:39 PM   #28
Justin
Administrator
 
Justin's Avatar
 
Join Date: Jan 2005
Location: NYC
Posts: 7,864
Default

Heh no problem -- not a real goal in mind for an ARM port anyway.

I did spend some time getting Jesusonic running on my Pi, using the bytecode engine (which is portable, too!). It was, as you would imagine, incredibly slow.

Anyhoo...
Justin is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 01:36 PM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.