[
** Updated 02/03/2015 **
The patch to implement IFUNC for arm is submitted here - https://sourceware.org/ml/binutils/2015-01/msg00258.html
]
Scenario
A nasty bug happens in the IFUNC implementation, so write down what I understand for IFUNC for future reference.
IFunc is nothing advanced, it is merely a trick to choose, usually depending on cpu features, a certain function implementation version, the decision is not made before every function invocation, but just once right before binary execution.
A typical usage would be to select one of the following memcpy implementation for a certain hardware.
- memcpy_neon() …
- memcpy_vfp() …
- memcpy_generic_arm() …
The naive way
<pre><code>
void* memcpy(source, dest, size)
{
cpu_features = get_cpu_feture();
if (cpu_has_neon(cpu_features))
return memcpy_neon(source, dest, size);
else if(cpu_has_vfp(cpu_features))
return memcpy_vfp(source, dest, size);
return memcpy_generic_arm(source, dest, size);
}
</code></pre>
Which apparently incurs big performance penalty, the same logic executes for every memcpy invocation.
The ifunc way
IFunc comes in rescue for this scenario - defines a memcpy resolve function, instead of doing actual work, returning a function pointer, depending on a certain logic, in which the actual work will be done. Mark memcpy as a ifunc with resolver set to the aforementioned “memcpy resolver” like below.
<pre><code>
void *memcpy (void *, const void *, size_t)
attribute ((ifunc ("resolve_memcpy")));
// Returns a function pointer
static void (resolve_memcpy (void)) (void)
{
cpu_features = xx; / for arm, r0 is preset to the the cpu feature value. */
if (cpu_has_neon(cpu_features))
return &memcpy_neon;
else if(cpu_has_vfp(cpu_features))
return &memcpy_vfp;
return &memcpy_generic_arm;
}
</code></pre>
The big difference from “the naive way” is that resolve_memcpy is guaranteed to be called only and exactly once, and that is before main execution (usually in __start).
Implementation
Compiler side
Whenever seeing a “__attribute((ifunc(...))”, mark the function symbol as “IFUNC” in the symbol table, that’s it, simple enough.
Static linker side
[
** Updated 02/03/2015 ** – notice, arm and aarch64 has some slightly different implementation here. For aarch64, the resolve function address is encoded in addend field of a relocation, while for arm, the address is written into the got entry.
<pre><code>
// This is aarch64 implementation - aarch64/dl-irel.h
if (__glibc_likely (r_type == R_AARCH64_IRELATIVE))
{
// the resolve function address is encoded in addend field.
ElfW(Addr) value = elf_ifunc_invoke (reloc->r_addend);
*reloc_addr = value;
}
// This is arm implementation – arm/dl-irel.h
if (__builtin_expect (r_type == R_ARM_IRELATIVE, 1))
{
// the resolve function address in written into the relocation address (the got entry)
Elf32_Addr value = elf_ifunc_invoke (*reloc_addr);
*reloc_addr = value;
}
</code></pre>
This example is based on arm implementation.
]
Whenever seeing a call to an ifunc, the linker does these 3 things -
- make this call via plt
- set the corresponding plt.got entry to the address of the resolver function.
- attach a IRELATIVE to the plt entry.
For example -
<pre><code>
memcpy_pltentry:
0 add r12, pc, #4
4 add r12, r12, #0
8 ldr pc, [r12, #0] // transfer pc to 2000, the content of [12]
memcpy_gotentry:
12 2000 // Attach an IRELATIVE relocation here.
a_routine:
1000 b 0 // call memcpy via plt,
// 0 is the address of memcpy_pltentry
...
memcpy_resolver:
2000 mov r0, 3000
bx lr
memcpy_neon:
3000 ...
memcpy_vfp:
4000 ...
memcpy_generic_arm:
5000 ...
</code></pre>
Right before executing main
glibc will iterative all IRELATIVE relocations, for each such relocation it
- loads content from IRELATIVE address
- sets this content to PC (basically this runs the function, whose address is stored in the address denoted by the IRELATIVE relocation) For the example above, IRELATIVE address is 12, its content is 2000, so set pc to 2000, which is memcpy_resolver
- re-writes the IRELATIVE memory address with the return value from above function invocation. For the example above, the IRELATIVE address is 12 and the return value is 3000, so write 3000 to 12
All later invocation to memcpy goes to memcpy_neon, and memcpy_resolver will ** never be called again**.
After step 1,2, the above memory layout becomes -
<pre><code>
memcpy_pltentry:
0 add r12, pc, #4
4 add r12, r12, #0
8 ldr pc, [r12, #0] // transfer pc to 3000 now,
// the content of [12]
memcpy_gotentry:
12 3000 // 3000 is the value returned by memcpy_resolver.
a_routine:
1000 b 0 // call memcpy via plt,
// 0 is the address of memcpy_pltentry
...
memcpy_resolver:
2000 mov r0, 3000
bx lr
memcpy_neon:
3000 ...
memcpy_vfp:
4000 ...
memcpy_generic_arm:
5000 ...
</code></pre>