mips/o32: fix internal_syscall5/6/7

Message ID alpine.DEB.2.00.1708171313580.17596@tp.orcam.me.uk
State Superseded
Headers

Commit Message

Maciej W. Rozycki Aug. 17, 2017, 4:17 p.m. UTC
  On Wed, 16 Aug 2017, Joseph Myers wrote:

> >  If the answer to any of these questions is "yes", then would factoring 
> > out the syscall `asm' along with the associated VLA declaration to a 
> > helper `always_inline' function help or would it not?
> 
> I don't think that would help.  An asm can never make assumptions about 
> which parts of the stack are used for what, only use its operands.

 There may be ABI restrictions however, which could provide guarantees 
beyond those resulting from the lone `asm' operands.  And it would be 
enough if we could prove that a certain arrangement has to be done in 
order not to break the ABI.  I can't think of anything right now though 
and if neither you nor anyone else can, then we'll have to live with what 
we have right now.

> >  I mean it is a tiny optimisation, but some syscalls are frequently 
> > called, so if we can avoid a waste of resources, then why not?
> 
> Are any 5/6/7-argument syscalls frequently called?

 Good question, however I have no data available.

 Anyway, here's my counter-proposal implementing the approach previously 
outlined.  I have passed it through regular MIPS o32 testing with these 
changes in test outputs resulting:

@@ -2575,7 +2575,7 @@
 PASS: nptl/tst-cond22
 PASS: nptl/tst-cond23
 PASS: nptl/tst-cond24
-FAIL: nptl/tst-cond25
+PASS: nptl/tst-cond25
 PASS: nptl/tst-cond3
 PASS: nptl/tst-cond4
 PASS: nptl/tst-cond5
@@ -2704,7 +2704,7 @@
 PASS: nptl/tst-rwlock12
 PASS: nptl/tst-rwlock13
 PASS: nptl/tst-rwlock14
-FAIL: nptl/tst-rwlock15
+PASS: nptl/tst-rwlock15
 PASS: nptl/tst-rwlock16
 PASS: nptl/tst-rwlock17
 PASS: nptl/tst-rwlock18

 The drawback is it adds a bit to code generated, e.g. `__libc_pwrite' 
(from nptl/pwrite.o and nptl/pwrite.os) grows by 4 and 6 instructions 
respectively for non-PIC and PIC code respectively, and the whole 
libraries:

   text    data     bss     dec     hex filename
1483315   21129   11560 1516004  1721e4 libc.so
 105482     960    8448  114890   1c0ca nptl/libpthread.so

vs:

   text    data     bss     dec     hex filename
1484295   21133   11560 1516988  1725bc libc.so
 105974     960    8448  115382   1c2b6 nptl/libpthread.so

due to the insertion of the VLA size calculation (although GCC is smart 
enough to reuse a value of 0 already available, e.g.:

  38:	7c03e83b 	rdhwr	v1,$29
  3c:	8c638b70 	lw	v1,-29840(v1)
  40:	14600018 	bnez	v1,a4 <__libc_pwrite+0xa4>
  44:	000787c3 	sra	s0,a3,0x1f
  48:	000318c0 	sll	v1,v1,0x3
  4c:	03a08825 	move	s1,sp
  50:	03a3e823 	subu	sp,sp,v1

and save an isntruction) and the use of an extra register to preserve the 
value of $sp across the block containing the VLA (as also seen with $s1 in 
the disassembly above) even though it could use $fp that holds the same 
value instead (e.g. continuing from the above:

  74:	0220e825 	move	sp,s1
  78:	03c0e825 	move	sp,s8

).  It would be good to know how this compares to Adhemerval's proposal.

  Maciej

	* sysdeps/unix/sysv/linux/mips/mips32/sysdep.h 
	(FORCE_FRAME_POINTER): Remove macro.
	(internal_syscall5): Use a variable-length array to force the
	use of a frame pointer.
	(internal_syscall6): Likewise.
	(internal_syscall7): Likewise.
---
 sysdeps/unix/sysv/linux/mips/mips32/sysdep.h |   24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

glibc-mips-o32-syscall-stack.diff
  

Comments

Adhemerval Zanella Aug. 17, 2017, 5:25 p.m. UTC | #1
On 17/08/2017 13:17, Maciej W. Rozycki wrote:
> On Wed, 16 Aug 2017, Joseph Myers wrote:
> 
>>>  If the answer to any of these questions is "yes", then would factoring 
>>> out the syscall `asm' along with the associated VLA declaration to a 
>>> helper `always_inline' function help or would it not?
>>
>> I don't think that would help.  An asm can never make assumptions about 
>> which parts of the stack are used for what, only use its operands.
> 
>  There may be ABI restrictions however, which could provide guarantees 
> beyond those resulting from the lone `asm' operands.  And it would be 
> enough if we could prove that a certain arrangement has to be done in 
> order not to break the ABI.  I can't think of anything right now though 
> and if neither you nor anyone else can, then we'll have to live with what 
> we have right now.
> 
>>>  I mean it is a tiny optimisation, but some syscalls are frequently 
>>> called, so if we can avoid a waste of resources, then why not?
>>
>> Are any 5/6/7-argument syscalls frequently called?
> 
>  Good question, however I have no data available.
> 
>  Anyway, here's my counter-proposal implementing the approach previously 
> outlined.  I have passed it through regular MIPS o32 testing with these 
> changes in test outputs resulting:
> 
> @@ -2575,7 +2575,7 @@
>  PASS: nptl/tst-cond22
>  PASS: nptl/tst-cond23
>  PASS: nptl/tst-cond24
> -FAIL: nptl/tst-cond25
> +PASS: nptl/tst-cond25
>  PASS: nptl/tst-cond3
>  PASS: nptl/tst-cond4
>  PASS: nptl/tst-cond5
> @@ -2704,7 +2704,7 @@
>  PASS: nptl/tst-rwlock12
>  PASS: nptl/tst-rwlock13
>  PASS: nptl/tst-rwlock14
> -FAIL: nptl/tst-rwlock15
> +PASS: nptl/tst-rwlock15
>  PASS: nptl/tst-rwlock16
>  PASS: nptl/tst-rwlock17
>  PASS: nptl/tst-rwlock18
> 
>  The drawback is it adds a bit to code generated, e.g. `__libc_pwrite' 
> (from nptl/pwrite.o and nptl/pwrite.os) grows by 4 and 6 instructions 
> respectively for non-PIC and PIC code respectively, and the whole 
> libraries:
> 
>    text    data     bss     dec     hex filename
> 1483315   21129   11560 1516004  1721e4 libc.so
>  105482     960    8448  114890   1c0ca nptl/libpthread.so
> 
> vs:
> 
>    text    data     bss     dec     hex filename
> 1484295   21133   11560 1516988  1725bc libc.so
>  105974     960    8448  115382   1c2b6 nptl/libpthread.so
> 
> due to the insertion of the VLA size calculation (although GCC is smart 
> enough to reuse a value of 0 already available, e.g.:
> 
>   38:	7c03e83b 	rdhwr	v1,$29
>   3c:	8c638b70 	lw	v1,-29840(v1)
>   40:	14600018 	bnez	v1,a4 <__libc_pwrite+0xa4>
>   44:	000787c3 	sra	s0,a3,0x1f
>   48:	000318c0 	sll	v1,v1,0x3
>   4c:	03a08825 	move	s1,sp
>   50:	03a3e823 	subu	sp,sp,v1
> 
> and save an isntruction) and the use of an extra register to preserve the 
> value of $sp across the block containing the VLA (as also seen with $s1 in 
> the disassembly above) even though it could use $fp that holds the same 
> value instead (e.g. continuing from the above:
> 
>   74:	0220e825 	move	sp,s1
>   78:	03c0e825 	move	sp,s8
> 
> ).  It would be good to know how this compares to Adhemerval's proposal.

My point is I think we should aim for compiler optimization safeness
(to avoid code breakage over compiler defined default flags) and taking
as base current approach to *avoid* VLA on GLIBC I do not think it is
good approach to use it as a bridge to force GCC to generate the expected
code.

I still thinking trying to optimize for 5/6/7 syscall argument is over
engineering in this *specific* case.  As I put in my last message,
5/6/7 argument syscalls are used for 

pread, pwrite, lseek, llseek, ppoll, posix_fadvice, posix_fallocate, 
sync_file_range, fallocate, preadv, pwritev, preadv2, pwritev2, select,
pselect, mmap, readahead, epoll_pwait, splice, recvfrom, sendto, recvmmsg,
msgsnd, msgrcv, msgget, msgctl, semop, semget, semctl, semtimedop, shmat,
shmdt, shmget, and shmctl. 

Which are the one generated from C implementation (some are still auto
generated).  The majority of them are blocking syscalls, so both context
switch plus the required work for syscall completion itself will taking
proportionally all the required time.  So trying to squeeze some cycles
don't really pay off comparing to code maintainability (just all this
discussion of which C construct would be safe enough to generate the 
correct stack spill plus the current issue should indicate we should
aim for correctness first).
 


> 
>   Maciej
> 
> 	* sysdeps/unix/sysv/linux/mips/mips32/sysdep.h 
> 	(FORCE_FRAME_POINTER): Remove macro.
> 	(internal_syscall5): Use a variable-length array to force the
> 	use of a frame pointer.
> 	(internal_syscall6): Likewise.
> 	(internal_syscall7): Likewise.
> ---
>  sysdeps/unix/sysv/linux/mips/mips32/sysdep.h |   24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> glibc-mips-o32-syscall-stack.diff
> Index: glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
> ===================================================================
> --- glibc.orig/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-04-11 21:27:25.000000000 +0100
> +++ glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-08-16 20:49:15.758029215 +0100
> @@ -264,18 +264,20 @@
>  
>  /* We need to use a frame pointer for the functions in which we
>     adjust $sp around the syscall, or debug information and unwind
> -   information will be $sp relative and thus wrong during the syscall.  As
> -   of GCC 4.7, this is sufficient.  */
> -#define FORCE_FRAME_POINTER						\
> -  void *volatile __fp_force __attribute__ ((unused)) = alloca (4)
> +   information will be $sp relative and thus wrong during the syscall.
> +   We use a variable-length array to persuade GCC to use $fp.  */
>  
>  #define internal_syscall5(v0_init, input, number, err,			\
>  			  arg1, arg2, arg3, arg4, arg5)			\
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
> @@ -306,8 +308,12 @@
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
> @@ -339,8 +345,12 @@
>  ({									\
>  	long _sys_result;						\
>  									\
> -	FORCE_FRAME_POINTER;						\
> +	size_t s = 0;							\
> +	asm ("" : "+r" (s));						\
>  	{								\
> +	char vla[s << 3];						\
> +	asm ("" : : "p" (vla));						\
> +									\
>  	register long __s0 asm ("$16") __attribute__ ((unused))		\
>  	  = (number);							\
>  	register long __v0 asm ("$2");					\
>
  
Joseph Myers Aug. 17, 2017, 5:32 p.m. UTC | #2
On Thu, 17 Aug 2017, Adhemerval Zanella wrote:

> My point is I think we should aim for compiler optimization safeness
> (to avoid code breakage over compiler defined default flags) and taking
> as base current approach to *avoid* VLA on GLIBC I do not think it is
> good approach to use it as a bridge to force GCC to generate the expected
> code.

I think the point that -Werror=alloca -Werror=vla would be desirable for 
building glibc (if you don't have any variable-size stack allocations, you 
don't need to worry about problems with unbounded stack allocations, which 
are always bad, even given reliable stack checking, because of the 
inability to report errors from them) is a good one about why to avoid 
using the VLA approach.
  
Aurelien Jarno Aug. 17, 2017, 6:18 p.m. UTC | #3
On 2017-08-17 17:17, Maciej W. Rozycki wrote:
>  The drawback is it adds a bit to code generated, e.g. `__libc_pwrite' 
> (from nptl/pwrite.o and nptl/pwrite.os) grows by 4 and 6 instructions 
> respectively for non-PIC and PIC code respectively, and the whole 
> libraries:
> 
>    text    data     bss     dec     hex filename
> 1483315   21129   11560 1516004  1721e4 libc.so
>  105482     960    8448  114890   1c0ca nptl/libpthread.so
> 
> vs:
> 
>    text    data     bss     dec     hex filename
> 1484295   21133   11560 1516988  1725bc libc.so
>  105974     960    8448  115382   1c2b6 nptl/libpthread.so
> 
> due to the insertion of the VLA size calculation (although GCC is smart 
> enough to reuse a value of 0 already available, e.g.:
> 
>   38:	7c03e83b 	rdhwr	v1,$29
>   3c:	8c638b70 	lw	v1,-29840(v1)
>   40:	14600018 	bnez	v1,a4 <__libc_pwrite+0xa4>
>   44:	000787c3 	sra	s0,a3,0x1f
>   48:	000318c0 	sll	v1,v1,0x3
>   4c:	03a08825 	move	s1,sp
>   50:	03a3e823 	subu	sp,sp,v1
> 
> and save an isntruction) and the use of an extra register to preserve the 
> value of $sp across the block containing the VLA (as also seen with $s1 in 
> the disassembly above) even though it could use $fp that holds the same 
> value instead (e.g. continuing from the above:
> 
>   74:	0220e825 	move	sp,s1
>   78:	03c0e825 	move	sp,s8
> 
> ).  It would be good to know how this compares to Adhemerval's proposal.

I have been trying to improve Adhemerval's patches a bit by returning
the error value in v1, in addition to the return code in v0. Here are
the corresponding numbers:

w/o patch:
   text    data     bss     dec     hex filename
1489767   21085   11560 1522412  173aec libc.so
 107908     956    8448  117312   1ca40 nptl/libpthread.so

with patch:
   text    data     bss     dec     hex filename
1488135   21089   11560 1520784  173490 libc.so
 107244     960    8448  116652   1c7ac nptl/libpthread.so


When looking at a given function like `__libc_pwrite' it gets reduced
by 13 instructions in both PIC and non-PIC cases. However we need to
add the 16 instructions of __libc_do_syscall.

Aurelien
  
Maciej W. Rozycki Aug. 17, 2017, 8:34 p.m. UTC | #4
On Thu, 17 Aug 2017, Adhemerval Zanella wrote:

> My point is I think we should aim for compiler optimization safeness
> (to avoid code breakage over compiler defined default flags) and taking
> as base current approach to *avoid* VLA on GLIBC I do not think it is
> good approach to use it as a bridge to force GCC to generate the expected
> code.

 You certainly have a point here overall, although I don't think a VLA 
whose size is always 0 really hurts.  And we've used the approach with 
`alloca' since forever with no adverse effects until we added a place 
where the caller invokes the syscall wrapper in a loop.  So I wouldn't 
necessarily call it an issue.  Mind that this is target-specific code, so 
we can rely on a target-specific execution model rather than limiting 
ourselves to what generic ISO C guarantees.

 Aurelien's figures indicating a clear size reduction certainly count as a 
pro though.

> I still thinking trying to optimize for 5/6/7 syscall argument is over
> engineering in this *specific* case.  As I put in my last message,
> 5/6/7 argument syscalls are used for 
> 
> pread, pwrite, lseek, llseek, ppoll, posix_fadvice, posix_fallocate, 
> sync_file_range, fallocate, preadv, pwritev, preadv2, pwritev2, select,
> pselect, mmap, readahead, epoll_pwait, splice, recvfrom, sendto, recvmmsg,
> msgsnd, msgrcv, msgget, msgctl, semop, semget, semctl, semtimedop, shmat,
> shmdt, shmget, and shmctl. 
> 
> Which are the one generated from C implementation (some are still auto
> generated).  The majority of them are blocking syscalls, so both context
> switch plus the required work for syscall completion itself will taking
> proportionally all the required time.  So trying to squeeze some cycles
> don't really pay off comparing to code maintainability (just all this
> discussion of which C construct would be safe enough to generate the 
> correct stack spill plus the current issue should indicate we should
> aim for correctness first).

 TBH, I find it questionable whether it's really the approach I proposed 
that requires more engineering (and long-term maintenance) effort rather 
than using a separate handwritten assembly-language call stub.  Especially 
if a non-standard calling convention is used.

 If everyone but me thinks there's a clear advantage in using such a 
handcoded stub though, then as I previously noted please adjust the 
affected MIPS16 stubs to avoid the extra indirection, i.e. you can call 
`__libc_do_syscall' directly from MIPS16 code as you'd do from regular 
MIPS or microMIPS code, as the lone reason for the existence of the MIPS16 
stubs is the inexistence of a MIPS16 SYSCALL instruction.

 Once you're done with that I can push your proposed change through MIPS16 
regression testing if that helped.  I can see if I can run microMIPS 
testing as well, although I'd have to double-check for an available board 
as I don't use one regularly.

  Maciej
  
Adhemerval Zanella Aug. 17, 2017, 9:09 p.m. UTC | #5
On 17/08/2017 17:34, Maciej W. Rozycki wrote:
> On Thu, 17 Aug 2017, Adhemerval Zanella wrote:
> 
>> My point is I think we should aim for compiler optimization safeness
>> (to avoid code breakage over compiler defined default flags) and taking
>> as base current approach to *avoid* VLA on GLIBC I do not think it is
>> good approach to use it as a bridge to force GCC to generate the expected
>> code.
> 
>  You certainly have a point here overall, although I don't think a VLA 
> whose size is always 0 really hurts.  And we've used the approach with 
> `alloca' since forever with no adverse effects until we added a place 
> where the caller invokes the syscall wrapper in a loop.  So I wouldn't 
> necessarily call it an issue.  Mind that this is target-specific code, so 
> we can rely on a target-specific execution model rather than limiting 
> ourselves to what generic ISO C guarantees.
> 
>  Aurelien's figures indicating a clear size reduction certainly count as a 
> pro though.

Joseph pointed out another advantage of avoid VLAs (building with 
-Werror=alloca -Werror=vla).  My main problem here is we are betting that
compiler won't mess with our assumptions and generate the desirable code
without trying to adhere what it is suppose to provide.  Target generic
ISO C give us a better guarantee and any deviation indicates a possible
compiler issue, not otherwise (such this case).  My another point is we
can optimize if required later if this is the case and imho this is hardly
the case here (at least for latency).

If I understood correctly Aurelien's suggestion of returning err in v1
is not ABI strictly so it will end up calling __libc_do_syscall with a
non-conformant ABI convention (similar to pipe implementation where requires
assembly specific implementation for a lot of architectures to get this
right).  Again this is something I would really to avoid.

> 
>> I still thinking trying to optimize for 5/6/7 syscall argument is over
>> engineering in this *specific* case.  As I put in my last message,
>> 5/6/7 argument syscalls are used for 
>>
>> pread, pwrite, lseek, llseek, ppoll, posix_fadvice, posix_fallocate, 
>> sync_file_range, fallocate, preadv, pwritev, preadv2, pwritev2, select,
>> pselect, mmap, readahead, epoll_pwait, splice, recvfrom, sendto, recvmmsg,
>> msgsnd, msgrcv, msgget, msgctl, semop, semget, semctl, semtimedop, shmat,
>> shmdt, shmget, and shmctl. 
>>
>> Which are the one generated from C implementation (some are still auto
>> generated).  The majority of them are blocking syscalls, so both context
>> switch plus the required work for syscall completion itself will taking
>> proportionally all the required time.  So trying to squeeze some cycles
>> don't really pay off comparing to code maintainability (just all this
>> discussion of which C construct would be safe enough to generate the 
>> correct stack spill plus the current issue should indicate we should
>> aim for correctness first).
> 
>  TBH, I find it questionable whether it's really the approach I proposed 
> that requires more engineering (and long-term maintenance) effort rather 
> than using a separate handwritten assembly-language call stub.  Especially 
> if a non-standard calling convention is used.

IMHO I find the VLA suggestion more fragile in long term.

> 
>  If everyone but me thinks there's a clear advantage in using such a 
> handcoded stub though, then as I previously noted please adjust the 
> affected MIPS16 stubs to avoid the extra indirection, i.e. you can call 
> `__libc_do_syscall' directly from MIPS16 code as you'd do from regular 
> MIPS or microMIPS code, as the lone reason for the existence of the MIPS16 
> stubs is the inexistence of a MIPS16 SYSCALL instruction.

Ok, I will try to at least check it on qemu. If you have any points on how
correctly build a mips16 glibc it could be helpful. 

> 
>  Once you're done with that I can push your proposed change through MIPS16 
> regression testing if that helped.  I can see if I can run microMIPS 
> testing as well, although I'd have to double-check for an available board 
> as I don't use one regularly.
> 
>   Maciej
>
  
Aurelien Jarno Aug. 17, 2017, 9:34 p.m. UTC | #6
On 2017-08-17 18:09, Adhemerval Zanella wrote:
> 
> 
> On 17/08/2017 17:34, Maciej W. Rozycki wrote:
> > On Thu, 17 Aug 2017, Adhemerval Zanella wrote:
> > 
> >> My point is I think we should aim for compiler optimization safeness
> >> (to avoid code breakage over compiler defined default flags) and taking
> >> as base current approach to *avoid* VLA on GLIBC I do not think it is
> >> good approach to use it as a bridge to force GCC to generate the expected
> >> code.
> > 
> >  You certainly have a point here overall, although I don't think a VLA 
> > whose size is always 0 really hurts.  And we've used the approach with 
> > `alloca' since forever with no adverse effects until we added a place 
> > where the caller invokes the syscall wrapper in a loop.  So I wouldn't 
> > necessarily call it an issue.  Mind that this is target-specific code, so 
> > we can rely on a target-specific execution model rather than limiting 
> > ourselves to what generic ISO C guarantees.
> > 
> >  Aurelien's figures indicating a clear size reduction certainly count as a 
> > pro though.
> 
> Joseph pointed out another advantage of avoid VLAs (building with 
> -Werror=alloca -Werror=vla).  My main problem here is we are betting that
> compiler won't mess with our assumptions and generate the desirable code
> without trying to adhere what it is suppose to provide.  Target generic
> ISO C give us a better guarantee and any deviation indicates a possible
> compiler issue, not otherwise (such this case).  My another point is we
> can optimize if required later if this is the case and imho this is hardly
> the case here (at least for latency).
> 
> If I understood correctly Aurelien's suggestion of returning err in v1
> is not ABI strictly so it will end up calling __libc_do_syscall with a
> non-conformant ABI convention (similar to pipe implementation where requires
> assembly specific implementation for a lot of architectures to get this
> right).  Again this is something I would really to avoid.
> 

In the ABI v1 is used in pair with v0 to return 64-bit values. In my
patch the __libc_do_syscall is declared as returning a long long. The
value is then split using a union, in a similar way to what is already
done for the mips16 code.
  
Maciej W. Rozycki Aug. 17, 2017, 9:47 p.m. UTC | #7
On Thu, 17 Aug 2017, Adhemerval Zanella wrote:

> If I understood correctly Aurelien's suggestion of returning err in v1
> is not ABI strictly so it will end up calling __libc_do_syscall with a
> non-conformant ABI convention (similar to pipe implementation where requires
> assembly specific implementation for a lot of architectures to get this
> right).  Again this is something I would really to avoid.

 Using $v1 is fine, in ABI terms it's just a part of a `long long' result, 
and you can access it in plain C in the caller (shifting and masking 
individual 32-bit halves if necessary).  I've done it myself in the past 
in some bare-metal library code.

> >  If everyone but me thinks there's a clear advantage in using such a 
> > handcoded stub though, then as I previously noted please adjust the 
> > affected MIPS16 stubs to avoid the extra indirection, i.e. you can call 
> > `__libc_do_syscall' directly from MIPS16 code as you'd do from regular 
> > MIPS or microMIPS code, as the lone reason for the existence of the MIPS16 
> > stubs is the inexistence of a MIPS16 SYSCALL instruction.
> 
> Ok, I will try to at least check it on qemu. If you have any points on how
> correctly build a mips16 glibc it could be helpful. 

 Just pass `-mips16' along with CFLAGS.  You may have to make sure your 
GCC configuration includes/supports a suitable MIPS16 mulitilib though 
(i.e. MIPS16 libgcc.a and CRT files of your chosen endianness; check with 
`-print-multi-lib' for entries with `@mips16'), to avoid interlinking 
scenarios that may not be supported.  I don't remember offhand what the 
defaults for the individual GCC configurations are, although I'm fairly 
sure at least one of `mips-mti-linux-gnu' and `mips-img-linux-gnu' 
configurations does have MIPS16 multilibs.  Let me know if you have 
troubles with that.

  Maciej
  

Patch

Index: glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
===================================================================
--- glibc.orig/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-04-11 21:27:25.000000000 +0100
+++ glibc/sysdeps/unix/sysv/linux/mips/mips32/sysdep.h	2017-08-16 20:49:15.758029215 +0100
@@ -264,18 +264,20 @@ 
 
 /* We need to use a frame pointer for the functions in which we
    adjust $sp around the syscall, or debug information and unwind
-   information will be $sp relative and thus wrong during the syscall.  As
-   of GCC 4.7, this is sufficient.  */
-#define FORCE_FRAME_POINTER						\
-  void *volatile __fp_force __attribute__ ((unused)) = alloca (4)
+   information will be $sp relative and thus wrong during the syscall.
+   We use a variable-length array to persuade GCC to use $fp.  */
 
 #define internal_syscall5(v0_init, input, number, err,			\
 			  arg1, arg2, arg3, arg4, arg5)			\
 ({									\
 	long _sys_result;						\
 									\
-	FORCE_FRAME_POINTER;						\
+	size_t s = 0;							\
+	asm ("" : "+r" (s));						\
 	{								\
+	char vla[s << 3];						\
+	asm ("" : : "p" (vla));						\
+									\
 	register long __s0 asm ("$16") __attribute__ ((unused))		\
 	  = (number);							\
 	register long __v0 asm ("$2");					\
@@ -306,8 +308,12 @@ 
 ({									\
 	long _sys_result;						\
 									\
-	FORCE_FRAME_POINTER;						\
+	size_t s = 0;							\
+	asm ("" : "+r" (s));						\
 	{								\
+	char vla[s << 3];						\
+	asm ("" : : "p" (vla));						\
+									\
 	register long __s0 asm ("$16") __attribute__ ((unused))		\
 	  = (number);							\
 	register long __v0 asm ("$2");					\
@@ -339,8 +345,12 @@ 
 ({									\
 	long _sys_result;						\
 									\
-	FORCE_FRAME_POINTER;						\
+	size_t s = 0;							\
+	asm ("" : "+r" (s));						\
 	{								\
+	char vla[s << 3];						\
+	asm ("" : : "p" (vla));						\
+									\
 	register long __s0 asm ("$16") __attribute__ ((unused))		\
 	  = (number);							\
 	register long __v0 asm ("$2");					\