[v2,6/7] Alpha: Add option to avoid data races for sub-longword memory stores [PR117759]

Message ID alpine.DEB.2.21.2501050344030.49841@angie.orcam.me.uk
State Under Review
Headers
Series Fix data races with sub-longword accesses on Alpha |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Patch failed to apply

Commit Message

Maciej W. Rozycki Jan. 6, 2025, 1:03 p.m. UTC
  With non-BWX Alpha implementations we have a problem of data races where 
a 8-bit byte or 16-bit word quantity is to be written to memory in that 
in those cases we use an unprotected RMW access of a 32-bit longword or 
64-bit quadword width.  If contents of the longword or quadword accessed 
outside the byte or word to be written are changed midway through by a 
concurrent write executing on the same CPU such as by a signal handler 
or a parallel write executing on another CPU such as by another thread 
or via a shared memory segment, then the concluding write of the RMW 
access will clobber them.  This is especially important for the safety 
of RCU algorithms, but is otherwise an issue anyway.

To guard against these data races with byte and aligned word quantities 
introduce the `-msafe-bwa' command-line option (standing for Safe Byte & 
Word Access) that instructs the compiler to instead use an atomic RMW 
access sequence where byte and word memory access machine instructions 
are not available.  There is no change to code produced for BWX targets.

It would be sufficient for the secondary reload handle to use a pair of 
scratch registers, as requested by `reload_out<mode>', but it would end 
with poor code produced as one of the scratches would be occupied by 
data retrieved and the other one would have to be reloaded with repeated 
calculations, all within the LL/SC sequence.

Therefore I chose to add a dedicated `reload_out<mode>_safe_bwa' handler 
and ask for more scratches there by defining a 256-bit OI integer mode.  
While reload is documented in our manual to support an arbitrary number 
of scratches in reality it hasn't been implemented for IRA:

/* ??? It would be useful to be able to handle only two, or more than
   three, operands, but for now we can only handle the case of having
   exactly three: output, input and one temp/scratch.  */

and it seems to be the case for LRA as well.  Do what everyone else does 
then and just have one wide multi-register scratch.

I note that the atomic sequences emitted are suboptimal performance-wise 
as the looping branch for the unsuccessful completion of the sequence 
points backwards, which means it will be predicted as taken despite that 
in most cases it will fall through.  I do not see it as a deficiency of 
this change proposed as it takes care of recording that the branch is 
unlikely to be taken, by calling `alpha_emit_unlikely_jump'.  Therefore 
generic code elsewhere 
shou

Add test cases accordingly.

There are notable regressions between a plain `-mno-bwx' configuration 
and a `-mno-bwx -msafe-bwa' one:

FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O0  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O1  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O3 -g  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -Os  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: g++.dg/init/array25.C  -std=c++17 execution test
FAIL: g++.dg/init/array25.C  -std=c++98 execution test
FAIL: g++.dg/init/array25.C  -std=c++26 execution test

They come from the fact that these test cases play tricks with alignment 
and end up calling code that expects a reference to aligned data but is 
handed one to unaligned data.

This doesn't cause a visible problem with plain `-mno-bwx' code, because 
the resulting alignment exception is fixed up by Linux.  There's no such 
handling currently implemented for LDL_L or LDQ_L instructions (which 
are first in the sequence) and consequently the offender is issued with 
SIGBUS instead.  Suitable handling will be added to Linux to complement 
this change, so these regressions are seen as harmless and expected.

	gcc/
	PR target/117759
	* config/alpha/alpha-modes.def (OI): New integer mode.
	* config/alpha/alpha-protos.h (alpha_expand_mov_safe_bwa): New 
	prototype.
	* config/alpha/alpha.cc (alpha_expand_mov_safe_bwa): New 
	function.
	(alpha_secondary_reload): Handle TARGET_SAFE_BWA.
	* config/alpha/alpha.md (aligned_store_safe_bwa)
	(unaligned_store<mode>_safe_bwa, reload_out<mode>_safe_bwa)
	(reload_out<mode>_unaligned_safe_bwa): New expanders.
	(mov<mode>, movcqi, reload_out<mode>_aligned): Handle 
	TARGET_SAFE_BWA.
	(reload_out<mode>): Guard against TARGET_SAFE_BWA.
	* config/alpha/alpha.opt (msafe-bwa): New option.
	* config/alpha/alpha.opt.urls: Regenerate.
	* doc/invoke.texi (Option Summary, DEC Alpha Options): Document 
	the new option.

	gcc/testsuite/
	PR target/117759
	* gcc.target/alpha/stb.c: New file.
	* gcc.target/alpha/stb-bwa.c: New file.
	* gcc.target/alpha/stb-bwx.c: New file.
	* gcc.target/alpha/stba.c: New file.
	* gcc.target/alpha/stba-bwa.c: New file.
	* gcc.target/alpha/stba-bwx.c: New file.
	* gcc.target/alpha/stw.c: New file.
	* gcc.target/alpha/stw-bwa.c: New file.
	* gcc.target/alpha/stw-bwx.c: New file.
	* gcc.target/alpha/stwa.c: New file.
	* gcc.target/alpha/stwa-bwa.c: New file.
	* gcc.target/alpha/stwa-bwx.c: New file.
---
 NB I note that there is a warning in gcc/config/alpha/sync.md that it is 
unpredictable if the lock_flag is cleared or not by a normal load or store 
executed on the same CPU and therefore we need to make sure no register 
spill is inserted in the sequence.  I seem not to have seen it actually 
happen and testing results with actual hardware do look good.

 However out of the abundance of caution we may want to make sure it can't
happen.  It should be quite a straightforward change, but owing to the 
number of issues encountered, as indicated by the size of the patchset, 
and the limited time I did not get to it in time for Stage 1 closure.  So 
I've chosen to post this change for review anyway with the intent to make 
a suitable update in the coming weeks.  As it only affects newly-added 
`-msafe-bwa' option I don't think it will be disruptive to the pre-release 
stabilisation process.  I will appreciate input for this part

 NB2 I reckon the manual ought to be updated to say only one scratch is 
permitted for secondary reloads.  While it's documented otherwise since 
forever, it's never actually matched reality.

Changes from v1:

- Add a reference to PR target/117759.
---
 gcc/config/alpha/alpha-modes.def          |    4 
 gcc/config/alpha/alpha-protos.h           |    1 
 gcc/config/alpha/alpha.cc                 |   68 ++++++++++++-
 gcc/config/alpha/alpha.md                 |  155 +++++++++++++++++++++++++++++-
 gcc/config/alpha/alpha.opt                |    4 
 gcc/config/alpha/alpha.opt.urls           |    3 
 gcc/doc/invoke.texi                       |    9 +
 gcc/testsuite/gcc.target/alpha/stb-bwa.c  |   28 +++++
 gcc/testsuite/gcc.target/alpha/stb-bwx.c  |   16 +++
 gcc/testsuite/gcc.target/alpha/stb.c      |   25 ++++
 gcc/testsuite/gcc.target/alpha/stba-bwa.c |   35 ++++++
 gcc/testsuite/gcc.target/alpha/stba-bwx.c |   23 ++++
 gcc/testsuite/gcc.target/alpha/stba.c     |   33 ++++++
 gcc/testsuite/gcc.target/alpha/stw-bwa.c  |   28 +++++
 gcc/testsuite/gcc.target/alpha/stw-bwx.c  |   16 +++
 gcc/testsuite/gcc.target/alpha/stw.c      |   25 ++++
 gcc/testsuite/gcc.target/alpha/stwa-bwa.c |   35 ++++++
 gcc/testsuite/gcc.target/alpha/stwa-bwx.c |   23 ++++
 gcc/testsuite/gcc.target/alpha/stwa.c     |   33 ++++++
 19 files changed, 558 insertions(+), 6 deletions(-)

gcc-alpha-safe-bwa.diff
  

Comments

Jeff Law Jan. 7, 2025, 12:12 a.m. UTC | #1
On 1/6/25 6:03 AM, Maciej W. Rozycki wrote:
> With non-BWX Alpha implementations we have a problem of data races where
> a 8-bit byte or 16-bit word quantity is to be written to memory in that
> in those cases we use an unprotected RMW access of a 32-bit longword or
> 64-bit quadword width.  If contents of the longword or quadword accessed
> outside the byte or word to be written are changed midway through by a
> concurrent write executing on the same CPU such as by a signal handler
> or a parallel write executing on another CPU such as by another thread
> or via a shared memory segment, then the concluding write of the RMW
> access will clobber them.  This is especially important for the safety
> of RCU algorithms, but is otherwise an issue anyway.
But in the case of concurrent accesses, shouldn't these objects be 
declared as atomic?  Similarly for objects potentially accessed in a 
signal hnadler shouldn't they be accessed via sig_atomic_t?

Point being I'm not 100% sure we really need to tackle this problem in a 
fully generic manner for all 8/16 bit accesses

What am I missing here?

jeff
  
Paul E. McKenney Jan. 7, 2025, 12:22 a.m. UTC | #2
On Mon, Jan 06, 2025 at 05:12:57PM -0700, Jeff Law wrote:
> 
> 
> On 1/6/25 6:03 AM, Maciej W. Rozycki wrote:
> > With non-BWX Alpha implementations we have a problem of data races where
> > a 8-bit byte or 16-bit word quantity is to be written to memory in that
> > in those cases we use an unprotected RMW access of a 32-bit longword or
> > 64-bit quadword width.  If contents of the longword or quadword accessed
> > outside the byte or word to be written are changed midway through by a
> > concurrent write executing on the same CPU such as by a signal handler
> > or a parallel write executing on another CPU such as by another thread
> > or via a shared memory segment, then the concluding write of the RMW
> > access will clobber them.  This is especially important for the safety
> > of RCU algorithms, but is otherwise an issue anyway.
> But in the case of concurrent accesses, shouldn't these objects be declared
> as atomic?  Similarly for objects potentially accessed in a signal hnadler
> shouldn't they be accessed via sig_atomic_t?
> 
> Point being I'm not 100% sure we really need to tackle this problem in a
> fully generic manner for all 8/16 bit accesses
> 
> What am I missing here?

Doesn't the behavior Maciej is describing constitute a data race injected
by the compiler?  As of C11 and C++11, this is forbidden, correct?

							Thanx, Paul
  
Linus Torvalds Jan. 7, 2025, 12:59 a.m. UTC | #3
On Mon, 6 Jan 2025 at 16:13, Jeff Law <jeffreyalaw@gmail.com> wrote:
>
> But in the case of concurrent accesses, shouldn't these objects be
> declared as atomic?

No.

They aren't concurrent accesses to the same variable.

They are concurrent accesses to *different* memory locations, and the
compiler is not allowed to mess them up.

IOW, if you have

    struct myvar {
        pthread_mutex_t buffer_lock;
        pthread_mutex_t another_var_lock;
        char buffer[7];
        char another_var;
    };

and "buffer_lock" serializes accesses to "buffer[]", and
"another_var_lock" serializes accesses to "another_var", then the
compiler IS NOT ALLOWED TO TOUCH "another_var" when the code touches
"buffer[]".

So if a compiler turns "memset(var->buffer, 0, 7)" into "load 8 bytes,
clear 7 of the bytes, store 8 bytes", then the compiler is buggy.

Because that messes up another thread that accesses "another_var", and
the 8-byte write may write back an old value that is no longer valid.

There is absolutely no gray area here. It was always buggy, and the
alpha architecture was always completely and fundamentally
mis-designed.

C11 made it explicitly clear:

  "Different threads of execution are always allowed to access (read
and modify) different memory locations concurrently, with no
interference and no synchronization requirements"

but honestly, that was just codifying something that should have been
blindingly obvious even before.

                Linus
  
Jeff Law Jan. 7, 2025, 2:02 a.m. UTC | #4
On 1/6/25 5:59 PM, Linus Torvalds wrote:
> On Mon, 6 Jan 2025 at 16:13, Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>> But in the case of concurrent accesses, shouldn't these objects be
>> declared as atomic?
> 
> No.
> 
> They aren't concurrent accesses to the same variable.
> 
> They are concurrent accesses to *different* memory locations, and the
> compiler is not allowed to mess them up.
You're absolutely right.  Definitely not kosher.  Thanks.

Jeff
  
Jeff Law Jan. 7, 2025, 4:20 a.m. UTC | #5
On 1/6/25 6:03 AM, Maciej W. Rozycki wrote:
> With non-BWX Alpha implementations we have a problem of data races where
> a 8-bit byte or 16-bit word quantity is to be written to memory in that
> in those cases we use an unprotected RMW access of a 32-bit longword or
> 64-bit quadword width.  If contents of the longword or quadword accessed
> outside the byte or word to be written are changed midway through by a
> concurrent write executing on the same CPU such as by a signal handler
> or a parallel write executing on another CPU such as by another thread
> or via a shared memory segment, then the concluding write of the RMW
> access will clobber them.  This is especially important for the safety
> of RCU algorithms, but is otherwise an issue anyway.
> 
> To guard against these data races with byte and aligned word quantities
> introduce the `-msafe-bwa' command-line option (standing for Safe Byte &
> Word Access) that instructs the compiler to instead use an atomic RMW
> access sequence where byte and word memory access machine instructions
> are not available.  There is no change to code produced for BWX targets.
> 
> It would be sufficient for the secondary reload handle to use a pair of
> scratch registers, as requested by `reload_out<mode>', but it would end
> with poor code produced as one of the scratches would be occupied by
> data retrieved and the other one would have to be reloaded with repeated
> calculations, all within the LL/SC sequence.
> 
> Therefore I chose to add a dedicated `reload_out<mode>_safe_bwa' handler
> and ask for more scratches there by defining a 256-bit OI integer mode.
> While reload is documented in our manual to support an arbitrary number
> of scratches in reality it hasn't been implemented for IRA:
> 
> /* ??? It would be useful to be able to handle only two, or more than
>     three, operands, but for now we can only handle the case of having
>     exactly three: output, input and one temp/scratch.  */
> 
> and it seems to be the case for LRA as well.  Do what everyone else does
> then and just have one wide multi-register scratch.
> 
> I note that the atomic sequences emitted are suboptimal performance-wise
> as the looping branch for the unsuccessful completion of the sequence
> points backwards, which means it will be predicted as taken despite that
> in most cases it will fall through.  I do not see it as a deficiency of
> this change proposed as it takes care of recording that the branch is
> unlikely to be taken, by calling `alpha_emit_unlikely_jump'.  Therefore
> generic code elsewhere shou
Looks like this got truncated.  Anyway, easy to forget how limited 
branch prediction was in this era.  I haven't pondered static branch 
prediction in forever.  I wouldn't worry too much about this case -- we 
can always come back to it if generic doesn't do the right thing with 
code layout.

> 
> Add test cases accordingly.
> 
> There are notable regressions between a plain `-mno-bwx' configuration
> and a `-mno-bwx -msafe-bwa' one:
> 
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O0  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O1  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O3 -g  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -Os  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
> FAIL: gcc.dg/torture/inline-mem-cpy-cmp-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
> FAIL: g++.dg/init/array25.C  -std=c++17 execution test
> FAIL: g++.dg/init/array25.C  -std=c++98 execution test
> FAIL: g++.dg/init/array25.C  -std=c++26 execution test
> 
> They come from the fact that these test cases play tricks with alignment
> and end up calling code that expects a reference to aligned data but is
> handed one to unaligned data.
> 
> This doesn't cause a visible problem with plain `-mno-bwx' code, because
> the resulting alignment exception is fixed up by Linux.  There's no such
> handling currently implemented for LDL_L or LDQ_L instructions (which
> are first in the sequence) and consequently the offender is issued with
> SIGBUS instead.  Suitable handling will be added to Linux to complement
> this change, so these regressions are seen as harmless and expected.
> 
> 	gcc/
> 	PR target/117759
> 	* config/alpha/alpha-modes.def (OI): New integer mode.
> 	* config/alpha/alpha-protos.h (alpha_expand_mov_safe_bwa): New
> 	prototype.
> 	* config/alpha/alpha.cc (alpha_expand_mov_safe_bwa): New
> 	function.
> 	(alpha_secondary_reload): Handle TARGET_SAFE_BWA.
> 	* config/alpha/alpha.md (aligned_store_safe_bwa)
> 	(unaligned_store<mode>_safe_bwa, reload_out<mode>_safe_bwa)
> 	(reload_out<mode>_unaligned_safe_bwa): New expanders.
> 	(mov<mode>, movcqi, reload_out<mode>_aligned): Handle
> 	TARGET_SAFE_BWA.
> 	(reload_out<mode>): Guard against TARGET_SAFE_BWA.
> 	* config/alpha/alpha.opt (msafe-bwa): New option.
> 	* config/alpha/alpha.opt.urls: Regenerate.
> 	* doc/invoke.texi (Option Summary, DEC Alpha Options): Document
> 	the new option.
> 
> 	gcc/testsuite/
> 	PR target/117759
> 	* gcc.target/alpha/stb.c: New file.
> 	* gcc.target/alpha/stb-bwa.c: New file.
> 	* gcc.target/alpha/stb-bwx.c: New file.
> 	* gcc.target/alpha/stba.c: New file.
> 	* gcc.target/alpha/stba-bwa.c: New file.
> 	* gcc.target/alpha/stba-bwx.c: New file.
> 	* gcc.target/alpha/stw.c: New file.
> 	* gcc.target/alpha/stw-bwa.c: New file.
> 	* gcc.target/alpha/stw-bwx.c: New file.
> 	* gcc.target/alpha/stwa.c: New file.
> 	* gcc.target/alpha/stwa-bwa.c: New file.
> 	* gcc.target/alpha/stwa-bwx.c: New file.
> ---
>   NB I note that there is a warning in gcc/config/alpha/sync.md that it is
> unpredictable if the lock_flag is cleared or not by a normal load or store
> executed on the same CPU and therefore we need to make sure no register
> spill is inserted in the sequence.  I seem not to have seen it actually
> happen and testing results with actual hardware do look good.
If it's a real issue in practice and we have to revisit this code, then 
we could look at hard barriers before/after the sequence to prevent 
scheduling into the sequence.  Or we could emit the entire sequence as 
an atomic unit much like some ports do for inlined subword atomics.

> 
>   However out of the abundance of caution we may want to make sure it can't
> happen.  It should be quite a straightforward change, but owing to the
> number of issues encountered, as indicated by the size of the patchset,
> and the limited time I did not get to it in time for Stage 1 closure.  So
> I've chosen to post this change for review anyway with the intent to make
> a suitable update in the coming weeks.  As it only affects newly-added
> `-msafe-bwa' option I don't think it will be disruptive to the pre-release
> stabilisation process.  I will appreciate input for this part
> 
>   NB2 I reckon the manual ought to be updated to say only one scratch is
> permitted for secondary reloads.  While it's documented otherwise since
> forever, it's never actually matched reality.
ISTM that could be a follow-up.  Consider such an update pre-approved.

OK for the trunk.

jeff
  
Linus Torvalds Jan. 7, 2025, 5:18 a.m. UTC | #6
On Mon, 6 Jan 2025 at 16:59, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There is absolutely no gray area here. It was always buggy, and the
> alpha architecture was always completely and fundamentally
> mis-designed.

Note that I really do want to re-emphasize that while I think it's
kind of interesting that Maciej is trying to make gcc DTRT on alpha,
the non-BWX machines really are a completely broken architecture and
almost entirely unfixable.

Yeah, yeah, Maciej also has patches to avoid all the ldq_u/stq_u
sequences for regular byte accesses into actually using ldl_l / stl_c
sequences, but those instructions take hundreds of cycles and go out
on the bus outside the CPU.

So using actual thread-safe byte ops with ldl_l / stl_c turns those
non-BWX alpha CPU's into something very sad and pointless. You might
as well go full retro and use a 6502 or something.

And even the newer alphas that *had* BWX were designed to still do
byte operations with quadword ops and masking. Yes, byte ops existed,
but they were still very much designed as a "only when you really
can't use the masked word model".

For example, the standard "memset()" library routine for EV6 literally
does exactly the thing that I said would be a compiler bug to do.

Because that's literally the core design of the architecture: it's buggy.

So while I applaud Maciej's efforts, I'm not convinced they are all
that productive. Even with a fixed compiler, it's all broken anyway.

Of course, most of the time, you won't ever see the breakage. It's
there, but hitting it in practice is almost impossible.

The Linux kernel uses the known-broken memcpy and memset library code.
All user space does the same.

They are hopelessly and fundamentally broken, but they work in
practice as long as you don't have concurrent accesses "near-by" in
space-time.

                 Linus
  

Patch

Index: gcc/gcc/config/alpha/alpha-modes.def
===================================================================
--- gcc.orig/gcc/config/alpha/alpha-modes.def
+++ gcc/gcc/config/alpha/alpha-modes.def
@@ -17,6 +17,10 @@  You should have received a copy of the G
 along with GCC; see the file COPYING3.  If not see
 <http://www.gnu.org/licenses/>.  */
 
+/* 256-bit integer mode used by "reload_out<mode>_safe_bwa" secondary
+   reload patterns to obtain 4 scratch registers.  */
+INT_MODE (OI, 32);
+
 /* 128-bit floating point.  This gets reset in alpha_option_override
    if VAX float format is in use.  */
 FLOAT_MODE (TF, 16, ieee_quad_format);
Index: gcc/gcc/config/alpha/alpha-protos.h
===================================================================
--- gcc.orig/gcc/config/alpha/alpha-protos.h
+++ gcc/gcc/config/alpha/alpha-protos.h
@@ -43,6 +43,7 @@  extern enum reg_class alpha_preferred_re
 extern void alpha_set_memflags (rtx, rtx);
 extern bool alpha_split_const_mov (machine_mode, rtx *);
 extern bool alpha_expand_mov (machine_mode, rtx *);
+extern bool alpha_expand_mov_safe_bwa (machine_mode, rtx *);
 extern bool alpha_expand_mov_nobwx (machine_mode, rtx *);
 extern void alpha_expand_movmisalign (machine_mode, rtx *);
 extern void alpha_emit_floatuns (rtx[]);
Index: gcc/gcc/config/alpha/alpha.cc
===================================================================
--- gcc.orig/gcc/config/alpha/alpha.cc
+++ gcc/gcc/config/alpha/alpha.cc
@@ -1660,8 +1660,10 @@  alpha_secondary_reload (bool in_p, rtx x
 	      if (!aligned_memory_operand (x, mode))
 		sri->icode = direct_optab_handler (reload_in_optab, mode);
 	    }
-	  else
+	  else if (aligned_memory_operand (x, mode) || !TARGET_SAFE_BWA)
 	    sri->icode = direct_optab_handler (reload_out_optab, mode);
+	  else
+	    sri->icode = code_for_reload_out_safe_bwa (mode);
 	  return NO_REGS;
 	}
     }
@@ -2386,6 +2388,70 @@  alpha_expand_mov_nobwx (machine_mode mod
 	}
       return true;
     }
+
+  return false;
+}
+
+/* Expand a multi-thread and async-signal safe QImode or HImode
+   move instruction; return true if all work is done.  */
+
+bool
+alpha_expand_mov_safe_bwa (machine_mode mode, rtx *operands)
+{
+  /* If the output is not a register, the input must be.  */
+  if (MEM_P (operands[0]))
+    operands[1] = force_reg (mode, operands[1]);
+
+  /* If it's a memory load, the sequence is the usual non-BWX one.  */
+  if (any_memory_operand (operands[1], mode))
+    return alpha_expand_mov_nobwx (mode, operands);
+
+  /* Handle memory store cases, unaligned and aligned.  The only case
+     where we can be called during reload is for aligned loads; all
+     other cases require temporaries.  */
+  if (any_memory_operand (operands[0], mode))
+    {
+      if (aligned_memory_operand (operands[0], mode))
+	{
+	  rtx label = gen_rtx_LABEL_REF (DImode, gen_label_rtx ());
+	  emit_label (XEXP (label, 0));
+
+	  rtx aligned_mem, bitnum;
+	  rtx status = gen_reg_rtx (SImode);
+	  rtx temp = gen_reg_rtx (SImode);
+	  get_aligned_mem (operands[0], &aligned_mem, &bitnum);
+	  emit_insn (gen_aligned_store_safe_bwa (aligned_mem, operands[1],
+						 bitnum, status, temp));
+
+	  rtx cond = gen_rtx_EQ (DImode,
+				 gen_rtx_SUBREG (DImode, status, 0),
+				 const0_rtx);
+  	  alpha_emit_unlikely_jump (cond, label);
+	}
+      else
+	{
+	  rtx addr = gen_reg_rtx (DImode);
+	  emit_insn (gen_rtx_SET (addr, get_unaligned_address (operands[0])));
+
+	  rtx aligned_addr = gen_reg_rtx (DImode);
+	  emit_insn (gen_rtx_SET (aligned_addr,
+				  gen_rtx_AND (DImode, addr, GEN_INT (-8))));
+
+	  rtx label = gen_rtx_LABEL_REF (DImode, gen_label_rtx ());
+	  emit_label (XEXP (label, 0));
+
+	  rtx status = gen_reg_rtx (DImode);
+	  rtx temp = gen_reg_rtx (DImode);
+	  rtx seq = gen_unaligned_store_safe_bwa (mode, addr, operands[1],
+						  aligned_addr, status, temp);
+	  alpha_set_memflags (seq, operands[0]);
+	  emit_insn (seq);
+
+	  rtx cond = gen_rtx_EQ (DImode, status, const0_rtx);
+  	  alpha_emit_unlikely_jump (cond, label);
+	}
+      return true;
+    }
 
   return false;
 }
Index: gcc/gcc/config/alpha/alpha.md
===================================================================
--- gcc.orig/gcc/config/alpha/alpha.md
+++ gcc/gcc/config/alpha/alpha.md
@@ -4200,6 +4200,31 @@ 
 			    << INTVAL (operands[2])));
 })
 
+;; Multi-thread and async-signal safe variant.  Operand 0 is the aligned
+;; SImode MEM.  Operand 1 is the data to store. Operand 2 is the number
+;; of bits within the word that the value should be placed.  Operand 3 is
+;; the SImode status.  Operand 4 is a SImode temporary.
+
+(define_expand "aligned_store_safe_bwa"
+  [(set (match_operand:SI 3 "register_operand")
+	(unspec_volatile:SI
+	  [(match_operand:SI 0 "memory_operand")] UNSPECV_LL))
+   (set (subreg:DI (match_dup 3) 0)
+	(and:DI (subreg:DI (match_dup 3) 0) (match_dup 5)))
+   (set (subreg:DI (match_operand:SI 4 "register_operand") 0)
+	(ashift:DI (zero_extend:DI (match_operand 1 "register_operand"))
+		   (match_operand:DI 2 "const_int_operand")))
+   (set (subreg:DI (match_dup 3) 0)
+	(ior:DI (subreg:DI (match_dup 4) 0) (subreg:DI (match_dup 3) 0)))
+   (parallel [(set (subreg:DI (match_dup 3) 0)
+		   (unspec_volatile:DI [(const_int 0)] UNSPECV_SC))
+	      (set (match_dup 0) (match_dup 3))])]
+  ""
+{
+  operands[5] = GEN_INT (~(GET_MODE_MASK (GET_MODE (operands[1]))
+			   << INTVAL (operands[2])));
+})
+
 ;; For the unaligned byte and halfword cases, we use code similar to that
 ;; in the Architecture book, but reordered to lower the number of registers
 ;; required.  Operand 0 is the address.  Operand 1 is the data to store.
@@ -4227,6 +4252,31 @@ 
   ""
   "operands[5] = GEN_INT (GET_MODE_MASK (<MODE>mode));")
 
+;; Multi-thread and async-signal safe variant.  Operand 0 is the address.
+;; Operand 1 is the data to store.  Operand 2 is the aligned address.
+;; Operand 3 is the DImode status.  Operand 4 is a DImode temporary.
+
+(define_expand "@unaligned_store<mode>_safe_bwa"
+  [(set (match_operand:DI 3 "register_operand")
+	(unspec_volatile:DI
+	  [(mem:DI (match_operand:DI 2 "register_operand"))] UNSPECV_LL))
+   (set (match_dup 3)
+	(and:DI (not:DI
+		  (ashift:DI (match_dup 5)
+			     (ashift:DI (match_operand:DI 0 "register_operand")
+					(const_int 3))))
+		(match_dup 3)))
+   (set (match_operand:DI 4 "register_operand")
+	(ashift:DI (zero_extend:DI
+		     (match_operand:I12MODE 1 "register_operand"))
+		   (ashift:DI (match_dup 0) (const_int 3))))
+   (set (match_dup 3) (ior:DI (match_dup 4) (match_dup 3)))
+   (parallel [(set (match_dup 3)
+		   (unspec_volatile:DI [(const_int 0)] UNSPECV_SC))
+	      (set (mem:DI (match_dup 2)) (match_dup 3))])]
+  ""
+  "operands[5] = GEN_INT (GET_MODE_MASK (<MODE>mode));")
+
 ;; Here are the define_expand's for QI and HI moves that use the above
 ;; patterns.  We have the normal sets, plus the ones that need scratch
 ;; registers for reload.
@@ -4236,8 +4286,8 @@ 
 	(match_operand:I12MODE 1 "general_operand"))]
   ""
 {
-  if (TARGET_BWX
-      ? alpha_expand_mov (<MODE>mode, operands)
+  if (TARGET_BWX ? alpha_expand_mov (<MODE>mode, operands)
+      : TARGET_SAFE_BWA ? alpha_expand_mov_safe_bwa (<MODE>mode, operands)
       : alpha_expand_mov_nobwx (<MODE>mode, operands))
     DONE;
 })
@@ -4292,7 +4342,9 @@ 
 	  operands[1] = gen_lowpart (HImode, operands[1]);
 	do_aligned2:
 	  operands[0] = gen_lowpart (HImode, operands[0]);
-	  done = alpha_expand_mov_nobwx (HImode, operands);
+	  done = (TARGET_SAFE_BWA
+		  ? alpha_expand_mov_safe_bwa (HImode, operands)
+		  : alpha_expand_mov_nobwx (HImode, operands));
 	  gcc_assert (done);
 	  DONE;
 	}
@@ -4371,6 +4423,8 @@ 
     }
   else
     {
+      gcc_assert (!TARGET_SAFE_BWA);
+
       rtx addr = get_unaligned_address (operands[0]);
       rtx scratch1 = gen_rtx_REG (DImode, regno);
       rtx scratch2 = gen_rtx_REG (DImode, regno + 1);
@@ -4388,6 +4442,52 @@ 
   DONE;
 })
 
+(define_expand "@reload_out<mode>_safe_bwa"
+  [(parallel [(match_operand:RELOAD12 0 "any_memory_operand" "=m")
+	      (match_operand:RELOAD12 1 "register_operand" "r")
+	      (match_operand:OI 2 "register_operand" "=&r")])]
+  "!TARGET_BWX && TARGET_SAFE_BWA"
+{
+  unsigned regno = REGNO (operands[2]);
+
+  if (<MODE>mode == CQImode)
+    {
+      operands[0] = gen_lowpart (HImode, operands[0]);
+      operands[1] = gen_lowpart (HImode, operands[1]);
+    }
+
+  rtx addr = get_unaligned_address (operands[0]);
+  rtx status = gen_rtx_REG (DImode, regno);
+  rtx areg = gen_rtx_REG (DImode, regno + 1);
+  rtx aligned_addr = gen_rtx_REG (DImode, regno + 2);
+  rtx scratch = gen_rtx_REG (DImode, regno + 3);
+
+  if (REG_P (addr))
+    areg = addr;
+  else
+    emit_move_insn (areg, addr);
+  emit_move_insn (aligned_addr, gen_rtx_AND (DImode, areg, GEN_INT (-8)));
+
+  rtx label = gen_label_rtx ();
+  emit_label (label);
+  LABEL_NUSES (label) = 1;
+
+  rtx seq = gen_reload_out<reloadmode>_unaligned_safe_bwa (areg, operands[1],
+							   aligned_addr,
+							   status, scratch);
+  alpha_set_memflags (seq, operands[0]);
+  emit_insn (seq);
+
+  rtx label_ref = gen_rtx_LABEL_REF (DImode, label);
+  rtx cond = gen_rtx_EQ (DImode, status, const0_rtx);
+  rtx jump = alpha_emit_unlikely_jump (cond, label_ref);
+  JUMP_LABEL (jump) = label;
+
+  cfun->split_basic_blocks_after_reload = 1;
+
+  DONE;
+})
+
 ;; Helpers for the above.  The way reload is structured, we can't
 ;; always get a proper address for a stack slot during reload_foo
 ;; expansion, so we must delay our address manipulations until after.
@@ -4420,10 +4520,55 @@ 
 {
   rtx aligned_mem, bitnum;
   get_aligned_mem (operands[0], &aligned_mem, &bitnum);
-  emit_insn (gen_aligned_store (aligned_mem, operands[1], bitnum,
-				operands[2], operands[3]));
+  if (TARGET_SAFE_BWA)
+    {
+      rtx label = gen_label_rtx ();
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+
+      rtx status = operands[2];
+      rtx temp = operands[3];
+      emit_insn (gen_aligned_store_safe_bwa (aligned_mem, operands[1], bitnum,
+					     status, temp));
+
+      rtx label_ref = gen_rtx_LABEL_REF (DImode, label);
+      rtx cond = gen_rtx_EQ (DImode, gen_rtx_SUBREG (DImode, status, 0),
+			     const0_rtx);
+      rtx jump = alpha_emit_unlikely_jump (cond, label_ref);
+      JUMP_LABEL (jump) = label;
+
+      cfun->split_basic_blocks_after_reload = 1;
+    }
+  else
+    emit_insn (gen_aligned_store (aligned_mem, operands[1], bitnum,
+				  operands[2], operands[3]));
   DONE;
 })
+
+;; Operand 0 is the address.  Operand 1 is the data to store.  Operand 2
+;; is the aligned address.  Operand 3 is the DImode status.  Operand 4 is
+;; a DImode scratch.
+
+(define_expand "reload_out<mode>_unaligned_safe_bwa"
+  [(set (match_operand:DI 3 "register_operand")
+	(unspec_volatile:DI [(mem:DI (match_operand:DI 2 "register_operand"))]
+			    UNSPECV_LL))
+   (set (match_dup 3)
+	(and:DI (not:DI
+		  (ashift:DI (match_dup 5)
+			     (ashift:DI (match_operand:DI 0 "register_operand")
+					(const_int 3))))
+		(match_dup 3)))
+   (set (match_operand:DI 4 "register_operand")
+	(ashift:DI (zero_extend:DI
+		     (match_operand:I12MODE 1 "register_operand"))
+		   (ashift:DI (match_dup 0) (const_int 3))))
+   (set (match_dup 3) (ior:DI (match_dup 4) (match_dup 3)))
+   (parallel [(set (match_dup 3)
+		   (unspec_volatile:DI [(const_int 0)] UNSPECV_SC))
+	      (set (mem:DI (match_dup 2)) (match_dup 3))])]
+  ""
+  "operands[5] = GEN_INT (GET_MODE_MASK (<MODE>mode));")
 
 ;; Vector operations
 
Index: gcc/gcc/config/alpha/alpha.opt
===================================================================
--- gcc.orig/gcc/config/alpha/alpha.opt
+++ gcc/gcc/config/alpha/alpha.opt
@@ -69,6 +69,10 @@  mcix
 Target Mask(CIX)
 Emit code for the counting ISA extension.
 
+msafe-bwa
+Target Mask(SAFE_BWA)
+Emit multi-thread and async-signal safe code for byte and word memory accesses.
+
 mexplicit-relocs
 Target Mask(EXPLICIT_RELOCS)
 Emit code using explicit relocation directives.
Index: gcc/gcc/config/alpha/alpha.opt.urls
===================================================================
--- gcc.orig/gcc/config/alpha/alpha.opt.urls
+++ gcc/gcc/config/alpha/alpha.opt.urls
@@ -35,6 +35,9 @@  UrlSuffix(gcc/DEC-Alpha-Options.html#ind
 mcix
 UrlSuffix(gcc/DEC-Alpha-Options.html#index-mcix)
 
+msafe-bwa
+UrlSuffix(gcc/DEC-Alpha-Options.html#index-msafe-bwa)
+
 mexplicit-relocs
 UrlSuffix(gcc/DEC-Alpha-Options.html#index-mexplicit-relocs)
 
Index: gcc/gcc/doc/invoke.texi
===================================================================
--- gcc.orig/gcc/doc/invoke.texi
+++ gcc/gcc/doc/invoke.texi
@@ -976,6 +976,7 @@  Objective-C and Objective-C++ Dialects}.
 -mtrap-precision=@var{mode}  -mbuild-constants
 -mcpu=@var{cpu-type}  -mtune=@var{cpu-type}
 -mbwx  -mmax  -mfix  -mcix
+-msafe-bwa
 -mfloat-vax  -mfloat-ieee
 -mexplicit-relocs  -msmall-data  -mlarge-data
 -msmall-text  -mlarge-text
@@ -25691,6 +25692,14 @@  CIX, FIX and MAX instruction sets.  The
 sets supported by the CPU type specified via @option{-mcpu=} option or that
 of the CPU on which GCC was built if none is specified.
 
+@opindex msafe-bwa
+@opindex mno-safe-bwa
+@item -msafe-bwa
+@itemx -mno-safe-bwa
+Indicate whether in the absence of the optional BWX instruction set
+GCC should generate multi-thread and async-signal safe code for byte
+and aligned word memory accesses.
+
 @opindex mfloat-vax
 @opindex mfloat-ieee
 @item -mfloat-vax
Index: gcc/gcc/testsuite/gcc.target/alpha/stb-bwa.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stb-bwa.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -msafe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stb (char *p, char v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	bic $16,7,$2
+	insbl $17,$16,$17
+$L2:
+	ldq_l $1,0($2)
+	mskbl $1,$16,$1
+	bis $17,$1,$1
+	stq_c $1,0($2)
+	beq $1,$L2
+
+   with address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldq_l\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstq_c\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sinsbl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\smskbl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stb-bwx.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stb-bwx.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mbwx" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stb (char *p, char v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	stb $17,0($16)
+ */
+
+/* { dg-final { scan-assembler-times "\\sstb\\s\\\$17,0\\\(\\\$16\\\)\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stb.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stb.c
@@ -0,0 +1,25 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -mno-safe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stb (char *p, char v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	insbl $17,$16,$17
+	ldq_u $1,0($16)
+	mskbl $1,$16,$1
+	bis $17,$1,$17
+	stq_u $17,0($16)
+
+   without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldq_u\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstq_u\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sinsbl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\smskbl\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stba-bwa.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stba-bwa.c
@@ -0,0 +1,35 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -msafe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    char c;
+  }
+char_a;
+
+void
+stba (char_a *p, char v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	and $17,0xff,$17
+$L2:
+	ldl_l $1,0($16)
+	bic $1,255,$1
+	bis $17,$1,$1
+	stl_c $1,0($16)
+	beq $1,$L2
+
+   without any INSBL or MSKBL instructions and without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldl_l\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstl_c\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sand\\s\\\$\[0-9\]+,0xff,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sbic\\s\\\$\[0-9\]+,255,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
+/* { dg-final { scan-assembler-not "\\s(?:insbl|mskbl)\\s" } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stba-bwx.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stba-bwx.c
@@ -0,0 +1,23 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mbwx" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    char c;
+  }
+char_a;
+
+void
+stba (char_a *p, char v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	stb $17,0($16)
+ */
+
+/* { dg-final { scan-assembler-times "\\sstb\\s\\\$17,0\\\(\\\$16\\\)\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stba.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stba.c
@@ -0,0 +1,33 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -mno-safe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    char c;
+  }
+char_a;
+
+void
+stba (char_a *p, char v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	and $17,0xff,$17
+	ldl $1,0($16)
+	bic $1,255,$1
+	bis $17,$1,$17
+	stl $17,0($16)
+
+   without any INSBL or MSKBL instructions and without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sand\\s\\\$\[0-9\]+,0xff,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sbic\\s\\\$\[0-9\]+,255,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
+/* { dg-final { scan-assembler-not "\\s(?:insbl|mskbl)\\s" } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stw-bwa.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stw-bwa.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -msafe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stw (short *p, short v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	bic $16,7,$2
+	inswl $17,$16,$17
+$L2:
+	ldq_l $1,0($2)
+	mskwl $1,$16,$1
+	bis $17,$1,$1
+	stq_c $1,0($2)
+	beq $1,$L2
+
+   with address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldq_l\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstq_c\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sinswl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\smskwl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stw-bwx.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stw-bwx.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mbwx" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stw (short *p, short v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	stw $17,0($16)
+ */
+
+/* { dg-final { scan-assembler-times "\\sstw\\s\\\$17,0\\\(\\\$16\\\)\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stw.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stw.c
@@ -0,0 +1,25 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -mno-safe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+void
+stw (short *p, short v)
+{
+  *p = v;
+}
+
+/* Expect assembly such as:
+
+	inswl $17,$16,$17
+	ldq_u $1,0($16)
+	mskwl $1,$16,$1
+	bis $17,$1,$17
+	stq_u $17,0($16)
+
+   without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldq_u\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstq_u\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sinswl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\smskwl\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stwa-bwa.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stwa-bwa.c
@@ -0,0 +1,35 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -msafe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    short c;
+  }
+short_a;
+
+void
+stwa (short_a *p, short v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	zapnot $17,3,$17
+$L2:
+	ldl_l $1,0($16)
+	zapnot $1,252,$1
+	bis $17,$1,$1
+	stl_c $1,0($16)
+	beq $1,$L2
+
+   without any INSWL or MSKWL instructions and without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldl_l\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstl_c\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\szapnot\\s\\\$\[0-9\]+,3,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\szapnot\\s\\\$\[0-9\]+,252,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
+/* { dg-final { scan-assembler-not "\\s(?:inswl|mskwl)\\s" } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stwa-bwx.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stwa-bwx.c
@@ -0,0 +1,23 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mbwx" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    short c;
+  }
+short_a;
+
+void
+stwa (short_a *p, short v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	stw $17,0($16)
+ */
+
+/* { dg-final { scan-assembler-times "\\sstw\\s\\\$17,0\\\(\\\$16\\\)\\s" 1 } } */
Index: gcc/gcc/testsuite/gcc.target/alpha/stwa.c
===================================================================
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/alpha/stwa.c
@@ -0,0 +1,33 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mno-bwx -mno-safe-bwa" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+typedef union
+  {
+    int i;
+    short c;
+  }
+short_a;
+
+void
+stwa (short_a *p, short v)
+{
+  p->c = v;
+}
+
+/* Expect assembly such as:
+
+	zapnot $17,3,$17
+	ldl $1,0($16)
+	zapnot $1,252,$1
+	bis $17,$1,$17
+	stl $17,0($16)
+
+   without any INSWL or MSKWL instructions and without address masking.  */
+
+/* { dg-final { scan-assembler-times "\\sldl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sstl\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\szapnot\\s\\\$\[0-9\]+,3,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\szapnot\\s\\\$\[0-9\]+,252,\\\$\[0-9\]+\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sbic\\s\\\$\[0-9\]+,7,\\\$\[0-9\]+\\s" } } */
+/* { dg-final { scan-assembler-not "\\s(?:inswl|mskwl)\\s" } } */