gcov: Split atomic bitwise-or for some targets

Message ID 20251207121102.1222-1-sebastian.huber@embedded-brains.de
State New
Headers
Series gcov: Split atomic bitwise-or for some targets |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-lint warning Lint failed
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-test success Testing passed

Commit Message

Sebastian Huber Dec. 7, 2025, 12:11 p.m. UTC
  There are targets, which only offer 32-bit atomic operations (for
example 32-bit RISC-V).  For these targets, split the 64-bit atomic
bitwise-or operation into two parts.

For this test case

int a(int i);
int b(int i);

int f(int i)
{
  if (i) {
    return a(i);
  } else {
    return b(i);
  }
}

with options

-O2 -fprofile-update=atomic -fcondition-coverage

the code generation to 64-bit vs. 32-bit RISC-V looks like:

        addi    a5,a5,%lo(.LANCHOR0)
        beq     a0,zero,.L2
        li      a4,1
-       amoor.d zero,a4,0(a5)
-       addi    a5,a5,8
-       amoor.d zero,zero,0(a5)
+       amoor.w zero,a4,0(a5)
+       addi    a4,a5,4
+       amoor.w zero,zero,0(a4)
+       addi    a4,a5,8
+       amoor.w zero,zero,0(a4)
+       addi    a5,a5,12
+       amoor.w zero,zero,0(a5)
        tail    a
 .L2:
-       amoor.d zero,zero,0(a5)
+       amoor.w zero,zero,0(a5)
+       addi    a4,a5,4
+       amoor.w zero,zero,0(a4)
        li      a4,1
-       addi    a5,a5,8
-       amoor.d zero,a4,0(a5)
+       addi    a3,a5,8
+       amoor.w zero,a4,0(a3)
+       addi    a5,a5,12
+       amoor.w zero,zero,0(a5)
        tail    b

Not related to this patch, even with -O2 the compiler generates
no-operations like

amoor.d zero,zero,0(a5)

and

amoor.w zero,zero,0(a5)

Would this be possible to filter out in instrument_decisions()?

gcc/ChangeLog:

	* tree-profile.cc (split_update_decision_counter): New.
	(instrument_decisions): Use counter_update to determine which
	atomic operations are available.  Use
	split_update_decision_counter() if 64-bit atomic operations can
	be split up into two 32-bit atomic operations.
---
 gcc/tree-profile.cc | 73 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 6 deletions(-)
  

Comments

Jeffrey Law Dec. 26, 2025, 11:43 p.m. UTC | #1
On 12/7/2025 5:11 AM, Sebastian Huber wrote:
> There are targets, which only offer 32-bit atomic operations (for
> example 32-bit RISC-V).  For these targets, split the 64-bit atomic
> bitwise-or operation into two parts.
>
> For this test case
>
> int a(int i);
> int b(int i);
>
> int f(int i)
> {
>    if (i) {
>      return a(i);
>    } else {
>      return b(i);
>    }
> }
>
> with options
>
> -O2 -fprofile-update=atomic -fcondition-coverage
>
> the code generation to 64-bit vs. 32-bit RISC-V looks like:
>
>          addi    a5,a5,%lo(.LANCHOR0)
>          beq     a0,zero,.L2
>          li      a4,1
> -       amoor.d zero,a4,0(a5)
> -       addi    a5,a5,8
> -       amoor.d zero,zero,0(a5)
> +       amoor.w zero,a4,0(a5)
> +       addi    a4,a5,4
> +       amoor.w zero,zero,0(a4)
> +       addi    a4,a5,8
> +       amoor.w zero,zero,0(a4)
> +       addi    a5,a5,12
> +       amoor.w zero,zero,0(a5)
>          tail    a
>   .L2:
> -       amoor.d zero,zero,0(a5)
> +       amoor.w zero,zero,0(a5)
> +       addi    a4,a5,4
> +       amoor.w zero,zero,0(a4)
>          li      a4,1
> -       addi    a5,a5,8
> -       amoor.d zero,a4,0(a5)
> +       addi    a3,a5,8
> +       amoor.w zero,a4,0(a3)
> +       addi    a5,a5,12
> +       amoor.w zero,zero,0(a5)
>          tail    b
>
> Not related to this patch, even with -O2 the compiler generates
> no-operations like
>
> amoor.d zero,zero,0(a5)
>
> and
>
> amoor.w zero,zero,0(a5)
>
> Would this be possible to filter out in instrument_decisions()?

I'd bet this might be reasonably optimized in either gimple or RTL 
without major work.  Though someone would have to read up on semantics 
-- are we allowed to drop atomics like that?


>
> gcc/ChangeLog:
>
> 	* tree-profile.cc (split_update_decision_counter): New.
> 	(instrument_decisions): Use counter_update to determine which
> 	atomic operations are available.  Use
> 	split_update_decision_counter() if 64-bit atomic operations can
> 	be split up into two 32-bit atomic operations.

I was originally thinking that spitting these down in the optimizers or 
target files would make more sense.  But this looks like a fairly 
practical solution pending doing some real optimization around atomics 
(which I think we do need, I've seen unused relaxed loads showing up in 
profiles from jemalloc).  But until then....


> +
> +    /* Get the high 32-bit of the counter */
> +    tree shift_32 = build_int_cst (integer_type_node, 32);
> +    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
> +					       "PROF_decision");
> +    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
> +					    counter, shift_32);

Doesn't the type of shift_32 need to match the type of the object being 
shifted?  Or do we have loose requirements around type checking operands 
for this case (where the shift count is often in a smaller precision 
than the object being shifted).

Do we need to worry about logical vs arithmetic shifts here? COUNTER's 
type is going to drive that decision, so we just need to make sure it's 
sensible.


>
>
> @@ -1157,6 +1213,11 @@ instrument_decisions (array_slice<basic_block> expr, size_t condno,
>   						      next[k], relaxed);
>   		    gsi_insert_on_edge (e, flush);
>   		}
> +		else if (use_atomic_split)
> +		{
> +		    split_update_decision_counter (e, ref, next[k],
> +						   atomic_ior_32, relaxed);
> +		}

Consider dropping the extraneous curlys.  That function seems to be 
formatted without regard to our formatting conventions, so I'm not going 
to ask that you adjust indention on this little hunk since it mirrors 
nearby code.

Jeff
  
Sebastian Huber Dec. 28, 2025, 1:26 p.m. UTC | #2
----- Am 27. Dez 2025 um 0:43 schrieb Jeff Law jeffreyalaw@gmail.com:

> On 12/7/2025 5:11 AM, Sebastian Huber wrote:
[...]
>> +
>> +    /* Get the high 32-bit of the counter */
>> +    tree shift_32 = build_int_cst (integer_type_node, 32);
>> +    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
>> +					       "PROF_decision");
>> +    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
>> +					    counter, shift_32);
> 
> Doesn't the type of shift_32 need to match the type of the object being
> shifted?  Or do we have loose requirements around type checking operands
> for this case (where the shift count is often in a smaller precision
> than the object being shifted).

This is my attempt to write something like this:

int shift_32 = 32;
gcov_type_node counter_high_64 = counter >> shift_32;

> 
> Do we need to worry about logical vs arithmetic shifts here? COUNTER's
> type is going to drive that decision, so we just need to make sure it's
> sensible.

We have

tree
get_gcov_type (void)
{
  scalar_int_mode mode
    = smallest_int_mode_for_size
      (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32).require ();
  return lang_hooks.types.type_for_mode (mode, false);
}

So, the gcov_type_node is probably a signed type.

With

    gassign *assign4 = gimple_build_assign (counter_high_32, NOP_EXPR,
					    counter_high_64);

does it matter if it is a logical or arithmetic shift? I am sorry, but I don't really know what I am doing here. I tinkered this together by looking at examples in the code.

> 
> 
>>
>>
>> @@ -1157,6 +1213,11 @@ instrument_decisions (array_slice<basic_block> expr,
>> size_t condno,
>>   						      next[k], relaxed);
>>   		    gsi_insert_on_edge (e, flush);
>>   		}
>> +		else if (use_atomic_split)
>> +		{
>> +		    split_update_decision_counter (e, ref, next[k],
>> +						   atomic_ior_32, relaxed);
>> +		}
> 
> Consider dropping the extraneous curlys.  That function seems to be
> formatted without regard to our formatting conventions, so I'm not going
> to ask that you adjust indention on this little hunk since it mirrors
> nearby code.

Ok, I adjusted the patch.
  
Sebastian Huber Dec. 28, 2025, 1:34 p.m. UTC | #3
----- Am 28. Dez 2025 um 14:26 schrieb Sebastian Huber sebastian.huber@embedded-brains.de:

> ----- Am 27. Dez 2025 um 0:43 schrieb Jeff Law jeffreyalaw@gmail.com:
> 
>> On 12/7/2025 5:11 AM, Sebastian Huber wrote:
> [...]
>>> +
>>> +    /* Get the high 32-bit of the counter */
>>> +    tree shift_32 = build_int_cst (integer_type_node, 32);
>>> +    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
>>> +					       "PROF_decision");
>>> +    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
>>> +					    counter, shift_32);
>> 
>> Doesn't the type of shift_32 need to match the type of the object being
>> shifted?  Or do we have loose requirements around type checking operands
>> for this case (where the shift count is often in a smaller precision
>> than the object being shifted).
> 
> This is my attempt to write something like this:
> 
> int shift_32 = 32;
> gcov_type_node counter_high_64 = counter >> shift_32;

Oh, it looks like I confused left and right. This should be an RSHIFT_EXPR:

gassign *assign3 = gimple_build_assign (counter_high_64, RSHIFT_EXPR,
                                        counter, shift_32);

> 
>> 
>> Do we need to worry about logical vs arithmetic shifts here? COUNTER's
>> type is going to drive that decision, so we just need to make sure it's
>> sensible.
> 
> We have
> 
> tree
> get_gcov_type (void)
> {
>  scalar_int_mode mode
>    = smallest_int_mode_for_size
>      (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32).require ();
>  return lang_hooks.types.type_for_mode (mode, false);
> }
> 
> So, the gcov_type_node is probably a signed type.
> 
> With
> 
>    gassign *assign4 = gimple_build_assign (counter_high_32, NOP_EXPR,
>					    counter_high_64);
> 
> does it matter if it is a logical or arithmetic shift? I am sorry, but I don't
> really know what I am doing here. I tinkered this together by looking at
> examples in the code.
> 
>> 
>> 
>>>
>>>
>>> @@ -1157,6 +1213,11 @@ instrument_decisions (array_slice<basic_block> expr,
>>> size_t condno,
>>>   						      next[k], relaxed);
>>>   		    gsi_insert_on_edge (e, flush);
>>>   		}
>>> +		else if (use_atomic_split)
>>> +		{
>>> +		    split_update_decision_counter (e, ref, next[k],
>>> +						   atomic_ior_32, relaxed);
>>> +		}
>> 
>> Consider dropping the extraneous curlys.  That function seems to be
>> formatted without regard to our formatting conventions, so I'm not going
>> to ask that you adjust indention on this little hunk since it mirrors
>> nearby code.
> 
> Ok, I adjusted the patch.
> 
> --
> embedded brains GmbH & Co. KG
> Herr Sebastian HUBER
> Dornierstr. 4
> 82178 Puchheim
> Germany
> email: sebastian.huber@embedded-brains.de
> phone: +49-89-18 94 741 - 16
> fax:   +49-89-18 94 741 - 08
> 
> Registergericht: Amtsgericht München
> Registernummer: HRB 157899
> Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
> Unsere Datenschutzerklärung finden Sie hier:
> https://embedded-brains.de/datenschutzerklaerung/
  
Sebastian Huber Dec. 29, 2025, 2:03 a.m. UTC | #4
----- Am 28. Dez 2025 um 14:34 schrieb Sebastian Huber sebastian.huber@embedded-brains.de:

> ----- Am 28. Dez 2025 um 14:26 schrieb Sebastian Huber
> sebastian.huber@embedded-brains.de:
> 
>> ----- Am 27. Dez 2025 um 0:43 schrieb Jeff Law jeffreyalaw@gmail.com:
>> 
>>> On 12/7/2025 5:11 AM, Sebastian Huber wrote:
>> [...]
>>>> +
>>>> +    /* Get the high 32-bit of the counter */
>>>> +    tree shift_32 = build_int_cst (integer_type_node, 32);
>>>> +    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
>>>> +					       "PROF_decision");
>>>> +    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
>>>> +					    counter, shift_32);
>>> 
>>> Doesn't the type of shift_32 need to match the type of the object being
>>> shifted?  Or do we have loose requirements around type checking operands
>>> for this case (where the shift count is often in a smaller precision
>>> than the object being shifted).
>> 
>> This is my attempt to write something like this:
>> 
>> int shift_32 = 32;
>> gcov_type_node counter_high_64 = counter >> shift_32;
> 
> Oh, it looks like I confused left and right. This should be an RSHIFT_EXPR:
> 
> gassign *assign3 = gimple_build_assign (counter_high_64, RSHIFT_EXPR,
>                                        counter, shift_32);

I used this test case to double check that the shifting is now correct:

int a(void);
int b(void);
int c(int);
int f(int *i)
{
  if (c(i[0]) || c(i[1]) || c(i[2]) || c(i[3]) || c(i[4]) ||
      c(i[5]) || c(i[6]) || c(i[7]) || c(i[8]) || c(i[9]) ||
      c(i[10]) || c(i[11]) || c(i[12]) || c(i[13]) || c(i[14]) ||
      c(i[15]) || c(i[16]) || c(i[17]) || c(i[18]) || c(i[19]) ||
      c(i[20]) || c(i[21]) || c(i[22]) || c(i[23]) || c(i[24]) ||
      c(i[25]) || c(i[26]) || c(i[27]) || c(i[28]) || c(i[29]) ||
      c(i[30]) || c(i[31]) || c(i[32]) || c(i[33]) || c(i[34]) ||
      c(i[35]) || c(i[36]) || c(i[37]) || c(i[38]) || c(i[39])) {
    return a();
  } else {
    return b();
  }
}

Interestingly, GCC now reuses the "amoor.w zero,zero" operations (see "j .L46").

	.type	f, @function
f:
	addi	sp,sp,-16
	sw	s0,8(sp)
	mv	s0,a0
	lw	a0,0(a0)
	sw	ra,12(sp)
	call	c
	bne	a0,zero,.L49
	lw	a0,4(s0)
	call	c
	beq	a0,zero,.L4
	lui	a5,%hi(.LANCHOR0)
	addi	a5,a5,%lo(.LANCHOR0)
	li	a4,2
.L44:
	amoor.w	zero,a4,0(a5)
	addi	a4,a5,4
.L46:
	amoor.w	zero,zero,0(a4)
	addi	a4,a5,8
	amoor.w	zero,zero,0(a4)
	addi	a5,a5,12
	amoor.w	zero,zero,0(a5)
.L3:
	lw	s0,8(sp)
	lw	ra,12(sp)
	addi	sp,sp,16
	tail	a
.L4:
	lw	a0,8(s0)
	call	c
	beq	a0,zero,.L5
	lui	a5,%hi(.LANCHOR0)
	addi	a5,a5,%lo(.LANCHOR0)
	li	a4,4
	amoor.w	zero,a4,0(a5)
	add	a4,a5,a4
	j	.L46

GCC reloads the .LANCHOR0 about 40 times. It probably should do this only once and keep it in a non-volatile register.

Once the counter exceeds 32 bits, we get this code:

.L34:
	lw	a0,128(s0)
	call	c
	beq	a0,zero,.L35
	lui	a5,%hi(.LANCHOR0)
	addi	a5,a5,%lo(.LANCHOR0)
	amoor.w	zero,zero,0(a5)
	li	a4,1
.L45:
	addi	a3,a5,4
.L47:
	amoor.w	zero,a4,0(a3)
	addi	a4,a5,8
	amoor.w	zero,zero,0(a4)
	addi	a5,a5,12
	amoor.w	zero,zero,0(a5)
	j	.L3

This is the corresponding 64-bit code:

.L34:
	lw	a0,128(s0)
	call	c
	beq	a0,zero,.L35
	lui	a5,%hi(.LANCHOR0)
	li	a4,1
	addi	a5,a5,%lo(.LANCHOR0)
	slli	a4,a4,32
	amoor.d	zero,a4,0(a5)
	addi	a5,a5,8
	amoor.d	zero,zero,0(a5)
	j	.L3
  
Jeffrey Law Dec. 29, 2025, 6:07 p.m. UTC | #5
On 12/28/2025 6:26 AM, Sebastian Huber wrote:
> ----- Am 27. Dez 2025 um 0:43 schrieb Jeff Law jeffreyalaw@gmail.com:
>
>> On 12/7/2025 5:11 AM, Sebastian Huber wrote:
> [...]
>>> +
>>> +    /* Get the high 32-bit of the counter */
>>> +    tree shift_32 = build_int_cst (integer_type_node, 32);
>>> +    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
>>> +					       "PROF_decision");
>>> +    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
>>> +					    counter, shift_32);
>> Doesn't the type of shift_32 need to match the type of the object being
>> shifted?  Or do we have loose requirements around type checking operands
>> for this case (where the shift count is often in a smaller precision
>> than the object being shifted).
> This is my attempt to write something like this:
>
> int shift_32 = 32;
> gcov_type_node counter_high_64 = counter >> shift_32;
So I went into the tree checking code and we do indeed have looser 
checks for the shift/rotate cases; essentially we allow the shift/rotate 
count to be any integral type or vector of integrals.   So we're OK with 
a constant node like you're using.

> tree
> get_gcov_type (void)
> {
>    scalar_int_mode mode
>      = smallest_int_mode_for_size
>        (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32).require ();
>    return lang_hooks.types.type_for_mode (mode, false);
> }
>
> So, the gcov_type_node is probably a signed type.
That was my conclusion as well.

>
> With
>
>      gassign *assign4 = gimple_build_assign (counter_high_32, NOP_EXPR,
> 					    counter_high_64);
>
> does it matter if it is a logical or arithmetic shift? I am sorry, but I don't really know what I am doing here. I tinkered this together by looking at examples in the code.
No worries at all.  I'm not familiar with the gcov code, so we're both 
just kind of slogging through it.

So the assignment above will just convert the types and as I think 
through it, the type of the shift isn't going to matter because you 
shift the upper 32 bits into the low 32 bit positions.  The upper 32 
bits will be copies of the original sign bit.  But then we use the 
nop-conversion to drop those upper 32 bits anyway.  So it shouldn't 
really matter if they were zeros or copies of the sign bit because we 
never use them.
>> Consider dropping the extraneous curlys.  That function seems to be
>> formatted without regard to our formatting conventions, so I'm not going
>> to ask that you adjust indention on this little hunk since it mirrors
>> nearby code.
> Ok, I adjusted the patch.
Thanks.  I'll take another looksie, but we're probably good to go after 
working through this stuff a bit on this thread.
jeff
  
Jeffrey Law Dec. 29, 2025, 7:08 p.m. UTC | #6
On 12/28/2025 7:03 PM, Sebastian Huber wrote:
> I used this test case to double check that the shifting is now correct:
>
> int a(void);
> int b(void);
> int c(int);
> int f(int *i)
> {
>    if (c(i[0]) || c(i[1]) || c(i[2]) || c(i[3]) || c(i[4]) ||
>        c(i[5]) || c(i[6]) || c(i[7]) || c(i[8]) || c(i[9]) ||
>        c(i[10]) || c(i[11]) || c(i[12]) || c(i[13]) || c(i[14]) ||
>        c(i[15]) || c(i[16]) || c(i[17]) || c(i[18]) || c(i[19]) ||
>        c(i[20]) || c(i[21]) || c(i[22]) || c(i[23]) || c(i[24]) ||
>        c(i[25]) || c(i[26]) || c(i[27]) || c(i[28]) || c(i[29]) ||
>        c(i[30]) || c(i[31]) || c(i[32]) || c(i[33]) || c(i[34]) ||
>        c(i[35]) || c(i[36]) || c(i[37]) || c(i[38]) || c(i[39])) {
>      return a();
>    } else {
>      return b();
>    }
> }
>
> Interestingly, GCC now reuses the "amoor.w zero,zero" operations (see "j .L46").
Right.  That's not a huge surprise to me.  If we look at the gimple we see:


> ;;   basic block 3, loop depth 0
> ;;    pred:       2
>   __atomic_fetch_or_4 (&__gcov8.f[0], 1, 0);
>   __atomic_fetch_or_4 (&MEM <long long int> [(void *)&__gcov8.f + 4B], 
> 0, 0);
>   __atomic_fetch_or_4 (&__gcov8.f[1], 0, 0);
>   __atomic_fetch_or_4 (&MEM <long long int> [(void *)&__gcov8.f + 
> 12B], 0, 0);
>   _8 = a (i_3(D)); [tail call]
>   goto <bb 5>; [100.00%]
> ;;    succ:       5
>
> ;;   basic block 4, loop depth 0
> ;;    pred:       2
>   __atomic_fetch_or_4 (&__gcov8.f[0], 0, 0);
>   __atomic_fetch_or_4 (&MEM <long long int> [(void *)&__gcov8.f + 4B], 
> 0, 0);
>   __atomic_fetch_or_4 (&__gcov8.f[1], 1, 0);
>   __atomic_fetch_or_4 (&MEM <long long int> [(void *)&__gcov8.f + 
> 12B], 0, 0);
>   _6 = b (0); [tail call]
So to improve the code you need to recognize the atomic_fetch_or_4 where 
the object is IOR'd with the constant 0 as a nop and remove those 
statements (or not emit them to begin with).  In general our optimizers 
don't do a whole lot with atomics right now.

I think your change is missing a check somewhere.   When I compile your 
test I initially get "target does not support atomic profile update, 
single mode is selected", but then it still does the atomic path.  
Before your patch is just used the non-atomic updates.  So it appears 
something isn't quite right yet.

jeff
  
Sebastian Huber Dec. 30, 2025, 1:01 a.m. UTC | #7
----- Am 29. Dez 2025 um 20:08 schrieb Jeffrey Law jeffrey.law@oss.qualcomm.com:
[...]
> I think your change is missing a check somewhere.   When I compile your
> test I initially get "target does not support atomic profile update,
> single mode is selected", but then it still does the atomic path.
> Before your patch is just used the non-atomic updates.  So it appears
> something isn't quite right yet.

Yes, it seems the counter update mode selection was wrong for PROFILE_UPDATE_ATOMIC. There should be a dedicated warning if COUNTER_UPDATE_ATOMIC_PARTIAL is selected. Please have a look at this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2025-December/704607.html

If the target doesn't support libatomic, then some atomic operations cannot be carried out. If at least 32-bit atomic operations are available, we can at least do the atomic increments and the bit field updates. This case is indicated by COUNTER_UPDATE_ATOMIC_PARTIAL.
  

Patch

diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
index fe20e84838d..0ac4e826beb 100644
--- a/gcc/tree-profile.cc
+++ b/gcc/tree-profile.cc
@@ -1006,6 +1006,57 @@  resolve_counters (vec<counters>& cands)
 
 }
 
+/* At edge E, update the decision counter referenced by REF with the
+   COUNTER.  Generate two separate 32-bit atomic bitwise-or operations
+   specified by ATOMIC_IOR_32 in the RELAXED memory order.  */
+static void
+split_update_decision_counter (edge e, tree ref, tree counter, tree
+			       atomic_ior_32, tree relaxed)
+{
+    gimple_stmt_iterator gsi = gsi_last (PENDING_STMT (e));
+    ref = unshare_expr (ref);
+
+    /* Get the low and high address of the referenced counter */
+    tree addr_low = build_addr (ref);
+    tree addr_high = make_temp_ssa_name (TREE_TYPE (addr_low), NULL,
+					 "PROF_decision");
+    tree four = build_int_cst (size_type_node, 4);
+    gassign *assign1 = gimple_build_assign (addr_high, POINTER_PLUS_EXPR,
+					    addr_low, four);
+    gsi_insert_after (&gsi, assign1, GSI_NEW_STMT);
+    if (WORDS_BIG_ENDIAN)
+	std::swap (addr_low, addr_high);
+
+    /* Get the low 32-bit of the counter */
+    tree counter_low_32 = make_temp_ssa_name (uint32_type_node, NULL,
+					      "PROF_decision");
+    gassign *assign2 = gimple_build_assign (counter_low_32, NOP_EXPR, counter);
+    gsi_insert_after (&gsi, assign2, GSI_NEW_STMT);
+
+    /* Get the high 32-bit of the counter */
+    tree shift_32 = build_int_cst (integer_type_node, 32);
+    tree counter_high_64 = make_temp_ssa_name (gcov_type_node, NULL,
+					       "PROF_decision");
+    gassign *assign3 = gimple_build_assign (counter_high_64, LSHIFT_EXPR,
+					    counter, shift_32);
+    gsi_insert_after (&gsi, assign3, GSI_NEW_STMT);
+    tree counter_high_32 = make_temp_ssa_name (uint32_type_node, NULL,
+					       "PROF_decision");
+    gassign *assign4 = gimple_build_assign (counter_high_32, NOP_EXPR,
+					    counter_high_64);
+    gsi_insert_after (&gsi, assign4, GSI_NEW_STMT);
+
+    /* Atomically bitwise-or the low 32-bit counter parts */
+    gcall *call1 = gimple_build_call (atomic_ior_32, 3, addr_low,
+				      counter_low_32, relaxed);
+    gsi_insert_after (&gsi, call1, GSI_NEW_STMT);
+
+    /* Atomically bitwise-or the high 32-bit counter parts */
+    gcall *call2 = gimple_build_call (atomic_ior_32, 3, addr_high,
+				      counter_high_32, relaxed);
+    gsi_insert_after (&gsi, call2, GSI_NEW_STMT);
+}
+
 /* Add instrumentation to a decision subgraph.  EXPR should be the
    (topologically sorted) block of nodes returned by cov_blocks, MAPS the
    bitmaps returned by cov_maps, and MASKS the block of bitsets returned by
@@ -1108,11 +1159,16 @@  instrument_decisions (array_slice<basic_block> expr, size_t condno,
     gcc_assert (xi == bitmap_count_bits (core));
 
     const tree relaxed = build_int_cst (integer_type_node, MEMMODEL_RELAXED);
-    const bool atomic = flag_profile_update == PROFILE_UPDATE_ATOMIC;
-    const tree atomic_ior = builtin_decl_explicit
-	(TYPE_PRECISION (gcov_type_node) > 32
-	 ? BUILT_IN_ATOMIC_FETCH_OR_8
-	 : BUILT_IN_ATOMIC_FETCH_OR_4);
+    const bool use_atomic_builtin =
+	counter_update == COUNTER_UPDATE_ATOMIC_BUILTIN;
+    const bool use_atomic_split =
+	counter_update == COUNTER_UPDATE_ATOMIC_SPLIT ||
+	counter_update == COUNTER_UPDATE_ATOMIC_PARTIAL;
+    const tree atomic_ior_32 =
+	builtin_decl_explicit (BUILT_IN_ATOMIC_FETCH_OR_4);
+    const tree atomic_ior = TYPE_PRECISION (gcov_type_node) > 32 ?
+	builtin_decl_explicit (BUILT_IN_ATOMIC_FETCH_OR_8) :
+	atomic_ior_32;
 
     /* Flush to the gcov accumulators.  */
     for (const basic_block b : expr)
@@ -1149,7 +1205,7 @@  instrument_decisions (array_slice<basic_block> expr, size_t condno,
 	    {
 		tree ref = tree_coverage_counter_ref (GCOV_COUNTER_CONDS,
 						      2*condno + k);
-		if (atomic)
+		if (use_atomic_builtin)
 		{
 		    ref = unshare_expr (ref);
 		    gcall *flush = gimple_build_call (atomic_ior, 3,
@@ -1157,6 +1213,11 @@  instrument_decisions (array_slice<basic_block> expr, size_t condno,
 						      next[k], relaxed);
 		    gsi_insert_on_edge (e, flush);
 		}
+		else if (use_atomic_split)
+		{
+		    split_update_decision_counter (e, ref, next[k],
+						   atomic_ior_32, relaxed);
+		}
 		else
 		{
 		    tree get = emit_assign (e, ref);