AArch64: Improve address rematerialization costs

Message ID DB6PR0801MB18797423882B844F219D4A4483C69@DB6PR0801MB1879.eurprd08.prod.outlook.com
State New
Headers
Series AArch64: Improve address rematerialization costs |

Commit Message

Wilco Dijkstra May 9, 2022, 4:11 p.m. UTC
  Improve rematerialization costs of addresses.  The current costs are set too high
which results in extra register pressure and spilling.  Using lower costs means
addresses will be rematerialized more often rather than being spilled or causing
spills.  This results in significant codesize reductions and performance gains.
SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
smaller.

Passes bootstrap and regress. OK for commit?

ChangeLog:
2021-06-01  Wilco Dijkstra  <wdijkstr@arm.com>

        * config/aarch64/aarch64.cc (aarch64_rtx_costs): Use better rematerialization
        costs for HIGH, LO_SUM and SYMREF.

---
  

Comments

Richard Sandiford May 9, 2022, 4:27 p.m. UTC | #1
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Improve rematerialization costs of addresses.  The current costs are set too high
> which results in extra register pressure and spilling.  Using lower costs means
> addresses will be rematerialized more often rather than being spilled or causing
> spills.  This results in significant codesize reductions and performance gains.
> SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO.  Codesize is 0.12%
> smaller.

I'm not questioning the results, but I think we need to look in more
detail why rematerialisation requires such low costs.  The point of
comparison should be against a spill and reload, so any constant
that is as cheap as a load should be rematerialised.  If that isn't
happening then it sounds like changes are needed elsewhere.

Thanks,
Richard

> Passes bootstrap and regress. OK for commit?
>
> ChangeLog:
> 2021-06-01  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * config/aarch64/aarch64.cc (aarch64_rtx_costs): Use better rematerialization
>         costs for HIGH, LO_SUM and SYMREF.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 43d87d1b9c4ef1a85094e51f81745f98f1ef27fb..7341849121ffd6b3b0b77c9730e74e751742e852 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14529,45 +14529,28 @@ cost_plus:
>           return false;  /* All arguments need to be in registers.  */
>         }
>
> +    /* The following costs are used for rematerialization of addresses.
> +       Set a low cost for all global accesses - this ensures they are
> +       preferred for rematerialization, blocks them from being spilled
> +       and reduces register pressure.  The result is significant codesize
> +       reductions and performance gains. */
> +
>      case SYMBOL_REF:
> +      *cost = 0;
>
> -      if (aarch64_cmodel == AARCH64_CMODEL_LARGE
> -         || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
> -       {
> -         /* LDR.  */
> -         if (speed)
> -           *cost += extra_cost->ldst.load;
> -       }
> -      else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
> -              || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
> -       {
> -         /* ADRP, followed by ADD.  */
> -         *cost += COSTS_N_INSNS (1);
> -         if (speed)
> -           *cost += 2 * extra_cost->alu.arith;
> -       }
> -      else if (aarch64_cmodel == AARCH64_CMODEL_TINY
> -              || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
> -       {
> -         /* ADR.  */
> -         if (speed)
> -           *cost += extra_cost->alu.arith;
> -       }
> +      /* Use a separate remateralization cost for GOT accesses.  */
> +      if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
> +         && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
> +       *cost = COSTS_N_INSNS (1) / 2;
>
> -      if (flag_pic)
> -       {
> -         /* One extra load instruction, after accessing the GOT.  */
> -         *cost += COSTS_N_INSNS (1);
> -         if (speed)
> -           *cost += extra_cost->ldst.load;
> -       }
>        return true;
>
>      case HIGH:
> +      *cost = 0;
> +      return true;
> +
>      case LO_SUM:
> -      /* ADRP/ADD (immediate).  */
> -      if (speed)
> -       *cost += extra_cost->alu.arith;
> +      *cost = COSTS_N_INSNS (3) / 4;
>        return true;
>
>      case ZERO_EXTRACT:
  
Wilco Dijkstra May 9, 2022, 5:18 p.m. UTC | #2
Hi Richard,

> I'm not questioning the results, but I think we need to look in more
> detail why rematerialisation requires such low costs.  The point of
> comparison should be against a spill and reload, so any constant
> that is as cheap as a load should be rematerialised.  If that isn't
> happening then it sounds like changes are needed elsewhere.

The simple answer is that rematerializable expressions must have a lower cost
than the spill cost (potentially of something else), otherwise it will never happen.
The previous costs were set way too high (eg. 12 for ADRP+LDR vs 4 for a reload).
This patch basically ensures that is indeed the case. In principle a zero cost
works fine for anything that can be rematerialized. However it may use more
instructions than a spill (of something else), so a small non-zero cost avoids
bloating codesize.

There isn't really a better way of doing this within the existing costing code.
We could try doubling or quadrupling the spill costs but that would create a
lot of fallout since it affects everything.

Cheers,
Wilco
  
Richard Sandiford May 9, 2022, 5:30 p.m. UTC | #3
Wilco Dijkstra via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi Richard,
>
>> I'm not questioning the results, but I think we need to look in more
>> detail why rematerialisation requires such low costs.  The point of
>> comparison should be against a spill and reload, so any constant
>> that is as cheap as a load should be rematerialised.  If that isn't
>> happening then it sounds like changes are needed elsewhere.
>
> The simple answer is that rematerializable expressions must have a lower cost
> than the spill cost (potentially of something else), otherwise it will never happen.
> The previous costs were set way too high (eg. 12 for ADRP+LDR vs 4 for a reload).
> This patch basically ensures that is indeed the case. In principle a zero cost
> works fine for anything that can be rematerialized. However it may use more
> instructions than a spill (of something else), so a small non-zero cost avoids
> bloating codesize.
>
> There isn't really a better way of doing this within the existing costing code.

Yeah, I was wondering whether we could change something there.
ADRP+LDR is logically more expensive than a single LDR, especially
when optimising for size, so I think it's reasonable for the rtx_costs
to say so.  But that doesn't/shouldn't mean that spilling is better
(for either size or speed).

So it feels like there's something missing in the way the costs are
being applied.

Thanks,
Richard

> We could try doubling or quadrupling the spill costs but that would create a
> lot of fallout since it affects everything.
  
Wilco Dijkstra May 10, 2022, 2:57 p.m. UTC | #4
Hi Richard,

>> There isn't really a better way of doing this within the existing costing code.
>
> Yeah, I was wondering whether we could change something there.
> ADRP+LDR is logically more expensive than a single LDR, especially
> when optimising for size, so I think it's reasonable for the rtx_costs
> to say so.  But that doesn't/shouldn't mean that spilling is better
> (for either size or speed).
>
> So it feels like there's something missing in the way the costs are
> being applied.

Calculating accurate spill costs is hard. Spill optimization is done later, so
until then you can't know the actual cost of a spill decision already made.
Spills are also more expensive than you think due to store latency, more
dirty cachelines etc. There is little benefit in lifting an ADRP to the start of
a function but keep ADD/LDR close to references. Basically ADRP/MOV are
very cheap, so it's a waste to allocate these to long-lived registers.

Given that there are significant codesize and performance improvements,
it is clear that doing more rematerialization is better even in cases where it
takes 2 instructions to recompute the address. Binaries show a significant
reduction in stack-based loads and stores.

Cheers,
Wilco
  
Richard Sandiford May 10, 2022, 3:19 p.m. UTC | #5
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>>> There isn't really a better way of doing this within the existing costing code.
>>
>> Yeah, I was wondering whether we could change something there.
>> ADRP+LDR is logically more expensive than a single LDR, especially
>> when optimising for size, so I think it's reasonable for the rtx_costs
>> to say so.  But that doesn't/shouldn't mean that spilling is better
>> (for either size or speed).
>>
>> So it feels like there's something missing in the way the costs are
>> being applied.
>
> Calculating accurate spill costs is hard. Spill optimization is done later, so
> until then you can't know the actual cost of a spill decision already made.
> Spills are also more expensive than you think due to store latency, more
> dirty cachelines etc. There is little benefit in lifting an ADRP to the start of
> a function but keep ADD/LDR close to references. Basically ADRP/MOV are
> very cheap, so it's a waste to allocate these to long-lived registers.
>
> Given that there are significant codesize and performance improvements,
> it is clear that doing more rematerialization is better even in cases where it
> takes 2 instructions to recompute the address. Binaries show a significant
> reduction in stack-based loads and stores.

Yeah, I'm not disagreeing with any of that.  It's just a question of
whether the problem should be fixed by artificially lowering the general
rtx costs with one particular user (RA spill costs) in mind, or whether
it should be fixed by making the RA spill code take the factors above
into account.

Thanks,
Richard
  
Wilco Dijkstra May 11, 2022, 11:13 a.m. UTC | #6
Hi Richard,

> Yeah, I'm not disagreeing with any of that.  It's just a question of
> whether the problem should be fixed by artificially lowering the general
> rtx costs with one particular user (RA spill costs) in mind, or whether
> it should be fixed by making the RA spill code take the factors above
> into account.

The RA spill code already works fine on immediates but not on address
constants. And the reason is that the current rtx costs for addresses are
set artificially high without justification (I checked the patch that increased
the costs but there was nothing explaining why it was beneficial).

It's certainly possible to experiment with increasing the spill costs, but that
won't improve the issue with address constants unless they are at least doubled.
And it has the effect of halving all rtx costs in the register allocator which is
likely to cause regressions. So we'd need to adjust many rtx costs to keep the
allocator working, plus fix any further regressions this causes.

Cheers,
Wilco
  
Richard Sandiford May 11, 2022, 12:22 p.m. UTC | #7
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> Yeah, I'm not disagreeing with any of that.  It's just a question of
>> whether the problem should be fixed by artificially lowering the general
>> rtx costs with one particular user (RA spill costs) in mind, or whether
>> it should be fixed by making the RA spill code take the factors above
>> into account.
>
> The RA spill code already works fine on immediates but not on address
> constants. And the reason is that the current rtx costs for addresses are
> set artificially high without justification (I checked the patch that increased
> the costs but there was nothing explaining why it was beneficial).

But even if the costs are too high, the patch seems to be overcompensating.
It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.

Giving X zero cost means that a sequence like:

  (set (reg x0) X)
  (set (reg x1) X)

should stay as-is rather than be changed to:

  (set (reg x0) X)
  (set (reg x1) (reg x0))

I don't think we want that for multi-instruction constants when
optimising for size.

> It's certainly possible to experiment with increasing the spill costs, but that
> won't improve the issue with address constants unless they are at least doubled.
> And it has the effect of halving all rtx costs in the register allocator which is
> likely to cause regressions. So we'd need to adjust many rtx costs to keep the
> allocator working, plus fix any further regressions this causes.

Yeah, I wasn't suggesting that we increase the spill costs.  I'm saying
that we should look at whether the target-independent RA heuristics need
to change, whether new target hooks are needed, etc.  We shouldn't go
into this with the assumption that the target-independent code is
invariant and that any fix must be in existing aarch64 hooks (rtx costs
or spill costs).

Maybe it would help to turn the question around for a minute.  Can we
describe the cases in which it's *better* for the RA to spill a constant
address to the stack and reload it, rather than rematerialise on demand?

Thanks,
Richard
  
Richard Biener May 11, 2022, 1:08 p.m. UTC | #8
On Wed, May 11, 2022 at 2:23 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > Hi Richard,
> >
> >> Yeah, I'm not disagreeing with any of that.  It's just a question of
> >> whether the problem should be fixed by artificially lowering the general
> >> rtx costs with one particular user (RA spill costs) in mind, or whether
> >> it should be fixed by making the RA spill code take the factors above
> >> into account.
> >
> > The RA spill code already works fine on immediates but not on address
> > constants. And the reason is that the current rtx costs for addresses are
> > set artificially high without justification (I checked the patch that increased
> > the costs but there was nothing explaining why it was beneficial).
>
> But even if the costs are too high, the patch seems to be overcompensating.
> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.
>
> Giving X zero cost means that a sequence like:
>
>   (set (reg x0) X)
>   (set (reg x1) X)
>
> should stay as-is rather than be changed to:
>
>   (set (reg x0) X)
>   (set (reg x1) (reg x0))
>
> I don't think we want that for multi-instruction constants when
> optimising for size.
>
> > It's certainly possible to experiment with increasing the spill costs, but that
> > won't improve the issue with address constants unless they are at least doubled.
> > And it has the effect of halving all rtx costs in the register allocator which is
> > likely to cause regressions. So we'd need to adjust many rtx costs to keep the
> > allocator working, plus fix any further regressions this causes.
>
> Yeah, I wasn't suggesting that we increase the spill costs.  I'm saying
> that we should look at whether the target-independent RA heuristics need
> to change, whether new target hooks are needed, etc.  We shouldn't go
> into this with the assumption that the target-independent code is
> invariant and that any fix must be in existing aarch64 hooks (rtx costs
> or spill costs).
>
> Maybe it would help to turn the question around for a minute.  Can we
> describe the cases in which it's *better* for the RA to spill a constant
> address to the stack and reload it, rather than rematerialise on demand?

From the discussion in PR102178 it seems that LRA cannot rematerialize
all "constants" (though here it is constant pool loads).  Some constants
might also not be 'constant'.   See the PR for more fun "spilling" behavior
on x86_64.

It's also said that chosen alternatives might be the reason that
rematerialization
is not choosen and alternatives are chosen based on reload heuristics, not based
on actual costs.

Richard.

> Thanks,
> Richard
  
Richard Sandiford May 12, 2022, 8:21 a.m. UTC | #9
Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, May 11, 2022 at 2:23 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> > Hi Richard,
>> >
>> >> Yeah, I'm not disagreeing with any of that.  It's just a question of
>> >> whether the problem should be fixed by artificially lowering the general
>> >> rtx costs with one particular user (RA spill costs) in mind, or whether
>> >> it should be fixed by making the RA spill code take the factors above
>> >> into account.
>> >
>> > The RA spill code already works fine on immediates but not on address
>> > constants. And the reason is that the current rtx costs for addresses are
>> > set artificially high without justification (I checked the patch that increased
>> > the costs but there was nothing explaining why it was beneficial).
>>
>> But even if the costs are too high, the patch seems to be overcompensating.
>> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.
>>
>> Giving X zero cost means that a sequence like:
>>
>>   (set (reg x0) X)
>>   (set (reg x1) X)
>>
>> should stay as-is rather than be changed to:
>>
>>   (set (reg x0) X)
>>   (set (reg x1) (reg x0))
>>
>> I don't think we want that for multi-instruction constants when
>> optimising for size.
>>
>> > It's certainly possible to experiment with increasing the spill costs, but that
>> > won't improve the issue with address constants unless they are at least doubled.
>> > And it has the effect of halving all rtx costs in the register allocator which is
>> > likely to cause regressions. So we'd need to adjust many rtx costs to keep the
>> > allocator working, plus fix any further regressions this causes.
>>
>> Yeah, I wasn't suggesting that we increase the spill costs.  I'm saying
>> that we should look at whether the target-independent RA heuristics need
>> to change, whether new target hooks are needed, etc.  We shouldn't go
>> into this with the assumption that the target-independent code is
>> invariant and that any fix must be in existing aarch64 hooks (rtx costs
>> or spill costs).
>>
>> Maybe it would help to turn the question around for a minute.  Can we
>> describe the cases in which it's *better* for the RA to spill a constant
>> address to the stack and reload it, rather than rematerialise on demand?
>
> From the discussion in PR102178 it seems that LRA cannot rematerialize
> all "constants" (though here it is constant pool loads).  Some constants
> might also not be 'constant'.   See the PR for more fun "spilling" behavior
> on x86_64.
>
> It's also said that chosen alternatives might be the reason that
> rematerialization
> is not choosen and alternatives are chosen based on reload heuristics, not based
> on actual costs.

Thanks for the pointer.  Yeah, it'd be interesting to know if this
is the same issue, although I fear the only way of knowing for sure
is to fix it first and see whether both targets benefit. ;-)

Richard
  
Wilco Dijkstra May 12, 2022, 2:54 p.m. UTC | #10
Hi,

>> It's also said that chosen alternatives might be the reason that
>> rematerialization
>> is not choosen and alternatives are chosen based on reload heuristics, not based
>> on actual costs.
>
> Thanks for the pointer.  Yeah, it'd be interesting to know if this
> is the same issue, although I fear the only way of knowing for sure
> is to fix it first and see whether both targets benefit. ;-)

I don't believe this is the same issue - there are lots of register allocation problems
indeed, many are caused by the complex design. All the alternatives and register classes
create a huge crossproduct, making it almost impossible to make good allocation
decisions even if they were accurately costed.

I've found that the correct way to deal with this is to reduce all this choice as much
as possible. That means splitting instructions into simpler ones with fewer alternatives
and register classes. You also need to block it from treating all register classes as
equivalent - on AArch64 we had to force floating point values to be allocated to floating
point registers (which is obviously how any register allocator should work by default),
but maybe x86 doesn't do that yet.

Cheers,
Wilco
  
Wilco Dijkstra May 12, 2022, 4 p.m. UTC | #11
Hi Richard,

> But even if the costs are too high, the patch seems to be overcompensating.
> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.

An LDR is not a replacement for ADRP+LDR, you need a store in addition the
original ADRP+LDR. Basically a simple spill would be comparing these 2 sequences:

ADRP x0, ...
LDR x0, [x0, ...]
STR x0, [SP, ...]
...
LDR x0, [SP, ...]


ADRP x0, ...
LDR x0, [x0, ...]
...
ADRP x0, ...
LDR x0, [x0, ...]

Obviously it's far cheaper to do the latter than the former.

> Giving X zero cost means that a sequence like:
>
>   (set (reg x0) X)
>   (set (reg x1) X)
>
> should stay as-is rather than be changed to:
>
>   (set (reg x0) X)
>   (set (reg x1) (reg x0))
>
> I don't think we want that for multi-instruction constants when
> optimising for size.

I don't believe this is a real problem. The cost queries for address constants come
from register allocation, I don't see them affect other optimizations.

> Yeah, I wasn't suggesting that we increase the spill costs.  I'm saying

I'm saying that because we've set the spill costs low on purpose to work around
register allocation bugs. There have been some fixes since, so increasing the spill
costs may now be feasible (but not trivial).

> that we should look at whether the target-independent RA heuristics need
> to change, whether new target hooks are needed, etc.  We shouldn't go
> into this with the assumption that the target-independent code is
> invariant and that any fix must be in existing aarch64 hooks (rtx costs
> or spill costs).

But what bug do you think exists in target independent code? It behaves
correctly once we supply more accurate costs. If there was no rematerialization
irrespectively of the cost settings then you could claim there was a bug.

> Maybe it would help to turn the question around for a minute.  Can we
> describe the cases in which it's *better* for the RA to spill a constant
> address to the stack and reload it, rather than rematerialise on demand?

Rematerialization is almost always better than spilling and reloading from the
stack. If the constant requires multiple instructions and there are more than 2
references it would be better for codesize to spill, but for performance it is
better to rematerialize unless there are many references.

You also want to prefer rematerialization over spilling a different liferange when
other aspects are comparable.

Cheers,
Wilco
  
Richard Sandiford May 12, 2022, 5:20 p.m. UTC | #12
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> But even if the costs are too high, the patch seems to be overcompensating.
>> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR.
>
> An LDR is not a replacement for ADRP+LDR, you need a store in addition the
> original ADRP+LDR. Basically a simple spill would be comparing these 2 sequences:
>
> ADRP x0, ...
> LDR x0, [x0, ...]
> STR x0, [SP, ...]
> ...
> LDR x0, [SP, ...]
>
>
> ADRP x0, ...
> LDR x0, [x0, ...]
> ...
> ADRP x0, ...
> LDR x0, [x0, ...]
>
> Obviously it's far cheaper to do the latter than the former.

Sure.  Like I say, I'm not disagreeing with the intent of reducing
spilling and promoting rematerialisation.  I agree we should do that.

I'm just disagreeing with the approach of using rtx_costs.  The rtx_cost
hook isn't being asked the question: is spilling this better value than
rematerialising it?  It's being asked for the cost of an operation, on
the understanding that that cost will be compared with the cost of other
operations.  An ADRP+LDR operation then ought to be at least as costly
as an LDR, because in a two-way comparison, it is.

[…]

>> Maybe it would help to turn the question around for a minute.  Can we
>> describe the cases in which it's *better* for the RA to spill a constant
>> address to the stack and reload it, rather than rematerialise on demand?
>
> Rematerialization is almost always better than spilling and reloading from the
> stack. If the constant requires multiple instructions and there are more than 2
> references it would be better for codesize to spill, but for performance it is
> better to rematerialize unless there are many references.
>
> You also want to prefer rematerialization over spilling a different liferange when
> other aspects are comparable.

Yeah, that's what I thought the answer would be.  So the question is:
why is the RA choosing to spill and reload rather than rematerialise
these values?  Does it not know how to rematerialise them, and so we
rely on earlier passes not reusing the constants?  Or does the RA
know how but decides it isn't worthwhile, because of the way that
the RA uses the target costs?  If the latter, I would be much happier with
a new hook that allows the target to force the RA to rematerialise a given
value, if that's the heuristic we want to use when optimising for speed.

Thanks,
Richard
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 43d87d1b9c4ef1a85094e51f81745f98f1ef27fb..7341849121ffd6b3b0b77c9730e74e751742e852 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14529,45 +14529,28 @@  cost_plus:
 	  return false;  /* All arguments need to be in registers.  */
 	}
 
+    /* The following costs are used for rematerialization of addresses.
+       Set a low cost for all global accesses - this ensures they are
+       preferred for rematerialization, blocks them from being spilled
+       and reduces register pressure.  The result is significant codesize
+       reductions and performance gains. */
+
     case SYMBOL_REF:
+      *cost = 0;
 
-      if (aarch64_cmodel == AARCH64_CMODEL_LARGE
-	  || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC)
-	{
-	  /* LDR.  */
-	  if (speed)
-	    *cost += extra_cost->ldst.load;
-	}
-      else if (aarch64_cmodel == AARCH64_CMODEL_SMALL
-	       || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC)
-	{
-	  /* ADRP, followed by ADD.  */
-	  *cost += COSTS_N_INSNS (1);
-	  if (speed)
-	    *cost += 2 * extra_cost->alu.arith;
-	}
-      else if (aarch64_cmodel == AARCH64_CMODEL_TINY
-	       || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC)
-	{
-	  /* ADR.  */
-	  if (speed)
-	    *cost += extra_cost->alu.arith;
-	}
+      /* Use a separate remateralization cost for GOT accesses.  */
+      if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC
+	  && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G)
+	*cost = COSTS_N_INSNS (1) / 2;
 
-      if (flag_pic)
-	{
-	  /* One extra load instruction, after accessing the GOT.  */
-	  *cost += COSTS_N_INSNS (1);
-	  if (speed)
-	    *cost += extra_cost->ldst.load;
-	}
       return true;
 
     case HIGH:
+      *cost = 0;
+      return true;
+
     case LO_SUM:
-      /* ADRP/ADD (immediate).  */
-      if (speed)
-	*cost += extra_cost->alu.arith;
+      *cost = COSTS_N_INSNS (3) / 4;
       return true;
 
     case ZERO_EXTRACT: