[RFC] Teach vectorizer to deal with bitfield reads

Message ID 4a6f2350-f070-1473-63a5-3232968d3a89@arm.com
State New
Headers
Series [RFC] Teach vectorizer to deal with bitfield reads |

Commit Message

Andre Vieira July 26, 2022, 10 a.m. UTC
  Hi,

This is a RFC for my prototype for bitfield read vectorization. This 
patch enables bit-field read vectorization by removing the rejection of 
bit-field read's during DR analysis and by adding two vect patterns. The 
first one transforms TREE_COMPONENT's with BIT_FIELD_DECL's into 
BIT_FIELD_REF's, this is a temporary one as I believe there are plans to 
do this lowering earlier in the compiler. To avoid having to wait for 
that to happen we decided to implement this temporary vect pattern.
The second one looks for conversions of the result of BIT_FIELD_REF's 
from a 'partial' type to a 'full-type' and transforms it into a 
'full-type' load followed by the necessary shifting and masking.

The patch is not perfect, one thing I'd like to change for instance is 
the way the 'full-type' load is represented. I currently abuse the fact 
that the vectorizer transforms the original TREE_COMPONENT with a 
BIT_FIELD_DECL into a full-type vector load, because it uses the 
smallest mode necessary for that precision. The reason I do this is 
because I was struggling to construct a MEM_REF that the vectorizer 
would accept and this for some reason seemed to work ... I'd appreciate 
some pointers on how to do this properly :)

Another aspect that I haven't paid much attention to yet is costing, 
I've noticed some testcases fail to vectorize due to costing where I 
think it might be wrong, but like I said, I haven't paid much attention 
to it.

Finally another aspect I'd eventually like to tackle is the sinking of 
the masking when possible, for instance in bit-field-read-3.c the 
'masking' does not need to be inside the loop because we are doing 
bitwise operations. Though I suspect we are better off looking at things 
like this elsewhere, maybe where we do codegen for the reduction... 
Haven't looked at this at all yet.

Let me know if you believe this is a good approach? I've ran regression 
tests and this hasn't broken anything so far...

Kind regards,
Andre

PS: Once we do have lowering of BIT_FIELD_DECL's to BIT_FIELD_REF's 
earlier in the compiler I suspect we will require some further changes 
to the DR analysis part, but that's difficult to test right now.
  

Comments

Richard Biener July 27, 2022, 11:37 a.m. UTC | #1
On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:

> Hi,
> 
> This is a RFC for my prototype for bitfield read vectorization. This patch
> enables bit-field read vectorization by removing the rejection of bit-field
> read's during DR analysis and by adding two vect patterns. The first one
> transforms TREE_COMPONENT's with BIT_FIELD_DECL's into BIT_FIELD_REF's, this
> is a temporary one as I believe there are plans to do this lowering earlier in
> the compiler. To avoid having to wait for that to happen we decided to
> implement this temporary vect pattern.
> The second one looks for conversions of the result of BIT_FIELD_REF's from a
> 'partial' type to a 'full-type' and transforms it into a 'full-type' load
> followed by the necessary shifting and masking.
> 
> The patch is not perfect, one thing I'd like to change for instance is the way
> the 'full-type' load is represented. I currently abuse the fact that the
> vectorizer transforms the original TREE_COMPONENT with a BIT_FIELD_DECL into a
> full-type vector load, because it uses the smallest mode necessary for that
> precision. The reason I do this is because I was struggling to construct a
> MEM_REF that the vectorizer would accept and this for some reason seemed to
> work ... I'd appreciate some pointers on how to do this properly :)
> 
> Another aspect that I haven't paid much attention to yet is costing, I've
> noticed some testcases fail to vectorize due to costing where I think it might
> be wrong, but like I said, I haven't paid much attention to it.
> 
> Finally another aspect I'd eventually like to tackle is the sinking of the
> masking when possible, for instance in bit-field-read-3.c the 'masking' does
> not need to be inside the loop because we are doing bitwise operations. Though
> I suspect we are better off looking at things like this elsewhere, maybe where
> we do codegen for the reduction... Haven't looked at this at all yet.
> 
> Let me know if you believe this is a good approach? I've ran regression tests
> and this hasn't broken anything so far...

I don't think this is a good approach for what you gain and how 
necessarily limited it will be.  Similar to the recent experiment with
handling _Complex loads/stores this is much better tackled by lowering
things earlier (it will be lowered at RTL expansion time).

One place to do this experimentation would be to piggy-back on the
if-conversion pass so the lowering would happen only on the
vectorized code path.  Note that the desired lowering would look like
the following for reads:

  _1 = a.b;

to

  _2 = a.<representative for b>;
  _1 = BIT_FIELD_REF <2, ...>; // extract bits

and for writes:

  a.b = _1;

to

  _2 = a.<representative for b>;
  _3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
  a.<representative for b> = _3;

so you trade now handled loads/stores with not handled
BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
pattern match to shifts and logical ops in the vectorizer.

There's a separate thing of actually promoting all uses, for
example

struct { long long x : 33; } a;

 a.a = a.a + 1;

will get you 33bit precision adds (for bit fields less than 32bits
they get promoted to int but not for larger bit fields).  RTL
expansion again will rewrite this into larger ops plus masking.

So I think the time is better spent in working on the lowering of
bitfield accesses, if sufficiently separated it could be used
from if-conversion by working on loop SEME regions.  The patches
doing previous implementations are probably not too useful anymore
(I find one from 2011 and one from 2016, both pre-dating BIT_INSERT_EXPR)

Richard.
  
Andre Vieira July 29, 2022, 8:57 a.m. UTC | #2
Hi Richard,

Thanks for the review, I don't completely understand all of the below, 
so I added some extra questions to help me understand :)

On 27/07/2022 12:37, Richard Biener wrote:
> On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:
>
> I don't think this is a good approach for what you gain and how
> necessarily limited it will be.  Similar to the recent experiment with
> handling _Complex loads/stores this is much better tackled by lowering
> things earlier (it will be lowered at RTL expansion time).
I assume the approach you are referring to here is the lowering of the 
BIT_FIELD_DECL to BIT_FIELD_REF in the vect_recog part of the 
vectorizer. I am all for lowering earlier, the reason I did it there was 
as a 'temporary' approach until we have that earlier loading.
>
> One place to do this experimentation would be to piggy-back on the
> if-conversion pass so the lowering would happen only on the
> vectorized code path.
This was one of my initial thoughts, though the if-conversion changes 
are a bit more intrusive for a temporary approach and not that much 
earlier. It does however have the added benefit of not having to make 
any changes to the vectorizer itself later if we do do the earlier 
lowering, assuming the lowering results in the same.

The 'only on the vectorized code path' remains the same though as 
vect_recog also only happens on the vectorized code path right?
>   Note that the desired lowering would look like
> the following for reads:
>
>    _1 = a.b;
>
> to
>
>    _2 = a.<representative for b>;
>    _1 = BIT_FIELD_REF <2, ...>; // extract bits
I don't yet have a well formed idea of what '<representative for b>' is 
supposed to look like in terms of tree expressions. I understand what 
it's supposed to be representing, the 'larger than bit-field'-load. But 
is it going to be a COMPONENT_REF with a fake 'FIELD_DECL' with the 
larger size? Like I said on IRC, the description of BIT_FIELD_REF makes 
it sound like this isn't how we are supposed to use it, are we intending 
to make a change to that here?

> and for writes:
>
>    a.b = _1;
>
> to
>
>    _2 = a.<representative for b>;
>    _3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
>    a.<representative for b> = _3;
I was going to avoid writes for now because they are somewhat more 
complicated, but maybe it's not that bad, I'll add them too.
> so you trade now handled loads/stores with not handled
> BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
> pattern match to shifts and logical ops in the vectorizer.
Yeah that vect_recog pattern already exists in my RFC patch, though I 
can probably simplify it by moving the bit-field-ref stuff to ifcvt.
>
> There's a separate thing of actually promoting all uses, for
> example
>
> struct { long long x : 33; } a;
>
>   a.a = a.a + 1;
>
> will get you 33bit precision adds (for bit fields less than 32bits
> they get promoted to int but not for larger bit fields).  RTL
> expansion again will rewrite this into larger ops plus masking.
Not sure I understand why this is relevant here? The current way I am 
doing this would likely lower a  bit-field like that to a 64-bit load  
followed by the masking away of the top 31 bits, same would happen with 
a ifcvt-lowering approach.
>
> So I think the time is better spent in working on the lowering of
> bitfield accesses, if sufficiently separated it could be used
> from if-conversion by working on loop SEME regions.
I will start to look at modifying ifcvt to add the lowering there. Will 
likely require two pass though because we can no longer look at the 
number of BBs to determine whether ifcvt is even needed, so we will 
first need to look for bit-field-decls, then version the loops and then 
look for them again for transformation, but I guess that's fine?
> The patches
> doing previous implementations are probably not too useful anymore
> (I find one from 2011 and one from 2016, both pre-dating BIT_INSERT_EXPR)
>
> Richard.
  
Richard Biener July 29, 2022, 9:11 a.m. UTC | #3
On Fri, 29 Jul 2022, Andre Vieira (lists) wrote:

> Hi Richard,
> 
> Thanks for the review, I don't completely understand all of the below, so I
> added some extra questions to help me understand :)
> 
> On 27/07/2022 12:37, Richard Biener wrote:
> > On Tue, 26 Jul 2022, Andre Vieira (lists) wrote:
> >
> > I don't think this is a good approach for what you gain and how
> > necessarily limited it will be.  Similar to the recent experiment with
> > handling _Complex loads/stores this is much better tackled by lowering
> > things earlier (it will be lowered at RTL expansion time).
> I assume the approach you are referring to here is the lowering of the
> BIT_FIELD_DECL to BIT_FIELD_REF in the vect_recog part of the vectorizer. I am
> all for lowering earlier, the reason I did it there was as a 'temporary'
> approach until we have that earlier loading.

I understood, but "temporary" things in GCC tend to be still around
10 years later, so ...

> >
> > One place to do this experimentation would be to piggy-back on the
> > if-conversion pass so the lowering would happen only on the
> > vectorized code path.
> This was one of my initial thoughts, though the if-conversion changes are a
> bit more intrusive for a temporary approach and not that much earlier. It does
> however have the added benefit of not having to make any changes to the
> vectorizer itself later if we do do the earlier lowering, assuming the
> lowering results in the same.
> 
> The 'only on the vectorized code path' remains the same though as vect_recog
> also only happens on the vectorized code path right?
> >   Note that the desired lowering would look like
> > the following for reads:
> >
> >    _1 = a.b;
> >
> > to
> >
> >    _2 = a.<representative for b>;
> >    _1 = BIT_FIELD_REF <2, ...>; // extract bits
> I don't yet have a well formed idea of what '<representative for b>' is
> supposed to look like in terms of tree expressions. I understand what it's
> supposed to be representing, the 'larger than bit-field'-load. But is it going
> to be a COMPONENT_REF with a fake 'FIELD_DECL' with the larger size? Like I
> said on IRC, the description of BIT_FIELD_REF makes it sound like this isn't
> how we are supposed to use it, are we intending to make a change to that here?

<representative for b> is what DECL_BIT_FIELD_REPRESENTATIVE 
(FIELD_DECL-for-b) gives you, it is a "fake" FIELD_DECL for the underlying
storage, conveniently made available to you by the middle-end.  For
your 31bit field it would be simply 'int' typed.

The BIT_FIELD_REF then extracts the actual bitfield from the underlying
storage, but it's now no longer operating on memory but on the register
holding the underlying data.  To the vectorizer we'd probably have to
pattern-match this to shifts & masks and hope for the conversion to
combine with a later one.

> > and for writes:
> >
> >    a.b = _1;
> >
> > to
> >
> >    _2 = a.<representative for b>;
> >    _3 = BIT_INSERT_EXPR <_2, _1, ...>; // insert bits
> >    a.<representative for b> = _3;
> I was going to avoid writes for now because they are somewhat more
> complicated, but maybe it's not that bad, I'll add them too.

Only handling loads at start is probably fine as experiment, but
handling stores should be straight forward - of course the
BIT_INSERT_EXPR lowering to shifts & masks will be more
complicated.

> > so you trade now handled loads/stores with not handled
> > BIT_FIELD_REF / BIT_INSERT_EXPR which you would then need to
> > pattern match to shifts and logical ops in the vectorizer.
> Yeah that vect_recog pattern already exists in my RFC patch, though I can
> probably simplify it by moving the bit-field-ref stuff to ifcvt.
> >
> > There's a separate thing of actually promoting all uses, for
> > example
> >
> > struct { long long x : 33; } a;
> >
> >   a.a = a.a + 1;
> >
> > will get you 33bit precision adds (for bit fields less than 32bits
> > they get promoted to int but not for larger bit fields).  RTL
> > expansion again will rewrite this into larger ops plus masking.
> Not sure I understand why this is relevant here? The current way I am doing
> this would likely lower a  bit-field like that to a 64-bit load  followed by
> the masking away of the top 31 bits, same would happen with a ifcvt-lowering
> approach.

Yes, but if there's anything besides loading or storing you will have
a conversion from, say int:31 to int in the IL before any arithmetic.
I've not looked but your patch probably handles conversions to/from
bitfield types by masking / extending.  What I've mentioned with the
33bit example is that with that you can have arithmetic in 33 bits
_without_ intermediate conversions, so you'd have to properly truncate
after every such operation (or choose not to vectorize which I think
is what would happen now).

> >
> > So I think the time is better spent in working on the lowering of
> > bitfield accesses, if sufficiently separated it could be used
> > from if-conversion by working on loop SEME regions.
> I will start to look at modifying ifcvt to add the lowering there. Will likely
> require two pass though because we can no longer look at the number of BBs to
> determine whether ifcvt is even needed, so we will first need to look for
> bit-field-decls, then version the loops and then look for them again for
> transformation, but I guess that's fine?

Ah, yeah - I guess that's fine.  Just add an need_to_lower_bitfields
along need_to_predicate and friends.

Richard.

> > The patches
> > doing previous implementations are probably not too useful anymore
> > (I find one from 2011 and one from 2016, both pre-dating BIT_INSERT_EXPR)
> >
> > Richard.
> 
>
  
Jakub Jelinek July 29, 2022, 10:31 a.m. UTC | #4
On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches wrote:
> The 'only on the vectorized code path' remains the same though as vect_recog
> also only happens on the vectorized code path right?

if conversion (in some cases) duplicates a loop and guards one copy with
an ifn which resolves to true if that particular loop is vectorized and
false otherwise.  So, then changes that shouldn't be done in case of
vectorization failure can be done on the for vectorizer only copy of the
loop.

	Jakub
  
Richard Biener July 29, 2022, 10:52 a.m. UTC | #5
On Fri, 29 Jul 2022, Jakub Jelinek wrote:

> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches wrote:
> > The 'only on the vectorized code path' remains the same though as vect_recog
> > also only happens on the vectorized code path right?
> 
> if conversion (in some cases) duplicates a loop and guards one copy with
> an ifn which resolves to true if that particular loop is vectorized and
> false otherwise.  So, then changes that shouldn't be done in case of
> vectorization failure can be done on the for vectorizer only copy of the
> loop.

And just to mention, one issue with lowering of bitfield accesses
is bitfield inserts which, on some architectures (hello m68k) have
instructions operating on memory directly.  For those it's difficult
to not regress in code quality if a bitfield store becomes a
read-modify-write cycle.  That's one of the things holding this
back.  One idea would be to lower to .INSV directly for those targets
(but presence of insv isn't necessarily indicating support for
memory destinations).

Richard.
  
Andre Vieira Aug. 1, 2022, 10:13 a.m. UTC | #6
On 29/07/2022 11:31, Jakub Jelinek wrote:
> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches wrote:
>> The 'only on the vectorized code path' remains the same though as vect_recog
>> also only happens on the vectorized code path right?
> if conversion (in some cases) duplicates a loop and guards one copy with
> an ifn which resolves to true if that particular loop is vectorized and
> false otherwise.  So, then changes that shouldn't be done in case of
> vectorization failure can be done on the for vectorizer only copy of the
> loop.
>
> 	Jakub
I'm pretty sure vect_recog patterns have no effect on scalar codegen if 
the vectorization fails too. The patterns live as new vect_stmt_info's 
and no changes are actually done to the scalar loop. That was the point 
I was trying to make, but it doesn't matter that much, as I said I am 
happy to do this in if convert.
  
Andre Vieira Aug. 1, 2022, 10:21 a.m. UTC | #7
On 29/07/2022 11:52, Richard Biener wrote:
> On Fri, 29 Jul 2022, Jakub Jelinek wrote:
>
>> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via Gcc-patches wrote:
>>> The 'only on the vectorized code path' remains the same though as vect_recog
>>> also only happens on the vectorized code path right?
>> if conversion (in some cases) duplicates a loop and guards one copy with
>> an ifn which resolves to true if that particular loop is vectorized and
>> false otherwise.  So, then changes that shouldn't be done in case of
>> vectorization failure can be done on the for vectorizer only copy of the
>> loop.
> And just to mention, one issue with lowering of bitfield accesses
> is bitfield inserts which, on some architectures (hello m68k) have
> instructions operating on memory directly.  For those it's difficult
> to not regress in code quality if a bitfield store becomes a
> read-modify-write cycle.  That's one of the things holding this
> back.  One idea would be to lower to .INSV directly for those targets
> (but presence of insv isn't necessarily indicating support for
> memory destinations).
>
> Richard.
Should I account for that when vectorizing though? From what I can tell 
(no TARGET_VECTOR_* hooks implemented) m68k does not have vectorization 
support. So the question is, are there currently any targets that 
vectorize and have vector bitfield-insert/extract support? If they don't 
exist I suggest we worry about it when it comes around, if not just for 
the fact that we wouldn't be able to test it right now.

If this is about not lowering on the non-vectorized path, see my 
previous reply, I never intended to do that in the vectorizer. I just 
thought it was the plan to do lowering eventually.
  
Richard Biener Aug. 1, 2022, 1:16 p.m. UTC | #8
On Mon, 1 Aug 2022, Andre Vieira (lists) wrote:

> 
> On 29/07/2022 11:52, Richard Biener wrote:
> > On Fri, 29 Jul 2022, Jakub Jelinek wrote:
> >
> >> On Fri, Jul 29, 2022 at 09:57:29AM +0100, Andre Vieira (lists) via
> >> Gcc-patches wrote:
> >>> The 'only on the vectorized code path' remains the same though as
> >>> vect_recog
> >>> also only happens on the vectorized code path right?
> >> if conversion (in some cases) duplicates a loop and guards one copy with
> >> an ifn which resolves to true if that particular loop is vectorized and
> >> false otherwise.  So, then changes that shouldn't be done in case of
> >> vectorization failure can be done on the for vectorizer only copy of the
> >> loop.
> > And just to mention, one issue with lowering of bitfield accesses
> > is bitfield inserts which, on some architectures (hello m68k) have
> > instructions operating on memory directly.  For those it's difficult
> > to not regress in code quality if a bitfield store becomes a
> > read-modify-write cycle.  That's one of the things holding this
> > back.  One idea would be to lower to .INSV directly for those targets
> > (but presence of insv isn't necessarily indicating support for
> > memory destinations).
> >
> > Richard.
> Should I account for that when vectorizing though? From what I can tell (no
> TARGET_VECTOR_* hooks implemented) m68k does not have vectorization support.

No.

> So the question is, are there currently any targets that vectorize and have
> vector bitfield-insert/extract support? If they don't exist I suggest we worry
> about it when it comes around, if not just for the fact that we wouldn't be
> able to test it right now.
> 
> If this is about not lowering on the non-vectorized path, see my previous
> reply, I never intended to do that in the vectorizer. I just thought it was
> the plan to do lowering eventually.

Yes, for the vectorized path this all isn't an issue - and btw the
advantage with if-conversion is that you get VN of the result
"for free", the RMW cycle of bitfield stores likely have reads to
share (and also redundant stores in the end, but ...).

And yes, the plan was to do lowering generally.  Just the simplistic
approaches (my last one was a lowering pass somewhen after IPA, IIRC
combined with SRA) run into some issues, like that on m68k, but IIRC
also some others.  So I wouldn't hold my breath, but then just somebody
needs to do the work and think about how to deal with m68k and the
likes...

Richard.
  
Eric Botcazou Oct. 12, 2022, 9:02 a.m. UTC | #9
> Let me know if you believe this is a good approach? I've ran regression
> tests and this hasn't broken anything so far...

Small regression in Ada though, probably a missing guard somewhere:

                === gnat tests ===


Running target unix
FAIL: gnat.dg/loop_optimization23.adb 3 blank line(s) in output
FAIL: gnat.dg/loop_optimization23.adb (test for excess errors)
UNRESOLVED: gnat.dg/loop_optimization23.adb compilation failed to produce 
execut
able
FAIL: gnat.dg/loop_optimization23_pkg.adb 3 blank line(s) in output
FAIL: gnat.dg/loop_optimization23_pkg.adb (test for excess errors)

In order to reproduce, configure the compiler with Ada enabled, build it, and 
copy $[srcdir)/gcc/testsuite/gnat.dg/loop_optimization23_pkg.ad[sb] into the 
build directory, then just issue:

gcc/gnat1 -quiet loop_optimization23_pkg.adb -O2 -Igcc/ada/rts

eric@fomalhaut:~/build/gcc/native> gcc/gnat1 -quiet 
loop_optimization23_pkg.adb -O2 -Igcc/ada/rts
during GIMPLE pass: vect
+===========================GNAT BUG DETECTED==============================+
| 13.0.0 20221012 (experimental) [master ca7f7c3f140] (x86_64-suse-linux) GCC 
error:|
| in exact_div, at poly-int.h:2232                                         |
| Error detected around loop_optimization23_pkg.adb:5:3                    |
| Compiling loop_optimization23_pkg.adb
  

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bitfield-read-1.c b/gcc/testsuite/gcc.dg/vect/bitfield-read-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..c30aad365c40474109748bd03c3a5ca1d10723ed
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bitfield-read-1.c
@@ -0,0 +1,38 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+extern void abort(void);
+
+struct s { int i : 31; };
+
+#define ELT0 {0}
+#define ELT1 {1}
+#define ELT2 {2}
+#define ELT3 {3}
+#define RES 48
+struct s A[N]
+  = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};
+
+int f(struct s *ptr, unsigned n) {
+    int res = 0;
+    for (int i = 0; i < n; ++i)
+      res += ptr[i].i;
+    return res;
+}
+
+int main (void)
+{
+  check_vect ();
+
+  if (f(&A[0], N) != RES)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bitfield-read-2.c b/gcc/testsuite/gcc.dg/vect/bitfield-read-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..ab82ff347c55e78d098d194d739bcd9d7737f777
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bitfield-read-2.c
@@ -0,0 +1,42 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+extern void abort(void);
+
+struct s {
+    unsigned i : 31;
+    char a : 4;
+};
+
+#define N 32
+#define ELT0 {0x7FFFFFFFUL, 0}
+#define ELT1 {0x7FFFFFFFUL, 1}
+#define ELT2 {0x7FFFFFFFUL, 2}
+#define ELT3 {0x7FFFFFFFUL, 3}
+#define RES 48
+struct s A[N]
+  = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};
+
+int f(struct s *ptr, unsigned n) {
+    int res = 0;
+    for (int i = 0; i < n; ++i)
+      res += ptr[i].a;
+    return res;
+}
+
+int main (void)
+{
+  check_vect ();
+
+  if (f(&A[0], N) != RES)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bitfield-read-3.c b/gcc/testsuite/gcc.dg/vect/bitfield-read-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..216611a29fd8bbfbafdbdb79d790e520f44ba672
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bitfield-read-3.c
@@ -0,0 +1,43 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+#include <stdbool.h>
+
+extern void abort(void);
+
+typedef struct {
+    int  c;
+    int  b;
+    bool a : 1;
+} struct_t;
+
+#define N 16
+#define ELT_F { 0xFFFFFFFF, 0xFFFFFFFF, 0 }
+#define ELT_T { 0xFFFFFFFF, 0xFFFFFFFF, 1 }
+
+struct_t vect_false[N] = { ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F,
+			   ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F  };
+struct_t vect_true[N]  = { ELT_F, ELT_F, ELT_T, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F,
+			   ELT_F, ELT_F, ELT_T, ELT_F, ELT_F, ELT_F, ELT_F, ELT_F  };
+int main (void)
+{
+  unsigned ret = 0;
+  for (unsigned i = 0; i < N; i++)
+  {
+      ret |= vect_false[i].a;
+  }
+  if (ret)
+    abort ();
+
+  for (unsigned i = 0; i < N; i++)
+  {
+      ret |= vect_true[i].a;
+  }
+  if (!ret)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bitfield-read-4.c b/gcc/testsuite/gcc.dg/vect/bitfield-read-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..fa566b4411e0da16f617f092eb49cceccbe7ca90
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bitfield-read-4.c
@@ -0,0 +1,44 @@ 
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include "tree-vect.h"
+
+extern void abort(void);
+
+struct s {
+    unsigned i : 31;
+    char x : 2;
+    char a : 4;
+};
+
+#define N 32
+#define ELT0 {0x7FFFFFFFUL, 3, 0}
+#define ELT1 {0x7FFFFFFFUL, 3, 1}
+#define ELT2 {0x7FFFFFFFUL, 3, 2}
+#define ELT3 {0x7FFFFFFFUL, 3, 3}
+#define RES 48
+struct s A[N]
+  = { ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3,
+      ELT0, ELT1, ELT2, ELT3, ELT0, ELT1, ELT2, ELT3};
+
+int f(struct s *ptr, unsigned n) {
+    int res = 0;
+    for (int i = 0; i < n; ++i)
+      res += ptr[i].a;
+    return res;
+}
+
+int main (void)
+{
+  check_vect ();
+
+  if (f(&A[0], N) != RES)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+
diff --git a/gcc/tree-data-ref.cc b/gcc/tree-data-ref.cc
index ff9327f6deb2bb85abbd3853dca9c666699e7a37..a27ec37a456b0c726221767a4b5e52a74057ae23 100644
--- a/gcc/tree-data-ref.cc
+++ b/gcc/tree-data-ref.cc
@@ -1137,7 +1137,22 @@  dr_analyze_innermost (innermost_loop_behavior *drb, tree ref,
   gcc_assert (base != NULL_TREE);
 
   poly_int64 pbytepos;
-  if (!multiple_p (pbitpos, BITS_PER_UNIT, &pbytepos))
+  /* If we are dealing with a bit-field reference then the PBITPOS may not be
+     a multiple of BITS_PER_UNIT.  Set PBYTEPOS to 0 if PBITPOS is smaller than
+     a byte or to the largest number of bytes smaller than BITS_PER_UNIT *
+     PBITPOS.  */
+  if (TREE_CODE (ref) == COMPONENT_REF
+      && DECL_BIT_FIELD (TREE_OPERAND (ref, 1)))
+    {
+      if (!multiple_p (pbitpos, BITS_PER_UNIT, &pbytepos))
+	{
+	  if (known_lt (pbitpos, BITS_PER_UNIT))
+	   pbytepos = 0;
+	  else
+	   can_div_trunc_p (pbitpos, BITS_PER_UNIT, &pbytepos);
+	}
+    }
+  else if (!multiple_p (pbitpos, BITS_PER_UNIT, &pbytepos))
     return opt_result::failure_at (stmt,
 				   "failed: bit offset alignment.\n");
 
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index b279a82551eb70379804d405983ae5dc44b66bf5..cb610db9ca57e5179825e67be5aeb6af98d01aa0 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -4016,7 +4016,18 @@  vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
   if (reversep)
     return false;
 
-  poly_int64 pbytepos = exact_div (pbitpos, BITS_PER_UNIT);
+  /* If we are dealing with a bit-field reference then the PBITPOS may not be
+     a multiple of BITS_PER_UNIT.  Set PBYTEPOS to 0 if PBITPOS is smaller than
+     a byte or to the largest number of bytes smaller than BITS_PER_UNIT *
+     PBITPOS.  */
+  poly_int64 pbytepos;
+  if (!multiple_p (pbitpos, BITS_PER_UNIT, &pbytepos))
+    {
+      if (known_lt (pbitpos, BITS_PER_UNIT))
+	pbytepos = 0;
+      else
+	can_div_trunc_p (pbitpos, BITS_PER_UNIT, &pbytepos);
+    }
 
   if (TREE_CODE (base) == MEM_REF)
     {
@@ -4296,7 +4307,8 @@  vect_find_stmt_data_reference (loop_p loop, gimple *stmt,
       }
 
   if (TREE_CODE (DR_REF (dr)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (dr), 1)))
+      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (dr), 1))
+      && !DR_IS_READ (dr))
     {
       free_data_ref (dr);
       return opt_result::failure_at (stmt,
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index dfbfb71b3c69a0205ccc1b287cb50fa02a70942e..3f64b23888086f61e5ebf928a7ee0c6ed78bde15 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -35,6 +35,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimplify.h"
 #include "gimple-iterator.h"
+#include "gimplify-me.h"
 #include "cfgloop.h"
 #include "tree-vectorizer.h"
 #include "dumpfile.h"
@@ -1828,6 +1829,157 @@  vect_recog_widen_sum_pattern (vec_info *vinfo,
   return pattern_stmt;
 }
 
+static gimple *
+vect_recog_bit_field_decl (vec_info* /* vinfo*/, stmt_vec_info last_stmt_info,
+			 tree *type_out)
+{
+  gassign *decl_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);
+
+  if (!decl_stmt)
+    return NULL;
+
+  data_reference *dr = STMT_VINFO_DATA_REF (last_stmt_info);
+  if (!dr)
+    return NULL;
+
+  if (TREE_CODE (DR_REF (dr)) != COMPONENT_REF
+      || !DECL_BIT_FIELD (TREE_OPERAND (DR_REF (dr), 1))
+      || !DR_IS_READ (dr))
+    return NULL;
+
+  tree type = TREE_TYPE (gimple_get_lhs (decl_stmt));
+  tree record = TREE_OPERAND (DR_REF (dr), 0);
+  tree decl_field = TREE_OPERAND (DR_REF (dr), 1);
+  tree offset = fold_build2 (PLUS_EXPR, sizetype,
+			     DECL_FIELD_OFFSET (decl_field),
+			     DECL_FIELD_BIT_OFFSET (decl_field));
+  tree bf_ref = fold_build3 (BIT_FIELD_REF, type,
+			     record,
+			     build_int_cst (sizetype,
+			     TYPE_PRECISION (type)),
+			     offset);
+  gimple *pattern_stmt
+    = gimple_build_assign (vect_recog_temp_ssa_var (type, NULL), bf_ref);
+
+  *type_out = STMT_VINFO_VECTYPE (last_stmt_info);
+
+  vect_pattern_detected ("bit_field_decl pattern", last_stmt_info->stmt);
+
+  return pattern_stmt;
+}
+
+static gimple *
+vect_recog_bit_field_ref (vec_info *vinfo, stmt_vec_info last_stmt_info,
+			  tree *type_out)
+{
+  gassign *nop_stmt = dyn_cast <gassign *> (last_stmt_info->stmt);
+  if (!nop_stmt)
+    return NULL;
+
+  if (gimple_assign_rhs_code (nop_stmt) != NOP_EXPR)
+    return NULL;
+
+  if (TREE_CODE (gimple_assign_rhs1 (nop_stmt)) != SSA_NAME)
+    return NULL;
+
+  gassign *bf_stmt
+    = dyn_cast <gassign *> (SSA_NAME_DEF_STMT (gimple_assign_rhs1 (nop_stmt)));
+
+  if (!bf_stmt)
+    return NULL;
+
+  stmt_vec_info bf_stmt_info = vinfo->lookup_stmt (bf_stmt);
+  if (gimple_assign_rhs_code (bf_stmt) != BIT_FIELD_REF
+      && (!bf_stmt_info || !STMT_VINFO_IN_PATTERN_P (bf_stmt_info)))
+    return NULL;
+
+  if (gimple_assign_rhs_code (bf_stmt) != BIT_FIELD_REF)
+    {
+      if (!STMT_VINFO_RELATED_STMT (bf_stmt_info))
+	return NULL;
+      bf_stmt
+	= dyn_cast <gassign *> (STMT_VINFO_RELATED_STMT (bf_stmt_info)->stmt);
+    }
+
+  if (!bf_stmt || gimple_assign_rhs_code (bf_stmt) != BIT_FIELD_REF)
+    return NULL;
+
+  /* This is weird, why is rhs1 still a BIT_FIELD_REF??.  */
+  tree bf_ref = gimple_assign_rhs1 (bf_stmt);
+
+  tree record = TREE_OPERAND (bf_ref, 0);
+  tree size = TREE_OPERAND (bf_ref, 1);
+  tree offset = TREE_OPERAND (bf_ref, 2);
+
+  tree bf_type = TREE_TYPE (bf_ref);
+  unsigned HOST_WIDE_INT bf_precision = TYPE_PRECISION (bf_type);
+  unsigned HOST_WIDE_INT load_size
+    = CEIL (bf_precision, BITS_PER_UNIT) * BITS_PER_UNIT;
+
+  if (bf_precision == load_size)
+    return NULL;
+
+  tree addr = build1 (ADDR_EXPR, build_pointer_type (TREE_TYPE (record)),
+		      record);
+
+  addr = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (addr), addr,
+		      fold_convert (sizetype, offset));
+
+  tree load_type = build_nonstandard_integer_type (load_size, 1);
+  tree vectype = get_vectype_for_scalar_type (vinfo, load_type);
+  tree lhs = vect_recog_temp_ssa_var (load_type, NULL);
+
+  data_reference *dr = STMT_VINFO_DATA_REF (bf_stmt_info);
+  /* TODO: Fix this, rather than using the DR_REF here I'd like to reconstruct
+     the desired load, rather than rely on the 'misguided?' behaviour of the
+     vectorizer to vectorize these as normal loads.  However when I tried it
+     lead to the vectorizer think it needed to vectorize the address
+     computation too.  */
+  gimple *pattern_stmt = gimple_build_assign (lhs, DR_REF (dr));
+  gimple *load_stmt = pattern_stmt;
+
+  tree ret_type = TREE_TYPE (gimple_get_lhs (nop_stmt));
+  if (!useless_type_conversion_p (TREE_TYPE (lhs), ret_type))
+    {
+      append_pattern_def_seq (vinfo, bf_stmt_info, pattern_stmt,
+			      vectype);
+      pattern_stmt
+	= gimple_build_assign (vect_recog_temp_ssa_var (ret_type, NULL),
+			       NOP_EXPR, lhs);
+      lhs = gimple_get_lhs (pattern_stmt);
+      vectype = STMT_VINFO_VECTYPE (last_stmt_info);
+    }
+  vect_mark_pattern_stmts (vinfo, bf_stmt_info, pattern_stmt, vectype);
+
+  stmt_vec_info load_stmt_info = vinfo->lookup_stmt (load_stmt);
+  vinfo->move_dr (load_stmt_info, bf_stmt_info);
+
+  unsigned HOST_WIDE_INT offset_i = tree_to_uhwi (offset);
+  unsigned HOST_WIDE_INT shift_n = offset_i % load_size;
+
+  if (shift_n)
+    {
+      pattern_stmt
+	= gimple_build_assign (vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL),
+			       RSHIFT_EXPR, lhs,
+			       build_int_cst (integer_type_node, shift_n));
+      append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
+      lhs = gimple_get_lhs (pattern_stmt);
+    }
+
+  unsigned HOST_WIDE_INT mask_i = tree_to_uhwi (size);
+  tree mask = build_int_cst (TREE_TYPE (lhs), (1ULL << mask_i) - 1);
+  pattern_stmt
+    = gimple_build_assign (vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL),
+			   BIT_AND_EXPR, lhs, mask);
+
+  *type_out = STMT_VINFO_VECTYPE (last_stmt_info);
+  vect_pattern_detected ("bit_field_ref pattern", last_stmt_info->stmt);
+
+  return pattern_stmt;
+}
+
+
 /* Recognize cases in which an operation is performed in one type WTYPE
    but could be done more efficiently in a narrower type NTYPE.  For example,
    if we have:
@@ -5623,6 +5775,8 @@  struct vect_recog_func
    taken which means usually the more complex one needs to preceed the
    less comples onex (widen_sum only after dot_prod or sad for example).  */
 static vect_recog_func vect_vect_recog_func_ptrs[] = {
+  { vect_recog_bit_field_decl, "bit_field_decl" },
+  { vect_recog_bit_field_ref, "bitfield_ref" },
   { vect_recog_over_widening_pattern, "over_widening" },
   /* Must come after over_widening, which narrows the shift as much as
      possible beforehand.  */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index f582d238984fbd083650a45d87997f72b6cd3839..c06e96ba2973d3048a754b1c15cfae917a35e271 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4933,19 +4933,6 @@  vectorizable_conversion (vec_info *vinfo,
 	       && SCALAR_FLOAT_TYPE_P (rhs_type))))
     return false;
 
-  if (!VECTOR_BOOLEAN_TYPE_P (vectype_out)
-      && ((INTEGRAL_TYPE_P (lhs_type)
-	   && !type_has_mode_precision_p (lhs_type))
-	  || (INTEGRAL_TYPE_P (rhs_type)
-	      && !type_has_mode_precision_p (rhs_type))))
-    {
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "type conversion to/from bit-precision unsupported."
-                         "\n");
-      return false;
-    }
-
   if (op_type == binary_op)
     {
       gcc_assert (code == WIDEN_MULT_EXPR || code == WIDEN_LSHIFT_EXPR