RFC: Sort tarballs created by the src-release.sh script

Message ID 02ea8f50-7b68-6c4e-75db-e919121e8707@redhat.com
State New
Headers
Series RFC: Sort tarballs created by the src-release.sh script |

Commit Message

Nick Clifton Sept. 28, 2022, 12:59 p.m. UTC
  Hi Guys,

   A contributor recently pointed out that the binutils release tarballs are
   not sorted by name, making it hard to reproducibly recreate them.  At first
   we thought that using tar's --sort=name option would solve this, but it
   turns out that the src-release.sh script creates its own list of input file
   names, so instead I am proposing this patch:

   Any comments or objections ?

Cheers
   Nick
  

Comments

Andreas Schwab Sept. 28, 2022, 1:05 p.m. UTC | #1
On Sep 28 2022, Nick Clifton via Gdb-patches wrote:

> diff --git a/src-release.sh b/src-release.sh
> index 079b545ae7c..caae5148034 100755
> --- a/src-release.sh
> +++ b/src-release.sh
> @@ -185,7 +185,7 @@ do_tar()
>      echo "==> Making $package-$ver.tar"
>      rm -f $package-$ver.tar
>      find $package-$ver -follow \( $CVS_NAMES \) -prune \
> -	-o -type f -print \
> +	-o -type f -print | sort \

Better set LC_ALL=C to be independent from the locale sorting.
  
Nick Clifton Sept. 28, 2022, 1:34 p.m. UTC | #2
Hi Andreas,

> Better set LC_ALL=C to be independent from the locale sorting.
Like this ?

diff --git a/src-release.sh b/src-release.sh
index 079b545ae7c..8c2a8d897fd 100755
--- a/src-release.sh
+++ b/src-release.sh
@@ -185,7 +185,7 @@ do_tar()
      echo "==> Making $package-$ver.tar"
      rm -f $package-$ver.tar
      find $package-$ver -follow \( $CVS_NAMES \) -prune \
-       -o -type f -print \
+       -o -type f -print | LC_ALL=C sort \
         | tar cTfh - $package-$ver.tar
  }


Cheers
   Nick

PS.  Would sort's --dictionary-order option have a similar effect ?
  
Nick Clifton Sept. 29, 2022, 12:24 p.m. UTC | #3
Hi Guys,

Thinking about this a little more last night, it occurred to me that if
we want reproducible tarballs then we should not be storing user names,
group names or modification times either. So what do you think about
this extended version of the patch:

diff --git a/src-release.sh b/src-release.sh
index 079b545ae7c..908492c28f7 100755
--- a/src-release.sh
+++ b/src-release.sh
@@ -185,8 +185,8 @@ do_tar()
      echo "==> Making $package-$ver.tar"
      rm -f $package-$ver.tar
      find $package-$ver -follow \( $CVS_NAMES \) -prune \
-       -o -type f -print \
-       | tar cTfh - $package-$ver.tar
+       -o -type f -print | LC_ALL=C sort \
+       | tar cTfh - $package-$ver.tar --mtime=0 --group=0 --owner=0
  }

  # Compress the output with bzip2


Cheers
   Nick
  
Jan Beulich Sept. 29, 2022, 12:36 p.m. UTC | #4
On 29.09.2022 14:24, Nick Clifton via Binutils wrote:
> Thinking about this a little more last night, it occurred to me that if
> we want reproducible tarballs then we should not be storing user names,
> group names or modification times either. So what do you think about
> this extended version of the patch:
> 
> diff --git a/src-release.sh b/src-release.sh
> index 079b545ae7c..908492c28f7 100755
> --- a/src-release.sh
> +++ b/src-release.sh
> @@ -185,8 +185,8 @@ do_tar()
>       echo "==> Making $package-$ver.tar"
>       rm -f $package-$ver.tar
>       find $package-$ver -follow \( $CVS_NAMES \) -prune \
> -       -o -type f -print \
> -       | tar cTfh - $package-$ver.tar
> +       -o -type f -print | LC_ALL=C sort \
> +       | tar cTfh - $package-$ver.tar --mtime=0 --group=0 --owner=0
>   }

I wanted to indicate that an mtime of zero isn't the neatest, but the
two tar versions I've tried this with said anyway "Treating date '0' as
2022-09-29 00:00:00". A non-zero date with a time of zero is fine with
me, but won't make much of a difference in terms of reproducibility.

Jan
  
Nick Clifton Sept. 30, 2022, 11:38 a.m. UTC | #5
Hi Guys,

   Right, here is the latest and greatest - and hopefully last - version
   of the patch.  I added a parseable string to the --mtime option and a
   comment explaining why these options are being used.

   Any more comments/suggestions ?

Cheers
   Nick

diff --git a/src-release.sh b/src-release.sh
index 079b545ae7c..8a2ac125030 100755
--- a/src-release.sh
+++ b/src-release.sh
@@ -184,9 +184,11 @@ do_tar()
      ver=$2
      echo "==> Making $package-$ver.tar"
      rm -f $package-$ver.tar
+    # The sort command and --mtime, --group and --owner options are
+    # used in order to create consistent, reproducible tarballs.
      find $package-$ver -follow \( $CVS_NAMES \) -prune \
-       -o -type f -print \
-       | tar cTfh - $package-$ver.tar
+       -o -type f -print | LC_ALL=C sort \
+       | tar cTfh - $package-$ver.tar --mtime="1970-01-01 00:00:00" --group=0 --owner=0
  }

  # Compress the output with bzip2
  
Sam James Oct. 2, 2022, 7:54 a.m. UTC | #6
> On 30 Sep 2022, at 12:38, Nick Clifton via Binutils <binutils@sourceware.org> wrote:
> 
> Hi Guys,
> 
>  Right, here is the latest and greatest - and hopefully last - version
>  of the patch.  I added a parseable string to the --mtime option and a
>  comment explaining why these options are being used.
> 
>  Any more comments/suggestions ?
> 
> Cheers
>  Nick
> 
> diff --git a/src-release.sh b/src-release.sh
> index 079b545ae7c..8a2ac125030 100755
> --- a/src-release.sh
> +++ b/src-release.sh
> @@ -184,9 +184,11 @@ do_tar()
>     ver=$2
>     echo "==> Making $package-$ver.tar"
>     rm -f $package-$ver.tar
> +    # The sort command and --mtime, --group and --owner options are
> +    # used in order to create consistent, reproducible tarballs.
>     find $package-$ver -follow \( $CVS_NAMES \) -prune \
> -       -o -type f -print \
> -       | tar cTfh - $package-$ver.tar
> +       -o -type f -print | LC_ALL=C sort \
> +       | tar cTfh - $package-$ver.tar --mtime="1970-01-01 00:00:00" --group=0 --owner=0
> }
> 
> # Compress the output with bzip2
> 

I think this might hit a problem I faced when trying to do this with Go tarballs: https://www.gnu.org/software/tar/manual/tar.html#warnings.

With that date, I got "implausibly old time stamp" warnings from tar. I haven't tested this patchthough (writing from mobile, apologies).

Maybe default to the creation date of Binutils and allow overriding via https://reproducible-builds.org/docs/source-date-epoch/?

best,
sam
  
Jan Beulich Oct. 3, 2022, 6:55 a.m. UTC | #7
On 02.10.2022 09:54, Sam James wrote:
> 
> 
>> On 30 Sep 2022, at 12:38, Nick Clifton via Binutils <binutils@sourceware.org> wrote:
>>
>> Hi Guys,
>>
>>  Right, here is the latest and greatest - and hopefully last - version
>>  of the patch.  I added a parseable string to the --mtime option and a
>>  comment explaining why these options are being used.
>>
>>  Any more comments/suggestions ?
>>
>> Cheers
>>  Nick
>>
>> diff --git a/src-release.sh b/src-release.sh
>> index 079b545ae7c..8a2ac125030 100755
>> --- a/src-release.sh
>> +++ b/src-release.sh
>> @@ -184,9 +184,11 @@ do_tar()
>>     ver=$2
>>     echo "==> Making $package-$ver.tar"
>>     rm -f $package-$ver.tar
>> +    # The sort command and --mtime, --group and --owner options are
>> +    # used in order to create consistent, reproducible tarballs.
>>     find $package-$ver -follow \( $CVS_NAMES \) -prune \
>> -       -o -type f -print \
>> -       | tar cTfh - $package-$ver.tar
>> +       -o -type f -print | LC_ALL=C sort \
>> +       | tar cTfh - $package-$ver.tar --mtime="1970-01-01 00:00:00" --group=0 --owner=0
>> }
>>
>> # Compress the output with bzip2
>>
> 
> I think this might hit a problem I faced when trying to do this with Go tarballs: https://www.gnu.org/software/tar/manual/tar.html#warnings.
> 
> With that date, I got "implausibly old time stamp" warnings from tar. I haven't tested this patchthough (writing from mobile, apologies).
> 
> Maybe default to the creation date of Binutils and allow overriding via https://reproducible-builds.org/docs/source-date-epoch/?

Not sure what "creation date" might mean here. Assuming the script is
(typically) run from a git tree, perhaps the commit date of the top
level commit on the branch would be best to use?

Jan
  
Sam James Oct. 3, 2022, 6:59 a.m. UTC | #8
> On 3 Oct 2022, at 07:56, Jan Beulich <jbeulich@suse.com> wrote:
> 
> On 02.10.2022 09:54, Sam James wrote:
>> 
>> 
>>>> On 30 Sep 2022, at 12:38, Nick Clifton via Binutils <binutils@sourceware.org> wrote:
>>> 
>>> Hi Guys,
>>> 
>>> Right, here is the latest and greatest - and hopefully last - version
>>> of the patch.  I added a parseable string to the --mtime option and a
>>> comment explaining why these options are being used.
>>> 
>>> Any more comments/suggestions ?
>>> 
>>> Cheers
>>> Nick
>>> 
>>> diff --git a/src-release.sh b/src-release.sh
>>> index 079b545ae7c..8a2ac125030 100755
>>> --- a/src-release.sh
>>> +++ b/src-release.sh
>>> @@ -184,9 +184,11 @@ do_tar()
>>>    ver=$2
>>>    echo "==> Making $package-$ver.tar"
>>>    rm -f $package-$ver.tar
>>> +    # The sort command and --mtime, --group and --owner options are
>>> +    # used in order to create consistent, reproducible tarballs.
>>>    find $package-$ver -follow \( $CVS_NAMES \) -prune \
>>> -       -o -type f -print \
>>> -       | tar cTfh - $package-$ver.tar
>>> +       -o -type f -print | LC_ALL=C sort \
>>> +       | tar cTfh - $package-$ver.tar --mtime="1970-01-01 00:00:00" --group=0 --owner=0
>>> }
>>> 
>>> # Compress the output with bzip2
>>> 
>> 
>> I think this might hit a problem I faced when trying to do this with Go tarballs: https://www.gnu.org/software/tar/manual/tar.html#warnings.
>> 
>> With that date, I got "implausibly old time stamp" warnings from tar. I haven't tested this patchthough (writing from mobile, apologies).
>> 
>> Maybe default to the creation date of Binutils and allow overriding via https://reproducible-builds.org/docs/source-date-epoch/?
> 
> Not sure what "creation date" might mean here.

I meant whatever date folks consider to have been the creation of the Binutils project. First commit, announcement, first release, whatever.

Not that it really matters, it just can't be the unix epoch.
  
Andreas Schwab Oct. 3, 2022, 7:41 a.m. UTC | #9
On Okt 03 2022, Sam James via Gdb-patches wrote:

> I meant whatever date folks consider to have been the creation of the Binutils project. First commit, announcement, first release, whatever.

I think using the date of the commit from which the tarball is being
created makes the most sense (this is what git archive does).

Another thing to consider is the recorded permission of the files.
  
Andreas Schwab Oct. 3, 2022, 7:47 a.m. UTC | #10
On Okt 02 2022, Sam James via Gdb-patches wrote:

> With that date, I got "implausibly old time stamp" warnings from tar.

That happens because --mtime uses the local timezone.  To get the true
Epoch you can use --mtime=@0.
  
Nick Clifton Oct. 3, 2022, 2:40 p.m. UTC | #11
Hi Guys,

   [This appears to be getting slightly out of hand...]

> Not sure what "creation date" might mean here. Assuming the script is > (typically) run from a git tree, perhaps the commit date of the top> level commit on the branch would be best to use?
Except that a commit to the branch that does not affect something that
would go into the tarball would then result in a changed date.

We could use the src-release.sh file itself, like this:

diff --git a/src-release.sh b/src-release.sh
index 079b545ae7c..de1f98a70bb 100755
--- a/src-release.sh
+++ b/src-release.sh
@@ -184,9 +184,15 @@ do_tar()
      ver=$2
      echo "==> Making $package-$ver.tar"
      rm -f $package-$ver.tar
+    # The sort command and --mtime, --group and --owner options are
+    # used in order to create consistent, reproducible tarballs.
+    # BUILD_DATE is set to SOURCE_DATE_EPOCH if defined, or the
+    # modification date of this file otherwise.  cf:
+    # https://reproducible-builds.org/docs/source-date-epoch/
+    BUILD_DATE="$(date --utc --date="@${SOURCE_DATE_EPOCH:-$(date -r src-release.sh +%s)}" +%Y-%m-%d)"
      find $package-$ver -follow \( $CVS_NAMES \) -prune \
-       -o -type f -print \
-       | tar cTfh - $package-$ver.tar
+       -o -type f -print | LC_ALL=C sort \
+       | tar cTfh - $package-$ver.tar --mtime=$BUILD_DATE --group=0 --owner=0
  }

  # Compress the output with bzip2


Would that work ?

Cheers
   Nick
  
Andreas Schwab Oct. 3, 2022, 7:56 p.m. UTC | #12
On Okt 03 2022, Nick Clifton wrote:

> We could use the src-release.sh file itself, like this:

The timestamp of checked out files has no meaning, and is generally not
reproducible.

> diff --git a/src-release.sh b/src-release.sh
> index 079b545ae7c..de1f98a70bb 100755
> --- a/src-release.sh
> +++ b/src-release.sh
> @@ -184,9 +184,15 @@ do_tar()
>      ver=$2
>      echo "==> Making $package-$ver.tar"
>      rm -f $package-$ver.tar
> +    # The sort command and --mtime, --group and --owner options are
> +    # used in order to create consistent, reproducible tarballs.
> +    # BUILD_DATE is set to SOURCE_DATE_EPOCH if defined, or the
> +    # modification date of this file otherwise.  cf:
> +    # https://reproducible-builds.org/docs/source-date-epoch/
> +    BUILD_DATE="$(date --utc --date="@${SOURCE_DATE_EPOCH:-$(date -r src-release.sh +%s)}" +%Y-%m-%d)"
>      find $package-$ver -follow \( $CVS_NAMES \) -prune \
> -       -o -type f -print \
> -       | tar cTfh - $package-$ver.tar
> +       -o -type f -print | LC_ALL=C sort \
> +       | tar cTfh - $package-$ver.tar --mtime=$BUILD_DATE --group=0 --owner=0

That won't work, as --mtime=$BUILD_DATE is interpreted in the local time zone.
  
Jan Beulich Oct. 4, 2022, 7:10 a.m. UTC | #13
On 03.10.2022 16:40, Nick Clifton wrote:
> Hi Guys,
> 
>    [This appears to be getting slightly out of hand...]

It might seem like that, yes, but I guess that's not entirely unexpected
for a topic of that kind - different people have different expectations
and habits.

>> Not sure what "creation date" might mean here. Assuming the script is > (typically) run from a git tree, perhaps the commit date of the top> level commit on the branch would be best to use?
> Except that a commit to the branch that does not affect something that
> would go into the tarball would then result in a changed date.

Every commit should be considered to affect the tarball, imo, as such
effects could also be indirect. If you really wanted to go that route,
then perhaps an alternative would be to use the commit date of the
most recent commit touching bfd/version.m4.

Jan
  
Nick Clifton Oct. 5, 2022, 12:23 p.m. UTC | #14
Hi Guys,

On 10/4/22 08:10, Jan Beulich wrote:
> Every commit should be considered to affect the tarball, imo, as such
> effects could also be indirect. If you really wanted to go that route,
> then perhaps an alternative would be to use the commit date of the
> most recent commit touching bfd/version.m4.

Hmm, except that would probably only be appropriate for binutils tarballs,
not others.

So how about the attached patch ?  This one adds a new command line option to
src-release.sh.  If it is not used then the behaviour is not changed in any
way.  If the new option is used, it provides a date that is passed to tar's
--mtime option, along with triggering the use of sort and the other tar
options necessary to make a reproducible tarball.  So:

   src-release.sh -x -r `git log -1 --format=%cd --date=format:%F bfd/version.m4` binutils

should create a pretty consistent tarball.

Cheers
   Nick
  
Jan Beulich Oct. 5, 2022, 1 p.m. UTC | #15
On 05.10.2022 14:23, Nick Clifton wrote:
> Hi Guys,
> 
> On 10/4/22 08:10, Jan Beulich wrote:
>> Every commit should be considered to affect the tarball, imo, as such
>> effects could also be indirect. If you really wanted to go that route,
>> then perhaps an alternative would be to use the commit date of the
>> most recent commit touching bfd/version.m4.
> 
> Hmm, except that would probably only be appropriate for binutils tarballs,
> not others.
> 
> So how about the attached patch ?  This one adds a new command line option to
> src-release.sh.  If it is not used then the behaviour is not changed in any
> way.  If the new option is used, it provides a date that is passed to tar's
> --mtime option, along with triggering the use of sort and the other tar
> options necessary to make a reproducible tarball.  So:
> 
>    src-release.sh -x -r `git log -1 --format=%cd --date=format:%F bfd/version.m4` binutils
> 
> should create a pretty consistent tarball.

Lgtm, fwiw. Just one nit: You may want to add the missing 'b' for
"tarball" in the new help text line.

Jan
  

Patch

diff --git a/src-release.sh b/src-release.sh
index 079b545ae7c..caae5148034 100755
--- a/src-release.sh
+++ b/src-release.sh
@@ -185,7 +185,7 @@  do_tar()
      echo "==> Making $package-$ver.tar"
      rm -f $package-$ver.tar
      find $package-$ver -follow \( $CVS_NAMES \) -prune \
-	-o -type f -print \
+	-o -type f -print | sort \
  	| tar cTfh - $package-$ver.tar
  }