diff mbox

[3/4] gnu: Add Swish-e.

Message ID 20160823014912.1f80a48d@openmailbox.org
State New
Headers show

Commit Message

Eric Bavier Aug. 23, 2016, 6:49 a.m. UTC
On Tue, 23 Aug 2016 02:27:26 -0400
Leo Famulari <leo@famulari.name> wrote:

> On Tue, Aug 23, 2016 at 01:15:11AM -0500, Eric Bavier wrote:
> > From: Eric Bavier <bavier@member.fsf.org>
> > 
> > * gnu/packages/search.scm (swish-e): New variable.
> > * gnu/packages/patches/swish-e-search.patch: New patch.
> > * gnu/local.mk (dist_patch_DATA): Add it.
> > ---
> >  gnu/local.mk                              |  2 +
> >  gnu/packages/patches/swish-e-search.patch | 43 ++++++++++++++++++++
> >  gnu/packages/search.scm                   | 67 ++++++++++++++++++++++++++++++-
> >  3 files changed, 111 insertions(+), 1 deletion(-)
> >  create mode 100644 gnu/packages/patches/swish-e-search.patch
> > 
> > diff --git a/gnu/local.mk b/gnu/local.mk
> > index 02a7cc4..59f22d4 100644
> > --- a/gnu/local.mk
> > +++ b/gnu/local.mk
> > @@ -779,6 +779,8 @@ dist_patch_DATA =						\
> >    %D%/packages/patches/soprano-find-clucene.patch		\
> >    %D%/packages/patches/steghide-fixes.patch			\
> >    %D%/packages/patches/superlu-dist-scotchmetis.patch		\
> > +  %D%/packages/patches/swish-e-search.patch			\
> > +  %D%/packages/patches/swish-e-format-security.patch		\  
> 
> This patch seems to be missing.

Indeed.  Updated patch attached.

`~Eric

Comments

Leo Famulari Aug. 23, 2016, 8:46 p.m. UTC | #1
On Tue, Aug 23, 2016 at 01:49:12AM -0500, Eric Bavier wrote:
> * gnu/packages/search.scm (swish-e): New variable.
> * gnu/packages/patches/swish-e-search.patch,
> gnu/packages/patches/swish-e-format-security.patch: New patches.
> * gnu/local.mk (dist_patch_DATA): Add them.

It would be ideal to present these patches to the upstream maintainers,
but their site is offline. Do we know if the project is still active?

> diff --git a/gnu/packages/patches/swish-e-format-security.patch b/gnu/packages/patches/swish-e-format-security.patch
> new file mode 100644
> index 0000000..be9d7cb
> --- /dev/null
> +++ b/gnu/packages/patches/swish-e-format-security.patch
> @@ -0,0 +1,33 @@
> +Borrowed from Debian.
> +
> +--- swish-e-2.4.7/src/parser.c	2009-04-05 03:58:32.000000000 +0200
> ++++ swish-e-2.4.7/src/parser.c	2013-06-11 13:53:08.196559035 +0200
> +@@ -1760,7 +1760,7 @@
> +     va_start(args, msg);
> +     vsnprintf(str, 1000, msg, args );
> +     va_end(args);
> +-    xmlParserError(parse_data->ctxt, str);
> ++    xmlParserError(parse_data->ctxt, "%s", str);
> + }
> + 
> + static void warning(void *data, const char *msg, ...)
> +@@ -1772,7 +1772,7 @@
> +     va_start(args, msg);
> +     vsnprintf(str, 1000, msg, args );
> +     va_end(args);
> +-    xmlParserWarning(parse_data->ctxt, str);
> ++    xmlParserWarning(parse_data->ctxt, "%s", str);
> + }

My understanding is that xmlParserWarning() is from libxml2, defined in
'xmlerror.h' like this:

XMLPUBFUN void XMLCDECL
    xmlParserWarning            (void *ctx,
                                 const char *msg,
                                 ...) LIBXML_ATTR_FORMAT(2,3);

I don't understand this definition very much, but in libxml2 file
'xmlversion.h', LIBXML_ATTR_FORMAT is commented with "Macro used to
indicate to GCC the parameter are printf like".

Somebody else should review this.

> +--- swish-e-2.4.7/src/result_output.c	2009-04-05 03:58:32.000000000 +0200
> ++++ swish-e-2.4.7/src/result_output.c	2013-06-11 13:53:38.593550825 +0200
> +@@ -752,7 +752,7 @@
> +             s = (char *) emalloc(MAXWORDLEN + 1);
> +             n = strftime(s, (size_t) MAXWORDLEN, fmt, localtime(&(pv->value.v_date)));
> +             if (n && f)
> +-                fprintf(f, s);
> ++                fprintf(f, "%s", s);

LGTM

> diff --git a/gnu/packages/patches/swish-e-search.patch b/gnu/packages/patches/swish-e-search.patch
> new file mode 100644
> index 0000000..2a57a31
> --- /dev/null
> +++ b/gnu/packages/patches/swish-e-search.patch
> @@ -0,0 +1,43 @@
> +From http://swish-e.org/archive/2015-09/13295.html

The site is offline, but I found it on archive.org:
https://web.archive.org/web/20150907203848/http://www.swish-e.org/archive/2015-09/13295.html

Interestingly, I'm a few blocks the patch author's office :)

As far as I can tell, nobody from swish-e ever replied.

> +
> +--- a/src/compress.c	
> ++++ a/src/compress.c	
> +@@ -995,7 +995,7 @@ void    remove_worddata_longs(unsigned char *worddata,int *sz_worddata)
> +             progerr("Internal error in remove_worddata_longs");
> + 
> +         /* dst may be smaller than src. So move the data */
> +-        memcpy(dst,src,data_len);
> ++        memmove(dst,src,data_len);

LGTM

> + 
> +         /* Increase pointers */
> +         src += data_len;
> +--- a/src/headers.c	
> ++++ a/src/headers.c	
> +@@ -280,7 +280,7 @@ static SWISH_HEADER_VALUE fetch_single_header( IndexFILE *indexf, HEADER_MAP *he
> + 
> +         case SWISH_NUMBER:
> +         case SWISH_BOOL:
> +-            value.number = *(unsigned long *) data_pointer;
> ++            value.number = *(unsigned int *) data_pointer;

Could there be any risk in reducing the size of the variable like this?

> + 
> +             /* $$$ Ugly hack alert! */
> +             /* correct for removed files */
> +--- a/src/swishspider	
> ++++ a/src/swishspider	
> +@@ -27,6 +27,7 @@ use LWP::UserAgent;
> + use HTTP::Status;
> + use HTML::Parser 3.00;
> + use HTML::LinkExtor;
> ++use Encode;
> + 
> +     if (scalar(@ARGV) != 2) {
> +         print STDERR "Usage: $0 localpath url\n";
> +@@ -94,7 +95,7 @@ use HTML::LinkExtor;
> +     # Don't allow links above the base
> +     $URI::ABS_REMOTE_LEADING_DOTS = 1;
> + 
> +-    $p->parse( $$content_ref );
> ++    $p->parse( decode_utf8 $$content_ref );

Can you explain why we need this?

> +(define-public swish-e

The package definition LGTM. I did not try to build it.
Eric Bavier Aug. 23, 2016, 10:34 p.m. UTC | #2
On Tue, 23 Aug 2016 16:46:51 -0400
Leo Famulari <leo@famulari.name> wrote:

> On Tue, Aug 23, 2016 at 01:49:12AM -0500, Eric Bavier wrote:
> > * gnu/packages/search.scm (swish-e): New variable.
> > * gnu/packages/patches/swish-e-search.patch,
> > gnu/packages/patches/swish-e-format-security.patch: New patches.
> > * gnu/local.mk (dist_patch_DATA): Add them.  
> 
> It would be ideal to present these patches to the upstream maintainers,
> but their site is offline. Do we know if the project is still active?

The last active maintainer stepped out a while ago, and it seems no one
else has stepped up:
https://web.archive.org/web/20150908004634/http://www.swish-e.org/archive/2014-04/13214.html

> 
> > diff --git a/gnu/packages/patches/swish-e-format-security.patch b/gnu/packages/patches/swish-e-format-security.patch
> > new file mode 100644
> > index 0000000..be9d7cb
> > --- /dev/null
> > +++ b/gnu/packages/patches/swish-e-format-security.patch
> > @@ -0,0 +1,33 @@
> > +Borrowed from Debian.
> > +
> > +--- swish-e-2.4.7/src/parser.c	2009-04-05 03:58:32.000000000 +0200
> > ++++ swish-e-2.4.7/src/parser.c	2013-06-11 13:53:08.196559035 +0200
> > +@@ -1760,7 +1760,7 @@
> > +     va_start(args, msg);
> > +     vsnprintf(str, 1000, msg, args );
> > +     va_end(args);
> > +-    xmlParserError(parse_data->ctxt, str);
> > ++    xmlParserError(parse_data->ctxt, "%s", str);
> > + }
> > + 
> > + static void warning(void *data, const char *msg, ...)
> > +@@ -1772,7 +1772,7 @@
> > +     va_start(args, msg);
> > +     vsnprintf(str, 1000, msg, args );
> > +     va_end(args);
> > +-    xmlParserWarning(parse_data->ctxt, str);
> > ++    xmlParserWarning(parse_data->ctxt, "%s", str);
> > + }  
> 
> My understanding is that xmlParserWarning() is from libxml2, defined in
> 'xmlerror.h' like this:
> 
> XMLPUBFUN void XMLCDECL
>     xmlParserWarning            (void *ctx,
>                                  const char *msg,
>                                  ...) LIBXML_ATTR_FORMAT(2,3);
> 
> I don't understand this definition very much, but in libxml2 file
> 'xmlversion.h', LIBXML_ATTR_FORMAT is commented with "Macro used to
> indicate to GCC the parameter are printf like".
> 
> Somebody else should review this.
> 
> > +--- swish-e-2.4.7/src/result_output.c	2009-04-05 03:58:32.000000000 +0200
> > ++++ swish-e-2.4.7/src/result_output.c	2013-06-11 13:53:38.593550825 +0200
> > +@@ -752,7 +752,7 @@
> > +             s = (char *) emalloc(MAXWORDLEN + 1);
> > +             n = strftime(s, (size_t) MAXWORDLEN, fmt, localtime(&(pv->value.v_date)));
> > +             if (n && f)
> > +-                fprintf(f, s);
> > ++                fprintf(f, "%s", s);  
> 
> LGTM
> 
> > diff --git a/gnu/packages/patches/swish-e-search.patch b/gnu/packages/patches/swish-e-search.patch
> > new file mode 100644
> > index 0000000..2a57a31
> > --- /dev/null
> > +++ b/gnu/packages/patches/swish-e-search.patch
> > @@ -0,0 +1,43 @@
> > +From http://swish-e.org/archive/2015-09/13295.html  
> 
> The site is offline, but I found it on archive.org:
> https://web.archive.org/web/20150907203848/http://www.swish-e.org/archive/2015-09/13295.html
> 
> Interestingly, I'm a few blocks the patch author's office :)
> 
> As far as I can tell, nobody from swish-e ever replied.

AFAICT that right.

> 
> > +
> > +--- a/src/compress.c	
> > ++++ a/src/compress.c	
> > +@@ -995,7 +995,7 @@ void    remove_worddata_longs(unsigned char *worddata,int *sz_worddata)
> > +             progerr("Internal error in remove_worddata_longs");
> > + 
> > +         /* dst may be smaller than src. So move the data */
> > +-        memcpy(dst,src,data_len);
> > ++        memmove(dst,src,data_len);  
> 
> LGTM
> 
> > + 
> > +         /* Increase pointers */
> > +         src += data_len;
> > +--- a/src/headers.c	
> > ++++ a/src/headers.c	
> > +@@ -280,7 +280,7 @@ static SWISH_HEADER_VALUE fetch_single_header( IndexFILE *indexf, HEADER_MAP *he
> > + 
> > +         case SWISH_NUMBER:
> > +         case SWISH_BOOL:
> > +-            value.number = *(unsigned long *) data_pointer;
> > ++            value.number = *(unsigned int *) data_pointer;  
> 
> Could there be any risk in reducing the size of the variable like this?

Assuming the value is indeed a boolean, probably not.

> 
> > + 
> > +             /* $$$ Ugly hack alert! */
> > +             /* correct for removed files */
> > +--- a/src/swishspider	
> > ++++ a/src/swishspider	
> > +@@ -27,6 +27,7 @@ use LWP::UserAgent;
> > + use HTTP::Status;
> > + use HTML::Parser 3.00;
> > + use HTML::LinkExtor;
> > ++use Encode;
> > + 
> > +     if (scalar(@ARGV) != 2) {
> > +         print STDERR "Usage: $0 localpath url\n";
> > +@@ -94,7 +95,7 @@ use HTML::LinkExtor;
> > +     # Don't allow links above the base
> > +     $URI::ABS_REMOTE_LEADING_DOTS = 1;
> > + 
> > +-    $p->parse( $$content_ref );
> > ++    $p->parse( decode_utf8 $$content_ref );  
> 
> Can you explain why we need this?

Presumably to better handle utf8-encoded input.

Tomb developers have expressed interest in replacing their use of
swish-e with the "Recoll" search tool
https://github.com/dyne/Tomb/issues/211.  If maintenance of this
package turns out to be burdensome, we might be able to drop it.

Thanks for reviewing,

`~Eric
diff mbox

Patch

From 365fa64bb0ae6a249ad7ca3218bb008eab9a9577 Mon Sep 17 00:00:00 2001
From: Eric Bavier <bavier@member.fsf.org>
Date: Wed, 18 May 2016 01:02:02 -0500
Subject: [PATCH 3/4] gnu: Add Swish-e.

* gnu/packages/search.scm (swish-e): New variable.
* gnu/packages/patches/swish-e-search.patch,
gnu/packages/patches/swish-e-format-security.patch: New patches.
* gnu/local.mk (dist_patch_DATA): Add them.
---
 gnu/local.mk                                       |  2 +
 gnu/packages/patches/swish-e-format-security.patch | 33 +++++++++++
 gnu/packages/patches/swish-e-search.patch          | 43 ++++++++++++++
 gnu/packages/search.scm                            | 67 +++++++++++++++++++++-
 4 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 gnu/packages/patches/swish-e-format-security.patch
 create mode 100644 gnu/packages/patches/swish-e-search.patch

diff --git a/gnu/local.mk b/gnu/local.mk
index 02a7cc4..59f22d4 100644
--- a/gnu/local.mk
+++ b/gnu/local.mk
@@ -779,6 +779,8 @@  dist_patch_DATA =						\
   %D%/packages/patches/soprano-find-clucene.patch		\
   %D%/packages/patches/steghide-fixes.patch			\
   %D%/packages/patches/superlu-dist-scotchmetis.patch		\
+  %D%/packages/patches/swish-e-search.patch			\
+  %D%/packages/patches/swish-e-format-security.patch		\
   %D%/packages/patches/synfig-build-fix.patch			\
   %D%/packages/patches/t1lib-CVE-2010-2642.patch		\
   %D%/packages/patches/t1lib-CVE-2011-0764.patch		\
diff --git a/gnu/packages/patches/swish-e-format-security.patch b/gnu/packages/patches/swish-e-format-security.patch
new file mode 100644
index 0000000..be9d7cb
--- /dev/null
+++ b/gnu/packages/patches/swish-e-format-security.patch
@@ -0,0 +1,33 @@ 
+Borrowed from Debian.
+
+--- swish-e-2.4.7/src/parser.c	2009-04-05 03:58:32.000000000 +0200
++++ swish-e-2.4.7/src/parser.c	2013-06-11 13:53:08.196559035 +0200
+@@ -1760,7 +1760,7 @@
+     va_start(args, msg);
+     vsnprintf(str, 1000, msg, args );
+     va_end(args);
+-    xmlParserError(parse_data->ctxt, str);
++    xmlParserError(parse_data->ctxt, "%s", str);
+ }
+ 
+ static void warning(void *data, const char *msg, ...)
+@@ -1772,7 +1772,7 @@
+     va_start(args, msg);
+     vsnprintf(str, 1000, msg, args );
+     va_end(args);
+-    xmlParserWarning(parse_data->ctxt, str);
++    xmlParserWarning(parse_data->ctxt, "%s", str);
+ }
+ 
+ 
+--- swish-e-2.4.7/src/result_output.c	2009-04-05 03:58:32.000000000 +0200
++++ swish-e-2.4.7/src/result_output.c	2013-06-11 13:53:38.593550825 +0200
+@@ -752,7 +752,7 @@
+             s = (char *) emalloc(MAXWORDLEN + 1);
+             n = strftime(s, (size_t) MAXWORDLEN, fmt, localtime(&(pv->value.v_date)));
+             if (n && f)
+-                fprintf(f, s);
++                fprintf(f, "%s", s);
+             efree(s);
+         }
+         break;
diff --git a/gnu/packages/patches/swish-e-search.patch b/gnu/packages/patches/swish-e-search.patch
new file mode 100644
index 0000000..2a57a31
--- /dev/null
+++ b/gnu/packages/patches/swish-e-search.patch
@@ -0,0 +1,43 @@ 
+From http://swish-e.org/archive/2015-09/13295.html
+
+--- a/src/compress.c	
++++ a/src/compress.c	
+@@ -995,7 +995,7 @@ void    remove_worddata_longs(unsigned char *worddata,int *sz_worddata)
+             progerr("Internal error in remove_worddata_longs");
+ 
+         /* dst may be smaller than src. So move the data */
+-        memcpy(dst,src,data_len);
++        memmove(dst,src,data_len);
+ 
+         /* Increase pointers */
+         src += data_len;
+--- a/src/headers.c	
++++ a/src/headers.c	
+@@ -280,7 +280,7 @@ static SWISH_HEADER_VALUE fetch_single_header( IndexFILE *indexf, HEADER_MAP *he
+ 
+         case SWISH_NUMBER:
+         case SWISH_BOOL:
+-            value.number = *(unsigned long *) data_pointer;
++            value.number = *(unsigned int *) data_pointer;
+ 
+             /* $$$ Ugly hack alert! */
+             /* correct for removed files */
+--- a/src/swishspider	
++++ a/src/swishspider	
+@@ -27,6 +27,7 @@ use LWP::UserAgent;
+ use HTTP::Status;
+ use HTML::Parser 3.00;
+ use HTML::LinkExtor;
++use Encode;
+ 
+     if (scalar(@ARGV) != 2) {
+         print STDERR "Usage: $0 localpath url\n";
+@@ -94,7 +95,7 @@ use HTML::LinkExtor;
+     # Don't allow links above the base
+     $URI::ABS_REMOTE_LEADING_DOTS = 1;
+ 
+-    $p->parse( $$content_ref );
++    $p->parse( decode_utf8 $$content_ref );
+     close( LINKS );
+ 
+     exit;
diff --git a/gnu/packages/search.scm b/gnu/packages/search.scm
index 9a7bc76..60f902f 100644
--- a/gnu/packages/search.scm
+++ b/gnu/packages/search.scm
@@ -23,10 +23,14 @@ 
   #:use-module (guix packages)
   #:use-module (guix download)
   #:use-module (guix build-system gnu)
+  #:use-module (gnu packages)
   #:use-module (gnu packages compression)
   #:use-module (gnu packages check)
   #:use-module (gnu packages databases)
-  #:use-module (gnu packages linux))
+  #:use-module (gnu packages linux)
+  #:use-module (gnu packages perl)
+  #:use-module (gnu packages web)
+  #:use-module (gnu packages xml))
 
 (define-public xapian
   (package
@@ -171,4 +175,65 @@  with slocate, and attempts to be compatible to GNU locate when it does not
 conflict with slocate compatibility.")
     (license gpl2)))
 
+(define-public swish-e
+  (package
+    (name "swish-e")
+    (version "2.4.7")
+    (source (origin
+              (method url-fetch)
+              (uri (list (string-append "http://swish-e.org/distribution/"
+                                        "swish-e-" version ".tar.gz")
+                         ;; The upstream swish-e.org appears to be down... so
+                         ;; use debian's copy as a fallback.
+                         (string-append "http://http.debian.net/debian/pool/"
+                                        "main/s/swish-e/swish-e_" version
+                                        ".orig.tar.gz")))
+              (file-name (string-append name "-" version ".tar.gz"))
+              (sha256
+               (base32
+                "0qkrk7z25yp9hynj21vxkyn7yi8gcagcfxnass5cgczcz0gm9pax"))
+              (patches (search-patches "swish-e-search.patch"
+                                       "swish-e-format-security.patch"))))
+    (build-system gnu-build-system)
+    ;; Several other packages and perl modules may be installed alongside
+    ;; swish-e to extend its features at runtime, but are not required for
+    ;; building: xpdf, catdoc, MP3::Tag, Spreadsheet::ParseExcel,
+    ;; HTML::Entities.
+    (inputs
+     `(("libxml" ,libxml2)
+       ("zlib" ,zlib)
+       ("perl" ,perl)
+       ("perl-uri" ,perl-uri)
+       ("perl-html-parser" ,perl-html-parser)
+       ("perl-html-tagset" ,perl-html-tagset)
+       ("perl-mime-types" ,perl-mime-types)))
+    (arguments
+     `(#:phases (modify-phases %standard-phases
+                  (add-after 'install 'wrap-programs
+                    (lambda* (#:key inputs outputs #:allow-other-keys)
+                      (let* ((out (assoc-ref outputs "out")))
+                        (for-each
+                         (lambda (program)
+                           (wrap-program program
+                             `("PERL5LIB" ":" prefix
+                               ,(map (lambda (i)
+                                       (string-append (assoc-ref inputs i)
+                                                      "/lib/perl5/site_perl"))
+                                     ;; These perl modules have no propagated
+                                     ;; inputs, so no further analysis needed.
+                                     '("perl-uri"
+                                       "perl-html-parser"
+                                       "perl-html-tagset"
+                                       "perl-mime-types")))))
+                         (list (string-append out "/lib/swish-e/swishspider")
+                               (string-append out "/bin/swish-filter-test")))
+                        #t))))))
+    (home-page "http://swish-e.org")
+    (synopsis "Web indexing system")
+    (description
+     "Swish-e is Simple Web Indexing System for Humans - Enhanced.  Swish-e
+can quickly and easily index directories of files or remote web sites and
+search the generated indexes.")
+    (license gpl2+)))                   ;with exception
+
 ;;; search.scm ends here
-- 
2.9.2