pdfgrep 2.1.0 "Should Have Been Christmas" Released

by Hans-Peter Deifel on April, 29 2018

After a year of waiting, pdfgrep 2.1.0 has finally been released. The tarball can be download on the download page. As always: Thanks to everyone who helped with this release.

pdfgrep contributors
pdfgrep contributors

This release is packed with new features that bring pdfgrep closer to parity with GNU grep:

New options: --files-with-matches/-l and --files-without-match/-L

These two related options open up new possibilities in scripting. Since they return only file names and not page number or matched text, their output can be used as input for other programs or even pdfgrep itself. As such, they are especially useful in combination with -Z.

For example, to search for PDFs in the current directory that don’t contain “foo” but contain “bar”, run:

pdfgrep -Z --files-without-match "foo" *.pdf | xargs -0 pdfgrep -H bar

To search for PDFs containing “rilz”, interactively select one with fzf and open it in the PDF viewer evince, do:

pdfgrep -RilZ rilz | fzf --read0 --print0 | xargs -0 evince

New option: --page-range

This allows to limit the search to certain pages. For example, to search for a PDF that contains “foo” on its title page, run:

pdfgrep --page-range 1 foo *.pdf

New options: --regexp/-e and --file/-f

Since its first release, pdfgrep only allowed to search for a single pattern. And while it’s possible to combine multiple search strings into a single regular expression using the | operator, this is fiddly to do in scripts. Now there are better options (pun intended).

The new --regexp argument can be specified multiple times and --file allows to directly provide a list of patterns in a file. Both can be mixed and all patterns are combined implicitly with OR.

Restructured Documentation

With more and more command line options, the manpage got a little unwieldy, so we split it up into multiple sections based on greps own manpage.

pdfgrep 2.0.1 released

by Hans-Peter Deifel on March, 6 2017

pdfgrep 2.0.1 has been released! It contains only one bugfix for the new --cache option from 2.0: When used together with recursive search, --cache failed to index files in subdirectories. Thanks to Barna Ágoston for the report.

As usual, the tarball is available on the download page.

pdfgrep 2.0 released

by Hans-Peter Deifel on January, 25 2017

pdfgrep 2.0 has been released after more than a year of development. The tarball is available on the download page. As always, thanks to everybody who has helped with this release!

This release not only contains a few cool new features, it also breaks command line API compatibility in one specific case. Read on for an overview over the most important changes or see the NEWS file for a complete list.

API change: --context/-C now behaves like grep

One annoying difference between pdfgrep and grep has always been the behavior of --context n: For historic reasons, pdfgrep printed n characters of context around each match while grep prints n lines of context.

This is now fixed and pdfgrep behaves exactly like grep for --context. Please be sure to update any scripts you have that rely on the old behavior.

New options: -A/--after-context and -B/--before-context

Together with the above change, pdfgrep’s context handling is now the same as grep’s and much more useful.

For example, to print two lines above and three lines below each match, you can now write:

pdfgrep -B 2 -A 3 pattern some.pdf

Lines with multiple matches are printed only once

In the same spirit as the last two items, this improves compatibility with grep. Previously, pdfgrep would print the surrounding line for each individual match, even if two matches were on the same line. So a line with two matches would be printed twice. This is now fixed.

Optional caching of PDF text for faster operation

Before doing any actual searching, pdfgrep has to extract the text from each PDF using the poppler library which can take a considerable amount of time for large PDFs.

To speed things up, pdfgrep can now optionally cache the PDF’s text and use it on the next run. This is quite an improvement for people who repeatedly search the same PDFs.

Caching is enabled with --cache. This has to be used for the initial run generating the cache and for subsequent runs benefiting from it. To enable caching permanently it is recommended to add an alias to your shell, like so:

alias pdfgrep="pdfgrep --cache"

Thanks to Christian Dietrich for implementing this feature.

pdfgrep 1.4.1 released

by Hans-Peter Deifel on September, 26 2015

pdfgrep 1.4.1 is now released and can be obtained in the usual place.

This is a bugfix release, with the notable addition of a test suite that can be run from the toplevel source directory with:

make check

This test suite has already found some nasty bugs, which are now all fixed. See the NEWS file for detailed information.

As usual, thanks to everyone who contributed!

pdfgrep 1.4.0 released

by Hans-Peter Deifel on August, 14 2015

pdfgrep 1.4.0 is now available and contains many improvements and new features. Thanks to everyone who helped with this release!

Here is an overview over the changes:

New regex implementations

pdfgrep finally supports searching for fixed strings as well as Perl compatible regular expressions (PCRE). This allows for much more complex searches:

pdfgrep -P "(a|b)c\1" foo.pdf

But also more simple ones, such as searching for the string .*:

pdfgrep -F ".*" foo.pdf

More grep compatibility

The --null and --only-matching switches from grep have found their way into pdfgrep. Especially the first option allows for more robust scripting.

Usability improvements

pdfgrep now optionally prints a warning (with --warn-empty) if a PDF file contains no searchable text. This prevents surprises when searching e.g scanned documents, that usually consist only of images although they appear to contain text.

You can now change the prefix separator with --match-prefix-separator to something else:

$ pdfgrep -n --match-prefix-separator "|" foo foo.pdf

This is especially useful if your filenames frequently contain colons, as is the case under windows.

Also, it is now possible to search multiple PDFs encrypted with different passwords by passing more than one --password argument to pdfgrep. Each password will be tried on each PDF.

Good Bye SourceForge

by Hans-Peter Deifel on June, 19 2015

SourceForge’s aggressive advertising has always been frustrating, but free alternatives that provide a mailing list were scarce. However, recent events have made it intolerable.

Because of this, pdfgrep immediately switches to new infrastructure:

Please do not use pdfgrep’s SourceForge page any more, in particular don’t download the tarballs from there.

A big thanks to Christoph for kindly providing the hosting!