This is a very small bugfix release that fixes the build with
libunac support enabled, i.e
./configure --with-unac. Otherwise it’s the same as 2.1.0. As usual, the tarball is available at the download page.
After a year of waiting, pdfgrep 2.1.0 has finally been released. The tarball can be download on the download page. As always: Thanks to everyone who helped with this release.
This release is packed with new features that bring pdfgrep closer to parity with GNU grep:
These two related options open up new possibilities in scripting. Since they return only file names and not page number or matched text, their output can be used as input for other programs or even pdfgrep itself. As such, they are especially useful in combination with
For example, to search for PDFs in the current directory that don’t contain “foo” but contain “bar”, run:
pdfgrep -Z --files-without-match "foo" *.pdf | xargs -0 pdfgrep -H bar
pdfgrep -RilZ rilz | fzf --read0 --print0 | xargs -0 evince
This allows to limit the search to certain pages. For example, to search for a PDF that contains “foo” on its title page, run:
pdfgrep --page-range 1 foo *.pdf
Since its first release, pdfgrep only allowed to search for a single pattern. And while it’s possible to combine multiple search strings into a single regular expression using the
| operator, this is fiddly to do in scripts. Now there are better options (pun intended).
--regexp argument can be specified multiple times and
--file allows to directly provide a list of patterns in a file. Both can be mixed and all patterns are combined implicitly with
pdfgrep 2.0.1 has been released! It contains only one bugfix for the new
--cache option from 2.0: When used together with recursive search,
--cache failed to index files in subdirectories. Thanks to Barna Ágoston for the report.
As usual, the tarball is available on the download page.
pdfgrep 2.0 has been released after more than a year of development. The tarball is available on the download page. As always, thanks to everybody who has helped with this release!
This release not only contains a few cool new features, it also breaks command line API compatibility in one specific case. Read on for an overview over the most important changes or see the NEWS file for a complete list.
--context/-Cnow behaves like grep
One annoying difference between pdfgrep and grep has always been the behavior of
--context n: For historic reasons, pdfgrep printed
n characters of context around each match while grep prints
n lines of context.
This is now fixed and pdfgrep behaves exactly like grep for
--context. Please be sure to update any scripts you have that rely on the old behavior.
Together with the above change, pdfgrep’s context handling is now the same as grep’s and much more useful.
For example, to print two lines above and three lines below each match, you can now write:
pdfgrep -B 2 -A 3 pattern some.pdf
In the same spirit as the last two items, this improves compatibility with grep. Previously, pdfgrep would print the surrounding line for each individual match, even if two matches were on the same line. So a line with two matches would be printed twice. This is now fixed.
Before doing any actual searching, pdfgrep has to extract the text from each PDF using the poppler library which can take a considerable amount of time for large PDFs.
To speed things up, pdfgrep can now optionally cache the PDF’s text and use it on the next run. This is quite an improvement for people who repeatedly search the same PDFs.
Caching is enabled with
--cache. This has to be used for the initial run generating the cache and for subsequent runs benefiting from it. To enable caching permanently it is recommended to add an alias to your shell, like so:
alias pdfgrep="pdfgrep --cache"
Thanks to Christian Dietrich for implementing this feature.
pdfgrep 1.4.1 is now released and can be obtained in the usual place.
This is a bugfix release, with the notable addition of a test suite that can be run from the toplevel source directory with:
This test suite has already found some nasty bugs, which are now all fixed. See the
NEWS file for detailed information.
As usual, thanks to everyone who contributed!
pdfgrep 1.4.0 is now available and contains many improvements and new features. Thanks to everyone who helped with this release!
Here is an overview over the changes:
pdfgrep finally supports searching for fixed strings as well as Perl compatible regular expressions (PCRE). This allows for much more complex searches:
pdfgrep -P "(a|b)c\1" foo.pdf
But also more simple ones, such as searching for the string
pdfgrep -F ".*" foo.pdf
--only-matching switches from grep have found their way into pdfgrep. Especially the first option allows for more robust scripting.
pdfgrep now optionally prints a warning (with
--warn-empty) if a PDF file contains no searchable text. This prevents surprises when searching e.g scanned documents, that usually consist only of images although they appear to contain text.
You can now change the prefix separator with
--match-prefix-separator to something else:
$ pdfgrep -n --match-prefix-separator "|" foo foo.pdf foo.pdf|4|foobar
This is especially useful if your filenames frequently contain colons, as is the case under windows.
Also, it is now possible to search multiple PDFs encrypted with different passwords by passing more than one
--password argument to pdfgrep. Each password will be tried on each PDF.
SourceForge’s aggressive advertising has always been frustrating, but free alternatives that provide a mailing list were scarce. However, recent events have made it intolerable.
Because of this, pdfgrep immediately switches to new infrastructure:
Please do not use pdfgrep’s SourceForge page any more, in particular don’t download the tarballs from there.
A big thanks to Christoph for kindly providing the hosting!