pdfgrep 2.0 has been released after more than a year of development.
The tarball is available on the download page. As
always, thanks to everybody who has helped with this release!
This release not only contains a few cool new features, it also
breaks command line API compatibility in one specific case. Read on
for an overview over the most important changes or
see the NEWS file for a complete list.
API change: --context/-C
now behaves like grep
One annoying difference between pdfgrep and grep has always been the
behavior of --context n
: For historic reasons, pdfgrep printed n
characters of context around each match while grep prints n
lines
of context.
This is now fixed and pdfgrep behaves exactly like grep for
--context
. Please be sure to update any scripts you have that rely
on the old behavior.
New options: -A/--after-context
and -B/--before-context
Together with the above change, pdfgrep’s context handling is now the
same as grep’s and much more useful.
For example, to print two lines above and three lines below each
match, you can now write:
pdfgrep -B 2 -A 3 pattern some.pdf
Lines with multiple matches are printed only once
In the same spirit as the last two items, this improves compatibility
with grep. Previously, pdfgrep would print the surrounding line for
each individual match, even if two matches were on the same line. So a
line with two matches would be printed twice. This is now fixed.
Optional caching of PDF text for faster operation
Before doing any actual searching, pdfgrep has to extract the text
from each PDF using the poppler library which can take a
considerable amount of time for large PDFs.
To speed things up, pdfgrep can now optionally cache the PDF’s text
and use it on the next run. This is quite an improvement for people
who repeatedly search the same PDFs.
Caching is enabled with --cache
. This has to be used for the initial
run generating the cache and for subsequent runs benefiting from it.
To enable caching permanently it is recommended to add an alias to
your shell, like so:
alias pdfgrep="pdfgrep --cache"
Thanks to Christian Dietrich for implementing this feature.