pdfgrep 2.0 has been released after more than a year of development. The tarball is available on the download page. As always, thanks to everybody who has helped with this release!
This release not only contains a few cool new features, it also breaks command line API compatibility in one specific case. Read on for an overview over the most important changes or see the NEWS file for a complete list.
API change: --context/-C
now behaves like grep
One annoying difference between pdfgrep and grep has always been the
behavior of --context n
: For historic reasons, pdfgrep printed n
characters of context around each match while grep prints n
lines
of context.
This is now fixed and pdfgrep behaves exactly like grep for
--context
. Please be sure to update any scripts you have that rely
on the old behavior.
New options: -A/--after-context
and -B/--before-context
Together with the above change, pdfgrep’s context handling is now the same as grep’s and much more useful.
For example, to print two lines above and three lines below each match, you can now write:
pdfgrep -B 2 -A 3 pattern some.pdf
Lines with multiple matches are printed only once
In the same spirit as the last two items, this improves compatibility with grep. Previously, pdfgrep would print the surrounding line for each individual match, even if two matches were on the same line. So a line with two matches would be printed twice. This is now fixed.
Optional caching of PDF text for faster operation
Before doing any actual searching, pdfgrep has to extract the text from each PDF using the poppler library which can take a considerable amount of time for large PDFs.
To speed things up, pdfgrep can now optionally cache the PDF’s text and use it on the next run. This is quite an improvement for people who repeatedly search the same PDFs.
Caching is enabled with --cache
. This has to be used for the initial
run generating the cache and for subsequent runs benefiting from it.
To enable caching permanently it is recommended to add an alias to
your shell, like so:
alias pdfgrep="pdfgrep --cache"
Thanks to Christian Dietrich for implementing this feature.