News

pdfgrep 2.0 released

by Hans-Peter Deifel on January, 25 2017

pdfgrep 2.0 has been released after more than a year of development. The tarball is available on the download page. As always, thanks to everybody who has helped with this release!

This release not only contains a few cool new features, it also breaks command line API compatibility in one specific case. Read on for an overview over the most important changes or see the NEWS file for a complete list.

API change: --context/-C now behaves like grep

One annoying difference between pdfgrep and grep has always been the behavior of --context n: For historic reasons, pdfgrep printed n characters of context around each match while grep prints n lines of context.

This is now fixed and pdfgrep behaves exactly like grep for --context. Please be sure to update any scripts you have that rely on the old behavior.

New options: -A/--after-context and -B/--before-context

Together with the above change, pdfgrep’s context handling is now the same as grep’s and much more useful.

For example, to print two lines above and three lines below each match, you can now write:

pdfgrep -B 2 -A 3 pattern some.pdf

Lines with multiple matches are printed only once

In the same spirit as the last two items, this improves compatibility with grep. Previously, pdfgrep would print the surrounding line for each individual match, even if two matches were on the same line. So a line with two matches would be printed twice. This is now fixed.

Optional caching of PDF text for faster operation

Before doing any actual searching, pdfgrep has to extract the text from each PDF using the poppler library which can take a considerable amount of time for large PDFs.

To speed things up, pdfgrep can now optionally cache the PDF’s text and use it on the next run. This is quite an improvement for people who repeatedly search the same PDFs.

Caching is enabled with --cache. This has to be used for the initial run generating the cache and for subsequent runs benefiting from it. To enable caching permanently it is recommended to add an alias to your shell, like so:

alias pdfgrep="pdfgrep --cache"

Thanks to Christian Dietrich for implementing this feature.