\PdfDocumentExtractor
Extracts text from a PDF document. Tries to use the pdftotext command-line utility if it is installed, otherwise falls back to the PDF2Text class.
The pdftotext utility gives superior text extraction, and should be used
wherever possible. It can be found for Windows and Mac OS X at:
{@link http://www.foolabs.com/xpdf/}
If you are using Linux, the utility is part of both the xpdf and
poppler-utils packages. On Ubuntu and Debian:
<code>
apt-get install poppler-utils
</code>
The path to the pdftotext binary will be detected automatically if it lives
at /usr/bin/pdftotext or /usr/local/bin/pdftotext. If your catdoc binary is
in a non-standard place, you can set it in your _ss_environment.php file like
so:
<code>
define('PDFTOTEXT_BINARY_LOCATION', '/home/username/bin/pdftotext');
</code>
Or, if using _config.php, you can also set it directly on the class:
<code>
PdfDocumentExtractor::$binary_location = '/home/username/bin/pdftotext';
</code>
- Author: Darren Inwood <darren.inwood@chrometoaster.com>
Synopsis
- // members
- public static array $extensions = ;
- public static $binary_location;
- // Inherited members from ZendSearchLuceneTextExtractor
- public static array $extensions;
- public static integer $priority;
- // methods
- public static String extract()
- protected static void commandline()
- protected static void pdf2text()
- protected static String|Boolean get_binary_path()
- // Inherited methods from ZendSearchLuceneTextExtractor
- public abstract static String extract()
Hierarchy
Extends
Members
public
-
$binary_location
Holds the location of the pdftotext binary. Should be a full filesystem path. -
$extensions
The extensions that can be handled by this text extractor. -
$extensions
An array of strings representing file extensions that can be handled by this TextExtractor. Do not include a dot in your extensions. Extensions should be in lower case, and will detect all case variations on scanned files. -
$priority
Controls the order in which text extractor classes are tried for a specific file extension. Default is 100. To make your custom extractor run before an inbuilt one, set this to less than 100, or to make it run afterwards set it to more than 100.
Methods
protected
- commandline()
- get_binary_path() — Try to detect where the pdftptext binary has been installed.
- pdf2text()
public
- extract() — Returns a string containing the text in the given TXT document.
Inherited from ZendSearchLuceneTextExtractor
public
- extract() — Returns text for a given full filesystem path. If a file cannot be processed, you should return an empty string.