\CatdocDocExtractor
Extracts text from a DOC format Microsoft Word document. Uses the catdoc command-line utility to do so. Catdoc can be downloaded for Linux at:
{@link http://wagner.pp.ru/~vitus/software/catdoc/}
The path to the catdoc binary will be detected automatically if it lives at
/usr/bin/catdoc or /usr/local/bin/catdoc. If your catdoc binary is in a
non-standard place, you can set it in your _ss_environment.php file like so:
<code>
define('CATDOC_BINARY_LOCATION', '/home/username/bin/catdoc');
</code>
Or, if using _config.php, you can also set it directly on the class:
<code>
CatdocDocExtractor::$binary_location = '/home/username/bin/catdoc';
</code>
- Author: Darren Inwood <darren.inwood@chrometoaster.com>
Synopsis
class CatdocDocExtractor
extends ZendSearchLuceneTextExtractor
{
- // members
- public static array $extensions = ;
- public static $binary_location;
- // Inherited members from ZendSearchLuceneTextExtractor
- public static array $extensions;
- public static integer $priority;
- // methods
- public static String extract()
- protected static String|Boolean get_binary_path()
- // Inherited methods from ZendSearchLuceneTextExtractor
- public abstract static String extract()
Hierarchy
Extends
Members
public
-
$binary_location
Holds the location of the catdoc binary. Should be a full filesystem path. -
$extensions
The extensions that can be handled by this text extractor. -
$extensions
An array of strings representing file extensions that can be handled by this TextExtractor. Do not include a dot in your extensions. Extensions should be in lower case, and will detect all case variations on scanned files. -
$priority
Controls the order in which text extractor classes are tried for a specific file extension. Default is 100. To make your custom extractor run before an inbuilt one, set this to less than 100, or to make it run afterwards set it to more than 100.
Methods
protected
- get_binary_path() — Try to detect where the catdoc binary has been installed.
public
- extract() — Returns a string containing the text in the given Microsoft Word DOC document.
Inherited from ZendSearchLuceneTextExtractor
public
- extract() — Returns text for a given full filesystem path. If a file cannot be processed, you should return an empty string.