I don't know why, but for some reason the iFilter for DjVu files from Lizardtech really slows the indexing down. So I decided to switch the iFilter Preparator off again and use the ExternalPreparator together with djvused instead:
- Code: Select all
<!--
| This preparator may be used if you have an external program that can
| extract text. It's disabled by default.
+-->
<preparator enabled="true">
<class>.ExternalPreparator</class>
<config>
<section name="command">
<param name="urlPattern">\.(djvu|djv)$</param>
<param name="commandLine">djvused "${filename}" -e 'print-pure-txt'</param>
<param name="checkExitCode">false</param>
</section>
</config>
</preparator>
This uses the djvused command-line tool, which is part of DjVuLibre, and which should be placed next to regain.jar in the root catalog of regain.
The improvement in speed is dramatic: in one of my tests djvused performed 12 times faster than iFilter.
However, in order to have CHM indexing capability with iFilter you may leave both ExternalPreparator and iFilterPreparator enabled, but move the code for the former ahead of the code for the latter in the conf/CrawlerConfiguration.xml, so that ExternalPreparator takes precedence.