Indexing DjVu files with iFilter

Suggestions, questions oder problems with regain

Moderator: thtesche

Indexing DjVu files with iFilter

Postby talker » Sun Apr 13, 2008 9:06 pm

I've just tried indexing DjVu files with the help of LizardTech's iFilter plug-in (www . lizardtech . com/download/dl_download.php?detail=doc_ifilter&platform=win). It didn't work: I get the messages

Code: Select all
No preparator feels responsible for file


I also checked to see if the Windows built-in search function searches inside DjVu now, and it does. Some of my files are in DjVu, so it would be great if I could use Regain to index them as well.
Last edited by talker on Mon Apr 14, 2008 12:45 am, edited 1 time in total.
talker
 
Posts: 11
Joined: Sun Apr 13, 2008 7:50 pm

SOLVED!

Postby talker » Sun Apr 13, 2008 10:52 pm

I managed to solve it. The problem was that iFilter is turned off by default. To fix the problem, find the following section in conf/CrawlerConfiguration.xml:

Code: Select all
  <preparator enabled="false">
      <class>.IfilterPreparator</class>
  </preparator>


and change "false" to "true".
talker
 
Posts: 11
Joined: Sun Apr 13, 2008 7:50 pm

iFilter from Lizardtech is slow

Postby talker » Mon Apr 14, 2008 5:36 am

I don't know why, but for some reason the iFilter for DjVu files from Lizardtech really slows the indexing down. So I decided to switch the iFilter Preparator off again and use the ExternalPreparator together with djvused instead:

Code: Select all
  <!--
   | This preparator may be used if you have an external program that can
   | extract text. It's disabled by default.
   +-->
  <preparator enabled="true">
    <class>.ExternalPreparator</class>
    <config>
      <section name="command">
        <param name="urlPattern">\.(djvu|djv)$</param>
        <param name="commandLine">djvused "${filename}" -e 'print-pure-txt'</param>
        <param name="checkExitCode">false</param>
      </section>
    </config>
  </preparator>


This uses the djvused command-line tool, which is part of DjVuLibre, and which should be placed next to regain.jar in the root catalog of regain.

The improvement in speed is dramatic: in one of my tests djvused performed 12 times faster than iFilter.

However, in order to have CHM indexing capability with iFilter you may leave both ExternalPreparator and iFilterPreparator enabled, but move the code for the former ahead of the code for the latter in the conf/CrawlerConfiguration.xml, so that ExternalPreparator takes precedence.
talker
 
Posts: 11
Joined: Sun Apr 13, 2008 7:50 pm

Postby talker » Tue Apr 15, 2008 4:05 am

Things are not that colorful with djvused, it seems. It works well with ASCII, but does something weird to Unicode. Most probably, it's not djvused itself (I checked its output and it's correct Unicode), but something between djvused and Regain.

Anyway, since DjVu is a major document format, I started thinking about possibly writing a DjVu preparator based on the Java DjVu project, javadjvu . foxtrottechnologies . com. That would keep Regain in pure Java. The problem is that I've never programmed in Java (I'm a C++ programmer). So far, I've installed NetBeans and imported Regain project into it. To my surprise, it compiled without any tweaking. I'll keep looking into it.

Btw, any other suggestions for going about indexing DjVu?
talker
 
Posts: 11
Joined: Sun Apr 13, 2008 7:50 pm

good news

Postby getitdunk » Wed Jan 13, 2010 12:26 am

Yes, this is a good article. Thank you for sharing. I've learned something. I'm waiting for more.
getitdunk
 
Posts: 7
Joined: Fri Jan 08, 2010 3:49 am



08.23.24

Postby ximike2010 » Thu Aug 26, 2010 12:26 am

ximike2010
 
Posts: 301
Joined: Thu Aug 05, 2010 3:40 am


Return to regain

Who is online

Users browsing this forum: Yahoo [Bot] and 1 guest

cron