Crawler Problem (Invalid dictionary...)

Suggestions, questions oder problems with regain

Moderator: thtesche

Crawler Problem (Invalid dictionary...)

Postby malibu » Wed Jan 27, 2010 7:19 pm

Hello,

Thank you in advance for any insight or assistance. I am new to Regain. I recently installed Regain and indexed a local file directory without any issues. Things worked great. This morning I pointed the crawler to another local directory with significantly more directories and files (6000+ directories and 35000+ files) composed of images, pdf, Microsoft documents, etc.

The crawler was working fine for the first hour+ indexing content, but all the crawler displays now is the following repeating text in its standard out. It seems to be stuck in an infinite loop that it is unable to get out of.

11:08:38: Invalid dictionary, found:? but expected:''

The process has created an index directory and respective index files. I also looked at the crawler log and I don't see anything out of the ordinary in the logs. Not knowing if this is normal, I killed the crawler process 2 hours into the indexing. The last log entry is as follows:

2010-01-27 10:12:05 [main] ERROR: Preparing file://F%3A/A-PROPOSALS/MASTER+100%25Rev7+CD.xls with preparator net.sf.regain.crawler.preparator.PoiMsOfficePreparator failed
net.sf.regain.RegainException: Preparing file://F%3A/A-PROPOSALS/MASTER+100%25Rev7+CD.xls with preparator net.sf.regain.crawler.preparator.PoiMsOfficePreparator failed
at net.sf.regain.crawler.document.DocumentFactory.createDocument(DocumentFactory.java:350)
at net.sf.regain.crawler.document.DocumentFactory.createDocument(DocumentFactory.java:273)
at net.sf.regain.crawler.IndexWriterManager.createNewIndexEntry(IndexWriterManager.java:737)
at net.sf.regain.crawler.IndexWriterManager.addToIndex(IndexWriterManager.java:720)
at net.sf.regain.crawler.Crawler.run(Crawler.java:559)
at net.sf.regain.crawler.Main.main(Main.java:137)
Caused by: net.sf.regain.RegainException: Reading MS* (OpenXML) document failed : file://F%3A/A-PROPOSALS/MASTER+100%25Rev7+CD.xls
at net.sf.regain.crawler.preparator.PoiMsOfficePreparator.prepare(PoiMsOfficePreparator.java:91)
at net.sf.regain.crawler.document.DocumentFactory.createDocument(DocumentFactory.java:335)
... 5 more
Caused by: java.lang.IllegalStateException: bad text '&A'.
at org.apache.poi.hssf.usermodel.HeaderFooter.splitParts(HeaderFooter.java:77)
at org.apache.poi.hssf.usermodel.HeaderFooter.getLeft(HeaderFooter.java:87)
at org.apache.poi.hssf.extractor.ExcelExtractor._extractHeaderFooter(ExcelExtractor.java:395)
at org.apache.poi.hssf.extractor.ExcelExtractor.getText(ExcelExtractor.java:385)
at net.sf.regain.crawler.preparator.PoiMsOfficePreparator.prepare(PoiMsOfficePreparator.java:84)
... 6 more
malibu
 
Posts: 5
Joined: Wed Jan 27, 2010 7:02 pm

Re: Crawler Problem (Invalid dictionary...)

Postby malibu » Thu Jan 28, 2010 1:51 am

Well for those who may be interested....

I have narrowed this error down to a single pdf document. When I remove it, the index moves along as expected. When I try and index it, even if it is the only document, something breaks and the whole thing goes into the never ending loop and I have to explicitly kill the process.

I can open and print this file just fine with adobe acrobat reader. I am not sure what is embedded in this document that is causing the problem.

Also, I am running Regain version 1.6.8 server, tomcat on a Windows XP machine with the latest JDK.

So I have found the file but not the real reason and any potential fix to this problem. There may be other similarly problematic pdf files. This sounds to be an issue with the pdf preparator perhaps. I don't know if I can replace this preparator with a newer version or a different pdf preparator.

Any assistance or guidance is still appreciated.
malibu
 
Posts: 5
Joined: Wed Jan 27, 2010 7:02 pm

08.23.42

Postby ximike2010 » Tue Aug 24, 2010 2:15 am

ximike2010
 
Posts: 301
Joined: Thu Aug 05, 2010 3:40 am




Return to regain

Who is online

Users browsing this forum: No registered users and 6 guests

cron