[SOLVED] Strange error while crawling a PDF document

Suggestions, questions oder problems with regain

Moderator: thtesche

[SOLVED] Strange error while crawling a PDF document

Postby deajan » Fri Sep 04, 2015 12:03 pm

Hello,

I just freshly installed Regain Server 2.1.0 on a CentOS 7 box to index my own paper stuff.
I scan a lot of documents which are automatically OCRed by Abbyy OCR for Linux via a wrapper I wrote (for those interested in the wrapper, see github. com/deajan/pmOCR )

From 4 test PDFs I have crawled, one makes an error which doesn't make sense for me.
I cannot post the log file because phpBB thinks there are URLS because of the dot org.
Have uploaded the log to gist, visible here (remove the blank between dot and com) : gist.github. com/deajan/5132423dc99139df45e2

Any ideas here ?

Regards,
Ozy.
deajan
 
Posts: 6
Joined: Fri Sep 04, 2015 11:45 am

Re: [SOLVED] Strange error while crawling a PDF document

Postby deajan » Fri Sep 04, 2015 4:11 pm

While googling the error
java.lang.IllegalArgumentException: Comparison method violates its general contract!

I found it's related to pdfbox and has been fixed in version 1.8.8+ (see issues. apache. org/jira/browse/PDFBOX-1512)

So i've unzipped PdfBoxPreparator.jar and injected the latest precompiled pdfbox 1.8.10, rezipped to a jar file and here we go !
Error resolved.

Any change to update pdfbox in the next release ?

Regards,
Ozy.
deajan
 
Posts: 6
Joined: Fri Sep 04, 2015 11:45 am

Re: [SOLVED] Strange error while crawling a PDF document

Postby gillymour » Tue Jan 17, 2017 3:19 am

Hi Ozy,

I have followed your process and attempted to inject the latest precompiled pdfbox 1.8.10 into the PdfBoxPerparator.jar file. I also attempted to do this with 1.8.8 and 1.8.13. All attempts caused regain to crash.

How did you successfully inject a different pdfbox version into PdfBoxPerparator.jar?

Cheers,
Declan
gillymour
 
Posts: 3
Joined: Mon Jan 16, 2017 10:34 pm

Re: [SOLVED] Strange error while crawling a PDF document

Postby deajan » Tue Mar 21, 2017 9:42 pm

Hi,

Sorry I'm not visiting this forum very often.
I just happened to unzip the PdfBoxPreparator.jar file in linux, replaced the file and zipped it again.
Maybe you should try that 1.8.10 version I succeed with.

If not, feel free to mail me at ozy [ a t ] netpower [dot] fr so I can mail you the file.
deajan
 
Posts: 6
Joined: Fri Sep 04, 2015 11:45 am


Return to regain

Who is online

Users browsing this forum: No registered users and 1 guest

cron