Bad file U R L crashes crawler

Suggestions, questions oder problems with regain

Moderator: thtesche

Bad file U R L crashes crawler

Postby kfiles » Fri Jun 07, 2013 6:20 pm

My crawl was unexpectedly halted by the following exception:

Code: Select all
Exception in thread "main" java .lang .IllegalArgumentException: U R L Decoder: Illegal hex characters in escape (%
) pattern - For input string: "%2"
        at java . net . U  R L D ecoder .dec ode( U R L De coder . java : 173)
        at net .sf .regain .RegainToolkit . u r l Decode(Unknown Source)
        at net .sf .regain .RegainToolkit . u r l ToWhitespacedFileName(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.IndexWriterManager.createNewIndexEntry(Unknown Source)
        at net.sf.regain.crawler.IndexWriterManager.addToIndex(Unknown Source)
        at net.sf.regain.crawler.Crawler.run(Unknown Source)
        at net.sf.regain.crawler.Main.main(Unknown Source)



Looking at the code for RegainToolkit, I found that it was not catching IllegalArgumentException, so I added the following patch (unfortunately mangled to avoid the U R L filter) to avoid the crash and get some error messages telling me the offending U R L:


Code: Select all
diff --git a/src/net/sf/regain/RegainToolkit.java b/src/net/sf/regain/RegainToolkit.java
index 31c734d..085fe43 100644
--- a/src/net/sf/regain/RegainToolkit.java
+++ b/src/net/sf/regain/RegainToolkit.java
@@ -44,6 +44,7 @@
[some imports I cannot get past the filter]
@@ -1528,6 +1529,8 @@ public class RegainToolkit {
       return U R LDecoder.decode(text, encoding);
     } catch (UnsupportedEncodingException exc) {
       throw new RegainException("U R L-decoding failed: '" + text + "'", exc);
+    } catch (IllegalArgumentException exc) {
+      throw new RegainException("U R L-decoding failed: '" + text + "'", exc);
     }
   }


This has solved the problem; it would be great if you could apply the patch to git. The crawler now gives me the following error message, rather than a crash:

Code: Select all
14:09:20: Preparing [...]m411.php?tablename=boss_services&where=prod_nm%20=%20%27Intelligent%20Internet%20Management%27%20OR%20prod_desc%20like%20
%27ITP%20Intelligent%20Network%20Firewall%%27&page_title=Firewall%20Services%20in%20BOSS&quiet=1 with preparator net.sf.regain.crawler.preparator.
EmptyPreparator failed
net.sf.regain.RegainException: U R L-decoding failed: 'm411.php?tablename=boss_services&where=prod_nm%20=%20%27Intelligent%20Internet%20Management
%27%20OR%20prod_desc%20like%20%27ITP%20Intelligent%20Network%20Firewall%%27&page_title=Firewall%20Services%20in%20BOSS&quiet=1 m411 '
        at net.sf.regain.RegainToolkit.u r l Decode(Unknown Source)
        at net.sf.regain.RegainToolkit.u r l ToWhitespacedFileName(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.document.DocumentFactory.createDocument(Unknown Source)
        at net.sf.regain.crawler.IndexWriterManager.createNewIndexEntry(Unknown Source)
        at net.sf.regain.crawler.IndexWriterManager.addToIndex(Unknown Source)
        at net.sf.regain.crawler.Crawler.run(Unknown Source)
        at net.sf.regain.crawler.Main.main(Unknown Source)
Caused by: java .lang .IllegalArgumentException: U R L Deco der: Illegal hex characters in escape (%) pattern - For input string: "%2"
        at java . net . U  R L D ecoder .dec ode( U R L De coder . java : 173)


Clearly, that U R L is broken, at "Network%20Firewall%%27". There is indeed an extra %. And, in fact the original (abhorrent PHP) U R L had a '%' character in it:
... m411.php?tablename=boss_services&where=prod_nm = 'Intelligent Internet Management' OR prod_desc like 'ITP Intelligent Network Firewall%'&page_title=Firewall Services in BOSS&quiet=1

However, that U R L does round trip through RegainToolkit cleanly:
Code: Select all
groovy:000> System.print(RegainToolkit.u r lEncode("m411.php?tablename=boss_services&where=prod_nm = 'Intelligent Internet Management' OR prod_desc like 'ITP Intelligent Network Firewall%'&page_title=Firewall Services in BOSS&quiet=1", "UTF-8"))
m411.php%3Ftablename%3Dboss_services%26where%3Dprod_nm+%3D+%27Intelligent+Internet+Management%27+OR+prod_desc+like+%27ITP+Intelligent+Network+Firewall%25%27%26page_title%3DFirewall+Services+in+BOSS%26quiet%3D1
groovy:000> System.print(RegainToolkit.u r lDecode("m411.php%3Ftablename%3Dboss_services%26where%3Dprod_nm+%3D+%27Intelligent+Internet+Management%27+OR+prod_desc+like+%27ITP+Intelligent+Network+Firewall%25%27%26page_title%3DFirewall+Services+in+BOSS%26quiet%3D1", "UTF-8"))
m411.php?tablename=boss_services&where=prod_nm = 'Intelligent Internet Management' OR prod_desc like 'ITP Intelligent Network Firewall%'&page_title=Firewall Services in BOSS&quiet=1


So the question becomes, why isn't the U R L encoded the way I'd expect it to be by RegainToolkit. u r l Encode()? How does that stray '%' character end up in the encoded U R L ? Is there another path the crawler takes to encode the U R L ?

Thanks,
---
Kirby Files
Software Architect
Masergy Communications
kfiles
 
Posts: 3
Joined: Fri Jun 07, 2013 4:48 pm

Return to regain

Who is online

Users browsing this forum: No registered users and 1 guest

cron