[patch] add support for framesets

Suggestions, questions oder problems with regain

Moderator: thtesche

[patch] add support for framesets

Postby oliver » Sun Aug 05, 2012 4:36 pm

Hi,

it looks like regain doesn't index documents which are only referenced through a frameset (ie. with a <frame> tag). Here's a patch which implements this. It would be nice if this feature could be added to regain!

Thanks,
Oliver


Code: Select all
commit 996f3fe466129b4098aef49f83af2dba796edde2
Author: Oliver Gerlich <oliver.gerlich@gmx.de>
Date:   Sun Aug 5 18:25:42 2012 +0200

    HtmlPreparator: extract links from <frame> tags

diff --git a/src/net/sf/regain/crawler/preparator/HtmlPreparator.java b/src/net/sf/regain/crawler/preparator/HtmlPreparator.java
index a4a51d7..30ecf70 100644
--- a/src/net/sf/regain/crawler/preparator/HtmlPreparator.java
+++ b/src/net/sf/regain/crawler/preparator/HtmlPreparator.java
@@ -50,6 +50,7 @@ import org.htmlparser.beans.StringBean;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.lexer.Page;
import org.htmlparser.tags.LinkTag;
+import org.htmlparser.tags.FrameTag;
import org.htmlparser.util.ParserException;

/**
@@ -242,6 +243,7 @@ public class HtmlPreparator extends AbstractPreparator {
       // Parse the content
       parser.visitAllNodesWith(linkVisitor);
       ArrayList<Tag> links = linkVisitor.getLinks();
+      ArrayList<Tag> frames = linkVisitor.getFrames();
       htmlPage.setBaseUrl(rawDocument.getUrl());

       // Iterate over all links found
@@ -262,6 +264,18 @@ public class HtmlPreparator extends AbstractPreparator {
         }
       }

+      // Iterate over all frames found
+      Iterator framesIter = frames.iterator();
+      while (framesIter.hasNext()) {
+        FrameTag currTag = ((FrameTag) framesIter.next());
+        String link = CrawlerToolkit.removeAnchor(currTag.getFrameLocation());
+
+        // find urls which do not end with an '/' but are a directory
+        link = CrawlerToolkit.completeDirectory(link);
+
+        rawDocument.addLink(link, "");
+      }
+
     } catch (ParserException ex) {
       throw new RegainException("Error while extracting links: ", ex);
     }
diff --git a/src/net/sf/regain/crawler/preparator/html/LinkVisitor.java b/src/net/sf/regain/crawler/preparator/html/LinkVisitor.java
index 04bd91b..419bf01 100644
--- a/src/net/sf/regain/crawler/preparator/html/LinkVisitor.java
+++ b/src/net/sf/regain/crawler/preparator/html/LinkVisitor.java
@@ -32,11 +32,16 @@ import org.htmlparser.visitors.NodeVisitor;
public class LinkVisitor extends NodeVisitor {

   ArrayList<Tag> mExtLinks = new ArrayList<Tag>();
+  ArrayList<Tag> mExtFrames = new ArrayList<Tag>();

   public ArrayList<Tag> getLinks() {
     return mExtLinks;
   }

+  public ArrayList<Tag> getFrames() {
+    return mExtFrames;
+  }
+
   @Override
   public void visitTag(Tag tag) {

@@ -49,5 +54,14 @@ public class LinkVisitor extends NodeVisitor {
         //System.err.println("Corrupt html found!");
       }
     }
+
+    if ("frame".equalsIgnoreCase(name)) {
+      String srcValue = tag.getAttribute("src");
+      if (srcValue != null) {
+        mExtFrames.add(tag);
+      } else {
+        //System.err.println("Corrupt html found!");
+      }
+    }
   }
}
oliver
 
Posts: 2
Joined: Sun Aug 05, 2012 4:30 pm

Re: [patch] add support for framesets

Postby benjamin » Sat Aug 18, 2012 12:47 pm

Sorry, we would be happy to include your patch but I can't figure out how to apply a git patch to SVN. Can you send me the entire files?
Also, it would be nice if you created a test-frameset.html that we can use in our test suite.
benjamin
 
Posts: 65
Joined: Wed May 25, 2011 9:19 am

Re: [patch] add support for framesets

Postby benjamin » Mon Aug 20, 2012 8:58 am

Committed in Trunk. Thank you.
benjamin
 
Posts: 65
Joined: Wed May 25, 2011 9:19 am

Re: [patch] add support for framesets

Postby oliver » Tue Aug 21, 2012 7:49 am

Thanks for adding this!
oliver
 
Posts: 2
Joined: Sun Aug 05, 2012 4:30 pm

Re: [patch] add support for framesets

Postby LatoniaBarry » Fri Mar 29, 2013 5:42 am

Committed in Trunk.

__________________
The dvd for sale australia series brings you the most popular discount DVD stores online today.
LatoniaBarry
 
Posts: 1
Joined: Fri Mar 29, 2013 2:30 am
Location: http://www.hotterdvdau.com/


Return to regain

Who is online

Users browsing this forum: No registered users and 1 guest

cron