CS 1705 Library

cs1705.web
Class TurboWebBot

java.lang.Object
  extended by cs1705.web.WebBot
      extended by cs1705.web.TurboWebBot

public class TurboWebBot
extends WebBot

This advanced WebBot provides additional methods useful for extracting content from web pages basdon tag type, tag id, CSS class, or other features.

Version:
2009.03.31
Author:
Stephen Edwards

Constructor Summary
TurboWebBot()
          Creates a new WebBot that is not yet viewing any web page.
TurboWebBot(File file)
          Creates a new WebBot for a given file.
TurboWebBot(String url)
          Creates a new WebBot for a given URL.
TurboWebBot(URI uri)
          Creates a new WebBot for a given URI.
TurboWebBot(URL url)
          Creates a new WebBot for a given URL.
 
Method Summary
 void addXpathNamespace(String name, String url)
          Bind a symbolic name to an XML namespace URL so that the symbolic name can be used as a namespace prefix on identifiers in XPATH expressions.
 void advanceToNextElement()
          Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of interest it can find.
 void advanceToNextElement(String tagType)
          Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of the specified type that it can find.
 List<HtmlElement> getAllElementsMatchingXpath(String xpathExpression)
          Find nodes within the current document using an XPATH expression.
 List<HtmlElement> getAllMatchingElements(String tagType)
          Get all HTML elements of the specified type on this web page.
 List<HtmlElement> getAllMatchingElements(String parentTag, String... childTag)
          Get all HTML elements of the specified type on this web page, based on the context where the elements appear.
 HtmlElement getElementById(String id)
          Get the first HTML element with the specified id on this web page, using the HTML id="..." attribute on the element.
 List<HtmlElement> getElementsByCssClass(String cssClass)
          Get all the HTML elements with the specified CSS class on this web page, using the HTML class="..." attribute on the elements.
 HtmlElement getFirstMatchingElement(String tagType)
          Get the first HTML element of the specified type on this web page.
 HtmlElement getFirstMatchingElement(String parentTag, String... childTag)
          Get the first HTML element of the specified type on this web page, based on the context where the element appears.
 String getPageContent()
          Get the current web page's entire content as a string.
 int getPagePhraseCount()
          Get a count of the number of times the set phrase of interest occurs in the current page.
 double getPagePhraseFrequency()
          Get the frequency of the phrase of interest in the current page.
 boolean hasNextElement()
          Determine whether there are any more HTML elements of interest further down the page from the robot's current position.
 boolean hasNextElement(String tagType)
          Determine whether there are any more HTML elements of the specified type further down the page from the robot's current position.
 boolean hasVisitedPage(File file)
          Check whether this robot has visited this page before.
 boolean isLookingAtElement()
          Is the robot looking at (or standing on) an HTML element of interest on the current page? Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).
 boolean isLookingAtElement(String tagType)
          Is the robot looking at (or standing on) an HTML element of the specified type on the current page? The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).
 void jumpToPage(File file)
          Causes the bot to temporarily leave the current page and hop over to the specified file.
 void resetElementsOfInterest(String... tagTypes)
          Move the WebBot back to the beginning of the page and reset the set of elements that it can walk over to the given set of elements.
 void setPhraseOfInterest(String phrase)
          A key phrase of interest to look for in documents.
 
Methods inherited from class cs1705.web.WebBot
advanceToNextHeading, advanceToNextLink, echoCurrentElementText, echoPageTitle, getCurrentElement, getCurrentElementText, getHeadingLevel, getHeadings, getHeadingsToLevel, getLinks, getLinksOffServer, getLinksToOtherPages, getLinkURI, getOutputChannel, getPageTitle, getPageURL, hasPreviousPage, hasVisitedPage, hasVisitedPage, isLookingAtEndOfPage, isLookingAtHeading, isLookingAtLink, isViewingWebPage, jumpToLinkedPage, jumpToPage, jumpToPage, jumpToPage, jumpToThisHTML, linkGoesToAnotherPage, linkGoesToAnotherServer, numberOfPreviousPages, out, outputIsHtml, resolveURIFromPage, returnToPreviousPage, returnToStartOfPage, run, setOutputChannel, setOutputIsHtml, toString
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TurboWebBot

public TurboWebBot()
Creates a new WebBot that is not yet viewing any web page.


TurboWebBot

public TurboWebBot(URI uri)
Creates a new WebBot for a given URI.

Parameters:
uri - The web page where the robot will start.

TurboWebBot

public TurboWebBot(URL url)
Creates a new WebBot for a given URL.

Parameters:
url - The web page where the robot will start.

TurboWebBot

public TurboWebBot(String url)
Creates a new WebBot for a given URL.

Parameters:
url - The web page where the robot will start.

TurboWebBot

public TurboWebBot(File file)
Creates a new WebBot for a given file.

Parameters:
file - The web page where the robot will start.
Method Detail

setPhraseOfInterest

public void setPhraseOfInterest(String phrase)
A key phrase of interest to look for in documents. This string will be interpreted as a case-insensitive regular expression.

Parameters:
phrase - a regular expression

getPagePhraseCount

public int getPagePhraseCount()
Get a count of the number of times the set phrase of interest occurs in the current page. Requires the bot to be viewing a web page, and that the phrase of interest has been set.

Returns:
The number of occurrences of the phrase of interest in the current web page

getPagePhraseFrequency

public double getPagePhraseFrequency()
Get the frequency of the phrase of interest in the current page. This is a number between 0 and 1 that approximates the fraction of the page that is made up by the target phrase. It is calculated by taking the size of all the occurrences of the target phrase in the document and dividing by the document's total size.

Note that this number tends to be small, since even interesting phrases usually constitute only a small fraction of a page with any interesting amount of information in it. However, it does provide a relative measure of how many times a phrase has been used, normalized by the size of the document.

Requires the bot to be viewing a web page, and that the phrase of interest has been set.

Returns:
The frequency of the phrase of interest in the current web page

advanceToNextElement

public void advanceToNextElement()
Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of interest it can find. Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). If there are no elements of interest in the document, it will end up looking at the end of the page. Requires the bot to be viewing a web page.


advanceToNextElement

public void advanceToNextElement(String tagType)
Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of the specified type that it can find. The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). If there are no more elements of the desired type in the document, or the desired type is not an element of interest, the robot will end up looking at the end of the page. Requires the bot to be viewing a web page.

Parameters:
tagType - The type of element to look for (case-sensitive)

hasNextElement

public boolean hasNextElement()
Determine whether there are any more HTML elements of interest further down the page from the robot's current position. Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). Requires the bot to be viewing a web page.

Returns:
True if there are any more elements of interest in the remainder of document

hasNextElement

public boolean hasNextElement(String tagType)
Determine whether there are any more HTML elements of the specified type further down the page from the robot's current position. The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). Requires the bot to be viewing a web page.

Parameters:
tagType - The type of element to look for
Returns:
True if there are any more elements of the specified type in the remainder of document. False if there are no more elements of that type, or if the specified tag type is not an element of interest.

isLookingAtElement

public boolean isLookingAtElement()
Is the robot looking at (or standing on) an HTML element of interest on the current page? Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).

Returns:
True if the robot is positioned at an element of interest, or false otherwise.

isLookingAtElement

public boolean isLookingAtElement(String tagType)
Is the robot looking at (or standing on) an HTML element of the specified type on the current page? The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).

Returns:
True if the robot is positioned at an element of the desired type, or false otherwise. Also false if the specified tag type is not an element of interest.

getFirstMatchingElement

public HtmlElement getFirstMatchingElement(String tagType)
Get the first HTML element of the specified type on this web page. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type can be any HTML element, and the robot will search for and find the first such element on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
tagType - The kind of element to search for.
Returns:
The first matching element on the current web page, or null if none is found.
See Also:
getAllMatchingElements(String)

getFirstMatchingElement

public HtmlElement getFirstMatchingElement(String parentTag,
                                           String... childTag)
Get the first HTML element of the specified type on this web page, based on the context where the element appears. For example, if you want the first anchor in the first row of the first table on a page, you could use this call:
 HtmlElement result = myBot.getFirstMatchingElement("table", "tr", "a");
 
This method supports a variable number of arguments. It will find the first occurrence of the first element type listed. Then, inside that element, it will look for the first occurrence of the second element type, and then search inside that one for the first occurrence of the third element type, and so on. It returns the most deeply nested element in this series that it finds. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type(s) can be any HTML element, and the robot will search for and find the first matching element on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
parentTag - The first element to search for.
childTag - Additional elements to find--each one will be searched for within the contents of the element immediately preceding it in the argument list.
Returns:
The first matching element on the current web page, or null if none is found.
See Also:
getAllMatchingElements(String, String...)

getAllMatchingElements

public List<HtmlElement> getAllMatchingElements(String tagType)
Get all HTML elements of the specified type on this web page. This method is just like getFirstMatchingElement(String), except that it returns all matches instead of just the first one. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type can be any HTML element, and the robot will search for and find all such elements on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
tagType - The kind of element to search for.
Returns:
A list of all the matching elements. The list will be empty if none are found.
See Also:
getFirstMatchingElement(String)

getAllMatchingElements

public List<HtmlElement> getAllMatchingElements(String parentTag,
                                                String... childTag)
Get all HTML elements of the specified type on this web page, based on the context where the elements appear. This method is just like getFirstMatchingElement(String, String...), except that it returns all matches instead of just the first one. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag types can be any HTML element, and the robot will search for and find all such elements on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
parentTag - The first element to search for.
childTag - Additional elements to find--each one will be searched for within the contents of the element immediately preceding it in the argument list.
Returns:
A list of all the matching elements. The list will be empty if none are found.
See Also:
getFirstMatchingElement(String, String...)

getElementById

public HtmlElement getElementById(String id)
Get the first HTML element with the specified id on this web page, using the HTML id="..." attribute on the element. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find the first element with the given id on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
id - The id to search for.
Returns:
The first (and usually only) element on the current web page with the given id, or null if none is found.

getElementsByCssClass

public List<HtmlElement> getElementsByCssClass(String cssClass)
Get all the HTML elements with the specified CSS class on this web page, using the HTML class="..." attribute on the elements. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find all the elements with the given CSS class on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.

Parameters:
cssClass - The CSS class to search for.
Returns:
A list of all elements on the current web page with the given CSS class. The list will be empty if none are found.

resetElementsOfInterest

public void resetElementsOfInterest(String... tagTypes)
Move the WebBot back to the beginning of the page and reset the set of elements that it can walk over to the given set of elements. By default, a WebBot is interested in links and headings (a, h1, h2, h3, h4, h5, h6), but you can change the set of headings it will step through to any group of HTML elements you like. This method supports a variable number of arguments, so you can provide as many different element types as you like--if you provide no arguments, it will reset back to the default of all links and headings.

For example, to ignore all elements (including links and headings) except for image elements, use:

 myBot.resetElementsOfInterest("img");
 

If you want to look at links and at table cells:

 myBot.resetElementsOfInterest("a", "td");
 

Requires the bot to be viewing a web page.

Parameters:
tagTypes - a list of zero or more element types to look for. If none are specified, the default of ("a", "h1", "h2", "h3", "h4", "h5", "h6") will be used instead

addXpathNamespace

public void addXpathNamespace(String name,
                              String url)
Bind a symbolic name to an XML namespace URL so that the symbolic name can be used as a namespace prefix on identifiers in XPATH expressions. This method is for advanced users only. It is only necessary if your WebBot is manipulating content that is not HTML/XHTML, and you need to write XPATH expressions in some other XML namespace. The default namespace bindings are for the prefix "html" to be bound to the namespace http://www.w3.org/1999/xhtml. You can add as many additional namespaces as you need in order to build your own XPATH expressions.

Parameters:
name - The symbolic prefix to use for this namesapce
url - The URL identifying this XML namespace

getAllElementsMatchingXpath

public List<HtmlElement> getAllElementsMatchingXpath(String xpathExpression)
Find nodes within the current document using an XPATH expression. This method is for advanced users only, and requires that you understand XPATH. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find all nodes on the page that match the given XPATH expression, regardless of where the robot is currently standing. Your XPATH expression must use namespaces for all element names. The default namespace bindings are for the prefix "html" to be bound to the namespace http://www.w3.org/1999/xhtml. You can add additional namespace bindings yourself using addXpathNamespace(String, String) if you need more. Requires the bot to be viewing a web page.

Parameters:
xpathExpression - The XPATH expression to search for
Returns:
A list of all matching nodes. The list will be empty if no matches were found.

getPageContent

public String getPageContent()
Get the current web page's entire content as a string. Requires the bot to be viewing a web page.

Returns:
The page's content

jumpToPage

public void jumpToPage(File file)
Causes the bot to temporarily leave the current page and hop over to the specified file. The bot will "remember" where it came from, keeping track of past pages in a stack. After working with the other page, you can use WebBot.returnToPreviousPage() to come back to the point where you left off.

Parameters:
file - The new page to jump to

hasVisitedPage

public boolean hasVisitedPage(File file)
Check whether this robot has visited this page before.

Parameters:
file - The page to check
Returns:
True if this robot has previously visited (or is currently on) the given web page

Last updated: Wed, Apr 1, 2009 • 12:29 AM EDT

Copyright © 2009 Virginia Tech.