TurboWebBot (Student Library Documentation)

java.lang.Object
- student.web.WebBot
- - student.web.TurboWebBot

```
public class TurboWebBot
extends WebBot
```
This advanced WebBot provides additional methods useful for extracting content from web pages basdon tag type, tag id, CSS class, or other features.

Version:

$Revision: 1.3 $, $Date: 2010/02/23 17:06:36 $

Author:

Stephen Edwards, Last changed by $Author: stedwar2 $

Nested Class Summary
- Nested classes/interfaces inherited from class student.web.WebBot
  WebBot.Page, WebBot.PageLocation

Field Summary
- Fields inherited from class student.web.WebBot
  ALL_LINKS, HTML_NODE_PREFIX, OTHER_PAGE_LINKS, OTHER_SITE_LINKS, out, pages, targetPhrase, trueChannel

Constructor Summary

Constructors
Constructor and Description
`TurboWebBot()` Creates a new WebBot that is not yet viewing any web page.
`TurboWebBot(File file)` Creates a new WebBot for a given file.
`TurboWebBot(String url)` Creates a new WebBot for a given URL.
`TurboWebBot(URI uri)` Creates a new WebBot for a given URI.
`TurboWebBot(URL url)` Creates a new WebBot for a given URL.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addXpathNamespace(String name, String url)` Bind a symbolic name to an XML namespace URL so that the symbolic name can be used as a namespace prefix on identifiers in XPATH expressions.
`void`	`advanceToNextElement()` Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of interest it can find.
`void`	`advanceToNextElement(String tagType)` Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of the specified type that it can find.
`List<HtmlElement>`	`getAllElementsMatchingXpath(String xpathExpression)` Find nodes within the current document using an XPATH expression.
`List<HtmlElement>`	`getAllMatchingElements(String tagType)` Get all HTML elements of the specified type on this web page.
`List<HtmlElement>`	`getAllMatchingElements(String parentTag, String... childTag)` Get all HTML elements of the specified type on this web page, based on the context where the elements appear.
`HtmlElement`	`getElementById(String id)` Get the first HTML element with the specified id on this web page, using the HTML id="..." attribute on the element.
`List<HtmlElement>`	`getElementsByCssClass(String cssClass)` Get all the HTML elements with the specified CSS class on this web page, using the HTML class="..." attribute on the elements.
`HtmlElement`	`getFirstMatchingElement(String tagType)` Get the first HTML element of the specified type on this web page.
`HtmlElement`	`getFirstMatchingElement(String parentTag, String... childTag)` Get the first HTML element of the specified type on this web page, based on the context where the element appears.
`String`	`getPageContent()` Get the current web page's entire content as a string.
`int`	`getPagePhraseCount()` Get a count of the number of times the set phrase of interest occurs in the current page.
`double`	`getPagePhraseFrequency()` Get the frequency of the phrase of interest in the current page.
`boolean`	`hasNextElement()` Determine whether there are any more HTML elements of interest further down the page from the robot's current position.
`boolean`	`hasNextElement(String tagType)` Determine whether there are any more HTML elements of the specified type further down the page from the robot's current position.
`boolean`	`hasVisitedPage(File file)` Check whether this robot has visited this page before.
`boolean`	`isLookingAtElement()` Is the robot looking at (or standing on) an HTML element of interest on the current page? Elements of interest can be controlled by calling `resetElementsOfInterest(String...)` (the default is all links and all heading tags).
`boolean`	`isLookingAtElement(String tagType)` Is the robot looking at (or standing on) an HTML element of the specified type on the current page? The specified element type must be one of the elements of interest, as specified by calling `resetElementsOfInterest(String...)` (the default is all links and all heading tags).
`void`	`jumpToPage(File file)` Causes the bot to temporarily leave the current page and hop over to the specified file.
`void`	`resetElementsOfInterest(String... tagTypes)` Move the WebBot back to the beginning of the page and reset the set of elements that it can walk over to the given set of elements.
`void`	`setPhraseOfInterest(String phrase)` A key phrase of interest to look for in documents.

Methods inherited from class student.web.WebBot
advanceToNextHeading, advanceToNextLink, cachedPageFor, echoCurrentElementText, echoPageTitle, getCurrentElement, getCurrentElementText, getHeadingLevel, getHeadings, getHeadingsToLevel, getLinks, getLinksOffServer, getLinksToOtherPages, getLinkURI, getOutputChannel, getPageTitle, getPageURL, hasPreviousPage, hasVisitedPage, hasVisitedPage, isHeading, isLink, isLookingAtEndOfPage, isLookingAtHeading, isLookingAtLink, isViewingWebPage, jumpToLinkedPage, jumpToNormalizedURI, jumpToNormalizedURL, jumpToNormalizedURL, jumpToPage, jumpToPage, jumpToPage, jumpToPage, jumpToThisHTML, levelOf, linkGoesToAnotherPage, linkGoesToAnotherServer, makeFileAbsolute, normalizeURL, numberOfPreviousPages, out, outputIsHtml, releaseCachedResources, resolveURIFromPage, returnToPreviousPage, returnToStartOfPage, run, setOutputChannel, setOutputIsHtml, toString, urlForString

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - TurboWebBot
```
public TurboWebBot()
```
    Creates a new WebBot that is not yet viewing any web page.
  - TurboWebBot
```
public TurboWebBot(URI uri)
```
    Creates a new WebBot for a given URI.
    
    Parameters:
    
    uri - The web page where the robot will start.
  - TurboWebBot
```
public TurboWebBot(URL url)
```
    Creates a new WebBot for a given URL.
    
    Parameters:
    
    url - The web page where the robot will start.
  - TurboWebBot
```
public TurboWebBot(String url)
```
    Creates a new WebBot for a given URL.
    
    Parameters:
    
    url - The web page where the robot will start.
  - TurboWebBot
```
public TurboWebBot(File file)
```
    Creates a new WebBot for a given file.
    
    Parameters:
    
    file - The web page where the robot will start.
- Method Detail
  - setPhraseOfInterest
```
public void setPhraseOfInterest(String phrase)
```
    A key phrase of interest to look for in documents. This string will be interpreted as a case-insensitive regular expression.
    
    Parameters:
    
    phrase - a regular expression
  - getPagePhraseCount
```
public int getPagePhraseCount()
```
    Get a count of the number of times the set phrase of interest occurs in the current page. Requires the bot to be viewing a web page, and that the phrase of interest has been set.
    
    Returns:
    
    The number of occurrences of the phrase of interest in the current web page
  - getPagePhraseFrequency
```
public double getPagePhraseFrequency()
```
    Get the frequency of the phrase of interest in the current page. This is a number between 0 and 1 that approximates the fraction of the page that is made up by the target phrase. It is calculated by taking the size of all the occurrences of the target phrase in the document and dividing by the document's total size.
    Note that this number tends to be small, since even interesting phrases usually constitute only a small fraction of a page with any interesting amount of information in it. However, it does provide a relative measure of how many times a phrase has been used, normalized by the size of the document.
    Requires the bot to be viewing a web page, and that the phrase of interest has been set.
    
    Returns:
    
    The frequency of the phrase of interest in the current web page
  - advanceToNextElement
```
public void advanceToNextElement()
```
    Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of interest it can find. Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). If there are no elements of interest in the document, it will end up looking at the end of the page. Requires the bot to be viewing a web page.
  - advanceToNextElement
```
public void advanceToNextElement(String tagType)
```
    Advance the robot forward in the current document until it is looking at (or standing on) the next HTML element of the specified type that it can find. The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). If there are no more elements of the desired type in the document, or the desired type is not an element of interest, the robot will end up looking at the end of the page. Requires the bot to be viewing a web page.
    
    Parameters:
    
    tagType - The type of element to look for (case-sensitive)
  - hasNextElement
```
public boolean hasNextElement()
```
    Determine whether there are any more HTML elements of interest further down the page from the robot's current position. Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). Requires the bot to be viewing a web page.
    
    Returns:
    
    True if there are any more elements of interest in the remainder of document
  - hasNextElement
```
public boolean hasNextElement(String tagType)
```
    Determine whether there are any more HTML elements of the specified type further down the page from the robot's current position. The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags). Requires the bot to be viewing a web page.
    
    Parameters:
    
    tagType - The type of element to look for
    
    Returns:
    
    True if there are any more elements of the specified type in the remainder of document. False if there are no more elements of that type, or if the specified tag type is not an element of interest.
  - isLookingAtElement
```
public boolean isLookingAtElement()
```
    Is the robot looking at (or standing on) an HTML element of interest on the current page? Elements of interest can be controlled by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).
    
    Returns:
    
    True if the robot is positioned at an element of interest, or false otherwise.
  - isLookingAtElement
```
public boolean isLookingAtElement(String tagType)
```
    Is the robot looking at (or standing on) an HTML element of the specified type on the current page? The specified element type must be one of the elements of interest, as specified by calling resetElementsOfInterest(String...) (the default is all links and all heading tags).
    
    Returns:
    
    True if the robot is positioned at an element of the desired type, or false otherwise. Also false if the specified tag type is not an element of interest.
  - getFirstMatchingElement
```
public HtmlElement getFirstMatchingElement(String tagType)
```
    Get the first HTML element of the specified type on this web page. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type can be any HTML element, and the robot will search for and find the first such element on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    
    Parameters:
    
    tagType - The kind of element to search for.
    
    Returns:
    
    The first matching element on the current web page, or null if none is found.
    
    See Also:
    
    getAllMatchingElements(String)
  - getFirstMatchingElement
```
public HtmlElement getFirstMatchingElement(String parentTag,
                                           String... childTag)
```
    Get the first HTML element of the specified type on this web page, based on the context where the element appears. For example, if you want the first anchor in the first row of the first table on a page, you could use this call:
```
 HtmlElement result = myBot.getFirstMatchingElement("table", "tr", "a");
 
```
    This method supports a variable number of arguments. It will find the first occurrence of the first element type listed. Then, inside that element, it will look for the first occurrence of the second element type, and then search inside that one for the first occurrence of the third element type, and so on. It returns the most deeply nested element in this series that it finds. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type(s) can be any HTML element, and the robot will search for and find the first matching element on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    Parameters:
    
    parentTag - The first element to search for.
    
    childTag - Additional elements to find--each one will be searched for within the contents of the element immediately preceding it in the argument list.
    
    Returns:
    
    The first matching element on the current web page, or null if none is found.
    
    See Also:
    
    getAllMatchingElements(String, String...)
  - getAllMatchingElements
```
public List<HtmlElement> getAllMatchingElements(String tagType)
```
    Get all HTML elements of the specified type on this web page. This method is just like getFirstMatchingElement(String), except that it returns all matches instead of just the first one. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type can be any HTML element, and the robot will search for and find all such elements on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    
    Parameters:
    
    tagType - The kind of element to search for.
    
    Returns:
    
    A list of all the matching elements. The list will be empty if none are found.
    
    See Also:
    
    getFirstMatchingElement(String)
  - getAllMatchingElements
```
public List<HtmlElement> getAllMatchingElements(String parentTag,
                                                String... childTag)
```
    Get all HTML elements of the specified type on this web page, based on the context where the elements appear. This method is just like getFirstMatchingElement(String, String...), except that it returns all matches instead of just the first one. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag types can be any HTML element, and the robot will search for and find all such elements on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    
    Parameters:
    
    parentTag - The first element to search for.
    
    childTag - Additional elements to find--each one will be searched for within the contents of the element immediately preceding it in the argument list.
    
    Returns:
    
    A list of all the matching elements. The list will be empty if none are found.
    
    See Also:
    
    getFirstMatchingElement(String, String...)
  - getElementById
```
public HtmlElement getElementById(String id)
```
    Get the first HTML element with the specified id on this web page, using the HTML id="..." attribute on the element. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find the first element with the given id on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    
    Parameters:
    
    id - The id to search for.
    
    Returns:
    
    The first (and usually only) element on the current web page with the given id, or null if none is found.
  - getElementsByCssClass
```
public List<HtmlElement> getElementsByCssClass(String cssClass)
```
    Get all the HTML elements with the specified CSS class on this web page, using the HTML class="..." attribute on the elements. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find all the elements with the given CSS class on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
    
    Parameters:
    
    cssClass - The CSS class to search for.
    
    Returns:
    
    A list of all elements on the current web page with the given CSS class. The list will be empty if none are found.
  - resetElementsOfInterest
```
public void resetElementsOfInterest(String... tagTypes)
```
    Move the WebBot back to the beginning of the page and reset the set of elements that it can walk over to the given set of elements. By default, a WebBot is interested in links and headings (a, h1, h2, h3, h4, h5, h6), but you can change the set of headings it will step through to any group of HTML elements you like. This method supports a variable number of arguments, so you can provide as many different element types as you like--if you provide no arguments, it will reset back to the default of all links and headings.
    For example, to ignore all elements (including links and headings) except for image elements, use:
```
 myBot.resetElementsOfInterest("img");
 
```
    If you want to look at links and at table cells:
```
 myBot.resetElementsOfInterest("a", "td");
 
```
    Requires the bot to be viewing a web page.
    Parameters:
    
    tagTypes - a list of zero or more element types to look for. If none are specified, the default of ("a", "h1", "h2", "h3", "h4", "h5", "h6") will be used instead
  - addXpathNamespace
```
public void addXpathNamespace(String name,
                              String url)
```
    Bind a symbolic name to an XML namespace URL so that the symbolic name can be used as a namespace prefix on identifiers in XPATH expressions. This method is for advanced users only. It is only necessary if your WebBot is manipulating content that is not HTML/XHTML, and you need to write XPATH expressions in some other XML namespace. The default namespace bindings are for the prefix "html" to be bound to the namespace http://www.w3.org/1999/xhtml. You can add as many additional namespaces as you need in order to build your own XPATH expressions.
    
    Overrides:
    
    addXpathNamespace in class WebBot
    
    Parameters:
    
    name - The symbolic prefix to use for this namesapce
    
    url - The URL identifying this XML namespace
  - getAllElementsMatchingXpath
```
public List<HtmlElement> getAllElementsMatchingXpath(String xpathExpression)
```
    Find nodes within the current document using an XPATH expression. This method is for advanced users only, and requires that you understand XPATH. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The robot will search for and find all nodes on the page that match the given XPATH expression, regardless of where the robot is currently standing. Your XPATH expression must use namespaces for all element names. The default namespace bindings are for the prefix "html" to be bound to the namespace http://www.w3.org/1999/xhtml. You can add additional namespace bindings yourself using addXpathNamespace(String, String) if you need more. Requires the bot to be viewing a web page.
    
    Parameters:
    
    xpathExpression - The XPATH expression to search for
    
    Returns:
    
    A list of all matching nodes. The list will be empty if no matches were found.
  - getPageContent
```
public String getPageContent()
```
    Get the current web page's entire content as a string. Requires the bot to be viewing a web page.
    
    Overrides:
    
    getPageContent in class WebBot
    
    Returns:
    
    The page's content
  - jumpToPage
```
public void jumpToPage(File file)
```
    Causes the bot to temporarily leave the current page and hop over to the specified file. The bot will "remember" where it came from, keeping track of past pages in a stack. After working with the other page, you can use WebBot.returnToPreviousPage() to come back to the point where you left off.
    
    Parameters:
    
    file - The new page to jump to
  - hasVisitedPage
```
public boolean hasVisitedPage(File file)
```
    Check whether this robot has visited this page before.
    
    Parameters:
    
    file - The page to check
    
    Returns:
    
    True if this robot has previously visited (or is currently on) the given web page

Class TurboWebBot

Nested Class Summary

Nested classes/interfaces inherited from class student.web.WebBot

Field Summary

Fields inherited from class student.web.WebBot

Constructor Summary

Method Summary

Methods inherited from class student.web.WebBot

Methods inherited from class java.lang.Object

Constructor Detail

TurboWebBot

TurboWebBot

TurboWebBot

TurboWebBot

TurboWebBot

Method Detail

setPhraseOfInterest

getPagePhraseCount

getPagePhraseFrequency

advanceToNextElement

advanceToNextElement

hasNextElement

hasNextElement

isLookingAtElement

isLookingAtElement

getFirstMatchingElement

getFirstMatchingElement

getAllMatchingElements

getAllMatchingElements

getElementById

getElementsByCssClass

resetElementsOfInterest

addXpathNamespace

getAllElementsMatchingXpath

getPageContent

jumpToPage

hasVisitedPage