public class TurboWebBot extends WebBot
WebBot.Page, WebBot.PageLocation
ALL_LINKS, HTML_NODE_PREFIX, OTHER_PAGE_LINKS, OTHER_SITE_LINKS, out, pages, targetPhrase, trueChannel
Constructor and Description |
---|
TurboWebBot()
Creates a new WebBot that is not yet viewing any web page.
|
TurboWebBot(File file)
Creates a new WebBot for a given file.
|
TurboWebBot(String url)
Creates a new WebBot for a given URL.
|
TurboWebBot(URI uri)
Creates a new WebBot for a given URI.
|
TurboWebBot(URL url)
Creates a new WebBot for a given URL.
|
Modifier and Type | Method and Description |
---|---|
void |
addXpathNamespace(String name,
String url)
Bind a symbolic name to an XML namespace URL so that the symbolic name
can be used as a namespace prefix on identifiers in XPATH expressions.
|
void |
advanceToNextElement()
Advance the robot forward in the current document until it is looking
at (or standing on) the next HTML element of interest it can find.
|
void |
advanceToNextElement(String tagType)
Advance the robot forward in the current document until it is looking
at (or standing on) the next HTML element of the specified type that it
can find.
|
List<HtmlElement> |
getAllElementsMatchingXpath(String xpathExpression)
Find nodes within the current document using an XPATH expression.
|
List<HtmlElement> |
getAllMatchingElements(String tagType)
Get all HTML elements of the specified type on this web page.
|
List<HtmlElement> |
getAllMatchingElements(String parentTag,
String... childTag)
Get all HTML elements of the specified type on this web page, based
on the context where the elements appear.
|
HtmlElement |
getElementById(String id)
Get the first HTML element with the specified id on this web page,
using the HTML id="..." attribute on the element.
|
List<HtmlElement> |
getElementsByCssClass(String cssClass)
Get all the HTML elements with the specified CSS class on this web
page, using the HTML class="..." attribute on the elements.
|
HtmlElement |
getFirstMatchingElement(String tagType)
Get the first HTML element of the specified type on this web page.
|
HtmlElement |
getFirstMatchingElement(String parentTag,
String... childTag)
Get the first HTML element of the specified type on this web page, based
on the context where the element appears.
|
String |
getPageContent()
Get the current web page's entire content as a string.
|
int |
getPagePhraseCount()
Get a count of the number of times the set phrase of interest
occurs in the current page.
|
double |
getPagePhraseFrequency()
Get the frequency of the phrase of interest in the current page.
|
boolean |
hasNextElement()
Determine whether there are any more HTML elements of interest
further down the page from the robot's current position.
|
boolean |
hasNextElement(String tagType)
Determine whether there are any more HTML elements of the specified type
further down the page from the robot's current position.
|
boolean |
hasVisitedPage(File file)
Check whether this robot has visited this page before.
|
boolean |
isLookingAtElement()
Is the robot looking at (or standing on) an HTML element of interest on
the current page? Elements of interest can be controlled by calling
resetElementsOfInterest(String...) (the default is all links
and all heading tags). |
boolean |
isLookingAtElement(String tagType)
Is the robot looking at (or standing on) an HTML element of the
specified type on the current page? The
specified element type must be one of the elements of interest, as
specified by calling
resetElementsOfInterest(String...) (the
default is all links and all heading tags). |
void |
jumpToPage(File file)
Causes the bot to temporarily leave the current page and hop over to
the specified file.
|
void |
resetElementsOfInterest(String... tagTypes)
Move the WebBot back to the beginning of the page and reset the
set of elements that it can walk over to the given set of elements.
|
void |
setPhraseOfInterest(String phrase)
A key phrase of interest to look for in documents.
|
advanceToNextHeading, advanceToNextLink, cachedPageFor, echoCurrentElementText, echoPageTitle, getCurrentElement, getCurrentElementText, getHeadingLevel, getHeadings, getHeadingsToLevel, getLinks, getLinksOffServer, getLinksToOtherPages, getLinkURI, getOutputChannel, getPageTitle, getPageURL, hasPreviousPage, hasVisitedPage, hasVisitedPage, isHeading, isLink, isLookingAtEndOfPage, isLookingAtHeading, isLookingAtLink, isViewingWebPage, jumpToLinkedPage, jumpToNormalizedURI, jumpToNormalizedURL, jumpToNormalizedURL, jumpToPage, jumpToPage, jumpToPage, jumpToPage, jumpToThisHTML, levelOf, linkGoesToAnotherPage, linkGoesToAnotherServer, makeFileAbsolute, normalizeURL, numberOfPreviousPages, out, outputIsHtml, releaseCachedResources, resolveURIFromPage, returnToPreviousPage, returnToStartOfPage, run, setOutputChannel, setOutputIsHtml, toString, urlForString
public TurboWebBot()
public TurboWebBot(URI uri)
uri
- The web page where the robot will start.public TurboWebBot(URL url)
url
- The web page where the robot will start.public TurboWebBot(String url)
url
- The web page where the robot will start.public TurboWebBot(File file)
file
- The web page where the robot will start.public void setPhraseOfInterest(String phrase)
regular expression
.phrase
- a regular expressionpublic int getPagePhraseCount()
public double getPagePhraseFrequency()
Note that this number tends to be small, since even interesting phrases usually constitute only a small fraction of a page with any interesting amount of information in it. However, it does provide a relative measure of how many times a phrase has been used, normalized by the size of the document.
Requires the bot to be viewing a web page, and that the phrase of interest has been set.public void advanceToNextElement()
resetElementsOfInterest(String...)
(the default is all links
and all heading tags). If there are no elements of interest in the
document, it will end up looking at the end of the page.
Requires the bot to be viewing a web page.public void advanceToNextElement(String tagType)
resetElementsOfInterest(String...)
(the default is all links
and all heading tags). If there are no more elements of the desired
type in the document, or the desired type is not an element of interest,
the robot will end up looking at the end of the page.
Requires the bot to be viewing a web page.tagType
- The type of element to look for (case-sensitive)public boolean hasNextElement()
resetElementsOfInterest(String...)
(the default is all links
and all heading tags).
Requires the bot to be viewing a web page.public boolean hasNextElement(String tagType)
resetElementsOfInterest(String...)
(the
default is all links and all heading tags).
Requires the bot to be viewing a web page.tagType
- The type of element to look forpublic boolean isLookingAtElement()
resetElementsOfInterest(String...)
(the default is all links
and all heading tags).public boolean isLookingAtElement(String tagType)
resetElementsOfInterest(String...)
(the
default is all links and all heading tags).public HtmlElement getFirstMatchingElement(String tagType)
tagType
- The kind of element to search for.getAllMatchingElements(String)
public HtmlElement getFirstMatchingElement(String parentTag, String... childTag)
HtmlElement result = myBot.getFirstMatchingElement("table", "tr", "a");This method supports a variable number of arguments. It will find the first occurrence of the first element type listed. Then, inside that element, it will look for the first occurrence of the second element type, and then search inside that one for the first occurrence of the third element type, and so on. It returns the most deeply nested element in this series that it finds. This method does not affect the robot's current position (the robot will not move), and it does not depend on the elements of interest. The specified tag type(s) can be any HTML element, and the robot will search for and find the first matching element on the page, regardless of where the robot is currently standing. Requires the bot to be viewing a web page.
parentTag
- The first element to search for.childTag
- Additional elements to find--each one will be searched
for within the contents of the element immediately
preceding it in the argument list.getAllMatchingElements(String, String...)
public List<HtmlElement> getAllMatchingElements(String tagType)
getFirstMatchingElement(String)
,
except that it returns all matches instead of just the first one.
This method does not affect the robot's current position (the robot
will not move), and it does not depend on the elements of interest.
The specified tag type can be any HTML element, and the robot will
search for and find all such elements on the page, regardless of where
the robot is currently standing.
Requires the bot to be viewing a web page.tagType
- The kind of element to search for.getFirstMatchingElement(String)
public List<HtmlElement> getAllMatchingElements(String parentTag, String... childTag)
getFirstMatchingElement(String, String...)
,
except that it returns all matches instead of just the first one.
This method does not affect the robot's current position (the robot
will not move), and it does not depend on the elements of interest.
The specified tag types can be any HTML element, and the robot will
search for and find all such elements on the page, regardless of where
the robot is currently standing.
Requires the bot to be viewing a web page.parentTag
- The first element to search for.childTag
- Additional elements to find--each one will be searched
for within the contents of the element immediately
preceding it in the argument list.getFirstMatchingElement(String, String...)
public HtmlElement getElementById(String id)
id
- The id to search for.public List<HtmlElement> getElementsByCssClass(String cssClass)
cssClass
- The CSS class to search for.public void resetElementsOfInterest(String... tagTypes)
For example, to ignore all elements (including links and headings) except for image elements, use:
myBot.resetElementsOfInterest("img");
If you want to look at links and at table cells:
myBot.resetElementsOfInterest("a", "td");
Requires the bot to be viewing a web page.
tagTypes
- a list of zero or more element types to look for. If
none are specified, the default of ("a", "h1", "h2", "h3", "h4",
"h5", "h6") will be used insteadpublic void addXpathNamespace(String name, String url)
addXpathNamespace
in class WebBot
name
- The symbolic prefix to use for this namesapceurl
- The URL identifying this XML namespacepublic List<HtmlElement> getAllElementsMatchingXpath(String xpathExpression)
addXpathNamespace(String, String)
if you need more.
Requires the bot to be viewing a web page.xpathExpression
- The XPATH expression to search forpublic String getPageContent()
getPageContent
in class WebBot
public void jumpToPage(File file)
WebBot.returnToPreviousPage()
to come back to the
point where you left off.file
- The new page to jump topublic boolean hasVisitedPage(File file)
file
- The page to check