public class WebBot extends Object
Modifier and Type | Class and Description |
---|---|
protected class |
WebBot.Page
Represents a web page that can be visited by this bot.
|
protected static class |
WebBot.PageLocation
Represents a bot location on a specific web page.
|
Modifier and Type | Field and Description |
---|---|
protected static int |
ALL_LINKS
Internal constant used to specify the set of links to get from a
page.
|
protected static String |
HTML_NODE_PREFIX
Internal constant used as search + namespace prefix for xpath nodes.
|
protected static int |
OTHER_PAGE_LINKS
Internal constant used to specify the set of links to get from a
page.
|
protected static int |
OTHER_SITE_LINKS
Internal constant used to specify the set of links to get from a
page.
|
protected PrintWriterWithHistory |
out
The current output channel.
|
protected Stack<WebBot.PageLocation> |
pages
The stack of pages in the current history trail, where the top of
the stack is the current page.
|
protected Pattern |
targetPhrase
The target phrase to search for.
|
protected PrintWriter |
trueChannel
The current output channel.
|
Constructor and Description |
---|
WebBot()
Creates a new WebBot that is not yet viewing any web page.
|
WebBot(String url)
Creates a new WebBot for a given URL.
|
Modifier and Type | Method and Description |
---|---|
protected void |
addXpathNamespace(String name,
String url)
Bind a symbolic name to an XML namespace URL so that the symbolic name
can be used as a namespace prefix on identifiers in XPATH expressions.
|
void |
advanceToNextHeading()
Advance the robot forward in the current document until it is looking
at (or standing on) the next HTML heading element it can find.
|
void |
advanceToNextLink()
Advance the robot forward in the current document until it is looking
at (or standing on) the next HTML anchor containing an href attribute
that it can find.
|
protected WebBot.Page |
cachedPageFor(URL url)
Retrieve the cached page for the given URL.
|
void |
echoCurrentElementText()
Echo the text of the current HTML element (heading, link, etc.) to the
robot's default output channel.
|
void |
echoPageTitle()
Echo the current web page title to the robot's default output channel.
|
HtmlElement |
getCurrentElement()
Get the HTML element of interest that the robot is currently standing
on.
|
String |
getCurrentElementText()
Get the text of the current HTML element on this web page--i.e., the
title of a heading or the text associated with a link.
|
int |
getHeadingLevel()
Get the heading level (1-6) of the current heading on this web page.
|
List<HtmlHeadingElement> |
getHeadings()
Get an iterator over all headings in the current document.
|
List<HtmlHeadingElement> |
getHeadingsToLevel(int level)
Get an iterator over all headings in the current document with a level
less than or equal to the value specified.
|
List<URI> |
getLinks()
Get an iterator over all links in the current document.
|
List<URI> |
getLinksOffServer()
Get an iterator over all links in the current document that refer to
pages on other servers.
|
List<URI> |
getLinksToOtherPages()
Get an iterator over all links in the current document that refer to
other web pages.
|
URI |
getLinkURI()
Get the URI of the current link on this web page.
|
PrintWriterWithHistory |
getOutputChannel()
Get the output channel where this bot is sending its output.
|
protected String |
getPageContent()
Get the current web page's entire content as a string.
|
String |
getPageTitle()
Get the title the current web page.
|
URL |
getPageURL()
Get the URL for the current web page.
|
boolean |
hasPreviousPage()
Check to see if this bot previously visited a different page that it
can now return to.
|
boolean |
hasVisitedPage(URI uri)
Check whether this robot has visited this page before.
|
boolean |
hasVisitedPage(URL url)
Check whether this robot has visited this page before.
|
protected boolean |
isHeading(HtmlElement element)
Determine whether a given HTML element is a heading tag.
|
protected boolean |
isLink(HtmlElement element)
Determine whether a given HTML element is an anchor tag with an HREF
attribute.
|
boolean |
isLookingAtEndOfPage()
Has the robot advanced through all the contents (headings and links)
on the current page? Will also return true if
isViewingWebPage() returns false. |
boolean |
isLookingAtHeading()
Is the robot looking at (or standing on) an HTML heading element on
the current page?
|
boolean |
isLookingAtLink()
Is the robot looking at (or standing on) an HTML anchor containing
an href attribute (that is, a link to another web page) on
the current page?
|
boolean |
isViewingWebPage()
Is the robot currently viewing a real web page with readable contents?
Normally, this would be true, but may be false if the bot has not been
given a web page to start on, or if it has been given a malformed or
nonexistent URL address, or even if the server for the targeted page
is not available.
|
void |
jumpToLinkedPage()
Causes the bot to temporarily leave the current page and hop over to
the page at the end of the current link.
|
protected void |
jumpToNormalizedURI(URI uri)
The worker method for the various flavors of
jumpToPage(URI) . |
protected void |
jumpToNormalizedURL(File file)
The worker method for the various flavors of
jumpToPage(URL) . |
protected void |
jumpToNormalizedURL(URL url)
The worker method for the various flavors of
jumpToPage(URL) . |
void |
jumpToPage(String url)
Causes the bot to temporarily leave the current page and hop over to
the page specified by the URL (as a string).
|
void |
jumpToPage(URI uri)
Causes the bot to temporarily leave the current page and hop over to
the page specified by the URL.
|
void |
jumpToPage(URL url)
Causes the bot to temporarily leave the current page and hop over to
the page specified by the URL.
|
protected void |
jumpToPage(WebBot.Page page)
Adds this page to the history stack, enforcing required stack size
limit.
|
void |
jumpToThisHTML(String html)
Causes the bot to temporarily leave the current page and hop over to
a specific HTML string provided as a parameter.
|
protected int |
levelOf(HtmlElement element)
Convert an HTML element representing a heading tag into its
corresponding level number.
|
boolean |
linkGoesToAnotherPage()
Check whether the URL of the current link on this web page refers to
a different page, or just another location within the current page.
|
boolean |
linkGoesToAnotherServer()
Check whether the URL of the current link on this web page refers to
a page on a separate server, or simply another location on the same
server.
|
protected File |
makeFileAbsolute(File file)
This is needed to get around issues with relative file names when
the current working directory is unknown or when running on a
server.
|
protected URL |
normalizeURL(URL url)
Normalize a URL.
|
int |
numberOfPreviousPages()
How deep is the stack of previous pages that this robot can return to?
Each time the robot jumps to a new page, it remembers its previous
page so you can
returnToPreviousPage() . |
PrintWriterWithHistory |
out()
Get the output channel where this bot is sending its output.
|
boolean |
outputIsHtml()
Check whether this robot's output should be treated as plain text,
or as HTML markup.
|
protected void |
releaseCachedResources()
Performs cleanup once this bot has completed all its tasks.
|
URI |
resolveURIFromPage(String uri)
Get a fully-resolved URI from a (possibly relative) string URI, such as
the value of an anchor's href or an img's src attribute.
|
void |
returnToPreviousPage()
Causes the bot to leave the current page and return to the page it was
previously visiting, at the location where it left off.
|
void |
returnToStartOfPage()
Moves the robot back to the start of the current page.
|
void |
run()
Execute this robot's built-in sequence of steps.
|
void |
setOutputChannel(PrintWriter output)
Tell this bot where to send its output.
|
void |
setOutputIsHtml(boolean value)
Set whether this robot's output should be treated as plain text,
or as HTML markup.
|
String |
toString()
Get a printable summary of this robot.
|
protected URL |
urlForString(String url)
Convert a string to a URL.
|
protected Stack<WebBot.PageLocation> pages
protected PrintWriter trueChannel
protected PrintWriterWithHistory out
protected Pattern targetPhrase
protected static final int ALL_LINKS
protected static final int OTHER_PAGE_LINKS
protected static final int OTHER_SITE_LINKS
protected static final String HTML_NODE_PREFIX
public WebBot()
public WebBot(String url)
url
- The web page where the robot will start.public boolean isViewingWebPage()
public boolean isLookingAtEndOfPage()
isViewingWebPage()
returns false.public void returnToStartOfPage()
public String getPageTitle()
public void echoPageTitle()
public URL getPageURL()
public String toString()
public HtmlElement getCurrentElement()
public boolean isLookingAtHeading()
public void advanceToNextHeading()
public List<HtmlHeadingElement> getHeadings()
HtmlHeadingElement
objects describing the
headings in the page.public List<HtmlHeadingElement> getHeadingsToLevel(int level)
level
- Only include headings at this level or above (i.e.,
numerically less than or equal to this number)HtmlHeadingElement
objects describing the
headings in the page with levels less than or equal to the
specified level.public void echoCurrentElementText()
public String getCurrentElementText()
public int getHeadingLevel()
public boolean isLookingAtLink()
public void advanceToNextLink()
public URI getLinkURI()
public boolean linkGoesToAnotherPage()
public boolean linkGoesToAnotherServer()
public List<URI> getLinks()
URI
objects describing the
links in the page.public List<URI> getLinksToOtherPages()
getLinks()
, with any links to other locations within the same
page filtered out. This method is designed to make it easy to write
foreach-style loops over links.
Requires the bot to be viewing a web page.URI
objects describing the
links in the page.public List<URI> getLinksOffServer()
getLinks()
, with any links to pages on the same server as the
current page filtered out. This method is designed to make it easy
to write foreach-style loops over links.
Requires the bot to be viewing a web page.URI
objects describing the
links in the page.public void jumpToLinkedPage()
returnToPreviousPage()
to
come back to the point where you left off.
Requires the bot to be looking at a link (anchor) element on
the current web page.public void returnToPreviousPage()
jumpToLinkedPage()
to
explore multiple pages.
Requires the bot to have some previous page to return to.public boolean hasPreviousPage()
public int numberOfPreviousPages()
returnToPreviousPage()
. These previous pages
are remembered on a stack, and this method allows you to determine
how deep this stack is--that is, how many times you can repeatedly
call returnToPreviousPage() successfully.public void jumpToPage(String url)
returnToPreviousPage()
to come back to the point where you left off.url
- The new page to jump topublic void jumpToPage(URL url)
returnToPreviousPage()
to
come back to the point where you left off.url
- The new page to jump topublic void jumpToPage(URI uri)
returnToPreviousPage()
to
come back to the point where you left off.uri
- The new page to jump topublic void jumpToThisHTML(String html)
returnToPreviousPage()
to come back to the point where you left off in the previous page.html
- A string containing an HTML document to treat as if it
came from the webpublic URI resolveURIFromPage(String uri)
uri
- The URI to convert to absolute formpublic boolean hasVisitedPage(URI uri)
uri
- The page to checkpublic boolean hasVisitedPage(URL url)
url
- The page to checkpublic void setOutputChannel(PrintWriter output)
output
- The output channel to send messages topublic PrintWriterWithHistory getOutputChannel()
public PrintWriterWithHistory out()
getOutputChannel()
.public boolean outputIsHtml()
public void setOutputIsHtml(boolean value)
value
- True if the output should be treated as HTML markup, false
if it should be treated as plain textpublic void run()
RobotViewer
.protected String getPageContent()
protected void addXpathNamespace(String name, String url)
name
- The symbolic prefix to use for this namesapceurl
- The URL identifying this XML namespaceprotected boolean isLink(HtmlElement element)
element
- The HTML element to testprotected boolean isHeading(HtmlElement element)
element
- The HTML element to testprotected int levelOf(HtmlElement element)
element
- The HTML element to look upprotected WebBot.Page cachedPageFor(URL url)
url
- The URL to look upprotected void releaseCachedResources()
protected URL urlForString(String url)
url
- The string to convertprotected URL normalizeURL(URL url)
url
- The url to normalizeprotected void jumpToNormalizedURI(URI uri)
jumpToPage(URI)
.
This method assumes the given URI has been normalized.uri
- The new page to jump toprotected void jumpToNormalizedURL(URL url)
jumpToPage(URL)
.
This method assumes the given URL has been normalized.url
- The new page to jump toprotected void jumpToNormalizedURL(File file)
jumpToPage(URL)
.
This method assumes the given URL has been normalized.file
- The new page to jump toprotected void jumpToPage(WebBot.Page page)
page
- The new page to add to the stackprotected File makeFileAbsolute(File file)
file
- The file to turn into an absolute pathIOHelper.getFile(File)