|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.htmlparser.parserapplications.SiteCapturer
Save a web site locally. Illustrative program to save a web site contents locally. It was created to demonstrate URL rewriting in it's simplest form. It uses customized tags in the NodeFactory to alter the URLs. This program has a number of limitations:
Field Summary | |
protected boolean |
mCaptureResources
If true , save resources locally too,
otherwise, leave resource links pointing to original page. |
protected java.util.HashSet |
mCopied
The set of resources already copied. |
protected NodeFilter |
mFilter
The filter to apply to the nodes retrieved. |
protected java.util.HashSet |
mFinished
The set of pages already captured. |
protected java.util.ArrayList |
mImages
The list of resources to copy. |
protected java.util.ArrayList |
mPages
The list of pages to capture. |
protected Parser |
mParser
The parser to use for processing. |
protected java.lang.String |
mSource
The web site to capture. |
protected java.lang.String |
mTarget
The local directory to capture to. |
protected int |
TRANSFER_SIZE
Copy buffer size. |
Constructor Summary | |
SiteCapturer()
Create a web site capturer. |
Method Summary | |
void |
capture()
Perform the capture. |
protected void |
copy()
Copy a resource (image) locally. |
protected java.lang.String |
decode(java.lang.String raw)
Unescape a URL to form a file name. |
boolean |
getCaptureResources()
Getter for property captureResources. |
NodeFilter |
getFilter()
Getter for property filter. |
java.lang.String |
getSource()
Getter for property source. |
java.lang.String |
getTarget()
Getter for property target. |
protected boolean |
isHtml(java.lang.String link)
Returns true if the link contains text/html content. |
protected boolean |
isToBeCaptured(java.lang.String link)
Returns true if the link is one we are interested in. |
static void |
main(java.lang.String[] args)
Mainline to capture a web site locally. |
protected java.lang.String |
makeLocalLink(java.lang.String link,
java.lang.String current)
Converts a link to local. |
protected void |
process(NodeFilter filter)
Process a single page. |
void |
setCaptureResources(boolean capture)
Setter for property captureResources. |
void |
setFilter(NodeFilter filter)
Setter for property filter. |
void |
setSource(java.lang.String source)
Setter for property source. |
void |
setTarget(java.lang.String target)
Setter for property target. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected java.lang.String mSource
protected java.lang.String mTarget
protected java.util.ArrayList mPages
protected java.util.HashSet mFinished
protected java.util.ArrayList mImages
protected java.util.HashSet mCopied
protected Parser mParser
protected boolean mCaptureResources
true
, save resources locally too,
otherwise, leave resource links pointing to original page.
protected NodeFilter mFilter
protected final int TRANSFER_SIZE
Constructor Detail |
public SiteCapturer()
Method Detail |
public java.lang.String getSource()
public void setSource(java.lang.String source)
source
- New value of property source.public java.lang.String getTarget()
public void setTarget(java.lang.String target)
target
- New value of property target.public boolean getCaptureResources()
true
, the images and other resources referenced by
the site and within the base URL tree are also copied locally to the
target directory. If false
, the image links are left 'as
is', still refering to the original site.
public void setCaptureResources(boolean capture)
capture
- New value of property captureResources.public NodeFilter getFilter()
public void setFilter(NodeFilter filter)
filter
- New value of property filter.protected boolean isToBeCaptured(java.lang.String link)
true
if the link is one we are interested in.
link
- The link to be checked.
true
if the link has the source URL as a prefix
and doesn't contain '?' or '#'; the former because we won't be able to
handle server side queries in the static target directory structure and
the latter because presumably the full page with that reference has
already been captured previously. This performs a case insensitive
comparison, which is cheating really, but it's cheap.protected boolean isHtml(java.lang.String link) throws ParserException
true
if the link contains text/html content.
link
- The URL to check for content type.
true
if the HTTP header indicates the type is
"text/html".
ParserException
- If the supplied URL can't be read from.protected java.lang.String makeLocalLink(java.lang.String link, java.lang.String current)
link
- The link to make relative.current
- The current page URL, or empty if it's an absolute URL
that needs to be converted.
protected java.lang.String decode(java.lang.String raw)
raw
- The escaped URI.
protected void copy()
protected void process(NodeFilter filter) throws ParserException
filter
- The filter to apply to the collected nodes.
ParserException
- If a parse error occurs.public void capture()
public static void main(java.lang.String[] args) throws java.net.MalformedURLException, java.io.IOException
args
- The command line arguments.
There are three arguments the web site to capture, the local directory
to save it to, and a flag (true or false) to indicate whether resources
such as images and video are to be captured as well.
These are requested via dialog boxes if not supplied.
java.net.MalformedURLException
- If the supplied URL is invalid.
java.io.IOException
- If an error occurs reading the page or resources.
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |