|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.htmlparser.util.Translate
Translate numeric character references and character entity references to unicode characters. Based on tables found at http://www.w3.org/TR/REC-html40/sgml/entities.html
Typical usage:
String s = Translate.decode (getTextFromHtmlPage ());or
String s = "<HTML>" + Translate.encode (getArbitraryText ()) + "</HTML>";
Field Summary | |
protected static int |
BREAKPOINT
The dividing point between a simple table lookup and a binary search. |
static boolean |
DECODE_LINE_BY_LINE
If this member is set true , decoding of streams is
done line by line in order to reduce the maximum memory required. |
static boolean |
ENCODE_HEXADECIMAL
If this member is set true , encoding of numeric character
references uses hexadecimal digits, i.e. |
protected static CharacterReference[] |
mCharacterList
List of references sorted by character. |
protected static CharacterReference[] |
mCharacterReferences
Table mapping entity reference kernel to character. |
Method Summary | |
static char |
convertToChar(java.lang.String string)
Deprecated. Use decode . |
static char |
convertToChar(java.lang.String string,
int start,
int end)
Deprecated. Use decode . |
static java.lang.String |
convertToString(int character)
Deprecated. Use encode . |
static void |
decode(java.io.InputStream in,
java.io.PrintStream out)
Decode a stream containing references. |
static java.lang.String |
decode(java.lang.String string)
Decode a string containing references. |
static java.lang.String |
decode(java.lang.StringBuffer buffer)
Decode the characters in a string buffer containing references. |
static void |
encode(java.io.InputStream in,
java.io.PrintStream out)
Encode a stream to use references. |
static java.lang.String |
encode(int character)
Convert a character to a numeric character reference. |
static java.lang.String |
encode(java.lang.String string)
Encode a string to use references. |
static CharacterReference |
lookup(char character)
Look up a reference by character. |
protected static CharacterReference |
lookup(CharacterReference key)
Look up a reference by kernel. |
protected static int |
lookup(CharacterReference[] array,
char ref,
int lo,
int hi)
Binary search for a reference. |
static CharacterReference |
lookup(java.lang.String kernel,
int start,
int end)
Look up a reference by kernel. |
static void |
main(java.lang.String[] args)
Numeric character reference and character entity reference to unicode codec. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static boolean DECODE_LINE_BY_LINE
true
, decoding of streams is
done line by line in order to reduce the maximum memory required.
public static boolean ENCODE_HEXADECIMAL
true
, encoding of numeric character
references uses hexadecimal digits, i.e. ○, instead of decimal
digits.
protected static final CharacterReference[] mCharacterReferences
protected static final int BREAKPOINT
protected static final CharacterReference[] mCharacterList
BREAKPOINT
is stored
in a direct translational table, indexing into the table with a character
yields the reference. The second part is dense and sorted by character,
suitable for binary lookup.
Method Detail |
protected static int lookup(CharacterReference[] array, char ref, int lo, int hi)
array
- The array of CharacterReference
objects.ref
- The character to search for.lo
- The lower index within which to look.hi
- The upper index within which to look.
public static CharacterReference lookup(char character)
character
- The character to be looked up.
null
.protected static CharacterReference lookup(CharacterReference key)
key
- A character reference with the kernel set to the string
to be found. It need not be truncated at the exact end of the reference.public static CharacterReference lookup(java.lang.String kernel, int start, int end)
lookup(CharacterReference)
instead.
kernel
- The string to lookup, i.e. "amp".start
- The starting point in the string of the kernel.end
- The ending point in the string of the kernel.
This should be the index of the semicolon if it exists, or failing that,
at least an index past the last character of the kernel.
null
if it wasn't found.public static char convertToChar(java.lang.String string, int start, int end)
decode
.
string
- The string to convert. Of the form &xxxx; or &#xxxx; with
or without the leading ampersand or trailing semi-colon.start
- The starting pooint in the string to look for a character reference.end
- The ending point in the string to stop looking for a character reference.
public static char convertToChar(java.lang.String string)
decode
.
string
- The string to convert. Of the form &xxxx; or &#xxxx; with
or without the leading ampersand or trailing semi-colon.
public static java.lang.String decode(java.lang.String string)
string
- The string to translate.public static java.lang.String decode(java.lang.StringBuffer buffer)
buffer
- The StringBuffer containing references.
public static void decode(java.io.InputStream in, java.io.PrintStream out)
DECODE_LINE_BY_LINE
is true,
the input stream is broken up into lines, terminated by either
carriage return or newline, in order to reduce the latency and maximum
buffering memory size required.
in
- The stream to translate. It is assumed that the input
stream is encoded with ISO-8859-1 since the table of character
entity references in this class applies only to ISO-8859-1.out
- The stream to write the decoded stream to.public static java.lang.String convertToString(int character)
encode
.
character
- The character to convert.
public static java.lang.String encode(int character)
character
- The character to convert.
public static java.lang.String encode(java.lang.String string)
string
- The string to translate.
public static void encode(java.io.InputStream in, java.io.PrintStream out)
in
- The stream to translate. It is assumed that the input
stream is encoded with ISO-8859-1 since the table of character
entity references in this class applies only to ISO-8859-1.out
- The stream to write the decoded stream to.public static void main(java.lang.String[] args)
System.in
input into an encoded or decoded
stream and send the results to System.out
.
args
- If arg[0] is -encode
perform an encoding on
System.in
, otherwise perform a decoding.
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |