|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjunit.framework.Assert
junit.framework.TestCase
org.htmlparser.tests.ParserTestCase
org.htmlparser.tests.lexerTests.LexerTests
Field Summary |
Fields inherited from class org.htmlparser.tests.ParserTestCase |
mLexer, node, nodeCount, parser |
Constructor Summary | |
LexerTests(java.lang.String name)
Test the Lexer class. |
Method Summary | |
void |
checkTagNames(Node node)
Check the tag name for one of the ones expected on the page. |
void |
testAttributedTag()
Test operation with attributed tags. |
void |
testCommentInScript()
See bug #1227213 Particular SCRIPT tags close too late. |
void |
testConjoined()
See bug #825820 Words conjoined |
void |
testDosEOL()
Test operation with Dos line endings. |
void |
testEOF_EOL()
Test operation with line endings near the end of input. |
void |
testEscapedQuote()
See bug #899413 bug in javascript end detection. |
void |
testFidelity()
Test the fidelity of the toHtml() method. |
void |
testJIS()
Test case for bug #789439 Japanese page causes OutOfMemory Exception No exception is thrown in the current version of the parser, however, the problem is that ISO-2022-JP (aka JIS) encoding sometimes causes spurious tags. |
void |
testJsp()
See bug #880283 Character ">" erroneously inserted by Lexer |
void |
testPureTag()
Test operation with only tags. |
void |
testPureText()
Test operation without tags. |
void |
testRemark()
Test operation with comments. |
void |
testStackOverflow()
Check for StackOverflow error. |
void |
testTagStops()
Test that tags stop string nodes. |
void |
testUnixEOL()
Test operation with Unix line endings. |
void |
testUrlInStyle()
See bug #1227213 Particular SCRIPT tags close too late. |
Methods inherited from class org.htmlparser.tests.ParserTestCase |
assertHiddenIDTagPresent, assertNodeCount, assertNodeCount, assertSameType, assertStringEquals, assertSuperType, assertTagEquals, assertType, assertXmlEquals, createParser, createParser, createParser, createParser, failWithMessage, getParser, main, parse, parseAndAssertNodeCount, parseNodes, removeEscapeCharacters, setParser |
Methods inherited from class junit.framework.TestCase |
countTestCases, createResult, getName, name, run, run, runBare, runTest, setName, setUp, tearDown, toString |
Methods inherited from class junit.framework.Assert |
assert, assert, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertNotNull, assertNotNull, assertNull, assertNull, assertSame, assertSame, assertTrue, assertTrue, fail, fail |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
public LexerTests(java.lang.String name)
Method Detail |
public void testPureText() throws ParserException
ParserException
public void testUnixEOL() throws ParserException
ParserException
public void testDosEOL() throws ParserException
ParserException
public void testEOF_EOL() throws ParserException
ParserException
public void testTagStops() throws ParserException
ParserException
public void testPureTag() throws ParserException
ParserException
public void testAttributedTag() throws ParserException
ParserException
public void testRemark() throws ParserException
ParserException
public void testFidelity() throws ParserException, java.io.IOException
ParserException
java.io.IOException
public void testJIS() throws ParserException
The URL refrenced has an ISO-8859-1 encoding (the default), but Japanese characters intermixed on the page with English, using the JIS encoding. We detect failure by looking for weird tag names which were not correctly handled as string nodes.
Here is a partial dump of the page with escape sequences:
0002420 1b 24 42 3f 79 4a 42 25 47 25 38 25 2b 25 61 43 0002440 35 44 65 43 44 1b 28 4a 20 77 69 74 68 20 43 61 .. 0002720 6c 22 3e 4a 53 6b 79 1b 24 42 42 50 31 7e 25 5a 0002740 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 3c .. 0003060 20 69 1b 24 42 25 62 21 3c 25 49 42 50 31 7e 25 0003100 5a 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a .. 0003220 1b 24 42 25 2d 25 3f 25 5e 25 2f 25 69 24 4e 25 0003240 5b 21 3c 25 60 25 5a 21 3c 25 38 1b 28 4a 3c 2f .. 0003320 6e 65 31 2e 70 6c 22 3e 1b 24 42 3d 60 48 77 43 0003340 66 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 2d 2d 2d 2d .. 0004400 46 6f 72 75 6d 20 30 30 39 20 28 1b 24 42 3e 21 0004420 3c 6a 24 4b 31 4a 4a 21 44 2e 24 4a 24 49 1b 28 0004440 4a 29 3c 2f 41 3e 3c 49 4d 47 20 53 52 43 3d 22
The fix proposed by j_s_nightingale is implemented to swallow JIS escape sequences in the string parser. Apparently the fix won't help EUC-JP and Shift-JIS though, so this may still be a problem. It's theoretically possible that JIS encoding, or another one, could be used as attribute names or values within tags as well, but this is considered improbable and is therefore not handled in the tag parser state machine.
ParserException
public void checkTagNames(Node node)
public void testConjoined() throws ParserException
ParserException
public void testStackOverflow() throws ParserException
ParserException
public void testJsp() throws ParserException
ParserException
public void testEscapedQuote() throws ParserException
ParserException
public void testCommentInScript() throws ParserException
ParserException
public void testUrlInStyle() throws ParserException
ParserException
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |