org.htmlparser.tests.lexerTests
Class LexerTests

java.lang.Object
  extended byjunit.framework.Assert
      extended byjunit.framework.TestCase
          extended byorg.htmlparser.tests.ParserTestCase
              extended byorg.htmlparser.tests.lexerTests.LexerTests
All Implemented Interfaces:
junit.framework.Test

public class LexerTests
extends ParserTestCase


Field Summary
 
Fields inherited from class org.htmlparser.tests.ParserTestCase
mLexer, node, nodeCount, parser
 
Constructor Summary
LexerTests(java.lang.String name)
          Test the Lexer class.
 
Method Summary
 void checkTagNames(Node node)
          Check the tag name for one of the ones expected on the page.
 void testAttributedTag()
          Test operation with attributed tags.
 void testCommentInScript()
          See bug #1227213 Particular SCRIPT tags close too late.
 void testConjoined()
          See bug #825820 Words conjoined
 void testDosEOL()
          Test operation with Dos line endings.
 void testEOF_EOL()
          Test operation with line endings near the end of input.
 void testEscapedQuote()
          See bug #899413 bug in javascript end detection.
 void testFidelity()
          Test the fidelity of the toHtml() method.
 void testJIS()
          Test case for bug #789439 Japanese page causes OutOfMemory Exception No exception is thrown in the current version of the parser, however, the problem is that ISO-2022-JP (aka JIS) encoding sometimes causes spurious tags.
 void testJsp()
          See bug #880283 Character ">" erroneously inserted by Lexer
 void testPureTag()
          Test operation with only tags.
 void testPureText()
          Test operation without tags.
 void testRemark()
          Test operation with comments.
 void testStackOverflow()
          Check for StackOverflow error.
 void testTagStops()
          Test that tags stop string nodes.
 void testUnixEOL()
          Test operation with Unix line endings.
 void testUrlInStyle()
          See bug #1227213 Particular SCRIPT tags close too late.
 
Methods inherited from class org.htmlparser.tests.ParserTestCase
assertHiddenIDTagPresent, assertNodeCount, assertNodeCount, assertSameType, assertStringEquals, assertSuperType, assertTagEquals, assertType, assertXmlEquals, createParser, createParser, createParser, createParser, failWithMessage, getParser, main, parse, parseAndAssertNodeCount, parseNodes, removeEscapeCharacters, setParser
 
Methods inherited from class junit.framework.TestCase
countTestCases, createResult, getName, name, run, run, runBare, runTest, setName, setUp, tearDown, toString
 
Methods inherited from class junit.framework.Assert
assert, assert, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertNotNull, assertNotNull, assertNull, assertNull, assertSame, assertSame, assertTrue, assertTrue, fail, fail
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LexerTests

public LexerTests(java.lang.String name)
Test the Lexer class.

Method Detail

testPureText

public void testPureText()
                  throws ParserException
Test operation without tags.

Throws:
ParserException

testUnixEOL

public void testUnixEOL()
                 throws ParserException
Test operation with Unix line endings.

Throws:
ParserException

testDosEOL

public void testDosEOL()
                throws ParserException
Test operation with Dos line endings.

Throws:
ParserException

testEOF_EOL

public void testEOF_EOL()
                 throws ParserException
Test operation with line endings near the end of input.

Throws:
ParserException

testTagStops

public void testTagStops()
                  throws ParserException
Test that tags stop string nodes.

Throws:
ParserException

testPureTag

public void testPureTag()
                 throws ParserException
Test operation with only tags.

Throws:
ParserException

testAttributedTag

public void testAttributedTag()
                       throws ParserException
Test operation with attributed tags.

Throws:
ParserException

testRemark

public void testRemark()
                throws ParserException
Test operation with comments.

Throws:
ParserException

testFidelity

public void testFidelity()
                  throws ParserException,
                         java.io.IOException
Test the fidelity of the toHtml() method.

Throws:
ParserException
java.io.IOException

testJIS

public void testJIS()
             throws ParserException
Test case for bug #789439 Japanese page causes OutOfMemory Exception No exception is thrown in the current version of the parser, however, the problem is that ISO-2022-JP (aka JIS) encoding sometimes causes spurious tags. The root cause is characters bracketed by [esc]$B and [esc](J (contrary to what is indicated in then j_s_nightingale analysis of the problem) that sometimes have an angle bracket (< or 0x3c) embedded in them. These are taken to be tags by the parser, instead of being considered strings.

The URL refrenced has an ISO-8859-1 encoding (the default), but Japanese characters intermixed on the page with English, using the JIS encoding. We detect failure by looking for weird tag names which were not correctly handled as string nodes.

Here is a partial dump of the page with escape sequences:

 0002420 1b 24 42 3f 79 4a 42 25 47 25 38 25 2b 25 61 43
 0002440 35 44 65 43 44 1b 28 4a 20 77 69 74 68 20 43 61
 ..
 0002720 6c 22 3e 4a 53 6b 79 1b 24 42 42 50 31 7e 25 5a
 0002740 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 3c
 ..
 0003060 20 69 1b 24 42 25 62 21 3c 25 49 42 50 31 7e 25
 0003100 5a 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a
 ..
 0003220 1b 24 42 25 2d 25 3f 25 5e 25 2f 25 69 24 4e 25
 0003240 5b 21 3c 25 60 25 5a 21 3c 25 38 1b 28 4a 3c 2f
 ..
 0003320 6e 65 31 2e 70 6c 22 3e 1b 24 42 3d 60 48 77 43
 0003340 66 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 2d 2d 2d 2d
 ..
 0004400 46 6f 72 75 6d 20 30 30 39 20 28 1b 24 42 3e 21
 0004420 3c 6a 24 4b 31 4a 4a 21 44 2e 24 4a 24 49 1b 28
 0004440 4a 29 3c 2f 41 3e 3c 49 4d 47 20 53 52 43 3d 22
 

The fix proposed by j_s_nightingale is implemented to swallow JIS escape sequences in the string parser. Apparently the fix won't help EUC-JP and Shift-JIS though, so this may still be a problem. It's theoretically possible that JIS encoding, or another one, could be used as attribute names or values within tags as well, but this is considered improbable and is therefore not handled in the tag parser state machine.

Throws:
ParserException

checkTagNames

public void checkTagNames(Node node)
Check the tag name for one of the ones expected on the page. Recursively check the children.


testConjoined

public void testConjoined()
                   throws ParserException
See bug #825820 Words conjoined

Throws:
ParserException

testStackOverflow

public void testStackOverflow()
                       throws ParserException
Check for StackOverflow error.

Throws:
ParserException

testJsp

public void testJsp()
             throws ParserException
See bug #880283 Character ">" erroneously inserted by Lexer

Throws:
ParserException

testEscapedQuote

public void testEscapedQuote()
                      throws ParserException
See bug #899413 bug in javascript end detection.

Throws:
ParserException

testCommentInScript

public void testCommentInScript()
                         throws ParserException
See bug #1227213 Particular SCRIPT tags close too late.

Throws:
ParserException

testUrlInStyle

public void testUrlInStyle()
                    throws ParserException
See bug #1227213 Particular SCRIPT tags close too late. This was actually working prior to the patch, since the ScriptScanner didn't use smartquote processing. I'm not sure why jwilsonsprings1 said the patch worked for him. I can only assume he was mistaken in thinking it was the URL that caused the failure.

Throws:
ParserException