LexerTests

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.htmlparser.tests.lexerTests
Class LexerTests

java.lang.Object
  junit.framework.Assert
      junit.framework.TestCase
          org.htmlparser.tests.ParserTestCase
              org.htmlparser.tests.lexerTests.LexerTests

All Implemented Interfaces:: junit.framework.Test

public class LexerTests
extends ParserTestCase

Field Summary

Fields inherited from class org.htmlparser.tests.ParserTestCase

mLexer, node, nodeCount, parser

Constructor Summary
`LexerTests(java.lang.String name)` Test the Lexer class.

Method Summary
`void`	`checkTagNames(Node node)` Check the tag name for one of the ones expected on the page.
`void`	`testAttributedTag()` Test operation with attributed tags.
`void`	`testCommentInScript()` See bug #1227213 Particular SCRIPT tags close too late.
`void`	`testConjoined()` See bug #825820 Words conjoined
`void`	`testDosEOL()` Test operation with Dos line endings.
`void`	`testEOF_EOL()` Test operation with line endings near the end of input.
`void`	`testEscapedQuote()` See bug #899413 bug in javascript end detection.
`void`	`testFidelity()` Test the fidelity of the toHtml() method.
`void`	`testJIS()` Test case for bug #789439 Japanese page causes OutOfMemory Exception No exception is thrown in the current version of the parser, however, the problem is that ISO-2022-JP (aka JIS) encoding sometimes causes spurious tags.
`void`	`testJsp()` See bug #880283 Character ">" erroneously inserted by Lexer
`void`	`testPureTag()` Test operation with only tags.
`void`	`testPureText()` Test operation without tags.
`void`	`testRemark()` Test operation with comments.
`void`	`testStackOverflow()` Check for StackOverflow error.
`void`	`testTagStops()` Test that tags stop string nodes.
`void`	`testUnixEOL()` Test operation with Unix line endings.
`void`	`testUrlInStyle()` See bug #1227213 Particular SCRIPT tags close too late.

Methods inherited from class org.htmlparser.tests.ParserTestCase

assertHiddenIDTagPresent, assertNodeCount, assertNodeCount, assertSameType, assertStringEquals, assertSuperType, assertTagEquals, assertType, assertXmlEquals, createParser, createParser, createParser, createParser, failWithMessage, getParser, main, parse, parseAndAssertNodeCount, parseNodes, removeEscapeCharacters, setParser

Methods inherited from class junit.framework.TestCase

countTestCases, createResult, getName, name, run, run, runBare, runTest, setName, setUp, tearDown, toString

Methods inherited from class junit.framework.Assert

assert, assert, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertEquals, assertNotNull, assertNotNull, assertNull, assertNull, assertSame, assertSame, assertTrue, assertTrue, fail, fail

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Detail

LexerTests

public LexerTests(java.lang.String name)

Test the Lexer class.

Method Detail

testPureText

public void testPureText()
                  throws ParserException

Test operation without tags.

Throws:: ParserException

testUnixEOL

public void testUnixEOL()
                 throws ParserException

Test operation with Unix line endings.

Throws:: ParserException

testDosEOL

public void testDosEOL()
                throws ParserException

Test operation with Dos line endings.

Throws:: ParserException

testEOF_EOL

public void testEOF_EOL()
                 throws ParserException

Test operation with line endings near the end of input.

Throws:: ParserException

testTagStops

public void testTagStops()
                  throws ParserException

Test that tags stop string nodes.

Throws:: ParserException

testPureTag

public void testPureTag()
                 throws ParserException

Test operation with only tags.

Throws:: ParserException

testAttributedTag

public void testAttributedTag()
                       throws ParserException

Test operation with attributed tags.

Throws:: ParserException

testRemark

public void testRemark()
                throws ParserException

Test operation with comments.

Throws:: ParserException

testFidelity

public void testFidelity()
                  throws ParserException,
                         java.io.IOException

Test the fidelity of the toHtml() method.

Throws:: ParserException; java.io.IOException

testJIS

public void testJIS()
             throws ParserException

Test case for bug #789439 Japanese page causes OutOfMemory Exception No exception is thrown in the current version of the parser, however, the problem is that ISO-2022-JP (aka JIS) encoding sometimes causes spurious tags. The root cause is characters bracketed by [esc]$B and [esc](J (contrary to what is indicated in then j_s_nightingale analysis of the problem) that sometimes have an angle bracket (< or 0x3c) embedded in them. These are taken to be tags by the parser, instead of being considered strings.

The URL refrenced has an ISO-8859-1 encoding (the default), but Japanese characters intermixed on the page with English, using the JIS encoding. We detect failure by looking for weird tag names which were not correctly handled as string nodes.

Here is a partial dump of the page with escape sequences:

 0002420 1b 24 42 3f 79 4a 42 25 47 25 38 25 2b 25 61 43
 0002440 35 44 65 43 44 1b 28 4a 20 77 69 74 68 20 43 61
 ..
 0002720 6c 22 3e 4a 53 6b 79 1b 24 42 42 50 31 7e 25 5a
 0002740 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 3c
 ..
 0003060 20 69 1b 24 42 25 62 21 3c 25 49 42 50 31 7e 25
 0003100 5a 21 3c 25 38 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a
 ..
 0003220 1b 24 42 25 2d 25 3f 25 5e 25 2f 25 69 24 4e 25
 0003240 5b 21 3c 25 60 25 5a 21 3c 25 38 1b 28 4a 3c 2f
 ..
 0003320 6e 65 31 2e 70 6c 22 3e 1b 24 42 3d 60 48 77 43
 0003340 66 1b 28 4a 3c 2f 41 3e 3c 50 3e 0a 2d 2d 2d 2d
 ..
 0004400 46 6f 72 75 6d 20 30 30 39 20 28 1b 24 42 3e 21
 0004420 3c 6a 24 4b 31 4a 4a 21 44 2e 24 4a 24 49 1b 28
 0004440 4a 29 3c 2f 41 3e 3c 49 4d 47 20 53 52 43 3d 22

The fix proposed by j_s_nightingale is implemented to swallow JIS escape sequences in the string parser. Apparently the fix won't help EUC-JP and Shift-JIS though, so this may still be a problem. It's theoretically possible that JIS encoding, or another one, could be used as attribute names or values within tags as well, but this is considered improbable and is therefore not handled in the tag parser state machine.

Throws:: ParserException

checkTagNames

public void checkTagNames(Node node)

Check the tag name for one of the ones expected on the page. Recursively check the children.

testConjoined

public void testConjoined()
                   throws ParserException

See bug #825820 Words conjoined

Throws:: ParserException

testStackOverflow

public void testStackOverflow()
                       throws ParserException

Check for StackOverflow error.

Throws:: ParserException

testJsp

public void testJsp()
             throws ParserException

See bug #880283 Character ">" erroneously inserted by Lexer

Throws:: ParserException

testEscapedQuote

public void testEscapedQuote()
                      throws ParserException

See bug #899413 bug in javascript end detection.

Throws:: ParserException

testCommentInScript

public void testCommentInScript()
                         throws ParserException

See bug #1227213 Particular SCRIPT tags close too late.

Throws:: ParserException

testUrlInStyle

public void testUrlInStyle()
                    throws ParserException

See bug #1227213 Particular SCRIPT tags close too late. This was actually working prior to the patch, since the ScriptScanner didn't use smartquote processing. I'm not sure why jwilsonsprings1 said the patch worked for him. I can only assume he was mistaken in thinking it was the URL that caused the failure.

Throws:: ParserException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.htmlparser.tests.lexerTests Class LexerTests

LexerTests

testPureText

testUnixEOL

testDosEOL

testEOF_EOL

testTagStops

testPureTag

testAttributedTag

testRemark

testFidelity

testJIS

checkTagNames

testConjoined

testStackOverflow

testJsp

testEscapedQuote

testCommentInScript

testUrlInStyle

org.htmlparser.tests.lexerTests
Class LexerTests