Working with Text |
You need to locate character boundaries if your application allows the end-user to highlight individual characters, or to move a cursor through text a character at a time. To create aBreakIterator
that locates character boundaries, you invoke thegetCharacterInstance
method:This type ofBreakIterator characterIterator = BreakIterator.getCharacterInstance(currentLocale);BreakIterator
detects boundaries between user characters, not just Unicode characters. User characters differ with language, but theBreakIterator
class can recognize these differences because it is locale-sensitive. A user character may be composed of more than one Unicode character. For example, the user character ü can be composed by combining the Unicode characters '\u0075' (u) and '\u00a8' (¨). This isn't the best example, however, because the character ü may also be represented by the single Unicode character '\u00fc'. We'll draw upon the Arabic language for a more realistic example.In Arabic the word for house is:
Although this word contains three user characters, it is composed by six Unicode characters:The Unicode characters at positions 1, 3, and 5 in theString house = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";house
string are diacritics. In Arabic diacritics are required, because they can alter the meanings of words. The diacritics in the example are non-spacing characters since they appear above the base characters. In an Arabic word processor, you cannot move the cursor on the screen once for every Unicode character in the string. Instead, you must move it once for every user character, which may be composed by more than one Unicode character. Therefore, you must use aBreakIterator
to scan the user characters in the string.The sample program,
BreakIteratorDemo.java
, creates aBreakIterator
to scan Arabic characters. The program passes thisBreakIterator
, along with theString
object created previously, to a method namedlistPositions
:TheBreakIterator arCharIterator = BreakIterator.getCharacterInstance(new Locale ("ar","SA")); listPositions (house,arCharIterator);listPositions
method uses aBreakIterator
to locate the character boundaries in the string. Note that theBreakIteratorDemo
assigns a particular string to theBreakIterator
with thesetText
method. The program retrieves the first character boundary with thefirst
method, and then invokes thenext
method until the constantBreakIterator.DONE
is returned. The code for this routine is as follows:Thestatic void listPositions(String target, BreakIterator iterator) { iterator.setText(target); int boundary = iterator.first(); while (boundary != BreakIterator.DONE) { System.out.println (boundary); boundary = iterator.next(); } }listPositions
method prints out the following boundary positions for the user characters in the stringhouse
. Note that the positions of the diacritics (1, 3, 5) are not listed:0 2 4 6
Working with Text |