org.apache.lucene.analysis.icu.segmentation
Class LaoBreakIterator
java.lang.Object
com.ibm.icu.text.BreakIterator
org.apache.lucene.analysis.icu.segmentation.LaoBreakIterator
- All Implemented Interfaces:
- Cloneable
public class LaoBreakIterator
- extends com.ibm.icu.text.BreakIterator
Syllable iterator for Lao text.
This breaks Lao text into syllables according to:
Syllabification of Lao Script for Line Breaking
Phonpasit Phissamay, Valaxay Dalolay, Chitaphone Chanhsililath, Oulaiphone Silimasak,
Sarmad Hussain, Nadir Durrani, Science Technology and Environment Agency, CRULP.
- http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
- http://www.panl10n.net/Presentations/Cambodia/Phonpassit/LineBreakingAlgo.pdf
Most work is accomplished with RBBI rules, however some additional special logic is needed
that cannot be coded in a grammar, and this is implemented here.
For example, what appears to be a final consonant might instead be part of the next syllable.
Rules match in a greedy fashion, leaving an illegal sequence that matches no rules.
Take for instance the text ກວ່າດອກ
The first rule greedily matches ກວ່າດ, but then ອກ is encountered, which is illegal.
What LaoBreakIterator does, according to the paper:
- backtrack and remove the ດ from the last syllable, placing it on the current syllable.
- verify the modified previous syllable (ກວ່າ ) is still legal.
- verify the modified current syllable (ດອກ) is now legal.
- If 2 or 3 fails, then restore the ດ to the last syllable and skip the current character.
Finally, LaoBreakIterator also takes care of the second concern mentioned in the paper.
This is the issue of combining marks being in the wrong order (typos).
- WARNING: This API is experimental and might change in incompatible ways in the next release.
| Fields inherited from class com.ibm.icu.text.BreakIterator |
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD |
|
Constructor Summary |
LaoBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rules)
|
| Methods inherited from class com.ibm.icu.text.BreakIterator |
getAvailableLocales, getAvailableULocales, getBreakInstance, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, isBoundary, preceding, registerInstance, registerInstance, unregister |
LaoBreakIterator
public LaoBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rules)
current
public int current()
- Specified by:
current in class com.ibm.icu.text.BreakIterator
first
public int first()
- Specified by:
first in class com.ibm.icu.text.BreakIterator
following
public int following(int offset)
- Specified by:
following in class com.ibm.icu.text.BreakIterator
getText
public CharacterIterator getText()
- Specified by:
getText in class com.ibm.icu.text.BreakIterator
last
public int last()
- Specified by:
last in class com.ibm.icu.text.BreakIterator
next
public int next()
- Specified by:
next in class com.ibm.icu.text.BreakIterator
next
public int next(int n)
- Specified by:
next in class com.ibm.icu.text.BreakIterator
previous
public int previous()
- Specified by:
previous in class com.ibm.icu.text.BreakIterator
setText
public void setText(CharacterIterator text)
- Specified by:
setText in class com.ibm.icu.text.BreakIterator
setText
public void setText(String newText)
- Overrides:
setText in class com.ibm.icu.text.BreakIterator
clone
public Object clone()
- Clone method. Creates another LaoBreakIterator with the same behavior
and current state as this one.
- Overrides:
clone in class com.ibm.icu.text.BreakIterator
- Returns:
- The clone.