Boost.Locale
|
Boost.Locale provides a boundary analysis tool, allowing you to split text into characters, words, or sentences. It is commonly used to find appropriate places for line breaks.
Boost.Locale provides 2 major classes for boundary analysis:
Each of the classes above use an iterator type as template parameter. Both of these classes accept in their constructors:
For example:
Each class implements members begin()
, end()
and find()
making it possible to iterate over the selected segments or boundaries in the text or find a location of a segment or boundary for a given iterator.
Convenience typedefs like ssegment_index or wcboundary_point_index are provided as well, where "w", "u16" and "u32" prefixes define a character type wchar_t
, char16_t
and char32_t
and "c" and "s" prefixes define whether std::basic_string<CharType>::const_iterator
or CharType const *
are used.
The text segments analysis is done using segment_index class.
It provides a bidirectional iterator that returns segment object. The segment object represents a pair of iterators that define this segment and a rule according to which it was selected. It can be automatically converted to std::basic_string
object.
To perform boundary analysis, we first create an index object and then iterate over it:
For example:
Would print:
"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
This sentence "生きるか死ぬか、それが問題だ。" (from Tatoeba database) would be split into following segments in the ja_JP.UTF-8
(Japanese) locale:
"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
The boundary analysis that is done by Boost.Locale is much more complicated than just splitting the text according to white space characters, although it is not always perfect.
The segments selection can be customized using rule() and full_select() member functions.
By default, segment_index's iterator returns each text segment defined by two boundary points regardless the way they were selected. Thus in the example above we could see text segments like "." or " " that were selected as words.
Using a rule()
member function we can specify a binary mask of rules we want to use for selection of boundary points using word, line and sentence boundary rules.
For example, by calling
Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and ideographic characters ignoring all non-word related characters like white space or punctuation marks.
So the code:
Would print:
"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
And the for given text="生きるか死ぬか、それが問題だ。" and rule(word_ideo), the example above would print:
"生", "死", "問題",
You can determine why a segment was selected by using the segment::rule() member function. The return value is a bit-mask of rules.
For example:
Would print:
Segment 生 contains: ideographic characters Segment きるか contains: kana characters Segment 死 contains: ideographic characters Segment ぬか contains: kana characters Segment 、 contains: white space or punctuation marks Segment それが contains: kana characters Segment 問題 contains: ideographic characters Segment だ contains: kana characters Segment 。 contains: white space or punctuation marks
Note that rules are applied to the end boundary of a segment when deciding whether to include a segment. In some cases this can cause unexpected behavior.
For example, consider the text:
Hello! How are you?
Suppose we want to fetch all sentences from the text.
The sentence rules have two options:
Naturally to ignore sentence separators we would call segment_index::rule(rule_type v) with sentence_term parameter and then run the iterator.
Would result in:
Sentence [Hello! ] Sentence [are you? ]
These (potentially unexpected) results occur because "How\n" is still considered a sentence but is selected by a different rule.
This behavior can be changed by setting segment_index::full_select(bool) to true
. It will force the iterator to join the current segment with all previous segments even if they do not fit the required rule.
So we add this line:
Right after "map.rule(sentence_term);" and get expected output:
Sentence [Hello! ] Sentence [How are you? ]
Sometimes it is useful to find a segment that some specific iterator is pointing to.
For example, suppose we want to find the word a user clicked on.
segment_index provides find(base_iterator p) member function for this purpose.
This function returns an iterator to the segment that includes p.
For example:
Would print:
be
If the iterator is inside a segment, that segment is returned. If the segment does not fit the selection rules, then the next segment following the requested position that does fit the rules will be returned.
For example: For word boundary analysis with word_any rule:
The boundary_point_index is similar to segment_index in its interface but has a different role. Instead of returning text chunks (segments, it returns a boundary_point object that represents a position in text - a base iterator that is used for iteration of the source text C++ characters. The boundary_point object also provides a rule() member function that returns why this boundary was selected, i.e. the matched rule.
Lets see an example of selecting the first two sentences from a text:
Would print:
First two sentences are: First sentence. Second sentence!
Just like segment_index the boundary_point_index provides a rule(rule_type mask) member function to filter boundary points that interest us.
It allows to set word, line and sentence rules for filtering boundary points.
Lets change the example above a bit:
If we run our program as is on the sample above we would get:
First two sentences are: First sentence. Second
Which is not really what we expected, because the "Second\n" is considered an independent sentence that was separated by a line separator "Line Feed".
However, we can set set the rule sentence_term and the iterator would use only boundary points that are created by a sentence terminators like ".!?".
So by adding:
Right after the generation of the index we would get the desired output:
First two sentences are: First sentence. Second sentence!
You can also use the boundary_point::rule() member function to learn about the reason why this boundary point was created by comparing it with an appropriate mask.
For example:
Would give the following output:
There is a sentence terminator: [First sentence. |Second sentence! Third one?] There is a sentence separator: [First sentence. Second |sentence! Third one?] There is a sentence terminator: [First sentence. Second sentence! |Third one?] There is a sentence terminator: [First sentence. Second sentence! Third one?|]
Sometimes it is useful to find a specific boundary point according to a given iterator.
boundary_point_index provides a iterator find(base_iterator p) member function.
It returns a boundary point on p or at the location following p if p does not point to an appropriate position.
For example, for word boundary analysis:
For example, if we want to select 6 words around a specific boundary point we can use following code:
This would print:
be or not to be, that