Rich Text Document Structure

Text documents are represented by the QTextDocument class, which contains information about the document's internal representation, its structure, and keeps track of modifications to provide undo/redo facilities.

The structured representation of a text document presents its contents as a hierarchy of text blocks, frames, tables, and other objects. These provide a logical structure to the document and describe how their contents will be displayed. Generally, frames and tables are used to group other structures while text blocks contain the actual textual information.

New elements are created and inserted into the document programmatically with a QTextCursor or by using an editor widget, such as QTextEdit. Elements can be given a particular format when they are created; otherwise they take the cursor's current format for the element.

Basic structure

The "top level" of a document might be populated in the way shown. Each document always contains a root frame, and this always contains at least one text block.

For documents with some textual content, the root frame usually contains a sequence of blocks and other elements.

Sequences of frames and tables are always separated by text blocks in a document, even if the text blocks contain no information. This ensures that new elements can always be inserted between existing structures.

In this chapter, we look at each of the structural elements used in a rich text document, outline their features and uses, and show how to examine their contents. Document editing is described in The QTextCursor Interface.

Rich Text Documents

QTextDocument objects contain all the information required to construct rich text documents. Text documents can be accessed in two complementary ways: as a linear buffer for editors to use, and as an object hierarchy that is useful to layout engines. In the hierarchical document model, objects generally correspond to visual elements such as frames, tables, and lists. At a lower level, these elements describe properties such as the text style and alignment. The linear representation of the document is used for editing and manipulation of the document's contents.

Although QTextEdit makes it easy to display and edit rich text, documents can also be used independently of any editor widget, for example:

 QTextDocument *newDocument = new QTextDocument;

Alternatively, they can be extracted from an existing editor:

 QTextEdit *editor = new QTextEdit;
 QTextDocument *editorDocument = editor->document();

This flexibility enables applications to handle multiple rich text documents without the overhead of multiple editor widgets, or requiring documents to be stored in some intermediate format.

An empty document contains a root frame which itself contains a single empty text block. Frames provide logical separation between parts of the document, but also have properties that determine how they will appear when rendered. A table is a specialized type of frame that consists of a number of cells, arranged into rows and columns, each of which can contain further structure and text. Tables provide management and layout features that allow flexible configurations of cells to be created.

Text blocks contain text fragments, each of which specifies text and character format information. Textual properties are defined both at the character level and at the block level. At the character level, properties such as font family, text color, and font weight can be specified. The block level properties control the higher level appearance and behavior of the text, such as the direction of text flow, alignment, and background color.

The document structure is not manipulated directly. Editing is performed through a cursor-based interface. The text cursor interface automatically inserts new document elements into the root frame, and ensures that it is padded with empty blocks where necessary.

We obtain the root frame in the following manner:

     QTextDocument *editorDocument = editor->document();
     QTextFrame *root = editorDocument->rootFrame();

When navigating the document structure, it is useful to begin at the root frame because it provides access to the entire document structure.

Document Elements

Rich text documents usually consist of common elements such as paragraphs, frames, tables, and lists. These are represented in a QTextDocument by the QTextBlock, QTextFrame, QTextTable, and QTextList classes. Unlike the other elements in a document, images are represented by specially formatted text fragments. This enables them to be placed formatted inline with the surrounding text.

The basic structural building blocks in documents are QTextBlock and QTextFrame. Blocks themselves contain fragments of rich text (QTextFragment), but these do not directly influence the high level structure of a document.

Elements which can group together other document elements are typically subclasses of QTextObject, and fall into two categories: Elements that group together text blocks are subclasses of QTextBlockGroup, and those that group together frames and other elements are subclasses of QTextFrame.

Text Blocks

Text blocks are provided by the QTextBlock class.

Text blocks group together fragments of text with different character formats, and are used to represent paragraphs in the document. Each block typically contains a number of text fragments with different styles. Fragments are created when text is inserted into the document, and more of them are added when the document is edited. The document splits, merges, and removes fragments to efficiently represent the different styles of text in the block.

The fragments within a given block can be examined by using a QTextBlock::iterator to traverse the block's internal structure:

     QTextBlock::iterator it;
     for (it = currentBlock.begin(); !(it.atEnd()); ++it) {
         QTextFragment currentFragment = it.fragment();
         if (currentFragment.isValid())
             processFragment(currentFragment);
     }

Blocks are also used to represent list items. As a result, blocks can define their own character formats which contain information about block-level decoration, such as the type of bullet points used for list items. The formatting for the block itself is described by the QTextBlockFormat class, and describes properties such as text alignment, indentation, and background color.

Although a given document may contain complex structures, once we have a reference to a valid block in the document, we can navigate between each of the text blocks in the order in which they were written:

     QTextBlock currentBlock = textDocument->begin();

     while (currentBlock.isValid()) {
         processBlock(currentBlock);
         currentBlock = currentBlock.next();
     }

This method is useful for when you want to extract just the rich text from a document because it ignores frames, tables, and other types of structure.

QTextBlock provides comparison operators that make it easier to manipulate blocks: operator==() and operator!=() are used to test whether two blocks are the same, and operator<() is used to determine which one occurs first in a document.

Frames

Frames are provided by the QTextFrame class.

Text frames group together blocks of text and child frames, creating document structures that are larger than paragraphs. The format of a frame specifies how it is rendered and positioned on the page. Frames are either inserted into the text flow, or they float on the left or right hand side of the page. Each document contains a root frame that contains all the other document elements. As a result, all frames except the root frame have a parent frame.

Since text blocks are used to separate other document elements, each frame will always contain at least one text block, and zero or more child frames. We can inspect the contents of a frame by using a QTextFrame::iterator to traverse the frame's child elements:

     QTextFrame::iterator it;
     for (it = frame->begin(); !(it.atEnd()); ++it) {

         QTextFrame *childFrame = it.currentFrame();
         QTextBlock childBlock = it.currentBlock();

         if (childFrame)
             processFrame(childFrame);
         else if (childBlock.isValid())
             processBlock(childBlock);
     }

Note that the iterator selects both frames and blocks, so it is necessary to check which it is referring to. This allows us to navigate the document structure on a frame-by-frame basis yet still access text blocks if required. Both the QTextBlock::iterator and QTextFrame::iterator classes can be used in complementary ways to extract the required structure from a document.

Tables

Tables are provided by the QTextTable class.

Tables are collections of cells that are arranged in rows and columns. Each table cell is a document element with its own character format, but it can also contain other elements, such as frames and text blocks. Table cells are automatically created when the table is constructed, or when extra rows or columns are added. They can also be moved between tables.

QTextTable is a subclass of QTextFrame, so tables are treated like frames in the document structure. For each frame that we encounter in the document, we can test whether it represents a table, and deal with it in a different way:

     QTextFrame::iterator it;
     for (it = frame->begin(); !(it.atEnd()); ++it) {

         QTextFrame *childFrame = it.currentFrame();
         QTextBlock childBlock = it.currentBlock();

         if (childFrame) {
             QTextTable *childTable = qobject_cast<QTextTable*>(childFrame);

             if (childTable)
                 processTable(childTable);
             else
                 processFrame(childFrame);

         } else if (childBlock.isValid()) {
             processBlock(childBlock);
         }
     }

The cells within an existing table can be examined by iterating through the rows and columns.

     for (int row = 0; row < table->rows(); ++row) {
         for (int column = 0; column < table->columns(); ++column) {
             QTextTableCell tableCell = table->cellAt(row, column);
             processTableCell(tableCell);
         }
     }

Lists

Lists are provided by the QTextList class.

Lists are sequences of text blocks that are formatted in the usual way, but which also provide the standard list decorations such as bullet points and enumerated items. Lists can be nested, and will be indented if the list's format specifies a non-zero indentation.

We can refer to each list item by its index in the list:

     for (int index = 0; index < list->count(); ++index) {
         QTextBlock listItem = list->item(index);
         processListItem(listItem);
     }

Since QTextList is a subclass of QTextBlockGroup, it does not group the list items as child elements, but instead provides various functions for managing them. This means that any text block we find when traversing a document may actually be a list item. We can ensure that list items are correctly identified by using the following code:

     QTextFrame::iterator it;
     for (it = frame->begin(); !(it.atEnd()); ++it) {

         QTextBlock block = it.currentBlock();

         if (block.isValid()) {

             QTextList *list = block.textList();

             if (list) {
                 int index = list->itemNumber(block);
                 processListItem(list, index);
             }
         }
     }

Images

Images in QTextDocument are represented by text fragments that reference external images via the resource mechanism. Images are created using the cursor interface, and can be modified later by changing the character format of the image's text fragment:

     if (fragment.isValid()) {
         QTextImageFormat newImageFormat = fragment.charFormat().toImageFormat();

         if (newImageFormat.isValid()) {
             newImageFormat.setName(":/images/newimage.png");
             QTextCursor helper = cursor;

             helper.setPosition(fragment.position());
             helper.setPosition(fragment.position() + fragment.length(),
                                 QTextCursor::KeepAnchor);
             helper.setCharFormat(newImageFormat);
         }
     }

The fragment that represents the image can be found by iterating over the fragments in the text block that contains the image.