How to write a scintilla lexer A lexer for a particular language determines how a specified range of text shall be colored. Writing a lexer is relatively straightforward because the lexer need only color given text. The harder job of determining how much text actually needs to be colored is handled by Scintilla itself, that is, the lexer's caller. Parameters The lexer for language LLL has the following prototype: static void ColouriseLLLDoc ( unsigned int startPos, int length, int initStyle, WordList *keywordlists[], Accessor &styler); The styler parameter is an Accessor object. The lexer must use this object to access the text to be colored. The lexer gets the character at position i using styler.SafeGetCharAt(i); The startPos and length parameters indicate the range of text to be recolored; the lexer must determine the proper color for all characters in positions startPos through startPos+length. The initStyle parameter indicates the initial state, that is, the state at the character before startPos. States also indicate the coloring to be used for a particular range of text. Note: the character at StartPos is assumed to start a line, so if a newline terminates the initStyle state the lexer should enter its default state (or whatever state should follow initStyle). The keywordlists parameter specifies the keywords that the lexer must recognize. A WordList class object contains methods that simplify the recognition of keywords. Present lexers use a helper function called classifyWordLLL to recognize keywords. These functions show how to use the keywordlists parameter to recognize keywords. This documentation will not discuss keywords further. The lexer code The task of a lexer can be summarized briefly: for each range r of characters that are to be colored the same, the lexer should call styler.ColourTo(i, state) where i is the position of the last character of the range r. The lexer should set the state variable to the coloring state of the character at position i and continue until the entire text has been colored. Note 1: the styler (Accessor) object remembers the i parameter in the previous calls to styler.ColourTo, so the single i parameter suffices to indicate a range of characters. Note 2: As a side effect of calling styler.ColourTo(i,state), the coloring states of all characters in the range are remembered so that Scintilla may set the initStyle parameter correctly on future calls to the lexer. Lexer organization There are at least two ways to organize the code of each lexer. Present lexers use what might be called a "character-based" approach: the outer loop iterates over characters, like this: lengthDoc = startPos + length ; for (unsigned int i = startPos; i < lengthDoc; i++) { chNext = styler.SafeGetCharAt(i + 1); << handle special cases >> switch(state) { // Handlers examine only ch and chNext. // Handlers call styler.ColorTo(i,state) if the state changes. case state_1: << handle ch in state 1 >> case state_2: << handle ch in state 2 >> ... case state_n: << handle ch in state n >> } chPrev = ch; } styler.ColourTo(lengthDoc - 1, state); An alternative would be to use a "state-based" approach. The outer loop would iterate over states, like this: lengthDoc = startPos+lenth ; for ( unsigned int i = startPos ;; ) { char ch = styler.SafeGetCharAt(i); int new_state = 0 ; switch ( state ) { // scanners set new_state if they set the next state. case state_1: << scan to the end of state 1 >> break ; case state_2: << scan to the end of state 2 >> break ; case default_state: << scan to the next non-default state and set new_state >> } styler.ColourTo(i, state); if ( i >= lengthDoc ) break ; if ( ! new_state ) { ch = styler.SafeGetCharAt(i); << set state based on ch in the default state >> } } styler.ColourTo(lengthDoc - 1, state); This approach might seem to be more natural. State scanners are simpler than character scanners because less needs to be done. For example, there is no need to test for the start of a C string inside the scanner for a C comment. Also this way makes it natural to define routines that could be used by more than one scanner; for example, a scanToEndOfLine routine. However, the special cases handled in the main loop in the character-based approach would have to be handled by each state scanner, so both approaches have advantages. These special cases are discussed below. Special case: Lead characters Lead bytes are part of DBCS processing for languages such as Japanese using an encoding such as Shift-JIS. In these encodings, extended (16-bit) characters are encoded as a lead byte followed by a trail byte. Lead bytes are rarely of any lexical significance, normally only being allowed within strings and comments. In such contexts, lexers should ignore ch if styler.IsLeadByte(ch) returns TRUE. Note: UTF-8 is simpler than Shift-JIS, so no special handling is applied for it. All UTF-8 extended characters are >= 128 and none are lexically significant in programming languages which, so far, use only characters in ASCII for operators, comment markers, etc. Special case: Folding Folding may be performed in the lexer function. It is better to use a separate folder function as that avoids some troublesome interaction between styling and folding. The folder function will be run after the lexer function if folding is enabled. The rest of this section explains how to perform folding within the lexer function. During initialization, lexers that support folding set bool fold = styler.GetPropertyInt("fold"); If folding is enabled in the editor, fold will be TRUE and the lexer should call: styler.SetLevel(line, level); at the end of each line and just before exiting. The line parameter is simply the count of the number of newlines seen. It's initial value is styler.GetLine(startPos) and it is incremented (after calling styler.SetLevel) whenever a newline is seen. The level parameter is the desired indentation level in the low 12 bits, along with flag bits in the upper four bits. The indentation level depends on the language. For C++, it is incremented when the lexer sees a '{' and decremented when the lexer sees a '}' (outside of strings and comments, of course). The following flag bits, defined in Scintilla.h, may be set or cleared in the flags parameter. The SC_FOLDLEVELWHITEFLAG flag is set if the lexer considers that the line contains nothing but whitespace. The SC_FOLDLEVELHEADERFLAG flag indicates that the line is a fold point. This normally means that the next line has a greater level than present line. However, the lexer may have some other basis for determining a fold point. For example, a lexer might create a header line for the first line of a function definition rather than the last. The SC_FOLDLEVELNUMBERMASK mask denotes the level number in the low 12 bits of the level param. This mask may be used to isolate either flags or level numbers. For example, the C++ lexer contains the following code when a newline is seen: if (fold) { int lev = levelPrev; // Set the "all whitespace" bit if the line is blank. if (visChars == 0) lev |= SC_FOLDLEVELWHITEFLAG; // Set the "header" bit if needed. if ((levelCurrent > levelPrev) && (visChars > 0)) lev |= SC_FOLDLEVELHEADERFLAG; styler.SetLevel(lineCurrent, lev); // reinitialize the folding vars describing the present line. lineCurrent++; visChars = 0; // Number of non-whitespace characters on the line. levelPrev = levelCurrent; } The following code appears in the C++ lexer just before exit: // Fill in the real level of the next line, keeping the current flags // as they will be filled in later. if (fold) { // Mask off the level number, leaving only the previous flags. int flagsNext = styler.LevelAt(lineCurrent); flagsNext &= ~SC_FOLDLEVELNUMBERMASK; styler.SetLevel(lineCurrent, levelPrev | flagsNext); } Don't worry about performance The writer of a lexer may safely ignore performance considerations: the cost of redrawing the screen is several orders of magnitude greater than the cost of function calls, etc. Moreover, Scintilla performs all the important optimizations; Scintilla ensures that a lexer will be called only to recolor text that actually needs to be recolored. Finally, it is not necessary to avoid extra calls to styler.ColourTo: the sytler object buffers calls to ColourTo to avoid multiple updates of the screen. Page contributed by Edward K. Ream