Lexer changes

To fix some problems with lexing in Scintilla and to add new capabilities, there are going to be some major changes. It is likely these will go into the release after the next one. This release will be called 2.20 as the changes are not completely backwards compatible.

Single lexer per document

A problem with current Scintilla is that lexers and lexer options such as properties and keywords are attached to the view (ScintillaBase) object rather than the Document object. When two views are showing one document then it is possible for two different lexers to be called to style the text leading to arbitrary and confusing results.

To fix this, lexer state is being moved from ScintillaBase to Document although the state is still being set up by ScintillaBase as it is providing the API to client code.

This will change the scope of some settings so may require changes to applications. Applications that only set up properties or word lists at initialisation or when changing languages will have to repeat these for each document. Conversely, there will no longer be a need to set parameters for each view on a document or when switching between documents on a view since documents retain settings.

Stateful lexers

Some languages may benefit from features like styling local variables differently to global variables or showing fields that are not present in a structure in an error style. These sorts of features require that something like a symbol table is maintained by the lexer.

Lexers currently have only limited space to store information about each document: the document's style bytes and line state (a single integer per line). There are some other locations that could be used, like unused bits in folding state, but using these for lexer state may not be compatible with future changes. This makes it too difficult to implement a symbol table with only the current features.

The solution is to create a lexer object which can contain arbitrary additional data. Each document has a separate lexer object. Lexer objects implement the ILexer interface.

ILexer

class ILexer {
public:
    virtual int SCI_METHOD Version() const = 0;
    virtual void SCI_METHOD Release() = 0;
    virtual const char * SCI_METHOD PropertyNames() = 0;
    virtual int SCI_METHOD PropertyType(const char *name) = 0;
    virtual const char * SCI_METHOD DescribeProperty(const char *name) = 0;
    virtual int SCI_METHOD PropertySet(const char *key, const char *val) = 0;
    virtual const char * SCI_METHOD DescribeWordListSets() = 0;
    virtual int SCI_METHOD WordListSet(int n, const char *wl) = 0;
    virtual void SCI_METHOD Lex(unsigned int startPos, int lengthDoc, int initStyle, IDocument *pAccess) = 0;
    virtual void SCI_METHOD Fold(unsigned int startPos, int lengthDoc, int initStyle, IDocument *pAccess) = 0;
    virtual void * SCI_METHOD PrivateCall(int operation, void *pointer) = 0;
};

The lexer object may contain any data required for the functioning of the lexer. This can include information extracted from the document such as a list of functions.

There is no current way for a lexer to indicate that changing a property or keyword list should cause restyling. In SciTE, you can for example add a keyword to keywordclass.cpp, then return to a C++ document and not see any change to existing styles. Only lexing done after the addition will use the new keywords. Lexer objects will be responsible for storing properties and word lists. They provide PropertySet and WordListSet methods to receive these parameters and return a position where lexing should be restarted from (normally 0 although lexers may be more intelligent about this) or -1 if the change does not affect lexing or folding.

Release is called to destroy the lexer object.

PrivateCall allows for direct communication between the application and a lexer. An example would be where an application maintains a single large data structure containing symbolic information about system headers (like Windows.h) and provides this to the lexer where it can be applied to each document. This avoids the costs of constructing the system header information for each document. This is invoked with the SCI_PRIVATELEXERCALL API.

IDocument

Currently lexers interact with the document through a concrete class derived from the Accessor abstract base class with Accessor providing most of the functionality and the derived class implementing communication with the document. This is either direct (DocumentAccessor) for lexers linked into Scintilla or through messages (WindowAccessor) for external lexers. In the new scheme, the only way of performing communications with the document is through the IDocument interface which can be used for external lexers as well as lexers linked into Scintilla.

This avoids dependence on GUI windowing code and makes it easier to move lexers between shared libraries and linked in. It could also be used for lexers housed within application code although this has not yet been implemented.

class IDocument {
public:
    virtual int SCI_METHOD Version() const = 0;
    virtual void SCI_METHOD SetErrorStatus(int status) = 0;
    virtual int SCI_METHOD Length() const = 0;
    virtual void SCI_METHOD GetCharRange(char *buffer, int position, int lengthRetrieve) const = 0;
    virtual char SCI_METHOD StyleAt(int position) const = 0;
    virtual int SCI_METHOD LineFromPosition(int position) const = 0;
    virtual int SCI_METHOD LineStart(int line) const = 0;
    virtual int SCI_METHOD GetLevel(int line) const = 0;
    virtual int SCI_METHOD SetLevel(int line, int level) = 0;
    virtual int SCI_METHOD GetLineState(int line) const = 0;
    virtual int SCI_METHOD SetLineState(int line, int state) = 0;
    virtual void SCI_METHOD StartStyling(int position, char mask) = 0;
    virtual bool SCI_METHOD SetStyleFor(int length, char style) = 0;
    virtual bool SCI_METHOD SetStyles(int length, const char *styles) = 0;
    virtual void SCI_METHOD DecorationSetCurrentIndicator(int indicator) = 0;
    virtual void SCI_METHOD DecorationFillRange(int position, int value, int fillLength) = 0;
    virtual void SCI_METHOD ChangeLexerState(int start, int end) = 0;
    virtual int SCI_METHOD CodePage() const = 0;
    virtual bool SCI_METHOD IsDBCSLeadByte(char ch) const = 0;
};

Since IDocument is an interface, it can be used across build boundaries (such as between two DLLs) where the implementation can not be seen from the client so can not be optimized by the compiler. This gets in the way of efficient buffering, so the task of buffering is moved to a helper class that is local to the lexer. Example helper classes are the simple LexAccessor and its subclass Accessor which provides more services. These may be used by lexers or lexers may create their own helper classes.

The use of interfaces between components is similar to COM or XPCOM. Using actual COM or XPCOM would add complexity. The interfaces are defined as C++ but can be emulated by C and probably by other languages that are compatible with COM. SCI_METHOD is defined to be whatever is needed to specify a reasonable calling convention on each platform so that each side of the interface can call the other. This is currently __stdcall on Windows and is unspecified on Unix.

The ILexer and IDocument interfaces may be expanded in the future with extended versions (ILexer2...). The Version method indicates which interface is implemented and thus which methods may be called.

Scintilla tries to minimize the consequences of modifying text to only relex and redraw the line of the change where possible. Lexer objects contain their own private extra state which can affect later lines. For example, if the C++ lexer is greying out inactive code segments then changing the statement #define BEOS 0 to #define BEOS 1 may require restyling and redisplaying later parts of the document. The lexer can call ChangeLexerState to signal to the document that it should relex and display more.

SetErrorStatus is used to notify the document of exceptions. Exceptions should not be thrown over build boundaries as the two sides may be built with different compilers or incompatible exception options.

External lexers

External lexers will require changes. They will have to implement a lexer object factory function (exposed through GetLexerFactory) instead of the current Lex and Fold functions. Once a lexer object has been created, it is called exactly the same as internal lexer objects.

Migration

Existing lexers do not have to change much as the LexerModule and LexerSimple classes provide a very similar environment. The set of headers used by lexers has changed but is fairly consistent among lexers so can just be copied from a lexer included with Scintilla. Lexers should not include Platform.h and only use headers from the include and lexlib directories. Using headers from the src, win32, or gtk directories makes the code dependent on features that may change so should not be done.

A lexer may be converted to an ILexer implementing class by defining a class derived from ILexer, a factory function and changing the LexerModule to use the factory function rather than lexing and folding functions. Initially it is simplest to derive the class from LexerBase as this provides some default functionality including standard property set and word lists. Later these should be overridden to optimize changes to parameters.

Around 60 lines of boiler-plate additional code are needed to convert an existing lexer into an external lexer that implements ILexer.

Code

An implementation of all this is available from http://www.scintilla.org/nulex.zip

Additional directories have been used to impose some more order on the source code. Lexers have been moved into the lexers directory and classes used by lexers are in the lexlib directory. The build files work for Windows and GTK+, but those for OS X have not been updated.

The C++ lexer included has some code to show whether or not code is active based on preprocessor state with inactive code shown in different styles to active code. This is turned on with lexer.cpp.track.preprocessor=1 and keywords5 containing a set of preprocessor definitions in the form <var>=<value> <var>=<value> ... Definitions within the source will be picked up if lexer.cpp.update.preprocessor=1. Both these options have some cost in terms of speed and memory. The inactive states are 64 greater than their active counterparts. This looks like

Example properties to achieve above:

lexer.cpp.track.preprocessor=1
lexer.cpp.update.preprocessor=1

keywords5.$(file.patterns.cpp)=\
PLAT_GTK=1 \
_MSC_VER \
PLAT_GTK_WIN32=1

# White space
style.cpp.64=fore:#808080,fore:#C0C0C0
# Comment: /* */.
style.cpp.65=$(style.cpp.1),fore:#90B090
style.cpp.66=$(style.cpp.2),fore:#90B090
style.cpp.67=$(style.cpp.3),fore:#D0D0D0
style.cpp.68=$(style.cpp.4),fore:#90B0B0
style.cpp.69=$(style.cpp.5),fore:#9090B0
style.cpp.70=$(style.cpp.6),fore:#B090B0
style.cpp.71=$(style.cpp.7),fore:#B090B0
style.cpp.72=$(style.cpp.8),fore:#C0C0C0
style.cpp.73=$(style.cpp.9),fore:#B0B090
style.cpp.74=$(style.cpp.10),fore:#B0B0B0
style.cpp.75=$(style.cpp.11),fore:#B0B0B0
style.cpp.76=$(style.cpp.12),fore:#000000
style.cpp.77=$(style.cpp.13),fore:#007F00
style.cpp.78=$(style.cpp.14),fore:#7FAF7F
style.cpp.79=$(style.cpp.15),fore:#C0C0C0
style.cpp.80=$(style.cpp.16),fore:#C0C0C0
style.cpp.81=$(style.cpp.17),fore:#C0C0C0
style.cpp.82=$(style.cpp.18),fore:#C0C0C0