Posts Tagged ‘clang’

Location, Location, Location

April 27, 2021

As of a few days ago, a new feature in clang-query allows introspecting the source locations for a given clang AST node. The feature is also available for experimentation in Compiler Explorer. I previously delivered a talk at EuroLLVM 2019 and blogged in 2018 about this feature and others to assist in discovery of AST matchers and source locations. This is a major step in getting the Tooling API discovery features upstream into LLVM/Clang.

Background

When creating clang-tidy checks to perform source to source transformation, there are generally two steps common to all checks:

  • Matching on the AST
  • Replacing particular source ranges in source files with new text

To complete the latter, you will need to become familiar with the source locations clang provides for the AST. A diagnostic is then issued with zero or more “fix it hints” which indicate changes to the code. Almost all clang-tidy checks are implemented in this way.

Some of the source locations which might be interesting for a FunctionDecl are illustrated here:

Pick Your Name

A common use case for this kind of tooling is to port a large codebase from a deprecated API to a new API.

A tool might replace a member call pushBack with push_back on a custom container, for the purpose of making the API more like standard containers. It might be the case that you have multiple classes with a pushBack method and you only want to change uses of it on a particular class, so you can not simply find and replace across the entire repository.

Given test code like

    struct MyContainer
    {
        // deprected:
        void pushBack(int t);

        // new:
        void push_back(int t);    
    };

    void calls()
    {
        MyContainer mc;

        mc.pushBack(42);
    }

A matcher could look something like:

    match cxxMemberCallExpr(
    on(expr(hasType(cxxRecordDecl(hasName("MyContainer"))))),
    callee(cxxMethodDecl(hasName("pushBack")))
    )

Try experimenting with it on Compiler Explorer.

An explanation of how to discover how to write this AST matcher expression is out of scope for this blog post, but you can see blogs passim for that too.

Know Your Goal

Having matched a call to pushBack the next step is to replace the source text of the call with push_back. The call to mc.pushBack() is represented by an instance of CXXMemberCallExpr. Given the instance, we need to identify the location in the source of the first character after the “.” and the location of the opening paren. Given those locations, we create a diagnostic with a FixItHint to replace that source range with the new method name:

    diag(MethodCallLocation, "Use push_back instead of pushBack")
        << FixItHint::CreateReplacement(
            sourceRangeForCall, "push_back");

When we run our porting tool in clang-tidy, we get output similiar to:

warning: Use push_back instead of pushBack [misc-update-pushBack]
    mc.pushBack(42);
       ^~~~~~~~
       push_back

Running clang-tidy with -fix then causes the tooling to apply the suggested fix. Once we have tested it, we can run the tool to apply the change to all of our code at once.

Find Your Place

So, how do we identify the sourceRangeForCall?

One way is to study the documentation of the Clang AST to try to identify what API calls might be useful to access that particular source range. That is quite difficult to determine for newcomers to the Clang AST API.

The new clang-query feature allows users to introspect all available locations for a given AST node instance.

note: source locations here
    mc.pushBack(42);
       ^
 * "getExprLoc()"

note: source locations here
    mc.pushBack(42);
                  ^
 * "getEndLoc()"
 * "getRParenLoc()"

With this output, we can see that the location of the member call is retrievable by calling getExprLoc() on the CXXMemberCallExpr, which happens to be defined on the Expr base class. Because clang replacements can operate on token ranges, the location for the start of the member call is actually all we need to complete the replacement.

One of the design choices of the srcloc output of clang-query is that only locations on the “current” AST node are part of the output. That’s why for example, the arguments of a function call are not part of the locations output for a CXXMemberCallExpr. Instead it is necessary to traverse to the argument and introspect the locations of the node which represents the argument.

By traversing to the MemberExpr of the CXXMethodCallExpr we can see more locations. In particular, we can see that getOperatorLoc() can be used to get the location of the operator (a “.” in this case, but it could be a “->” for example) and getMemberNameInfo().getSourceRange() can be used to get a source range for the name of the member being called.

The Best Location

Given the choice of using getExprLoc() or getMemberNameInfo().getSourceRange(), the latter is preferable because it is more semantically related to what we want to replace. Aside from the hint that we want the “source range” of the “member name”, the getExprLoc() should be disfavored as that API is usually only used to choose a position to indicate in a diagnostic. That is not specifically what we wish to use the location for.

Additionally, by experimenting with slightly more complex code, we can see that getExprLoc() on a template-dependent call expression does not give the desired source location (At time of publishing! – This is likely undesirable in this case). At any rate, getMemberNameInfo().getSourceRange() gives the correct source range in all cases.

In the end, our diagnostic can look something like:

    diag(MethodCallLocation, "Use push_back instead of pushBack")
        << FixItHint::CreateReplacement(
            theMember->getMemberNameInfo().getSourceRange(), "push_back");

This feature is a powerful way to discover source locations and source ranges while creating and maintaining clang-tidy checks. Let me know if you find it useful!

AST Matchmaking made easy

February 14, 2021

The upcoming version of Clang 12 includes a new traversal mode which can be used for easier matching of AST nodes.

I presented this mode at EuroLLVM and ACCU 2019, but at the time I was calling it “ignoring invisible” mode. The primary aim is to make AST Matchers easier to write by requiring less “activation learning” of the newcomer to the AST Matcher API. I’m analogizing to “activation energy” here – this mode reduces the amount of learning of new concepts must be done before starting to use AST Matchers.

The new mode is a mouthful – IgnoreUnlessSpelledInSource – but it makes AST Matchers easier to use correctly and harder to use incorrectly. Some examples of the mode are available in the AST Matchers reference documentation.

In clang-query, the mode affects both matching and dumping of AST nodes and it is enabled with:

set traversal IgnoreUnlessSpelledInSource

while in the C++ API of AST Matchers, it is enabled by wrapping a matcher in:

traverse(TK_IgnoreUnlessSpelledInSource, ...)

The result is that matching of AST nodes corresponds closely to what is written syntactically in the source, rather than corresponding to the somewhat arbitrary structure implicit in the clang::RecursiveASTVisitor class.

Using this new mode makes it possible to “add features by removing code” in clang-tidy, making the checks more maintainable and making it possible to run checks in all language modes.

Clang does not use this new mode by default.

Implicit nodes in expressions

One of the issues identified is that the Clang AST contains many nodes which must exist in order to satisfy the requirements of the language. For example, a simple function relying on an implicit conversion might look like.

struct A {
    A(int);
    ~A();
};

A f()
{
    return 42;
}

In the new IgnoreUnlessSpelledInSource mode, this is represented as

ReturnStmt
`-IntegerLiteral '42'
and the integer literal can be matched with
returnStmt(hasReturnValue(integerLiteral().bind("returnVal")))

In the default mode, the AST might be (depending on C++ language dialect) represented by something like:

ReturnStmt
`-ExprWithCleanups
  `-CXXConstructExpr
    `-MaterializeTemporaryExpr
      `-ImplicitCastExpr
        `-CXXBindTemporaryExpr
          `-ImplicitCastExpr
            `-CXXConstructExpr
              `-IntegerLiteral '42'

To newcomers to the Clang AST, and to me, it is not obvious what all of the nodes there are for. I can reason that an instance of A must be constructed. However, there are two CXXConstructExprs in this AST and many other nodes, some of which are due to the presence of a user-provided destructor, others due to the temporary object. These kinds of extra nodes appear in most expressions, such as when processing arguments to a function call or constructor, declaring or assigning a variable, converting something to bool in an if condition etc.

There are already AST Matchers such as ignoringImplicit() which skip over some of the implicit nodes in AST Matchers. Still though, a complete matcher for the return value of this return statement looks something like

returnStmt(hasReturnValue(
    ignoringImplicit(
        ignoringElidableConstructorCall(
            ignoringImplicit(
                cxxConstructExpr(hasArgument(0,
                    ignoringImplicit(
                        integerLiteral().bind("returnVal")
                        )
                    ))
                )
            )
        )
    ))

Another mouthful.

There are several problems with this.

  • Typical clang-tidy checks which deal with expressions tend to require extensive use of such ignoring...() matchers. This makes the matcher expressions in such clang-tidy checks quite noisy
  • Different language dialects represent the same C++ code with different AST structures/extra nodes, necessitating testing and implementing the check in multiple language dialects
  • The requirement or possibility to use these intermediate matchers at all is not easily discoverable, nor are the required matchers to saitsfy all language modes easily discoverable
  • If an AST Matcher is written without explicitly ignoring implicit nodes, Clang produces lots of surprising results and incorrect transformations

Implicit declarations nodes

Aside from implicit expression nodes, Clang AST Matchers also match on implicit declaration nodes in the AST. That means that if we wish to make copy constructors in our codebase explicit we might use a matcher such as

cxxConstructorDecl(
    isCopyConstructor()
    ).bind("prepend_explicit")

This will work fine in the new IgnoreUnlessSpelledInSource mode.

However, in the default mode, if we have a struct with a compiler-provided copy constructor such as:

struct Copyable {
    OtherStruct m_o;
    Copyable();
};

we will match the compiler provided copy constructor. When our check inserts explicit at the copy constructor location it will result in:

struct explicit Copyable {
    OtherStruct m_o;
    Copyable();
};

Clearly this is an incorrect transformation despite the transformation code “looking” correct. This AST Matcher API is hard to use correctly and easy to use incorrectly. Because of this, the isImplicit() matcher is typically used in clang-tidy checks to attempt to exclude such transformations, making the matcher expression more complicated.

Implicit template instantiations

Another surpise in the behavior of AST Matchers is that template instantiations are matched by default. That means that if we wish to change class members of type int to type safe_int for example, we might write a matcher something like

fieldDecl(
    hasType(asString("int"))
    ).bind("use_safe_int")

This works fine for non-template code.

If we have a template like

template  
struct TemplStruct {
    TemplStruct() {}
    ~TemplStruct() {}

private:
    T m_t;
};

then clang internally creates an instantiation of the template with a substituted type for each template instantation in our translation unit.

The new IgnoreUnlessSpelledInSource mode ignores those internal instantiations and matches only on the template declaration (ie, with the T un-substituted).

However, in the default mode, our template will be transformed to use safe_int too:

template  
struct TemplStruct {
    TemplStruct() {}
    ~TemplStruct() {}

private:
    safe_int m_t;
};

This is clearly an incorrect transformation. Because of this, isTemplateInstantiation() and similar matchers are often used in clang-tidy to exclude AST matches which produce such transformations.

Matching metaphorical code

C++ has multiple features which are designed to be simple expressions which the compiler expands to something less-convenient to write. Range-based for loops are a good example as they are a metaphor for an explicit loop with calls to begin and end among other things. Lambdas are another good example as they are a metaphor for a callable object. C++20 adds several more, including rewriting use of operator!=(...) to use !operator==(...) and operator<(...) to use the spaceship operator.

[I admit that in writing this blog post I searched for a metaphor for “a device which aids understanding by replacing the thing it describes with something more familiar” before realizing the recursion. I haven’t heard these features described as metaphorical before though…]

All of these metaphorical replacements can be explored in the Clang AST or on CPP Insights.

Matching these internal representations is confusing and can cause incorrect transformations. None of these internal representations are matchable in the new IgnoreUnlessSpelledInSource mode.

In the default matching mode, the CallExprs for begin and end are matched, as are the CXXRecordDecl implicit in the lambda and hidden comparisons within rewritten binary operators such as spaceship (causing bugs in clang-tidy checks).

Easy Mode

This new mode of AST Matching is designed to be easier for users, especially newcomers to the Clang AST, to use and discover while offering protection from typical transformation traps. It will likely be used in my Qt-based Gui Quaplah, but it must be enabled explicitly in existing clang tools.

As usual, feedback is very welcome!