Learn Python ASTs, by building your own linter

mgdlbp · on Dec 29, 2021

Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.

The Roslyn SDK exposes its syntax tree, symbol table, and semantic model, with the primary use being for custom code analysis. I surprisingly easily made a linter ('analyzer') for a personal style preference, along with 'code fix' (lightbulb suggestion that appears in Visual Studio) through the quick-start tutorial. The resulting .NET assembly integrated impressively with msbuild and Visual Studio, my custom analyzer being indistinguishable in UX from the built-in ones. Seeing the actual syntax tree, especially where the compiler had recovered from syntax errors, also seemed a great learning experience for getting a feel of how the compiler treated errors.

It seems to now be fairly common for .NET projects to develop their own analyzers to enforce specialized best-practices; I wonder if other languages have similar customs?

https://docs.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/

eyelidlessness · on Dec 30, 2021

> Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.

Besides lisps?

tusharsadhwani · on Dec 29, 2021

That's amazing. I'm interested in C# now

benhoff · on Dec 29, 2021

I've often thought it would be cool to use AST's and perhaps code embeddings generated from machine learning as a tool to help students improve.

If you've ever taught a course with intro level python, it quickly becomes apparent how repeatable the mistakes are, or where you didn't spend enough time. As a student, this is frustrating because the correction comes too late, it's why having someone knowledgeable over your shoulder can speed up your learning.

The challenge that I believe ASTs present is that they only parse compliant code. So if someone makes a syntax error, it becomes a whole new ball game. I'd glanced at tree sitter to see if this could fix some of these issues, but I think it's a more fundamental problem than that.

tusharsadhwani · on Dec 29, 2021

tree sitter can definitely help with this problem, but so can regular AST parsers, the idea is the same: just add code or grammar that will parse the "invalid" grammar, mark it as invalid, and continue parsing valid code as soon as possible.

Existing code editors like VSCode do exactly this for better syntax highlighting of incomplete code.

hsbauauvhabzb · on Dec 29, 2021

Wouldn’t that be impossible? The structure of python is finite, and invalid deviations are infinite. Sure any language AST compiler could be more helpful, but they can’t take trash and turn it to gold.

TuringTest · on Dec 29, 2021

In the context of programming lessons, there is a known correct program. Wouldn't it be possible to calculate the distance between what has been typed and how the correct finished program should be, to guide students into correcting the non-parsing parts of their code?

popotamonga · on Dec 29, 2021

In general, for me at least i find the best way to learn about something is to work in the 'internals'. For instance when react came i couldn't wrap my head around it so i started my own js framework, and it ended up almost exactly like react (then i dumped it as it ended up just being a learning exercise)

sarupbanskota · on Dec 29, 2021

You'll enjoy https://codecrafters.io

nefitty · on Dec 29, 2021

I'm trying to become a top 0.01% JS user and creating linters, flavors, etc is my plan as well. I've read through and annotated the React codebase but it didn't stick very well. I would have done better to create my own framework! I keep having to relearn that lesson... I can have a lot of knowledge about a thing through reading, but knowledge of the thing requires some practical application.

A tangent, but as it relates to that, if anyone reading has ideas on how to apply traditional computer science curriculum, I would love to hear it. I can think of toy CPU emulators, system architecture diagramming, language creation... But not sure if there's a thing I can build that would say, "I understand computer science."

agumonkey · on Dec 29, 2021

I forgot whoever coined the saying (Feynmann or else) but I'm definitely in the camp that needs to build something to feel at home with it.

alansammarone · on Dec 29, 2021

I believe you are referring to "What I cannot create I don't understand", which is indeed by Feynman.

agumonkey · on Dec 29, 2021

Most probably

jmac01 · on Dec 29, 2021

> "So what is an AST?"

I had to google... because it doesn't actually say that it stands for Abstract Syntax Tree haha

Would be nice to highlight what AST stands for in the first sentence of that section! :D

fintler · on Dec 29, 2021

In the context of this article -- it's mostly just talking about Python-specific ASTs.

Reading this article might be confusing to someone who's trying to learn what an AST is. ASTs are not unique to Python, they're just a common data structure used in compiler design.

ASTs are used by compilers like this:

1) A compiler will take source code and process it into little pieces called tokens (e.g., a number, an equals sign, a variable type, etc) with a little program called a "lexer".

2) Then, those tokens are processed by a "parser" -- which is a little program that inputs the tokens from the lexer, as well as a description of a programming language (e.g. a Chomsky context-free grammar in Backus Naur Form) and outputs an AST.

3) Then finally, the AST nodes are walked and machine code is generated.

This article hooks into the AST inside the Python "compiler" between steps 2 and 3 to do some analysis on the AST instead of converting it to something that can be executed (e.g. machine code or some other IR). Which, is a very useful thing, but probably not a good introduction to compilers.

If you're new to compilers, I suggest staying away from the Python "ast" module until you're comfortable with general compiler design. Maybe start with playing around with something like PLY instead -- create a simple little language yourself and write a compiler for it:

<https://www.dabeaz.com/ply/ply.html#ply_nn2>

tusharsadhwani · on Dec 29, 2021

I'll agree, it's not a good introduction to compilers, but it isn't meant to be.

PLY on the other hand is an amazing resource, thanks for linking it here.

jmac01 · on Dec 30, 2021

Thanks!

tusharsadhwani · on Dec 29, 2021

My bad xD, to my credit it's mentioned later in the article. But you're right, I should add that in the beginning.

jmac01 · on Dec 30, 2021

NP all good I'm just a complete noob so was confused and waiting for that haha! Thanks :)

apurtbapurt · on Dec 29, 2021

I maintain some code that rely on Python AST for finding and packaging modules with appropriate class signatures when building customer specific distributions. It works really well most of the time. And, it is a lot easier to maintain than 50+ separate wheel definitions.

The one big drawback is that the AST for even trivial code patterns has had a history of changing between Python versions. This makes it more annoying than usual to support multiple versions at the same time. Luckily 3.9 and 3.10 hasn't brought any changes that impacted my codebase, as far as I've noticed.

tusharsadhwani · on Dec 29, 2021

The only major changes that I'm aware of since python3 has been the change with keyword arguments in 3.6, and the deprecation of Index and introduction of Constant more recently. Those are big changes, but relatively small and maintainable imo. What challenges have you faced?

masklinn · on Dec 29, 2021

> the deprecation of Index and introduction of Constant more recently.

The introduction of Constant also deprecated everything it replaced (Str, Num, Bytes, and NameConstant).

There's also the introduction of f-strings (ast'd as JoinedStr), various nodes being duplicated for their async version.

Probably more relevant to automatically discovering signatures would be the addition of positional-only arguments to the `arguments` object.

But messing with the AST is definitely a lot more stable than messing with the bytecode.

stevekemp · on Dec 29, 2021

That's a pretty awesome read, and the approach is pretty flexible.

I've written simple code using the AST-visitor approach to enforce some common-standards on code within our company. Simple things like ensuring that when we use Troposphere to generate AWS cloudformation templates we always setup some specific values. (For example I wrote a checker to ensure that every time an ECR instance is created we must enable ScanOnPush, or every time we declare a security-group we must have a comment "[cloudformation] ..." with it - so that manual edits stand out.)

tusharsadhwani · on Dec 29, 2021

Thanks!

The stuff that ASTs let you do really flexibly is almost always lost to people because they're not aware of it. A lot of other developers would try to do this with string or regex matching, and that often leads to painful experiences.

stevekemp · on Dec 29, 2021

Agreed. Simple checks like these are trivial:

A call to function "Foo" Must always have an argument matching the regexp "/blah/". Otherwise raise an error.

And they're so lightweight you can add them to any CI/CD/automation steps in your repository. Once you get a few things like that, or validating naming-standards, you can roll them up into a simple "linter".

wbkang · on Dec 29, 2021

This is a cool post thank you. I knew about ASTs but did not know how to build them easily for Python so the second half was very useful for me.

tusharsadhwani · on Dec 29, 2021

I really wasn't expecting anyone to read all of it, I was afraid people will either find it too trivial or too complex based on skill level. So that's great to hear.

jpeloquin · on Dec 30, 2021

One thing I don't understand from the article: "Hopefully this explains why we explicitly need to tell each variable whether it is in a load or store context in the AST."

Why does the Name() need to know its own context? It seems like the manner in which Name() is used, whether load, store, or delete, is fully determined by the Name()'s parent and whether the parent uses the Name() as a target or as a value.

tusharsadhwani · on Dec 30, 2021

I can't think of any exceptions to this right away, so you might be right.

But, not having ctx available as an explicit value would make certain ast manipulations, as well as bytecode generation by the interpreter more complicated. This is because in both the scenarios, checking for the context will involve checking the parent, instead of a child. And finding the parent in a tree is often times much harder than a child.

uryga · on Dec 30, 2021

yeah, that's not really explained in the article, and it's not explained in the `ast` docs either [1]. for reference, the full list of assignment targets is:

         -- the following expression can appear in assignment context
         | Attribute(expr value, identifier attr, expr_context ctx)
         | Subscript(expr value, expr slice, expr_context ctx)
         | Starred(expr value, expr_context ctx)
         | Name(identifier id, expr_context ctx)
         | List(expr* elts, expr_context ctx)
         | Tuple(expr* elts, expr_context ctx)

perhaps it's just neater to just have the context there on every one of them.

also, from a pragmatic standpoint, when actually processing the AST and analyzing it semantically, you'll pretty much always going to be handling expressions and patterns (= rvalues/lvalues) differently, bc they mean completely different things! and having the context right there makes it more convenient to handle. so like, when designing an AST datatype, you could just as well not include the context in Name() and it'd be fine, but the python `ast` module's primary usecase is compiling the AST to bytecode, where it's more convenient to just have that info around, so that's what they did.

[1] https://docs.python.org/3/library/ast.html#ast.Load

ambrose2 · on Dec 29, 2021

This was a really nice read! The best part was learning that there’s no need to actually parse tokens when building a Python linter (well, maybe there’s an exception) because you can leverage the already parsed AST or CST.

tusharsadhwani · on Dec 29, 2021

True! Although there's some lints that would require you to parse tokens, such as checking for single vs. double quotes, or number of spaces used for indentation.

However, python has a builtin tokenize module for that as well.