Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.
The Roslyn SDK exposes its syntax tree, symbol table, and semantic model, with the primary use being for custom code analysis. I surprisingly easily made a linter ('analyzer') for a personal style preference, along with 'code fix' (lightbulb suggestion that appears in Visual Studio) through the quick-start tutorial. The resulting .NET assembly integrated impressively with msbuild and Visual Studio, my custom analyzer being indistinguishable in UX from the built-in ones. Seeing the actual syntax tree, especially where the compiler had recovered from syntax errors, also seemed a great learning experience for getting a feel of how the compiler treated errors.
It seems to now be fairly common for .NET projects to develop their own analyzers to enforce specialized best-practices; I wonder if other languages have similar customs?
> Quite a few languages can access their own ASTs, but I don't know of one other than C# (and VB.NET--Roslyn is the compiler for both) where the API is so deeply integrated and hence useful.
I've often thought it would be cool to use AST's and perhaps code embeddings generated from machine learning as a tool to help students improve.
If you've ever taught a course with intro level python, it quickly becomes apparent how repeatable the mistakes are, or where you didn't spend enough time. As a student, this is frustrating because the correction comes too late, it's why having someone knowledgeable over your shoulder can speed up your learning.
The challenge that I believe ASTs present is that they only parse compliant code. So if someone makes a syntax error, it becomes a whole new ball game. I'd glanced at tree sitter to see if this could fix some of these issues, but I think it's a more fundamental problem than that.
tree sitter can definitely help with this problem, but so can regular AST parsers, the idea is the same: just add code or grammar that will parse the "invalid" grammar, mark it as invalid, and continue parsing valid code as soon as possible.
Existing code editors like VSCode do exactly this for better syntax highlighting of incomplete code.
Wouldn’t that be impossible? The structure of python is finite, and invalid deviations are infinite. Sure any language AST compiler could be more helpful, but they can’t take trash and turn it to gold.
In the context of programming lessons, there is a known correct program. Wouldn't it be possible to calculate the distance between what has been typed and how the correct finished program should be, to guide students into correcting the non-parsing parts of their code?
In general, for me at least i find the best way to learn about something is to work in the 'internals'. For instance when react came i couldn't wrap my head around it so i started my own js framework, and it ended up almost exactly like react (then i dumped it as it ended up just being a learning exercise)
I'm trying to become a top 0.01% JS user and creating linters, flavors, etc is my plan as well. I've read through and annotated the React codebase but it didn't stick very well. I would have done better to create my own framework! I keep having to relearn that lesson... I can have a lot of knowledge about a thing through reading, but knowledge of the thing requires some practical application.
A tangent, but as it relates to that, if anyone reading has ideas on how to apply traditional computer science curriculum, I would love to hear it. I can think of toy CPU emulators, system architecture diagramming, language creation... But not sure if there's a thing I can build that would say, "I understand computer science."
In the context of this article -- it's mostly just talking about Python-specific ASTs.
Reading this article might be confusing to someone who's trying to learn what an AST is. ASTs are not unique to Python, they're just a common data structure used in compiler design.
ASTs are used by compilers like this:
1) A compiler will take source code and process it into little pieces called tokens (e.g., a number, an equals sign, a variable type, etc) with a little program called a "lexer".
2) Then, those tokens are processed by a "parser" -- which is a little program that inputs the tokens from the lexer, as well as a description of a programming language (e.g. a Chomsky context-free grammar in Backus Naur Form) and outputs an AST.
3) Then finally, the AST nodes are walked and machine code is generated.
This article hooks into the AST inside the Python "compiler" between steps 2 and 3 to do some analysis on the AST instead of converting it to something that can be executed (e.g. machine code or some other IR). Which, is a very useful thing, but probably not a good introduction to compilers.
If you're new to compilers, I suggest staying away from the Python "ast" module until you're comfortable with general compiler design. Maybe start with playing around with something like PLY instead -- create a simple little language yourself and write a compiler for it:
I maintain some code that rely on Python AST for finding and packaging modules with appropriate class signatures when building customer specific distributions. It works really well most of the time. And, it is a lot easier to maintain than 50+ separate wheel definitions.
The one big drawback is that the AST for even trivial code patterns has had a history of changing between Python versions. This makes it more annoying than usual to support multiple versions at the same time. Luckily 3.9 and 3.10 hasn't brought any changes that impacted my codebase, as far as I've noticed.
The only major changes that I'm aware of since python3 has been the change with keyword arguments in 3.6, and the deprecation of Index and introduction of Constant more recently. Those are big changes, but relatively small and maintainable imo. What challenges have you faced?
That's a pretty awesome read, and the approach is pretty flexible.
I've written simple code using the AST-visitor approach to enforce some common-standards on code within our company. Simple things like ensuring that when we use Troposphere to generate AWS cloudformation templates we always setup some specific values. (For example I wrote a checker to ensure that every time an ECR instance is created we must enable ScanOnPush, or every time we declare a security-group we must have a comment "[cloudformation] ..." with it - so that manual edits stand out.)
The stuff that ASTs let you do really flexibly is almost always lost to people because they're not aware of it. A lot of other developers would try to do this with string or regex matching, and that often leads to painful experiences.
A call to function "Foo"
Must always have an argument matching the regexp "/blah/".
Otherwise raise an error.
And they're so lightweight you can add them to any CI/CD/automation steps in your repository. Once you get a few things like that, or validating naming-standards, you can roll them up into a simple "linter".
I really wasn't expecting anyone to read all of it, I was afraid people will either find it too trivial or too complex based on skill level. So that's great to hear.
One thing I don't understand from the article: "Hopefully this explains why we explicitly need to tell each variable whether it is in a load or store context in the AST."
Why does the Name() need to know its own context? It seems like the manner in which Name() is used, whether load, store, or delete, is fully determined by the Name()'s parent and whether the parent uses the Name() as a target or as a value.
I can't think of any exceptions to this right away, so you might be right.
But, not having ctx available as an explicit value would make certain ast manipulations, as well as bytecode generation by the interpreter more complicated. This is because in both the scenarios, checking for the context will involve checking the parent, instead of a child. And finding the parent in a tree is often times much harder than a child.
yeah, that's not really explained in the article, and it's not explained in the `ast` docs either [1]. for reference, the full list of assignment targets is:
perhaps it's just neater to just have the context there on every one of them.
also, from a pragmatic standpoint, when actually processing the AST and analyzing it semantically, you'll pretty much always going to be handling expressions and patterns (= rvalues/lvalues) differently, bc they mean completely different things! and having the context right there makes it more convenient to handle. so like, when designing an AST datatype, you could just as well not include the context in Name() and it'd be fine, but the python `ast` module's primary usecase is compiling the AST to bytecode, where it's more convenient to just have that info around, so that's what they did.
This was a really nice read! The best part was learning that there’s no need to actually parse tokens when building a Python linter (well, maybe there’s an exception) because you can leverage the already parsed AST or CST.
True! Although there's some lints that would require you to parse tokens, such as checking for single vs. double quotes, or number of spaces used for indentation.
However, python has a builtin tokenize module for that as well.
The Roslyn SDK exposes its syntax tree, symbol table, and semantic model, with the primary use being for custom code analysis. I surprisingly easily made a linter ('analyzer') for a personal style preference, along with 'code fix' (lightbulb suggestion that appears in Visual Studio) through the quick-start tutorial. The resulting .NET assembly integrated impressively with msbuild and Visual Studio, my custom analyzer being indistinguishable in UX from the built-in ones. Seeing the actual syntax tree, especially where the compiler had recovered from syntax errors, also seemed a great learning experience for getting a feel of how the compiler treated errors.
It seems to now be fairly common for .NET projects to develop their own analyzers to enforce specialized best-practices; I wonder if other languages have similar customs?
https://docs.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/