OK, but highlighting the code correctly has essentially nothing to do with context free grammars.
This happens in Vim and Emacs with languages other than C and C++ -- here docs in shell, multiline strings in Python -- and Python does use a CFG, etc.
I agree it's annoying although I think most people view it as a minor thing. They stick with Vim and Emacs for other reasons.
I'm not sure anyone has based their language design around Vim/Emacs syntax highlighting, although ironically that is one of my criteria for language design. I was just confused by the advice to use a CFG, since it's not the relevant issue.
I would say the relevant issue is that your lexer shouldn't be too clever and have too many modes. And to avoid mixing languages in the same file, or have a very obvious lexical construct to mix languages.
The C preprocessor is an entirely separate language than C or C++, so that is the core of the issue in your example. Likewise, it is usually hard to highlight CSS and JavaScript embedded within HTML.
This is incorrect. Some languages have user-defined tokens. Some have contextual keywords. Both require a semantic understanding of the code to highlight them correctly.
And it isn't just the preprocessor with C++. There's the >> problem. It's not just me talking through my hat - tools for C++, such as pretty-printers and refactoring tools - have been very slow to appear, and fragile. But with a language like Java the tooling is quick & easy to write.
You don't have to believe me. Write a tool that reads C++ source code and inserts boilerplate at the beginning and end of each function, and works 100% of the time.
I know exactly what problem you are talking about. It's exactly the problem that Clang solves.
With the Clang front end, you can write a tool to read C++ source code and insert boilerplate at the beginning and end of each function, and it will work 100% of the time. There are dozens of such tools in active use at Google and I'm sure many other places.
But it has nothing to do with context free grammars -- really. Clang uses a recursive descent parser. GCC used to use a yacc-style grammar (which BTW is only context free-ish because of semantic actions), but it could NOT perform the task you are talking about. In fact that was largely the motivation for Clang.
It also doesn't really have to do with syntax highlighting as practiced by any editor or IDE I know of. Even though Clang has the power that you want ("semantic understanding"), I don't know any editor that uses it for syntax highlighting.
Instead they use approximate lexical hacks. This is probably because of the need to highlight partial files and malformed files, as I mentioned. You don't want your syntax highlighting to turn off in the middle of typing a code fragment.
But editors DO use Clang for semantic understanding, e.g. the YCM plugin for Vim.
But they use CFGs for NEITHER problem. You're conflating two different issues and suggesting the wrong solution for both of them.
There are a lot of links about this issue with regard to languages like C#, Scala, Go, JavaScript, etc. in the wiki page I linked.
I agree with your general point about language design, but the terminology you're using is wrong and confusing.
Yes, and clang appeared on the scene 20 years after C++ did. It's a long wait. If you create a new language, are you willing to wait 20 years for tooling?
I agree C++ is too hard to parse, and you should design something simpler. Simpler isn't the same thing as a context-free grammar. The issues you are pointing out are lexical (Python has a CFG but still has imprecise syntax highlighting in editors).
> A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting
I disagree with this because it's wrong. People don't use Clang or context-free grammars to syntax highlight code. Java has a CFG -- who uses it to syntax highlight code?
This conversation isn't very interesting because it's just me explaining the same thing to you over and over again. Your head is stuck in the mode of "expert" and not somebody who is curious and wants to learn something.
This happens in Vim and Emacs with languages other than C and C++ -- here docs in shell, multiline strings in Python -- and Python does use a CFG, etc.
I agree it's annoying although I think most people view it as a minor thing. They stick with Vim and Emacs for other reasons.
I'm not sure anyone has based their language design around Vim/Emacs syntax highlighting, although ironically that is one of my criteria for language design. I was just confused by the advice to use a CFG, since it's not the relevant issue.
I would say the relevant issue is that your lexer shouldn't be too clever and have too many modes. And to avoid mixing languages in the same file, or have a very obvious lexical construct to mix languages.
The C preprocessor is an entirely separate language than C or C++, so that is the core of the issue in your example. Likewise, it is usually hard to highlight CSS and JavaScript embedded within HTML.