Basically, language models are already graph neural networks. To understand why, go back to Word2Vec/GLoVe: word embedding distance represent co-occurrence frequency of words in a sentence.
Note how this is the same as a graph embedding problem: words are nodes, and the edge weight is co-occurence frequency. You embed the graph nodes. In fact, this is stated in formal math in the GLoVe paper.
The LLM architecture is basically doing the same thing, except the graph is conditional occurence based on the previous contextual words.
This setup makes for a graph with a truly astronomic number of nodes (word|context) and edges. This huge graph exists only in the land of abstract math, but it also shows why LLMs require so many parameters to perform well.
In any case, 4 years on, I'm still pretty lukewarm on the current gen of graph neural network architectures.
Case in point: the OP paper is pretty much the classic ML paper mill setup of "take some existing algorithm, add some stuff over it, spend a ton hyperparameter searching on your algo and show it beats some 2 year old baseline".
Like you said, language models have started "working" to write text since (and not before) they used attention, that is, adjacency matrices (being GNNs). GloVe and Word2Vec did well for aggregate word representations over an entire corpus, but those representations were not sufficient to generate text. In other words, they were an incomplete representation of language. Turns out, you need non-euclidean operations to actually represent language in sufficient detail to generate text.
An embedding into a Euclidean space with a distance is a compression and this compression can be coded back into a graph which is by necessity symmetric. As such, it representative power for complex graphs is limited.
That is why in a LLM, "embeddings" are not Euclidean. They are a combination of operations on adjacency matrices.
This goes for any interesting graph: For a triplet of nodes, relations need not follow the triangle inequality. As such, embedding a triplet into a 2D space and applying a (e.g. cosine) distance will not work exactly. This generalizes to larger networks. For tree structures, you need a hyperbolic embedding. For non-tree graphs, there is no general euclidean node embedding that works, and you have to rely on high dimensionality of the embedding vector.
This becomes icky when you have more difficult problems, such as embedding a graph onto a geometric structure, manifold, or more complex graph structures.
LLM embeddings were inferior to even Word2Vec for 2-3 years even when the models were obviously very good at understanding language. The reason is simple: The models do not create any single number for a token that is associated with a single Euclidean space. There are, in fact, many such spaces within an LLM. Early BERT embeddings tried to find that single space by some linear combination of token activations and failed.
Point being: Graph embeddings are important. There ARE reasons to embed a graph, since graphs can become huge (recommender systems, social graphs, material, 3D modeling, even certain knowledge graph structures). The moving parts are both the underlying data, and the desired space in which to embed.
In many cases not all paths lengths need to be considered and a local factorization is sufficient is another matter. But that's still a "GNN".
We can show that transformer like GNNs or factorizations are also not sufficient to represent any graph structure.
Yet, any modern LLM is literally a GNN and that's just for text. Now if you add a huge knowledge or process graph on top of that, it stands to reason that you might profit from considering how to represent this additional layer of relationships between text tokens.
As for LLM embeddings being inferior: the quality of embeddings really is dependent on the usecase. Embedding is a lossy form of compression.
In more than half the cases I've seen, a 32-dimension GLoVe embedding preforms just as accurately as the 4000 dimension LLM embeddings some people default to using. The 32d embedding performs better even, if you consider system performance (throughput, costs, etc.)
Exactly, because Glove embeddings generate directly a linear space. This is, if I remember the paper, explicitly their objective. This corresponds to a single symmetrical relationship between two tokens, so a rather simple network embeddable into few dimensions without much loss.
And it has been shown time and time again that Glove embeddings work really well. The underlying relational structure is linear, as mentioned, so they are for example amendable to linear operations, aggregation etc.
By contrast, LLM embeddings are quite lossy as they somehow need to be created from an underlying non linear space.
For instance to represent a sentence from its tokens, it does not work ti just sum or mean. You need to train an extra aggregation model on top with new data that builds a good embedding space from the LLM activtions.
However, there is a limit to what Glove ans Word2Vec can represent. Simple things like polysemic words are already a limit etc. On a sequence level, especially for longer passages, LLM based sentence embeddings give you much more information if you meed it (insofar that relational semantics of a bag of words approach would not be sufficient)
Yes transformers are a class of graph NN but there can still be value in dealing directly with both text and knowledge graphs (for RAG-style applications, presumably) because the graph enforces additional constraints.
It is like the difference between concrete and abstract syntax. LLMs frequently generate code that won't compile, since they predict tokens not AST nodes. They are underconstrained for the task.
How to address? You can train a single model to handle both, as the authors did, or you can manually enforce constraints while decoding.
Graphing is not only the future, it's the present that is happening behind closed doors that you don't get to hear about. The things that can be done already are insane and the capabilities are growing quickly. Agentic graph based systems are the future for ASI.
Anyone using the word "agentic" has jumped on the loony train. These people use it to discriminate against AIs that they personally dislike, claiming it's not sufficiently "agentic". In reality there is a wide spectrum of how a system uses AI, and there is no hard line separating agentic from non-agentic.
I would argue that agentic means writing and running code at will, by the LLM.
Eric Schmidt recently commented that he thought three things would bring profound change: (a) an infinite context window, (b) chain of thought reasoning in agents and (c) the text-to-action capacity for programming.
In that view, agentic is mostly about (b) and (c). Having a way to chain together steps (which is just looping with the right context) and the ability to write and run code allows LLMs to take action on the things it is given to "do".
> Having a way to chain together steps (which is just looping with the right context) and the ability to write and run code allows LLMs to take action on the things it is given to "do".
What you describe is basically just a standard LLM workflow. It's not as if the LLM is evolving the workflow itself as it goes.
Basically, language models are already graph neural networks. To understand why, go back to Word2Vec/GLoVe: word embedding distance represent co-occurrence frequency of words in a sentence.
Note how this is the same as a graph embedding problem: words are nodes, and the edge weight is co-occurence frequency. You embed the graph nodes. In fact, this is stated in formal math in the GLoVe paper.
The LLM architecture is basically doing the same thing, except the graph is conditional occurence based on the previous contextual words.
This setup makes for a graph with a truly astronomic number of nodes (word|context) and edges. This huge graph exists only in the land of abstract math, but it also shows why LLMs require so many parameters to perform well.
In any case, 4 years on, I'm still pretty lukewarm on the current gen of graph neural network architectures.
Case in point: the OP paper is pretty much the classic ML paper mill setup of "take some existing algorithm, add some stuff over it, spend a ton hyperparameter searching on your algo and show it beats some 2 year old baseline".