I can't speak to whether or not Nature should publish this, but I don't find the outcomes in the paper spectacular. In brief, they create an ontology -- in the initial examples, four made up color words, each corresponding to a real color and then three made up 'function' words, representing 'before', 'in the middle of' and 'triple'.
They then ask humans and their AI to generate sequences of colors based on short and progressively longer strings of color words and functions.
The AI is about as good at this as humans, which is to say 85% successful for instructions longer than three or four words.
The headline says that GPT4 is worse at this than MLC. I am doubtful about this claim. I feel a quality prompt engineer could beat 85% with GPT4.
The claims in the document are that MLC shows similar error types to humans, and that this backs some sort of theory of mind I don't know anything about. That's as may be.
I would be surprised if this becomes a radical new architecture based on what I'm reading. I couldn't find a lot of information as to size of the network used; I suppose if its very small that might be interesting. But this read to me very like an 'insider' paper, its target other academics who care about some of these theories of mind, not people who need to get sentences like 'cax kiki pup wif gep' turned into real color circles right away.
>I feel a quality prompt engineer could beat 85% with GPT4.
I generally don't pull the "you're not prompting it good enough" card unless i have direct experience with the task but this does raise a few flags - "between 42 and 86% of the time, depending on how the researchers presented the task."
I would have liked them to show how they presented this task, at least something more than a single throw away line, given how integral it is to the main claim.
I don't find this argument persuasive. If you need to engineer the prompt in order for the model to give you the answer you want, and if it's performing poorly without that kind of assistance from a human who understands the task and the desired outcome... then it's perfectly OK to not bother and give the LLM a low mark on a benchmark. After all, the whole point was to test the LLM's language and reasoning skills.
"Prompt engineering" makes sense if your goal is more utilitarian, but it's not a license to cheat on tests. If you want a machine learning model to generate aesthetically pleasing images, or to follow certain rules in a conversation, then it's OK to rely on hacky solutions to get there.
I kind of dislike the term "Prompt Engineering". It mystifies what to me is an ordinary process. I'm not talking about chanting magic words, i'm just talking about taking the structure of an LLM into account, even taking how humans would approach the problem into account.
In microsoft's agi paper, there is a test for planning they put out for GPT-4. They give detailed instructions on the constraint of a poem and expect it to spit out the constrained poem in one shot. Of course it fails and the conclusion is that it can't plan. But if you think about it, this is a ridiculous assertion. No human on earth would have passed that test with working memory alone. Even by our standards, it's a weird way to present the information but they did so anyway. This is actually a major issue for most plan benchmark assessment for LLMs i've come across.
These are the kind of problems i'm talking about. If i changed the request to encourage for a draft/revise generation process and it consistently passed these tests then it's very fair to say it can plan. That's not "hacky". It's just truth. and saying otherwise is denying real world usage for a misplaced sense of propriety to benchmarks. a benchmark is only as useful as the capabilities it can assess.
a "this is how this is presented, this is how the model performed" would have been very prudent in my opinion. Maybe these guys already covered the kind of things i'm talking about(i can accept that!)....but i can't know that if they don't tell us.
how is "Engineering" "about chanting magic words"? The wires may be getting crossed on if it's an engineering discipline or just the use of the word engineer, meaning to build mindfully, I feel the latter is appropriate and the former more misguided but still it's not "Prompt Wizarding" ;)
Precisely! You’re trying to learn the art of poking the box in the right way. As an engineer, I really despise the bastardisation of the term to refer to jobs or tasks that are inherently non-scientific.
I think adjusting the prompt makes sense. We have to keep in mind that this implicitly happens on the human side as well. The person creating the initial question will consider how understandable it is for them and edit until they're happy. Then the questions will be reviewed by co-authors and possibly tested on a few people before running the full experiment. They'd probably review the explanation again or change the task if they realised half the people can't follow the instructions.
We just call it "editing for clarity" rather than "prompt engineering" when we write for other people.
I would claim that most people are very bad at communicating in a way that can be interpreted, by human or AI, with a single shot. Most of the time some back and forth is required to clear things up. I think it's unfair to expect gold from garbage input, since meat brains can't do that either, without some back and fourth.
Those prompts should be available. It's ridiculous that they're not.
So what if we have an AI prompter-assister... we take an input prompt: run it through an AI to transform that prompt into proper AI talk - then that is what's prompted to the actual AI... That way the AI will always have the best prompt possible, because the Prompt-helper helps your prompt...
While I think that's a valid point, for the purpose of the paper having a competent prompt writer would be needed to make a fair comparison.
I think the researchers would know how to get the best from their own AI bot, so that's a level of competency that should be extended to comparisons, otherwise user competency becomes a source of bias. I do feel you're correct in your concerns though, the systems shouldn't need experts to use them, nor should they need the user to already know the right answer, which leads me to my next point:
When it comes to real world expectations, perhaps instead we need a large group of random people (with no prior experience) working with each bot to complete a set of tasks in order to determine how it truly performs - something that could be enhanced if the answers weren't easy to check.
Disagree - a model may have capabilities that it only deploys in certain circumstances. In particular, if you train on the whole internet, you probably learn what both stupid and smart outputs look like. But you probably also learn that the modal output is stupid. So it’s no ding on the model’s capabilities if it defaults to assuming it should behave stupidly.
What you're basically saying is "the model as trained can't do well at this task, so let's use our own cognitive skills to help it."
That's a problem for comparative benchmarking, right? You're no longer testing the model; you're testing the model in tandem with the prompt engineer. This raises several big questions:
1) How do you know that the engineer's knowledge of the "correct" answer isn't being subtly encoded in the prompt? Essentially the Clever Hans phenomenon.
2) If you want this to be a fair game, how do you give precisely the same kind of advantage to whatever you're benchmarking the LLM against?
3) Last but not least, if you're not going to throw in your prompt engineer for free with the product, will your results be reproducible by your customers?
To be clear, I don't think there's some cosmically objective way of doing this. If you're using prompts written for humans, you're already putting the computer at a disadvantage. But at the very least, you're measuring something meaningful: how will the model behave in the real world.
The supplement discusses how they presented the task. Notably, they first gave all the training examples and then told the model what the task is and didn't ask it to reason or create any context before spitting out the answers. So basically the simplest way to interact with it, but probably not a great way to get solutions for this problem if that was the task at hand.
Agreed. And there's going to be some significant domain decisions for GPT-4 to consider: Will the made up words be existing single tokens? Will those single tokens be heavily overloaded tokens, e.g. a letter of the alphabet? Will the representation of the circles be, say "B" for blue, or something else?
Along with this are questions as to whether you're going to treat GPT-4 as a zero shot or multi-shot oracle; while they have this idea of 'context free' challenges in the paper, they, crucially, train their network first.
Anyway, I like this paper less the more I think about it.
The article suggested that GPF-4 FAILED between 42 and 86 percent of the time. So its success rate would be 58 to 14 percent, which compared to the novel NN's 80 percent, seems significant.
1. The level of variance displayed here is very atypical for a language model unless you're making significant changes to how the information is presented. This alone is cause for more elaboration.
2. With this big a variance already showing, it's hard to say for sure that they hit the top end.
Maybe it seems strange but most LLM evaluation papers still don't actually bother to tie in even some extremely well known/basic "output boosters" like chain of thought and so on. But at least in these instances, we know how they presented it.
a "this is how we presented the task, this is how it performed" would have been very nice.
But isn’t that their point? That their new neural network needs less prompt crafting because it is part better at reading intent and meaning of words and part less likely to hallucinate?
Sure, I also think GPT-4 can be brute forced with a very precise prompt into generating correct answers but that’s not how you would describe the problem to a human, and I assume the humans performing well in this test weren’t either, but given a fairly informal description of the exercise.
>I couldn't find a lot of information as to size of the network used
In the original paper,[0] in the section called 'Implementation of MLC', there's a description of sorts (Greek to me):
"Both the encoder and decoder have 3 layers, 8 attention heads per layer, input and hidden embeddings of size 128, and a feedforward hidden size of 512. Following GPT63, GELU64 activation functions are used instead of ReLU. In total, the architecture has about 1.4 million parameters."
If I am evaluating how intelligent an AI is against a human then IMO it's only fair they both get the same prompt.
It should be counted as a fault of the AI if it can't understand the prompt as well as the human, or another AI. That's all part of being a useful entity, how well you understand a prompt and how well you can infer the original meaning from what prompt you have been given.
Honestly I’m surprised by how low Nature has fallen. Is it even a useful signal of paper quality anymore? NeurIPS and ICLR aren’t great but in general I’ve found their work to be more rigorous than that of Nature despite of the fact that they are shorter conference papers compared to Nature’s journal papers.
Right now sadly the only useful signal in Deep Learning research is research group. If OpenAI releases a paper I know it’s something good that works at scale, similarly if Kaiming He, Piotr Dollar and team in Meta AI release a paper, it tends to be really good and SOTA. Google Deepmind maintains high quality, Google brain has been more of a mixed bag. If Berkeley releases a paper, it has 50% chance of directly going to trash, Stanford has a much lower percent (I also disambiguate) based on specific groups. Of course I’m going to be heavily biased and this system is not great, but conferences and journals have managed to become such a useless signal that I find this method to be more accurate.
I can't speak to whether or not Nature should publish this, but I don't find the outcomes in the paper spectacular. In brief, they create an ontology -- in the initial examples, four made up color words, each corresponding to a real color and then three made up 'function' words, representing 'before', 'in the middle of' and 'triple'.
They then ask humans and their AI to generate sequences of colors based on short and progressively longer strings of color words and functions.
The AI is about as good at this as humans, which is to say 85% successful for instructions longer than three or four words.
The headline says that GPT4 is worse at this than MLC. I am doubtful about this claim. I feel a quality prompt engineer could beat 85% with GPT4.
The claims in the document are that MLC shows similar error types to humans, and that this backs some sort of theory of mind I don't know anything about. That's as may be.
I would be surprised if this becomes a radical new architecture based on what I'm reading. I couldn't find a lot of information as to size of the network used; I suppose if its very small that might be interesting. But this read to me very like an 'insider' paper, its target other academics who care about some of these theories of mind, not people who need to get sentences like 'cax kiki pup wif gep' turned into real color circles right away.