I actually read the (first half) of the paper. I can't speak to whether or not N...

famouswaffles · on Oct 25, 2023

>I feel a quality prompt engineer could beat 85% with GPT4.

I generally don't pull the "you're not prompting it good enough" card unless i have direct experience with the task but this does raise a few flags - "between 42 and 86% of the time, depending on how the researchers presented the task."

I would have liked them to show how they presented this task, at least something more than a single throw away line, given how integral it is to the main claim.

PumpkinSpice · on Oct 25, 2023

I don't find this argument persuasive. If you need to engineer the prompt in order for the model to give you the answer you want, and if it's performing poorly without that kind of assistance from a human who understands the task and the desired outcome... then it's perfectly OK to not bother and give the LLM a low mark on a benchmark. After all, the whole point was to test the LLM's language and reasoning skills.

"Prompt engineering" makes sense if your goal is more utilitarian, but it's not a license to cheat on tests. If you want a machine learning model to generate aesthetically pleasing images, or to follow certain rules in a conversation, then it's OK to rely on hacky solutions to get there.

famouswaffles · on Oct 25, 2023

I kind of dislike the term "Prompt Engineering". It mystifies what to me is an ordinary process. I'm not talking about chanting magic words, i'm just talking about taking the structure of an LLM into account, even taking how humans would approach the problem into account.

In microsoft's agi paper, there is a test for planning they put out for GPT-4. They give detailed instructions on the constraint of a poem and expect it to spit out the constrained poem in one shot. Of course it fails and the conclusion is that it can't plan. But if you think about it, this is a ridiculous assertion. No human on earth would have passed that test with working memory alone. Even by our standards, it's a weird way to present the information but they did so anyway. This is actually a major issue for most plan benchmark assessment for LLMs i've come across.

These are the kind of problems i'm talking about. If i changed the request to encourage for a draft/revise generation process and it consistently passed these tests then it's very fair to say it can plan. That's not "hacky". It's just truth. and saying otherwise is denying real world usage for a misplaced sense of propriety to benchmarks. a benchmark is only as useful as the capabilities it can assess.

a "this is how this is presented, this is how the model performed" would have been very prudent in my opinion. Maybe these guys already covered the kind of things i'm talking about(i can accept that!)....but i can't know that if they don't tell us.

haskellandchill · on Oct 25, 2023

how is "Engineering" "about chanting magic words"? The wires may be getting crossed on if it's an engineering discipline or just the use of the word engineer, meaning to build mindfully, I feel the latter is appropriate and the former more misguided but still it's not "Prompt Wizarding" ;)

Guthur · on Oct 26, 2023

Well you're building nothing, your poking at a black box and hoping it spits out something close to what you want.

pietro72ohboy · on Oct 26, 2023

Precisely! You’re trying to learn the art of poking the box in the right way. As an engineer, I really despise the bastardisation of the term to refer to jobs or tasks that are inherently non-scientific.

elforce002 · on Oct 27, 2023

I agree with you a 100%

haskellandchill · on Oct 27, 2023

It can be done scientifically, there are many techniques in the literature that put the engineering into Prompt Engineering.

famouswaffles · on Oct 25, 2023

Fair Enough. Just that discussion around the term feels like that sometimes especially with image generation.

viraptor · on Oct 26, 2023

I think adjusting the prompt makes sense. We have to keep in mind that this implicitly happens on the human side as well. The person creating the initial question will consider how understandable it is for them and edit until they're happy. Then the questions will be reviewed by co-authors and possibly tested on a few people before running the full experiment. They'd probably review the explanation again or change the task if they realised half the people can't follow the instructions.

We just call it "editing for clarity" rather than "prompt engineering" when we write for other people.

nomel · on Oct 26, 2023

I would claim that most people are very bad at communicating in a way that can be interpreted, by human or AI, with a single shot. Most of the time some back and forth is required to clear things up. I think it's unfair to expect gold from garbage input, since meat brains can't do that either, without some back and fourth.

Those prompts should be available. It's ridiculous that they're not.

altruios · on Oct 25, 2023

So what if we have an AI prompter-assister... we take an input prompt: run it through an AI to transform that prompt into proper AI talk - then that is what's prompted to the actual AI... That way the AI will always have the best prompt possible, because the Prompt-helper helps your prompt...

/s if anyone needs it.

larperdoodle · on Oct 25, 2023

Is the /s because that's literally what they're already doing?

quitit · on Oct 26, 2023

While I think that's a valid point, for the purpose of the paper having a competent prompt writer would be needed to make a fair comparison.

I think the researchers would know how to get the best from their own AI bot, so that's a level of competency that should be extended to comparisons, otherwise user competency becomes a source of bias. I do feel you're correct in your concerns though, the systems shouldn't need experts to use them, nor should they need the user to already know the right answer, which leads me to my next point:

When it comes to real world expectations, perhaps instead we need a large group of random people (with no prior experience) working with each bot to complete a set of tasks in order to determine how it truly performs - something that could be enhanced if the answers weren't easy to check.

emtel · on Oct 26, 2023

Disagree - a model may have capabilities that it only deploys in certain circumstances. In particular, if you train on the whole internet, you probably learn what both stupid and smart outputs look like. But you probably also learn that the modal output is stupid. So it’s no ding on the model’s capabilities if it defaults to assuming it should behave stupidly.

PumpkinSpice · on Oct 26, 2023

What you're basically saying is "the model as trained can't do well at this task, so let's use our own cognitive skills to help it."

That's a problem for comparative benchmarking, right? You're no longer testing the model; you're testing the model in tandem with the prompt engineer. This raises several big questions:

1) How do you know that the engineer's knowledge of the "correct" answer isn't being subtly encoded in the prompt? Essentially the Clever Hans phenomenon.

2) If you want this to be a fair game, how do you give precisely the same kind of advantage to whatever you're benchmarking the LLM against?

3) Last but not least, if you're not going to throw in your prompt engineer for free with the product, will your results be reproducible by your customers?

To be clear, I don't think there's some cosmically objective way of doing this. If you're using prompts written for humans, you're already putting the computer at a disadvantage. But at the very least, you're measuring something meaningful: how will the model behave in the real world.

pama · on Oct 26, 2023

The supplement discusses how they presented the task. Notably, they first gave all the training examples and then told the model what the task is and didn't ask it to reason or create any context before spitting out the answers. So basically the simplest way to interact with it, but probably not a great way to get solutions for this problem if that was the task at hand.

https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

vessenes · on Oct 26, 2023

Agreed. And there's going to be some significant domain decisions for GPT-4 to consider: Will the made up words be existing single tokens? Will those single tokens be heavily overloaded tokens, e.g. a letter of the alphabet? Will the representation of the circles be, say "B" for blue, or something else?

Along with this are questions as to whether you're going to treat GPT-4 as a zero shot or multi-shot oracle; while they have this idea of 'context free' challenges in the paper, they, crucially, train their network first.

Anyway, I like this paper less the more I think about it.

randcraw · on Oct 25, 2023

The article suggested that GPF-4 FAILED between 42 and 86 percent of the time. So its success rate would be 58 to 14 percent, which compared to the novel NN's 80 percent, seems significant.

famouswaffles · on Oct 25, 2023

Yes I'm aware. My point is that

1. The level of variance displayed here is very atypical for a language model unless you're making significant changes to how the information is presented. This alone is cause for more elaboration.

2. With this big a variance already showing, it's hard to say for sure that they hit the top end. Maybe it seems strange but most LLM evaluation papers still don't actually bother to tie in even some extremely well known/basic "output boosters" like chain of thought and so on. But at least in these instances, we know how they presented it.

a "this is how we presented the task, this is how it performed" would have been very nice.

jug · on Oct 26, 2023

But isn’t that their point? That their new neural network needs less prompt crafting because it is part better at reading intent and meaning of words and part less likely to hallucinate?

Sure, I also think GPT-4 can be brute forced with a very precise prompt into generating correct answers but that’s not how you would describe the problem to a human, and I assume the humans performing well in this test weren’t either, but given a fairly informal description of the exercise.

8bitsrule · on Oct 25, 2023

>I couldn't find a lot of information as to size of the network used

In the original paper,[0] in the section called 'Implementation of MLC', there's a description of sorts (Greek to me):

"Both the encoder and decoder have 3 layers, 8 attention heads per layer, input and hidden embeddings of size 128, and a feedforward hidden size of 512. Following GPT63, GELU64 activation functions are used instead of ReLU. In total, the architecture has about 1.4 million parameters."

[0] https://www.nature.com/articles/s41586-023-06668-3

SirMaster · on Oct 27, 2023

If I am evaluating how intelligent an AI is against a human then IMO it's only fair they both get the same prompt.

It should be counted as a fault of the AI if it can't understand the prompt as well as the human, or another AI. That's all part of being a useful entity, how well you understand a prompt and how well you can infer the original meaning from what prompt you have been given.

sashank_1509 · on Oct 26, 2023

Honestly I’m surprised by how low Nature has fallen. Is it even a useful signal of paper quality anymore? NeurIPS and ICLR aren’t great but in general I’ve found their work to be more rigorous than that of Nature despite of the fact that they are shorter conference papers compared to Nature’s journal papers.

Right now sadly the only useful signal in Deep Learning research is research group. If OpenAI releases a paper I know it’s something good that works at scale, similarly if Kaiming He, Piotr Dollar and team in Meta AI release a paper, it tends to be really good and SOTA. Google Deepmind maintains high quality, Google brain has been more of a mixed bag. If Berkeley releases a paper, it has 50% chance of directly going to trash, Stanford has a much lower percent (I also disambiguate) based on specific groups. Of course I’m going to be heavily biased and this system is not great, but conferences and journals have managed to become such a useless signal that I find this method to be more accurate.

tokai · on Oct 26, 2023

>The claims in the document are that MLC shows similar error types to humans, and that this backs some sort of theory of mind

The same thing was claimed with latent semantic analysis in the late 90s.

avgcorrection · on Oct 25, 2023

That’s seriously it? What an insulting (to the reader) headline by Nature.