Hacker News new | past | comments | ask | show | jobs | submit login

LLM certainly does write perfectly grammatical and idiomatic English (I haven't tried enough other languages to know if this is true for, say, Japanese, too). But regular people all have their own idiosyncratic styles - words and turns of phrases they like using more than others, preferred sentence structures and lengths, different levels of politeness, deference and assertiveness, etc.

LLM output to me usually sounds very sanitised style-wise (not just content-wise), some sort of lowest-common-denominator language, which is probably why it sounds so corporate-y. I guess you can influence the style by clever prompt engineering, but I doubt you'd get a very unique style this way.




I have successfully gotten ChatGPT to copy a Norwegian artificial sociolect spoken by at most a few hundred people that it wouldn't admit to even knowing (the circle using it includes a few published authors and journalists, so it's likely there's some content in it's training data, but not much) by describing the features of it, so I think you might be surprised if you try. Maintaining it through a longer conversation might prove a nuisance, though.


In case it ever needed to be said, yes they do generate idiomatic languages but do sound translated corporatese in Japanese. Considering that there are no purely $LANG trained viable LLM other than for LANG=`en_US`, I suspect there's something (corporate)English specific in LLM architecture that only few people in the world understand.


You can definitely attempt to impose "in the style of X" - or if you have original samples you can try to provide them as stylistic sample data.

But realistically, how many people are going to actually do that? Communication fed through an LLM represents a rather bleak linguistic convergence.


> some sort of lowest-common-denominator language

LLM's output the statistically most average sequence of tokens (there's no intelligence there, "artificial" or otherwise), so yeah, that's by design.


It can emulate a bad English speaker if prompted to do that.


Yes, there's enough explicitly tagged bad English in the training dataset to make a valid average approximation.


No. Not explicitly tagged. They are initially trained on vast amounts of data which are not tagged.

You fundamentally misunderstand how this works.

The LLMs learn the various grammars and "accents" implicitly. They automatically differentiate these grammars.

Sounds like you still have this idea that LLMs are a giant Markov chain. They are not Markov chains.

They are deep neural networks with hundreds of layers and they automatically model relations at extremely deep levels of abstraction.


The context is the explicit tagging in this case. You don't need to understand language to detect English-as-a-second language speakers. (Indeed Markov chains will happily solve this problem for you.)

> they automatically model relations

No, they do not model anything at all. If you follow the tech bubble turtles all the way down you find a maximum likelihood logistic approximation.

I know, I know - then you'll do a sleight of hand and claim that all intelligence and modeling is also just maximum likelihood, even thought it's patently and obviously untrue.


It's literally a model.

Large Language Model (LLM).

Hundreds of layers with a trillion weights and you think "nothing is modelled" there. The comments on this site are ridiculous.

Studies have traced individual "neurons" in LLMs that represent specific concepts. It's not even debatable at this point.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: