LLM certainly does write perfectly grammatical and idiomatic English (I haven't ...

vidarh · 2025-01-26T07:53:02 1737877982

I have successfully gotten ChatGPT to copy a Norwegian artificial sociolect spoken by at most a few hundred people that it wouldn't admit to even knowing (the circle using it includes a few published authors and journalists, so it's likely there's some content in it's training data, but not much) by describing the features of it, so I think you might be surprised if you try. Maintaining it through a longer conversation might prove a nuisance, though.

numpad0 · 2025-01-26T17:54:00 1737914040

In case it ever needed to be said, yes they do generate idiomatic languages but do sound translated corporatese in Japanese. Considering that there are no purely $LANG trained viable LLM other than for LANG=`en_US`, I suspect there's something (corporate)English specific in LLM architecture that only few people in the world understand.

vunderba · 2025-01-26T16:41:20 1737909680

You can definitely attempt to impose "in the style of X" - or if you have original samples you can try to provide them as stylistic sample data.

But realistically, how many people are going to actually do that? Communication fed through an LLM represents a rather bleak linguistic convergence.

otabdeveloper4 · 2025-01-26T07:15:04 1737875704

> some sort of lowest-common-denominator language

LLM's output the statistically most average sequence of tokens (there's no intelligence there, "artificial" or otherwise), so yeah, that's by design.

xcv123 · 2025-01-26T09:50:04 1737885004

It can emulate a bad English speaker if prompted to do that.

otabdeveloper4 · 2025-01-26T13:13:19 1737897199

Yes, there's enough explicitly tagged bad English in the training dataset to make a valid average approximation.

xcv123 · 2025-01-26T18:54:45 1737917685

No. Not explicitly tagged. They are initially trained on vast amounts of data which are not tagged.

You fundamentally misunderstand how this works.

The LLMs learn the various grammars and "accents" implicitly. They automatically differentiate these grammars.

Sounds like you still have this idea that LLMs are a giant Markov chain. They are not Markov chains.

They are deep neural networks with hundreds of layers and they automatically model relations at extremely deep levels of abstraction.

otabdeveloper4 · 2025-01-27T05:38:47 1737956327

The context is the explicit tagging in this case. You don't need to understand language to detect English-as-a-second language speakers. (Indeed Markov chains will happily solve this problem for you.)

> they automatically model relations

No, they do not model anything at all. If you follow the tech bubble turtles all the way down you find a maximum likelihood logistic approximation.

I know, I know - then you'll do a sleight of hand and claim that all intelligence and modeling is also just maximum likelihood, even thought it's patently and obviously untrue.

xcv123 · 2025-01-27T06:20:49 1737958849

It's literally a model.

Large Language Model (LLM).

Hundreds of layers with a trillion weights and you think "nothing is modelled" there. The comments on this site are ridiculous.

Studies have traced individual "neurons" in LLMs that represent specific concepts. It's not even debatable at this point.