Hacker News new | past | comments | ask | show | jobs | submit login

Oddly enough, LLM generated text is going to be far less likely to sound like a non-native speaker writing though, is the thing. Once you sort of understand the differences in grammar rules, or just from experience, certain types of non-native english always have a feel to them which reflects the mismatch between two languages - i.e. Chinese-English rough translations tend to retain the Chinese grammar structure and also mix up formalisms of words.

LLM text just plain doesn't do this: they're very good at writing perfectly formed English, but it just winds up saying nothing (and models like ChatGPT have been optimized so they end up having a particular voice they speak in as well).




> certain types of non-native english always have a feel to them which reflects the mismatch between two languages

This. My partner always speaks frenglish (french english) after talking to her parents. You have to know a little French to understand her sentences. They’re all English words, but the phraseology is all French.

I do the same with Slovenian. The words are all English, but the shape is Slovenian. It adds a lot of soul to your words.

It can also be topic dependent. When I describe memories from home in English, the language sounds more Slovenian. Likewise when I talk about American stuff to my parents, my Slovenian sounds more English.

ChatGPT would lose all that color.

Read Man In The High Castle to see this for yourself. Whole book is English but you can tell the different nationalities of each character because the shape of their English changes. Philip K Dick used this masterfully.


> Whole book is English

Amusingly, I think this phrase illustrates your point. To the best of my knowledge, a native speaker (which I'm not) would always say "The whole book is (in?) English", leaving off articles seems to be very common for Slavic people (since I believe you don't really have them in your languages).


leaving off articles seems to be very common for Slavic people

Whenever I come across text that has a lot of missing articles, the voice inside my head automatically changes to a Russian accent; and in the instances where I've bothered to find out the author, it was always someone from Russia or some other ex-USSR country, so it seems I've already ingrained this characteristic at a subconscious level.


Poles, Czechs etc. also do this and IMHO, their accent sounds quite different from the Russian one.


I think this is more about formality and modern usage. I'm nearly 50 and am British. I sometimes write in this abbreviated form, omitting things like articles when they are unnecessary. Especially in text messages, social media posts, etc.


I used to work in academia with a Chilean guy who added extra articles where they weren’t needed and a Slovakian guy who didn’t put any in at all. I had fun editing the papers we wrote!


Spanish has definite and indefinite articles like English, so at least the concept is not unknown. However, even then, the correct usage is sometimes really arbitrary and varies across languages, e.g. why is it typically "mankind" and not "the mankind" (by contrast, in German it's "die Menschheit", with an article)?


It also helps refute the point because you could certainly ask an LLM to speak as though they’re a character from the book.

And if what it does now is unimpressive, it might be a good thing to use to monitor the rapid progress of LLMs.


Just to corroborate as a native English speaker, yes, in my experience the "the" would only be left off in quite informal registers or in haste.


There is sure to be lots of training data from people with French as a first language and English as a second language that can be pulled up with some prompting.


LLM certainly does write perfectly grammatical and idiomatic English (I haven't tried enough other languages to know if this is true for, say, Japanese, too). But regular people all have their own idiosyncratic styles - words and turns of phrases they like using more than others, preferred sentence structures and lengths, different levels of politeness, deference and assertiveness, etc.

LLM output to me usually sounds very sanitised style-wise (not just content-wise), some sort of lowest-common-denominator language, which is probably why it sounds so corporate-y. I guess you can influence the style by clever prompt engineering, but I doubt you'd get a very unique style this way.


I have successfully gotten ChatGPT to copy a Norwegian artificial sociolect spoken by at most a few hundred people that it wouldn't admit to even knowing (the circle using it includes a few published authors and journalists, so it's likely there's some content in it's training data, but not much) by describing the features of it, so I think you might be surprised if you try. Maintaining it through a longer conversation might prove a nuisance, though.


In case it ever needed to be said, yes they do generate idiomatic languages but do sound translated corporatese in Japanese. Considering that there are no purely $LANG trained viable LLM other than for LANG=`en_US`, I suspect there's something (corporate)English specific in LLM architecture that only few people in the world understand.


You can definitely attempt to impose "in the style of X" - or if you have original samples you can try to provide them as stylistic sample data.

But realistically, how many people are going to actually do that? Communication fed through an LLM represents a rather bleak linguistic convergence.


> some sort of lowest-common-denominator language

LLM's output the statistically most average sequence of tokens (there's no intelligence there, "artificial" or otherwise), so yeah, that's by design.


It can emulate a bad English speaker if prompted to do that.


Yes, there's enough explicitly tagged bad English in the training dataset to make a valid average approximation.


No. Not explicitly tagged. They are initially trained on vast amounts of data which are not tagged.

You fundamentally misunderstand how this works.

The LLMs learn the various grammars and "accents" implicitly. They automatically differentiate these grammars.

Sounds like you still have this idea that LLMs are a giant Markov chain. They are not Markov chains.

They are deep neural networks with hundreds of layers and they automatically model relations at extremely deep levels of abstraction.


The context is the explicit tagging in this case. You don't need to understand language to detect English-as-a-second language speakers. (Indeed Markov chains will happily solve this problem for you.)

> they automatically model relations

No, they do not model anything at all. If you follow the tech bubble turtles all the way down you find a maximum likelihood logistic approximation.

I know, I know - then you'll do a sleight of hand and claim that all intelligence and modeling is also just maximum likelihood, even thought it's patently and obviously untrue.


It's literally a model.

Large Language Model (LLM).

Hundreds of layers with a trillion weights and you think "nothing is modelled" there. The comments on this site are ridiculous.

Studies have traced individual "neurons" in LLMs that represent specific concepts. It's not even debatable at this point.


> Chinese-English rough translations tend to retain the Chinese grammar structure

Those would be _really_ rough translations. Yes, I've seen "It's an achieve my dream's place" written, but that was in an essay written for high school.


LLMs do whatever you ask them to. They have a default, but they can be directed to use a different response style.

And of course you could build a corpus of text written by Chinese English speakers for more authenticity.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: