Oddly enough, LLM generated text is going to be far less likely to sound like a non-native speaker writing though, is the thing. Once you sort of understand the differences in grammar rules, or just from experience, certain types of non-native english always have a feel to them which reflects the mismatch between two languages - i.e. Chinese-English rough translations tend to retain the Chinese grammar structure and also mix up formalisms of words.
LLM text just plain doesn't do this: they're very good at writing perfectly formed English, but it just winds up saying nothing (and models like ChatGPT have been optimized so they end up having a particular voice they speak in as well).
> certain types of non-native english always have a feel to them which reflects the mismatch between two languages
This. My partner always speaks frenglish (french english) after talking to her parents. You have to know a little French to understand her sentences. They’re all English words, but the phraseology is all French.
I do the same with Slovenian. The words are all English, but the shape is Slovenian. It adds a lot of soul to your words.
It can also be topic dependent. When I describe memories from home in English, the language sounds more Slovenian. Likewise when I talk about American stuff to my parents, my Slovenian sounds more English.
ChatGPT would lose all that color.
Read Man In The High Castle to see this for yourself. Whole book is English but you can tell the different nationalities of each character because the shape of their English changes. Philip K Dick used this masterfully.
Amusingly, I think this phrase illustrates your point. To the best of my knowledge, a native speaker (which I'm not) would always say "The whole book is (in?) English", leaving off articles seems to be very common for Slavic people (since I believe you don't really have them in your languages).
leaving off articles seems to be very common for Slavic people
Whenever I come across text that has a lot of missing articles, the voice inside my head automatically changes to a Russian accent; and in the instances where I've bothered to find out the author, it was always someone from Russia or some other ex-USSR country, so it seems I've already ingrained this characteristic at a subconscious level.
I think this is more about formality and modern usage. I'm nearly 50 and am British. I sometimes write in this abbreviated form, omitting things like articles when they are unnecessary. Especially in text messages, social media posts, etc.
I used to work in academia with a Chilean guy who added extra articles where they weren’t needed and a Slovakian guy who didn’t put any in at all. I had fun editing the papers we wrote!
Spanish has definite and indefinite articles like English, so at least the concept is not unknown. However, even then, the correct usage is sometimes really arbitrary and varies across languages, e.g. why is it typically "mankind" and not "the mankind" (by contrast, in German it's "die Menschheit", with an article)?
There is sure to be lots of training data from people with French as a first language and English as a second language that can be pulled up with some prompting.
LLM certainly does write perfectly grammatical and idiomatic English (I haven't tried enough other languages to know if this is true for, say, Japanese, too). But regular people all have their own idiosyncratic styles - words and turns of phrases they like using more than others, preferred sentence structures and lengths, different levels of politeness, deference and assertiveness, etc.
LLM output to me usually sounds very sanitised style-wise (not just content-wise), some sort of lowest-common-denominator language, which is probably why it sounds so corporate-y. I guess you can influence the style by clever prompt engineering, but I doubt you'd get a very unique style this way.
I have successfully gotten ChatGPT to copy a Norwegian artificial sociolect spoken by at most a few hundred people that it wouldn't admit to even knowing (the circle using it includes a few published authors and journalists, so it's likely there's some content in it's training data, but not much) by describing the features of it, so I think you might be surprised if you try. Maintaining it through a longer conversation might prove a nuisance, though.
In case it ever needed to be said, yes they do generate idiomatic languages but do sound translated corporatese in Japanese. Considering that there are no purely $LANG trained viable LLM other than for LANG=`en_US`, I suspect there's something (corporate)English specific in LLM architecture that only few people in the world understand.
The context is the explicit tagging in this case. You don't need to understand language to detect English-as-a-second language speakers. (Indeed Markov chains will happily solve this problem for you.)
> they automatically model relations
No, they do not model anything at all. If you follow the tech bubble turtles all the way down you find a maximum likelihood logistic approximation.
I know, I know - then you'll do a sleight of hand and claim that all intelligence and modeling is also just maximum likelihood, even thought it's patently and obviously untrue.
> Chinese-English rough translations tend to retain the Chinese grammar structure
Those would be _really_ rough translations. Yes, I've seen "It's an achieve my dream's place" written, but that was in an essay written for high school.
LLM text just plain doesn't do this: they're very good at writing perfectly formed English, but it just winds up saying nothing (and models like ChatGPT have been optimized so they end up having a particular voice they speak in as well).