Some of my favorites include Sinatra singing ABBA's Dancing Queen [1], six presidents rapping the NWA classic [2], and Milton Friedman rapping 50's P.I.M.P. [3]
I'm struggling to understand how they're great examples. They sound absolutely atrocious and nowhere near believable. They sound really, really bad, like something an amateur with zero editing experience could put together with cheap equipment. They're so bad they're difficult to listen to, especially the Sinatra one.
> I'm struggling to understand how they're great examples.
Taking 'Six U.S. Presidents read "Fuck Tha Police" by N.W.A.' I certainly agree there are long sections where the synthesis is obvious - stilted sounding, overlong pauses between words, changes in tone in the middle of a sentence, background noises / changing audio quality accompanying different words and speakers, and so on.
But there are sections where it manages to generate several words in a row without those problems. Like at 23 seconds, the section "so help your black ass? / Ya goddamn right / Well, won't you tell everybody what the fuck you gotta say?" sounds pretty realistic.
If you blame the bad sections on lack of training data for older presidents, take the good sections as proof of what's possible, and imagine the entire recording will be that good in 3-5 years, then it's an impressive demo.
My apologies, it wasn't meant as an insult. I was trying to make a point and added a smiley face because I thought it would certainly be misconstrued as insult without it.
> Why would you prefer actual aggression to passive aggression?
Probably for the same reason I disliked the Sinatra content so much, too obviously fake.
The whole of the parent comment was that single remark I quoted, before they edited the comment.
They were intending to lob a passive aggressive insult on the lighter side. They didn't like my comment, so they fired off a shot about my comment being equivalent to a questionably human, possibly mediocre GPT3 text.
I don't mind that sort of mild insult in response, I just prefer it without the unnecessary facade of the smiley.
I thought they were saying the defense was like defenses of gpt-3, that it will be good someday, or it's good if you are very selective about the output, not that your comment was poorly written.
One of those readings is just musing about the state of all this "almost there" tech, the other is just randomly insulting someone's writing.
That does seem like pretty dangerous ambiguity. Akiselev maybe wrote a blue gold dress of comments where one reading is mean-spirited. I doubt it was intentional, but could have been clearer.
This is a fake, sure. Maybe it's not perfect, but the software is improving. So within a few years, I'm guessing that fakes will be indistinguishable from true recordings.
But this was based on the text of an actual speech written for President Nixen, to be delivered if the mission failed. So isn't it likely that he practiced it, just in case? And if he did, it might have been recorded, more or less accidentally. As Reagan's quip about nuking Russia was.
And so as others note, it becomes crucial to look at the historical context.
It is one thing to make apocalyptic predictions - they could be reasonable conclusions of possibilities, unreasonable, or anywhere in between. They can be discussed on their own merit. Heck even why they personally conclude something may be discussed and considered, and something to try to understand even if disagreed with or considered wrong. But the "here is how you should feel" leading bullshit is a clear run-arround for critical facilities that takes far too many people.
Nothing makes me write a source off as a bad actor more than panic mongering that /tells you/ how to feel as opposed to leaving it to you to decide. It is not only a sure sign of a manipulator because of scared people being easier to manipulate. It may be unduly harsh, caustic, or sharp but I have absolutely lost patience for this tactic.
With you 100% but unfortunately it works too well. All of the top grossing news outlets abuse the hell out of this technique. I don't believe it'll get any better and more now then ever people will need to learn to think about where their news is coming from.
I don't think that the prediction "deep fakes pose a risk to an information ecosystem" can be described as apocalyptic. I feel like I'm missing sonethis as this article seems like a fairly un-opininionated puff piece.
I was kind of expecting that this was going to demo some step forward, but this has all the same tells has the last several high profile deepfake demos.
Knowing the speech was a fake and paying attention to it "skeptically" I noticed the double head-bob as an odd thing which seemed suspiciously convenient as a transition between cut-n-pastes.
Now I'm going to have to go and watch archival Nixon footage to see if the same thing is there or not :)
I wonder if you could use a GAN to do a post-production fix on them. Teach it what actual people's head movements look like, and then get it to stabilize the image as a second pass.
"It is with a heavy heart that we must inform you that Apollo 11 has failed to return from the moon. We have lost contact with our crew, Neil Armstrong and Edwin Aldrin. When they left earth seven hours ago their fate was unknown but now it is certain. They died as men should, ready to die for what they believed. This loss is a tremendous loss for our nation, for their families, and for mankind but their sacrifice was not made in vain: It is a testimony to courage, determination, and human achievement that will be remembered by all who witnessed it today and for all those who come for all time so long as man walks this world or any other. For this tragedy will not mark the end of space exploration; it will strengthen our resolve to continue advancing space technology and human knowledge. We have enjoyed the fruits of their labor and shared the pride of their achievement just as we now share their isolation and grief. We will honor their memory by continuing that work and by taking that faith and that dream into a future they will not live to see but helped to build. We will remember them and we will remember what they stood for: The greatness of man and the hope for a better tomorrow."
This isn't a first shot output, I had it retry a bit and guided it a little, mostly to get it to write a longer speech instead of a short quote. I think it's much better than I could have written on my own in a minute or two.
It's an impressive draft, better than what I expected, but it needs a lot of polishing. For example:
> We have enjoyed the fruits of their labor
Or this:
> This loss is a tremendous loss for our nation, for their families, and for mankind
the official speech use an increasing order, and put "friends" as an intermediate step between "family" and "nation". You want to put "family" so your don't sound insensitive and as a blow under the belt. You want to put "nation" because you want to squeeze a few votes from the tragedy and also avoid been blamed. You want to put "mankind" because you want to hide that this was a stunt in the middle of the cold war. (The official speech says "people of the world".)
There is a small risk that GPT-3 is just retrieving a distorted version of the speech from the multiples sources that were available. Like if you or me are forced to rewrite it from memory. Let's try a different scenario, like: "Yuri Gagarin got toasted during reentry, and for some reason we decided to not hide it."
All the text in quotes was GPT-3 generated-- with a little help from me (e.g. when it went in the wrong direction-- e.g. ending the speech too early-- I made it go back and try again or clipped out the dumb part and had it continue).
I prompted it with some mission description and said that a speech was prepared for president Nixon that in case the astronauts were stranded.
GPT3 doesn't yet do consistently GREAT output without some guidance, partially as an artefact of the generation procedure. But with a little help it does very well.
The issue is that if you just take the most likely symbol it'll rapidly go into a loop of just copying text or other degenerate behaviour. So instead, everything uses the model by sampling it-- taking less likely choices by chance weighed by the model output. Unfortunately, that means that an unlucky draw will occasional paint it into a corner. If you see that happening you can just go back and try again and you get much better output.
If that is a fair comparison depends on what your application is... if you need to to run unsupervised, it isn't consistently great. If you just need a first draft out of it or some raw ideas to turn an hour writing task into a 5 minute one, it's great for that.
I don't think this kind of manual assistance is much of a cheat either-- a real speech writer also gets exactly this sort of help from others.
[And FWIW, I did this via the GPT3 based mode in the ai dungeon video game. ... I don't have access to the GPT3 API.]
I remember reading that 1 minute of an important speech of a politician requires about an hour of preparation by a professional speech writer. And I guess that a failure speech is even more difficult to write.
The current level of AI like GPT-3 can generate fluffy text, very good fluffy text, but still can't generate a text for this kind of speech.
> but still can't generate a text for this kind of speech.
I disagree. Maybe the output I gave above wouldn't quite pass for the output of one of the greatest speech writers of the time, but most speeches are rubbish and I think what you can get out of GPT3 well operated is not at all rubbish.
I was expecting the Walter Cronkite part to be the fake thing here in a clever twist.
They'd pass through the Nixon questions and then say "Alright, but what about Walter Cronkite?" My true advocacy would be to use another contemporary big three anchor from the era as the deep fake, then show the real Cronkite footage during the quiz at the end
That's the real lesson IMHO, not if you can tell when you're prompted to listen for it, but when you aren't.
They don't need to be perfect. They just need to be convincing.
Consider MP3 files. Any audiophile will tell you that a compressed MP3 file is a piss-poor representation of the true sound. The best experience is to listen to a band live, then vinyl, then FLAC.
And yet the majority of people listen to MP3 files. They strike the right balance of file size and sound clarity for an overwhelming amount of systems, doubly so since online streaming took off. So now that people have become accustomed to the sound of MP3, they are not used to anything else. They are convinced that an MP3 file is the "true sound" of the song.
Right now deepfakes is in its infancy. Personally I think it is improving at an alarming rate. If I shared the moon video on Facebook I am willing to bet most people would think it is real. They don't notice the clipped speech, the subtle double head nods. Their minds are not as critical of video as you and I are. So to them, they are convinced it is real.
What happens when only machines can detect if something is authentic or a fake? What happens when not all site administrators scan for videos and fail to mark them as illegitimate? What happens when courts use deepfakes unknowingly as evidence to convict someone, or impeach a president, or fire a worker for sexual harassment? We have already seen what happens when "verified" twitter accounts are compromised - what if a CEO puts out a video announcing some controversial new endeavor, or admission to fraud?
There are very real concerns about this technology and I believe it will very soon become a weapon that takes misinformation and turns it into very real consequences.
Misinformation is already a huge issue; deepfakes can only exacerbate and compound the issue in ways we can't even imagine at this time.
Imagine deepfake used to create false alibis, 911 calls, etc... That's probably not even the tip of the iceberg in coming years.
The fact we went in KNOWING it's a deepfake gave us an unmeasurable advantage/bias. For uninformed individuals that "stumble" upon this on the web, one can only imagine how it'll playout. This thought actually terrifies me... imagine a hypothetical (and conservative) 10% overall improvement AND production price reduction to this video every 6-12 months.
@SamuelAdams: Wish this site had a messaging system. Would be interesting to have a 1 on 1 discussion on this topic with you.
You mention phone calls. Right now robocalls are very common. But what if those random calls legitimately sounded like your parents, or your spouse, or your kids. Furthermore, what if they sounded like they were in real trouble and really needed money fast? This technology opens up whole new arenas to more legitimate scams.
I'm sure deepfakes will be used in fake kidnappings, social engineering, scams, etc. I can't begin to imagine the scale of damage this can cause. The reality is that a "well" executed 10 second deepfake can cause public panic, etc. The potential for abuse is just too high.
Its not a matter of if, it's a matter of WHEN this is economically and technologically available to the masses. The next 5-10 years is going to be a very confusing time to be alive...
How is the best experience to listen to a band live? I have had literally the exact opposite experience in life. Concerts are lots of fun, but when its comes to listening to the actual music at the concert, unquestionably not even close to the quality of a sound booth.
They definitely don't all look crap, although speech style transfer is still pretty rudimentary. The first point I'd make is that you can mask most of the defects of current deepfakes simply by degrading the recording quality. A mediocre voice deepfake sounds pretty damned credible when played through a cellphone with poor signal and a bit of background noise. A mediocre video deepfake appears wholly believable if you post-process it to look like low-res CCTV. For adversarial applications, this would tend to increase rather than decrease credibility - we would expect a covert recording of someone doing something illegal or shameful to be of poor technical quality.
Secondly, we do legitimately need to be concerned about what deepfakes will look like in five or ten years time. We're making substantial algorithmic improvements in efficiency and quality, coupled with a multi-billion-dollar race to improve computational performance for deep learning tasks.
Deepfakes may never improve (unlikely), they may gradually improve with better algorithms and more compute power, or there may be a sudden breakthrough in either area that makes 100% convincing deepfakes commonplace. We could wait until that moment to start thinking about social and technological countermeasures, but I wouldn't recommend it. If there's anything to be learned from 2020, it's that we should be investing a lot more in preparing for low-probability/high-magnitude risks.
You're getting down voted, but I agree. While the footage represents a technological marvel, it doesn't fill me with dread that there is some sort of deep fake genie waiting to jump out of the bottle.
I think the choice of speech is also quite telling. They chose a target with low video and audio resolution and a lot of interference that could mask some of the imperfections of their algorithm. Despite all this, oddness induced by the deepfake process is still quite evident.
Definitely felt a little off, but if I didn't know that this was a deepfake, I would probably accept it as real. As usual, I don't know if that fits more in to the "remarkable" or "disturbing" category.
Eh, I'm not entirely sure I would have. The effect is near identical to when you have one person behind another, pretending to be the front person's arms and hands - the motions just don't match up right.
Maybe people who use faces more than full body language are more affected by this particular one?
I kind of agree, but it's hard to separate knowing that it's fake from my reaction to it. If this were presented without that context, I might feel like it's off, but not necessarily question it, especially if it came from a source I believed to be legitimate.
I also think, on top of all your points (which I agree with), keeping it shorter would make it more convincing. I think with the current usage of social media, short clips created through some synthetic means will be harder and harder to identify.
I remember watching a TV ad where Michael Jordan around the age he was when he retired played 1-on-1 against himself as a college player. This was 15-20 years ago & I wrote an essay for high school English class on how that ad shows we can't believe our eyes when it comes to what we see on TV. It was visually much more convincing than the examples I see here, so I can't help but think deepfakes haven't quite caught up with professional editing, at least not yet.
For the majority of those on HN who didn't witness the live event this is a pretty good summary. It really took me back to that day as our entire family was glued to our black and white TV watching the man known to all as Uncle Walter. Watching President Nixon twitch a bit was the only tell that this wasn't real.
Got me to imagining the video where Eisenhower confesses D-Day was a failure, the actual speech he wrote does exist.
It works as a PoC, and it could fool me if I saw it at the edge of my vision, on my kitchen's TV. But otherwise it looks odd, and most importantly it sounds like the weirdest vocoder in existence.
We still can't truly fake instruments, never mind the human voice.
I really hate the fake static in this video - as a 7-yr old I watched the moon landing in 1969 and nobody's TV was that bad! If I had a TV that bad I would either repair it or throw it away!
Any chance you would be willing to share the (finetuned) models? For science? I’m interested in detection and want to look at artifacts common to speech audio generated by different models all based on the same overall implementation.
Deepfake stuff is not a threat, it's a vaccine. There finally is a chance people will realize they can believe nothing and nobody, news lie and mislead and they absolutely have to triple-check everything before they take that in consideration.
This destroys democracy. Because you can't check everything yourself, or even a tiny amount of things, so you have no meaningful basis on which to tick that box on the ballot.
Is it that much worse than the effect that photoshop had? We have been able to fake pictures/images for a long time now. What changed is that people no longer blindly trust pictures. I assume something similar will happen with video.
Trust in media with a reputation. Trust in journalism to unearth attempts of deceit. Quality journalism and reputation will become more important and powerful than ever
The "can't check everything yourself, or even a tiny amount of things" argument applies to everything. It's even more work to verify - for example - a small segment of a real hour-long Trump rally, carefully chopped off in the middle of a sentence to push a bullshit interpretation of what's being said, like I caught Snopes doing a while back.
Hell, even text suffers from this problem. Back when the UK (briefly) hit their target of 100,000 Covid-19 tests a day, the BBC News website pushed a bullshit claim that Germany was already averaging that many a month earlier. This played well into all the existing narratives about British exceptionalism, Brexit, inferiority to Europe, and the incompetence of our Covid response. It was also trivially verifiable as untrue - the German testing numbers were up on the RKI testing website, in English, and not only were they nowhere near that a month earlier, they were still well below it when the BBC published that claim. Someone at the BBC had mistaken the number of tests German labs had the capacity to process a day for the actual number of tests, which was particularly bad since literally every other part of Germany's testing process was more of a bottleneck than lab capacity. The BBC then kept this claim up and prominently linked to the article it was in on their front page for a month after they knew it was untrue. A large proportion of the UK population probably saw it. How many people do you think spotted the error? (They've since decided that because they've memory-holed it, they don't have to append a correction.)
Bullshit. Easily 80% of broadcast news today is fake, either overtly, or by omission. It only exists to advance a narrative, not to inform or support debate. As Denzel Washington quipped: "If you don't watch the news, you're uninformed. If you do watch the news, you're misinformed."
The ability to produce fake news artificially as well won't tip the scales even slightly, because nobody with a brain trusts the "news" anyway.
Some of my favorites include Sinatra singing ABBA's Dancing Queen [1], six presidents rapping the NWA classic [2], and Milton Friedman rapping 50's P.I.M.P. [3]
[1] https://www.youtube.com/watch?v=zo_w4KGifug
[2] https://www.youtube.com/watch?v=mAZVp-n-5TM
[3] https://www.youtube.com/watch?v=4mUYMvuNIas