I hope there is at least some cherrypicking here. This also seems like some shots fired at some of the other gen video startups
this is the opposite of history
One wonders how you might gain a representation of physics learned in the model. Perhaps multimodal inputs with rendered objects; physics simulations?
> That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.
"We believe safety relies on real-world use and that's why we will not be allowing real-world use until we have figured out safety."
Like you can see some weird artifacts, but take one of these videos, compress it down to a much lower quality and with the loss of quality you might not be able to tell the difference based on these examples. Any artifacts would likely be gone.
Given what I had seen on social media I had figured anything remotely real was a few years away, but I guess not...
I guess we have just stopped worrying about the impact of these tools?
What OpenAI does is amazing, but they obviously cannot be allowed to capture the value of every piece of media ever created — it'll both tank the economy and basically halt all new creation if everything you create will be immediately financially weaponized against you, if everything you create goes immediately into the Machine that can spit out a billion variations, flood the market, and give you nothing in return.
It's the same complaint people have had with Google Search pushed to its logical conclusion: anything you create will be anonymized and absorbed. You put in the effort and money, OpenAI gets the reward.
Again, I like OpenAI overall. But everyone's got to be brought to the table on this somehow. I wish our government would be capable of giving realistic guidance and regulation on this.
There was Seinfeld "Nothing, Forever" AI parody, but once the models improve enough and are cheap enough to deploy, studios will license their content for real and just have endless seasons.
Or even custom episodes. Imagine if every episode of a TV show was unique to the viewer.
This is neat and all but mostly just a toy. Everything I've seen has me convinced either we are optimizing the wrong loss functions or the architectures we have today are fundamentally limited. This should be understood for what it is and not for what people want it to be.
In this video, there's extremely consistent geometry as the camera moves, but the texture of the trees/shrubs on the top of the cliff on the left seems to remain very flat, reminiscent of low-poly geometry in games.
I wonder if this is an artifact of the way videos are generated. Is the model separating scene geometry from camera? Maybe some sort of video-NeRF or Gaussian Splatting under the hood?
All "proof" we have can be contested or fabricated.
Interested to know what the success rate of such amazingmess
Pika have really impressive videos on their homepage that are borderline impossible to make for myself.
Also, nicely timed to overshadow the Google Gemini 1.5 announcement.
“ Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.”
To create a movie I need character visual consistency across scenes.
Getting that right is the hardest part of all the existing text->video tools out there.
The prompts - incredible and such quality - amazing. "Prompt: An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film."
Prompt: poll worker sneakily taking ballots labeled <INSERT POLITICAL PARTY HERE>, and throwing them in the trash.
Sam is probably going to get his $7T if he keeps this up, and when he does everybody else will be locked out forever.
I already know people who have basically opted out of life. They're addicted to porn, addicted to podcasts where it's just dudes chatting as if they're all hanging out together, and addicted to instagram influencers.
100% they would pay a lot of money to be able to hang out with Joe Rogan, or some only fans person, and those pornstars or podcasts hosts will never disagree with them, never get mad at them, never get bored of them, never thing they're a loser, etc.
These videos are crazy. Highly suggest anybody who was playing with Dall-E a couple of years ago, and being mindblown by "an astronaut riding a horse in space" or whatever go back and look at the images they were creating then, and compare that to this.
As a layman watching the space, I didn't expect this level of quality for two or three more years. Pretty blown away, the puppies in the snow were really impressive.
Effect was stronger in some videos.
Based on my experience doing research on Stable Diffusion, scaling up the resolution is the conceptually easy part that only requires larger models and more high-resolution training data.
The hard part is semantic alignment with the prompt. Attempts to scale Stable Diffusion, like SDXL, have resulted only in marginally better prompt understanding (likely due to the continued reliance on CLIP prompt embeddings).
So, the key question here is how well Sora does prompt alignment.
The prompt tho. Probably not text. Probably a stream of vibes or something.
Looks like a dramatic improvement in video generation but still a miss in terms of realism unless one can apply pose control to the generated videos.
Let's temper the emotions for a second. Sora is great, but it's not ready for prime time. Many people are working on this problem that haven't shared their results yet. The speed of refinement is what's more interesting to me.
The killer app for this is being able to give a prompt of a detailed description of a scene, with actor movements and all detail of environment, structure, furniture, etc. Add to that camera views/angles/movement specified in the prompt along with text for actors.
And it is still not perfect. Looking at the example of the plastic chair being dug up in the desert[1] is frankly a bit... funky. But imagine in 5 or even 10 years.
You give it data like real time stock data, feed it into Sora, the prompt is "I need a chart based on the data, show me different time ranges"
As you move the cursor, it feeds into sora again, generating the next frame in real time.
Didn't think Google would be the first of the Facebook, Apple, Google and Microsoft to get disrupted.
Actually, thinking of this from the perspective of a start-up, it could be cool to instantly demonstrate a use-case of a product (with just a little light editing of a phone screen in post). We spent a lot of money on our product demo videos and now this would basically be free.
My initial observation is that the camera moves are very similar to a camera in a 3D modeling program: on an inhuman dolly flying through space on a impossibly smooth path / bezier curve. Makes me wonder if there is actually a something like 3D simulation at the root here, or maybe a 3D unsupervised training loop, and they are somehow mapping persistent AI textures onto it?
Motion-capture works fine because that's real motion, but every time people try to animate humans and animals, even in big-budget CGI movies, it's always ultimately obviously fake. There are so many subtle things that happen in terms of acceleration and deceleration of all of the different parts of an organism, that no animator ever gets it 100% right. No animation algorithm gets it to a point where it's believable, just where it's "less bad".
But these videos seem to be getting it entirely believable for both people and animals. Which is wild.
And then of course, not to mention that these are entirely believable 3D spaces, with seemingly full object permanence. As opposed to other efforts I've seen which are basically briefly animating a 2D scene to make it seem vaguely 3D.
Realistically, how do you fit this into a movie, a TV show, or a game? You write a text prompt, get a scene, and then everything is gone—the characters, props, rooms, buildings, environments, etc. won’t carry over to the next prompt.
I wonder if it could be used as a replacement for optical flow to create slow motion videos out of normal speed ones.
If anyone's taking requests, could you do one that takes audio clips from podcasts and turns them into animations? Ideally via API rather than some PITA UI
Being able to keep the animation style between generations would be the key feature for that kind of use-case I imagine.
I really wonder what's going to come out of the company and on what timeline.
I guess we've all just been replaced.
The fear I have has less to do with these taking jobs, but in that eventually this is just going to be used by a foreign actor and no one is going to know what is real anymore. This already exists in new stories, now imagine that with actual AI videos that are near indistinguishable from reality. It could get really bad. Have an insane conspiracy theory? Well, now you can have your belief validated by a completely fictional AI generated video that even the most trained eyes have trouble debunking.
The jobs thing is also a concern, because if you have a bunch of idle hands that suddenly aren't sure what to believe or just believe lies, it can quickly turn into mass political violence. Don't be naive to think this isn't already being thought of by various national security services and militaries. We're already on the precipice of it, this could eventually be a good shove down the hill.
https://cdn.openai.com/sora/videos/puppy-cloning.mp4
Perhaps there are particular aspects of our world that the human mind has evolved to hyperfocus on.
Will we figure out an easy way make these models match humans in those areas? Let's hope it takes some time.
Genuine question I have no idea
"Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI."
This also helps explain why the model is so good since it is trained to simulate the real world, as opposed to imitate the pixels.More importantly, its capabilities suggest AGI and general robotics could be closer than many think (even though some key weaknesses remain and further improvements are necessary before the goal is reached.)
EDIT: I just saw this relevant comment by an expert at Nvidia:
“If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.
I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!
Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." ….”
https://twitter.com/DrJimFan/status/1758210245799920123
Even though many things are super impressive, there is a lot of uncanny valley happening here.
All of the AI generated media has this quality where I can immediately tell that it's ai, and that becomes my dominant thought. I see these things on social media and think "oh, another ai pic" and keep scrolling. I've yet to be confused about whether something is ai generated or real for more than several seconds.
Consistency and continuity still seem to be a major issues. It would be very difficult to tell a story using Sora because details and the overall style would change from scene to scene. This is also true of the newest image models.
Many people think that Sora is the second coming, and I hope it turns out to have a major impact on all of our lives. But right now it's looking to have about the same impact that DALL-E has had so far.
The bottleneck of creating a separate prompt is very limiting.
Imagine asking for a recipe or car repair and it makes a video of the exact steps. Or if you could upload a video and ask it to make a new ending.
That’s what I imagine multi modal models would be.
1. Why would Adrej Karpathy leave when he knows such an impressive breakthrough is in the pipeline?
2. Why hasn't Ilya Stuskever spoken about this?
AGI can’t be far off, that stuff clearly understand a bunch of high level concepts.
This product looks incredible...
I guess it was anticipated.
Pika - $55M
Synthesia - $156M
Stability AI - $173M
This model shows a very good (albeit not perfect) understanding of the physics of objects and relationships between them. The announcement mentions this several times.
The OpenAI blog post lists "Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care." as one of the "failed" cases. But this (and "Reflections in the window of a train traveling through the Tokyo suburbs.") seem to me to be 2 of the most important examples.
- In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo. - In the chair one, OpenAI says the model failed to model the physics of the object (which hints that it did try to, which is not how the early diffusion models worked ; they just tried to generate "plausible" images). And we can see one of the archeologists basically chasing the chair down to grab it, which does correctly model the interaction with a floating object.
I think we can't underestimate how crucial that is to the building of a general model that has a strong model of the world. Not just a "theory of mind", but a litteral understanding of "what will happen next", independently of "what would a human say would happen next" (which is what the usual text-based models seem to do).
This is going to be much more important, IMO, than the video aspect.
I am curious of how optimised their approach is and what hardware you would need to analyse videos at reasonable speed.
Also begs the question, if more and more children are introduced to media from young age and they are fed more and more with generated content, will they be able to feel "uncanniness" or become completely blunt to it.
There's definitely interesting period ahead of us, not yet sure how to feel about it...
We'll get some groundbreaking film content out of this in the hands of a few talented creatives, and a vast ocean of mediocre content from the hands of talentless people who know how to type. What's the benefit to humanity, concretely?
This video is pretty instructive: https://cdn.openai.com/sora/videos/amalfi-coast.mp4
It "eats" several people with the wall part of the way through the video, and the camera movements are odd. Strange camera movements, in response to most of the prompts, seems like the biggest problem. The model arbitrarily decides to change direction on a dime - even a drone wouldn't behave quite like that.
Yes, it's a great technical achievement, but I just worry for the future. We don't have good social safety nets, and we aren't close to UBI. It's difficult for me to see that happen unless something drastic changes.
I'm also afraid of one company just having so much power. How does anyone compete?
Exciting for the potential this creates, but scary for the social implications (e.g., this will make trial law nearly impossible).
Brother, have you seen Runway Gen 2, or SVD 1.1? I'm not excited about Sora because I think it looks like Hollywood animations, I'm excited because an open-source 3rd-Gen Sora is going to be so much better, and this much progression in one step is really exciting!
Anyway, videos look incredible. I genuinely can't believe my eyes.
Sora: plays GTA V
- Disruptions like this happen to every industry every now and then. Just not on the level of "Communicating with people with words, and pictures". Anduril and SpaceX disrupted defense contractors and United Launch Alliance; Someone working for a defense contractor/ULA here affected by that might attest to the feeling?
- There will be plenty of opportunity to innovate. Industries are being created right now. People probably also felt the same way when they saw HTTP on their screens the first time. So don't think your career or life's worth of work is miniscule, its just a moving target, adapt & learn.
- Devil is in the details. When a bunch of large SaaS behemoths created Enterprise software an army of contractors and consultants grew to support the glue that was ETL. A lot of work remains to be done. It will just be a more imaginative glue.
Two things are interesting:
- No audio -- that must have been hard to add, or else it would have been there.
- Spelling is still probably hard to do (the familiar DallE problem)... e.g. a video showing a car driving past a billboard with specified text.
The future of these high-fidelity (but not perfect) generative AI systems is in realizing we're going to need "humans in the loop". This means designing to output human-manipulable data - perhaps models/skeletons/textures instead of whole output. Pixels are hard to manipulate directly!
As for entertainment, already we see people sick of CGI - will people really want to pay for AI-generated video?
- Local/Bespoke high quality video content creation by ordinary Joes: Check. - Ordinary joes making fake porn videos for money: Check. - Reduce cost for real movies dramatically by editing in AI scenes: Check.
A whole industry will get upturned.
We're nowhere near full-automation, these are growing pains, but maybe the canary in the goldmine for the job market. Expect more enthusiasm for UBI or negative tax and the like and policies to follow. Cheap energy is also coming eventually, just slower.
Such achievements in technology must lead to cultural change. Look at how popular vinyl has become, why not theatre again.
It's still too easy to notice these are all AI rendered.
I also feel a sense of dread too. Imagine the tidal wave of rubbish coming our way. First text, then images and now video can be spewed out in industrial quantities. Will it lead to a better culture? In theory it could, in practice I just feel like we'll be deluged with exponentially more mediocre "content" .
We've fairly quickly moved from a world where AIs would communicate with each other through text to one in which they can do so through video.
I'm very curious how something like Sora might end up being used to generate synthetic training data for multimodal models...
*Edit* Oh, I just read here (https://www.reddit.com/r/MachineLearning/comments/1armmng/d_...) that a technical paper should be released later today?
Would we be able to perceive the differences between those and the physical world? I can't help but feel like there is a proof for the simulation theory possible here.
Every industrial revolution and its resulting automation has brought not only more jobs but also created a more diverse set of jobs. Therefore also new industries are created. History rhymes, the ruling fears in such times have always been similar. Claims are being made but without any reasonable theories, expertise or provable facts (e.g. Goldman Sachs unemployment prediction is absolute bs). This is even more true when such related AI matters are thought about in more detail. Furthermore, even though employing tens of millions of people probably, only a few industries like content creation, movie etc. are affected. The affacted workforce of these industries is highly creative, as they are being paid for their job. The set of jobs today is big, they won't become cleaning staff nor homeless.
This technology has also to proof itself (Its technical potential is unlimited but financially limited by the size of funds being invested, and these are limited)
Transition to the use of such tools in corporations could take years, depending on the type and size and other parameters. People underestimate the inefficiencies that a lot of companies embody - and I am only talking about the US and some parts of Europe here. If a company did their job for 2 decades the same way, a sudden switch does not happen overnight. Affected people have ways to transition to other industries, educate themselves further and much more. Especially as someone living in the west, the opportunities are huge. And in addition, the wide array of different variables about the economy and the earth, and everything its differing societies are, comes into play: Some corporations want real videos made by real people; Some companies want to stay the way they are and compete using their traditional methods; Corporations are still going to hire ad agencies - ad agencies whose workflow his now much more efficient and more open to new creative spheres which benefits both customer and themselves. They list could go one endlessly.
Lots of people seem to fear or think about the alleged sole power OpenAI COULD achieve. But would that be a problem, would "another Alphabet" be a problem? Hundreds of millions of people benefited and are benefiting today from their products. They have products that are reliable and work (This forum consisting of tech experts is a niche case, nearly all people don't care at all if data on them is being used for commercial purposes). Google had a patent guaranteed monopoly on search. But here we have: an almost non patented or patentable market, an open source community, other companies of all sizes competing, innovation happening and much more. It is true that companies like OpenAI have more funds available to spend than others, but such circumstances have always driven competition and innovation. And at the end of the day, customers are still going to use the best product they have decided to be so.
I know I may be stating the obvious but: The economy and the world is a chaos system with a unpredictable future to come.
“I’ve heard a lot of people say they’re leaving film,” he says. “I’ve been thinking of where I can pivot to if I can’t make a living out of this anymore.” - a concept artist responsible for the look of the Hunger Games and some other films.
"A study surveying 300 leaders across Hollywood, issued in January, reported that three-fourths of respondents indicated that AI tools supported the elimination, reduction or consolidation of jobs at their companies. Over the next three years, it estimates that nearly 204,000 positions will be adversely affected."
"Commercial production may be among the main casualties of AI video tools as quality is considered less important than in film and TV production."
[1] https://www.hollywoodreporter.com/business/business-news/ope...
Extremely hard to do, it is, but you’ll become quasi-Amish and realize how little is actually actionable and in our control.
You’ll also feel quite isolated, but peaceful. There’s always tradeoffs. You can’t have something without giving up not-something, if that makes sense.
Edit: So, essentially, ignorance is bliss, but try to look past the pejorative nature of that phrase and take it for what it is without status implications.
That being said, there is value in these systems for casual use. For example, me and my girlfriend got into the habit of sending little cartoons to each other. These are cartoons we would have never created otherwise. I think that’s pretty awesome.
You just write the rules of the game and the player input, and let the AI generate the next frame.
But I have a problem: I am unable to believe the videos I saw were dreamt by AI. I can feel deeply that I do believe there is some trickery or severe embellishment. If I am wrong, I guess we are at an inflexion point.
I can recall 10+ years ago, we were talking "in hacking groups" about AI because we thought the human brain alone was not good enough anymore... but in a maths/sciences context.
I’d imagine IRL no-tech experiences will be the new ‘escapes’ too.
Maybe I’m too idealistic about the importance of the human spirit/essence…whatever that actually is.
Jokes aside. It's becoming more apparent, Power will further concentrate to big tech firms.
People to just create lifelike videos of anything they can put their mind to, is bound to lead to the ruining of many peoples' lives.
As many people that are aware and interested in this technology, there is 100x people who have no idea, don't care or can't comprehend it. Those are the people that I fear for. Grab a few pictures of the grandkids off of facebook, and now they have a realistic ransom video to send.
Am i being hyperbolic? I don't think so. Anything made by humans can be broken. And once its broken and out there, good luck.
Want to form a trade union I'm your workplace? Best be ready to have videos of you jacking off to be all over the internet.
Videotape a police officer brutalising someone? Could easily have been made with AI, not admissable.
These things will ruin the ability to trust anything online.
I guess what I'm wondering is how "new" the videos are, or how closely do they mimic a particular video in the training set? Will we generate compelling and novel works of art with this, or is this just a very round-about way of re-implementing the YouTube search bar?
- This enables everyone to be creators
- Given that everyone's creativity isn't top notch, highest quality will be limited to a the best
- So rest of us will be consumers
- How will we consume if we don't have work and there is no UBI?
We still need nurses, cooks, theater, builders etc.
Video 7 of 8 on the 2nd player on the page.
> Prompt: The camera rotates around a large stack of vintage televisions all showing different programs — 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.
For instance, the generational leap in video generation capability of SORA may be possible because:
1. Instead of resizing, cropping, or trimming videos to a standard size, Sora trains on data at its native size. This preserves the original aspect ratios and improves composition and framing in the generated videos. This requires massive infrastructure. This is eerily similar to how GPT3 benefited from a blunt approach of throwing massive resources at a problem rather than extensively optimizing the architecture, dataset, or pre-training steps.
2. Sora leverages the re-captioning technique from DALL-E 3 by leveraging GPT to turn short user prompts into longer detailed captions that are sent to the video model. Although it remains unclear whether they employ GPT-4 or another internal model, it stands to reason that they have access to a superior captioning model compared to others.
This is not to say that inertia and resources are the only factors that is differentiating OpenAI, they may have access to much better talent pool but that is hard to gauge from the outside.
As I understand the current US situation, a straight prompt-to-generate-video cannot be copyrighted. https://www.copyright.gov/ai/ai_policy_guidance.pdf
But the copyright office is apparently considering the situation more thoroughly now.
Is that where it stands?
If it can’t be copyrighted, it seems that would tamper many uses.
Except for live sporting events.
This is why I think megacorps all going to bid for sport league streaming right. That's the only one that AI can't touch.
It's doubly amazing when you think that the richness of video data is almost infinitely more than text, and require no human made data.
The next step is to combine LLM with this, not for multimodal, but to team up together to make a 'reality model' that can work together to make a shared understanding?
I called LLMs 'language induced reality model' in the past. Then this is 'video induced reality model', which is far better at modeling reality than just language, as humans have testified.
Also (since it's been a while): there are over 2000 comments in the current thread. To read them all, you need to click More links at the bottom of the page, or like this:
https://news.ycombinator.com/item?id=39386156&p=2
...and our future lies in the hand of venture capitalists, many of whom have no moral compass, just an insatiable hunger to make ever larger sum of money.
It’s interesting reading all the comments, I think both sides to the “we should be scared” are right in some sense.
These models currently give some sort of super power to experts in a lot of digital fields. I’m able to automate the mundane parts of coding and push out fun projects a lot easier today. Does it replace my work, no. Will it keep getting better, of course!
People who are willing to build will have a greater ability to output great things. On the flip side, larger companies will also have the ability to automate some parts of their business - leading to job loss.
At some point, my view is that this must keep advancing to some sort of AGI. Maybe it’s us connecting our brains to LLMs through a tool like Neuralink. Maybe it’s a random occurrence when you keep creating things like Sora. Who knows. It seems inevitable though doesn’t it?
Now the big question is. As OpenAI keeps pushing boundaries, it's fascinating to see the emergence of tools like Sora AI, capable of creating incredibly lifelike videos. But with this innovation comes a set of concerns we can't ignore.
So i'm worried about getting these tools misused. I'm thinking about what impact could they have on the trustworthiness of visual media, especially in an era plagued by fake news and misinformation? And what about the ethical considerations surrounding the creation and dissemination of content that looks real but isn't?
And, what we should do to tackle these potential issues? Should there be rules or guidelines to govern the use of such tools, and if so, how can we make sure they're effective?
So i'm worried about getting these tools misused. I'm thinking about what impact could they have on the trustworthiness of visual media, especially in an era plagued by fake news and misinformation? And what about the ethical considerations surrounding the creation and dissemination of content that looks real but isn't?
It's a nice cleansing benefit that comes with these really extraordinary tech achievement that should not be undervalued (after all it produces basically an endless amount of equally trained producers like the industry did in a - somehow malformed - way before).
Poster frames and commercials thrown at us all the time, consumed by our brains to a degree that we actually see a goal in producing more of them to act like a pro. The inflationary availability that comes with these tools seems a great help to leave some of this behind and draw a clearer line between it and actual content.
That said, Dall-E still produces enough colorful weirdness to not fall into that category at all.
Yes, I'm still bitter about that.
For example this looks very much like something from a modern 3d engine:
Creating these video's in CGI is a profession that can make you serious money.
Until today.
What a leap.
I vote for Hothouse, by Brian W Aldiss. So many images need to imagined, like spiders that jump to the moon and back again.
"Sora" is not a video generation technology offered by OpenAI. As of my last update in April 2023, OpenAI provides access to various AI technologies, including GPT (Generative Pre-trained Transformer) for text generation and DALL·E for image generation. For video generation or enhancement, there might be other technologies or platforms available, but "Sora" as a specific product related to OpenAI or video generation does not exist in the information I have.
If you're interested in AI technologies for video generation or any other AI-related inquiries, I'd be happy to provide information or help with what's currently available!
Why would it?
Creativity being automated while humans are forced to perform menial tasks for minimum wage doesn't seem like a great future and the geriatric political class has absolutely no clue how to manage the situation.
I can see a new market for true end-to-end analogue film productions emerging for people who like film.
I've always been a digital stills guy, and dabbled in video.. as a hobby. As a hobbyist, I always found the hardest thing is making something worth looking at. I don't see AI displacing the pleasure of the art for a hobbyist.
My next guess is the 80/20% or 95/5% problem is gonna be stuff like dialogue matching audio and mouth/face motion.
I do see this kind of stuff killing the stock images / media illustrator / b-roll footage / etc jobs.
Could a content mill pump out plausibly decent Netflix video series given this tool and a couple half decent writers.. maybe? Then again it may be the perpetual "5 years away". There's a wide gap between generating filler content & producing something people choose to watch willingly for entertainment.
Phase 1 (we are here now): While generative AI is not good enough to directly produce parts of the final product, it can already be used to quickly prototype styles, stories, designs, moods, etc. A good chunk of the unnamed behind-the-scenes-people will loose their job.
Phase 2: While generative AI is still expensive, the output quality is sufficient to directly produce parts of / the entire final product. Big production outlets will use it to produce AAA titles and blockbusters. Even actors, directors and other high publicity positions will be replaced.
Phase 3: The production cost will sink further until it becomes attainable by smaller studios and indie productions. The already fierce markets will be completely flooded with more and more quantity over quality. Advertisement will not be pre-produced and cut into videos anymore but become very subtle product placements, impossible for ad-blockers to strip from the product.
Phase 4: Once the production cost falls below the price of one copy of the product, we will get completely customized entertainment products tailored to our personal taste. Online communities will emerge which craft skeletons / templates which then are filled out by the personal parameter sets of the consumers. That way you can still share the experience with friends even though everybody experiences a different variation.
Phase 5: As consumers do not hit any production limits any more (e.g. binge watch their favorite series ad infinitum) and the product becomes optimized to be maximally addictive by measuring their reaction to it, it will become impossible for most human beings to resist. The entertainment mania will reach its peak and social isolation, health issues and economic factors will bring down the human reproduction rate to basically zero.
Phase 6: Human civilization collapsed within one or two generations and the only survivors will be extremely technology-adverse people by selection. AGI might have happened in the meantime but did not have the time to gracefully take over and remodel the human infrastructure to become self sufficient. Instead a strict religion will rule the lands and the dark ages begin anew.
Note that none of this is new, it is just the continuation and intensification of already existing trends. This is also not AGI doomerism as it does not involve a malicious AGI gone rouge or anything like that. It is simply what happens when human nature meets powerful technology.
TLDR: While I love the technology I can only see very negative long-term outcomes from this.
or we are heading towards a skynet-y feature
There should be an opt out from being subjected to AI content.
To me, it's becoming increasingly obvious that startups whose defensibility hinges on "hoping OpenAI doesn't do this" are probably not very enduring ones.
As several others have pointed out, realism of these models will continue to improve, and will soon be economically useful for producing beautiful or functional artifacts - however prompt adherence (getting what you want or intend) of the models is growing much more slowly.
However I think we have a long ways to go before we'll see a decent "AI Film" that tells a compelling story - and this has nothing to do with some sort of naturalistic fallacy that appeals to some innate nature of humans!
It comes down to the dataset and the limits of human creators in their ability to communicate their process. Image-Text and Video-Text pairs are mostly labeled by semi-skilled humans who describe what they see in detail. They are, for the most part, very good at capturing the obvious salient features of an image or a video. "reflections of the neon lights glisten in the sidewalk". However, what you see in a movie scene is the sum total of dozens if not hundreds of influences, large and subtle. Choices made by the actors, camera operators, lighting designers, sound designers, costuming, makeup, editors, etc... Most people are not trained to recognize these choices at all, or might not even be aware that there are choices to make. We (simply) see "Joaqin Phoenix is making awkward small-talk in the elevator with other office workers".
So much of what we experience processes on subconscious and emotional and purely sensory levels, we don't elevate those lower-level qualia to our higher-brain's awareness and label them with vocabulary without intentional training (such as tasting wine, coffee, beer, etc - developing a palate is an act of sensory-vocabulary alignment).
However, despite not raising these things to our intentional awareness, it has an influence on us -- often the desired impact of the person who made that choice in the first place. The overall effect of all of these intentional choices makes things 'feel right'.
There's no fundamental reason AI can't produce an output that has the same effect as those choices, however finding each little choice is like a needle in a haystack. Accurate labeling of the training data tells the AI where to look -- but the people labeling the data are probably not well-versed in all of the little intentional choices that can be made when creating a piece of video-media.
Beyond the issue of the labeling folks being trained in the art itself, there's the problem too of the artists themselves not being able to fully articulate their (numerous, little, snowflake-into-avalanche) choices - or simply not articulating it even if they could. Ask Jackson Pollock about paint viscosity and you'll learn a great deal, but ask about abstract painting composition and there's this ineffable gap that language seems ill-suited to cross. The painter paints what they feel, and they hope that feeling is conveyed to the viewer - but you'd be hard pressed to recreate "Autumn Rhythm (Number 30)" if you had to transmit the information via language and hope they interpreted it correctly. Art is simultaneously vague and specific!
So, to sum up the problem of conveying your intent to the model:
- The training data labels capture obvious or salient features, but not choices only visible to the trained eye
- The material itself is created by human artists who might not even be able to explain all of their choices in words
- You the prompter might not have the vocabulary that captures succinctly and specifically the intended effect
- The end result will necessarily be not quite what you imagined in your mind's eye as a result of all of this missing information
You can still get good results if you tell it to copy something, because the label "Tarantino" captures a lot of detail, even all the little things you and the training data would never have labeled in words. But it won't be yours and - until we have an army of trained artists providing precise descriptions for training data in their area of expertise, and you know how to speak those artists' language - it can't be yours.
On the other hand, people find the tech very impressive and there are a lot of mind blowing use-cases.
Personally, this opens up the world for me to create video ads for software projects I create, since I have no financial resources or time to actually make videos, I only know how to code. So I find it pretty exciting. It's great for solo entrepreneurs.
https://trends.google.com/trends/explore?date=all&q=disrupt&...
Sora represents a monumental leap forward, it's comically a 3000% improvement in 'coherent' video generation seconds. Coupled with a significantly enhanced understanding of contextual prompts and overall quality, it's has achieved what many (most?) thought would take another year or two.
I think we will see studios like ILM pivoting to AI in the near future. There's no need for 200 VFX artists when you can have 15 artists working with AI tooling to generate all the frame-by-frame effects, backgrounds, and compositing for movies. It'll open the door for indie projects that can take place in settings that were previously the domain of big Hollywood. A sci-fi opera could be put together with a few talented actors, AI effects and a small team to handle post-production. This could conceivably include AI scoring.
Sure, Hollywood and various guilds will strongly resist but it'll require just a handful of streaming companies to pivot. Suddenly content creation costs for Netflix drops an order of magnitude. The economics of content creation will fundamentally change.
At the risk of being proven very wrong, I think replacing actors is still fairly distant in the future but again... humans are bad at conceptualizing exponential progress.
Contrary to the trends in SV, dehumanization of creative professions will result not in productivity boost but in utter chaos and as a result will add more time loss in production process.
I never liked Sam Altman in his Y years, now I know why.
Even with the "blessings" from the "masters" in Davos/Bilderberg, a bad idea is a bad idea. Maybe this will push World ID as a result, but is it necessary?
The current trends in tech are not producing solutions for a professional problem. With rare exceptions, this looks more and more as removing of human input and normalization of a society ruled by AI at any cost. So sad.
"porn without consent" - thought crime
"too much porn of whatever you dream of" - yes, conservatives (50% of USA) actually think this is a problem
"spam" - advancing the closed garden model email is heading towards. soon you will simply need government id to make email even though there are plenty of alternative ways to do communication aside from email which was already considered insecure and a bad protocol in 2000. this has nothing to do with AI but they are still acknowledging this absurdity by framing AI as the enabler of that.
"automated social engineering" - just weaponizing the ignorance the bad thought leaders of the industry left us. instead of giving us proper authentication methods, we still have "just send my photo id to these 33 companies, which will ask for it in random ways we dont expect and just have to trust them"
"copyright" - literally not a problem, almost nothing "protected" by copyright matters and the law is just used by aggressive capitalists to shove their products down everyone's throat
"ICBMs being automatically hacked and launched at people" - just stop being bad government and hiring completely uncredible people to implement every mission critical control system while hooking it up to the internet
"racist bias" (or whatever) - this is the dumbest fucking thing i've ever heard of
this website is a perfect snapshot of why tech sucks so hard. its dressed up like cinematic film using a ton of js libs and css hacks or god knows so it can only be viewed smoothly on the latest computer hardware. only on one of the big 3 browsers that each had a trillion man hours of pointless iterations driven by digital graphics marketing companies. and on top of that they have a nice professional tone made by $300K/year PR people. please, sincerely, fuck off.
Comments: