Gemini Omni is Google’s answer to the next obvious question after Veo, Nano Banana, and the rest of the Gemini stack: what if one model handled all of it? The Omni pitch is that you can combine images, audio, video and text as input, and generate any of them as output, all grounded in Gemini’s real-world knowledge. The early roll-out went to AI Pro and Ultra subscribers in May 2026.
I tested Omni on mixed-input scenarios: photo + text into video, audio + text into video, text into multimodal output. Here is where the single-model approach actually pays off and where dedicated models still win.
What is Gemini Omni?
Gemini Omni is the omni-modal model family from Google DeepMind, built on the Gemini 3 stack. The headline capability is mixing input modalities (text + image + audio + video) freely and producing output in any modality. The first release, Gemini Omni Flash, focuses on video generation grounded in real-world reasoning.
Distribution is staged: rolling out to Google AI Plus, Pro and Ultra subscribers globally through the Gemini app and Google Flow, and arriving on YouTube Shorts Remix and the YouTube Create app for creators.
How I got access
Through Google AI Pro ($19.99/mo). Omni appeared in the Gemini app automatically. Flow integration arrived a week later. For YouTube Shorts Remix, you need a YouTube creator account in the rollout region.
The test results
Test 1. Photo + text to video
Prompt: “Input: a product still of a coffee mug. Text: "Animate the steam rising slowly, then a hand reaches in to grab the mug. Soft window light."”
Omni produced a 6-second clip that respected both the product still and the prompt direction. The steam animation was correct, the hand entered from the right side, and the mug identity stayed exact. Strong proof-point for the multi-input approach.
Test 2. Audio + text to video
Prompt: “Input: a recorded VO of someone saying "Welcome to the show." Text: "Generate a 5-second show intro where a presenter lip-syncs this audio in a TV studio."”
Lip sync to the supplied audio was the test. Omni delivered a presenter whose lips matched the timing of the VO across the 5 seconds. The studio setting was generic but coherent. This is the workflow Veo 4 cannot do natively: external audio as input.
Test 3. Text to multimodal output
Prompt: “Generate a complete 10-second product ad: video, voiceover script, music. Subject: a smart water bottle.”
Omni produced the 10-second video with synchronized VO and a basic music bed. The script was generic ("Stay hydrated. Stay smart.") but the timing and audio mix were correct. For full ad polish you still want a copywriter and a music pass.
The annoying parts
Rollout pace. Omni is rolling out by region and plan. EU and some Asian markets still lag.
Latency. Mixed-modality renders take 30 to 90 seconds. Slower than dedicated models on their own modality.
Output quality variance. Video output is good. Audio output is functional. Image output trails the dedicated Nano Banana Pro for editing work.
Is it worth the price?
For teams already on Google AI Pro, Omni is included and worth using for mixed-input scenarios that other models cannot handle.
For specialist work in one modality (pure video, pure image), the dedicated models (Veo 4, Nano Banana Pro) still produce higher quality at lower latency.
How Vuela.ai fits into an Omni workflow
Gemini Omni is a powerful generalist. It does not yet handle the pipeline jobs Vuela.ai automates: cloning a viral format, lip-sync translation across languages, repurposing one render into a dozen aspect ratios.
Use Omni for the mixed-modality bridges, use Vuela.ai for the pipeline that turns one generation into a shipped asset.
Omni-class multimodal output inside a real pipeline
Vuela.ai gives you the latest models on a flat plan plus cloner, translator, and 70+ tools.
The verdict
Gemini Omni is a major Google release and the first credible omni-modal generation model. For mixed-input scenarios, nothing else competes today.
For specialist single-modality work, dedicated models (Veo 4 for video, Nano Banana Pro for images) still produce higher quality. Use Omni when the input mix is the unlock.