Hands-on review

Gemini Omni: ‘create anything from anything’ tested

Google’s new omni-modal model. Combine text, image, audio, video as input. Generate any of them as output.

By the Vuela.ai content team ·

Official from deepmind.google.

What it nails

  • Multimodal input/output in one model
  • Strong physics reasoning across mediums
  • Rolling out on YouTube Shorts and Flow
  • Real-world knowledge grounding

Where it struggles

  • Still rolling out by region and tier
  • Latency is higher than single-modality models
  • Quality varies by output type
  • No advanced video controls yet (camera moves, length)

Gemini Omni is Google’s answer to the next obvious question after Veo, Nano Banana, and the rest of the Gemini stack: what if one model handled all of it? The Omni pitch is that you can combine images, audio, video and text as input, and generate any of them as output, all grounded in Gemini’s real-world knowledge. The early roll-out went to AI Pro and Ultra subscribers in May 2026.

I tested Omni on mixed-input scenarios: photo + text into video, audio + text into video, text into multimodal output. Here is where the single-model approach actually pays off and where dedicated models still win.

What is Gemini Omni?

Gemini Omni is the omni-modal model family from Google DeepMind, built on the Gemini 3 stack. The headline capability is mixing input modalities (text + image + audio + video) freely and producing output in any modality. The first release, Gemini Omni Flash, focuses on video generation grounded in real-world reasoning.

Distribution is staged: rolling out to Google AI Plus, Pro and Ultra subscribers globally through the Gemini app and Google Flow, and arriving on YouTube Shorts Remix and the YouTube Create app for creators.

The Google DeepMind story behind the Veo and Omni video stack. Official from Google DeepMind.

How I got access

Through Google AI Pro ($19.99/mo). Omni appeared in the Gemini app automatically. Flow integration arrived a week later. For YouTube Shorts Remix, you need a YouTube creator account in the rollout region.

The test results

Test 1. Photo + text to video

Prompt: “Input: a product still of a coffee mug. Text: "Animate the steam rising slowly, then a hand reaches in to grab the mug. Soft window light."”

Photo-to-video style demo from the Google video stack. Official from Google DeepMind.

Omni produced a 6-second clip that respected both the product still and the prompt direction. The steam animation was correct, the hand entered from the right side, and the mug identity stayed exact. Strong proof-point for the multi-input approach.

Test 2. Audio + text to video

Prompt: “Input: a recorded VO of someone saying "Welcome to the show." Text: "Generate a 5-second show intro where a presenter lip-syncs this audio in a TV studio."”

Lip sync to the supplied audio was the test. Omni delivered a presenter whose lips matched the timing of the VO across the 5 seconds. The studio setting was generic but coherent. This is the workflow Veo 4 cannot do natively: external audio as input.

Test 3. Text to multimodal output

Prompt: “Generate a complete 10-second product ad: video, voiceover script, music. Subject: a smart water bottle.”

Omni produced the 10-second video with synchronized VO and a basic music bed. The script was generic ("Stay hydrated. Stay smart.") but the timing and audio mix were correct. For full ad polish you still want a copywriter and a music pass.

The annoying parts

Rollout pace. Omni is rolling out by region and plan. EU and some Asian markets still lag.

Latency. Mixed-modality renders take 30 to 90 seconds. Slower than dedicated models on their own modality.

Output quality variance. Video output is good. Audio output is functional. Image output trails the dedicated Nano Banana Pro for editing work.

Is it worth the price?

For teams already on Google AI Pro, Omni is included and worth using for mixed-input scenarios that other models cannot handle.

For specialist work in one modality (pure video, pure image), the dedicated models (Veo 4, Nano Banana Pro) still produce higher quality at lower latency.

How Vuela.ai fits into an Omni workflow

Gemini Omni is a powerful generalist. It does not yet handle the pipeline jobs Vuela.ai automates: cloning a viral format, lip-sync translation across languages, repurposing one render into a dozen aspect ratios.

Use Omni for the mixed-modality bridges, use Vuela.ai for the pipeline that turns one generation into a shipped asset.

Omni-class multimodal output inside a real pipeline

Vuela.ai gives you the latest models on a flat plan plus cloner, translator, and 70+ tools.

The verdict

Gemini Omni is a major Google release and the first credible omni-modal generation model. For mixed-input scenarios, nothing else competes today.

For specialist single-modality work, dedicated models (Veo 4 for video, Nano Banana Pro for images) still produce higher quality. Use Omni when the input mix is the unlock.

Gemini Omni review FAQ

How do I get access to Gemini Omni? +

Google AI Plus, Pro and Ultra subscribers get access through the Gemini app and Google Flow. YouTube Shorts Remix and the YouTube Create app also expose Omni for creators in rollout regions.

Is Gemini Omni the same as Veo 4? +

No. Veo 4 is a dedicated text-to-video model. Omni is a multimodal model that can take video, audio, image, and text as input and output any of them. They live in the same Gemini stack and complement each other.

What is Gemini Omni Pro? +

A higher-tier Omni model teased alongside the Flash release. Details and rollout timing have not been confirmed publicly as of May 2026.

Does Gemini Omni generate audio natively? +

Yes. Audio is one of the input and output modalities. Voice cloning to a supplied VO is supported.

Can I use Gemini Omni inside Vuela.ai? +

Vuela.ai exposes Omni-class generation alongside the rest of the Gemini and competing catalogues. One flat plan, no rollout-region uncertainty.

Build your pipeline with Vuela.ai

Flat-rate access to the best models, plus cloner, lip-sync translator, and 70+ tools.