Women Who AI
Posts
This Week in AI: Behind Gemini's New Image Generation

This Week in AI: Behind Gemini's New Image Generation

Leandra T
March 24, 2025

Cutting through the noise in AI

Welcome to the Women Who AI Newsletter, your weekly update on what actually matters in AI when you’re focused on building and scaling startups.

Was this forwarded to you? Click here to subscribe and join our community of founders shaping the future of AI.

Deep Dive: Google’s Gemini 2.0 Flash Introduces Native Image Generation

Google just rolled out native image generation capabilities in Gemini 2.0 Flash. This experimental feature is accessible via gemini-2.0-flash-exp in both the Google AI Studio and API.

Image Source: https://developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/

Gemini is trained from the ground up to handle text, image, video, and audio together. Unlike some earlier models that bolt on visual processing, Gemini treats images as first-class citizens in its training and inference pipeline. It accepts interleaved input (text + image + audio) and can produce interleaved output, including text and generated images.

Source: Generated with Gemini 2.0 at https://aistudio.google.com/prompts/new_chat

To use in the AI Studio: Select Gemini 2.0 Flash (Image Generation Experimental) in the Model Dropdown on the right-hand side.

Features

Story illustration with character consistency

Gemini can maintain consistent characters and settings throughout illustrated stories, making it possible to create coherent visual narratives.

Multi-turn image editing conversations

Source: https://developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/

Gemini allows for natural language dialogue around image editing while maintaining context between iterations. This conversational approach means you can request and see image modifications implemented without starting from scratch each time.

In-Image Text Generation

Source: https://developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/

Gemini can generate high-quality text directly within images. Weird text has been a “tell” for images generated by AI models for a long time, so this is exciting for your work where realistic scenes are relevant and something to look out for those concerned about being able to tell what’s real and what’s AI-generated.

Use cases for busy founders

For AI startup founders, this addition to Gemini's toolkit presents opportunities to:

Create consistent visual branding assets without extensive design resources
Prototype product interfaces and visualizations rapidly
Develop more engaging content marketing with custom visuals
Build multimodal applications that seamlessly blend text and images

You can use the AI Studio interface or implement the API into workflows with tools like N8N to automate your tasks.

Gemini API implementation and documentation:

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=(
        "Generate a story about a cute baby turtle in a 3d digital art style. "
        "For each scene, generate an image."
    ),
    config=types.GenerateContentConfig(
        response_modalities=["Text", "Image"]
    ),
)

Research foundations behind this approach (maybe)

While Google doesn't explicitly reference specific research in its announcement, recent work by Google Research and DeepMind researchers can help us understand how it might work.

Multimodal Understanding & Control

The conversational editing capabilities in Gemini 2.0 Flash likely build upon research approaches demonstrated in Google's Instruct-Imagen. This work shows how detailed natural language instructions can coordinate multiple image-generation elements.

Source: https://instruct-imagen.github.io/

Instruct-Imagen introduced a framework in which natural language serves as a universal coordinator for multiple input modalities (text, images, masks, sketches). The system was trained in two phases: first, through retrieval-augmented training to help it understand context from external references, and then through instruction-tuning across diverse image generation tasks.

The innovation is in how it enables intuitive control through conversation rather than technical parameters. Instead of separate systems for style transfer, subject preservation, or layout control, Gemini unifies these capabilities under natural language instruction. This makes complex image editing accessible to anyone who can describe what they want without specialized knowledge of design tools. Multi-turn editing conversations become possible because the model maintains the context of previous edits while understanding new instructions in relation to the current image state.

Character Consistency & Scene Composition

Gemini 2.0 Flash's ability to maintain character consistency while generating images could build on their research from Taming Encoder for Zero Fine-tuning Image Customization.

Source: https://arxiv.org/pdf/2304.02642

This research showed how specialized encoders could capture high-level identity features without per-character optimization. At its core, Gemini uses advanced encoder technology to extract identity features from reference images, creating compact "identity embeddings" that capture what makes characters unique. The system employs specialized cross-attention mechanisms that integrate these identity signals throughout the diffusion generation process.

What makes this a breakthrough is efficiency and flexibility - Gemini maintains consistent characters across multiple scenes while adhering to new prompts about style, setting, and action, all in a single forward pass. Previous approaches required model fine-tuning (training) for each new character or multiple reference images, making iterative creative work cumbersome. With this advancement, Gemini can maintain character consistency in illustrated stories, making visual storytelling more accessible and natural.

The bigger picture

We're witnessing the gradual merging of specialized AI capabilities into more general-purpose models—a trend that parallels the early evolution of foundation models for text. Just as we saw specialized NLP models (sentiment analysis, classification, translation) absorbed into general LLMs, we're now seeing specialized vision capabilities and image generation integrated into multimodal foundation models.

Upcoming Hackathons

Ready to build that product you've been dreaming about? Check out these upcoming hackathons!

If you'd like to find a Women Who AI team for any event, reply to this email, and we'll connect everyone interested.

Bio x AI Hackathon: Shaping the Future of Agentic Science | Virtual & Berlin | April 8-10 | Event Link

AI Agents Hackathon by Microsoft, Virtual | April 8-30 | $50K in prizes | Event Link

Re-Weird the Web, San Francisco | April 12 | Event Link

Jobs We Love

Vertru | Remote | Full-time

Join an MIT-led startup with strong traction as VP of Engineering/CTO. This founder-level role seeks an experienced full-stack engineer with hands-on coding and execution abilities. Using an innovative AI-based knowledge graph approach, Vertru focuses on supply chain risk mapping, monitoring, and compliance. They’re backed by leading investors, including NY Ventures and MIT Founders Fund.

To apply, send your resume to [email protected]

Reply With Questions

We want this newsletter to address the real challenges you're facing. Is there a specific AI development you'd like covered? Jargon we included but didn't properly explain? A business problem you're trying to solve with AI? Reply directly to this email with your questions; we'll tackle them in next week's edition.

If you found value in today's newsletter, please consider forwarding it to other women in your network who are building, or thinking about building, in the AI space. The more we grow this community, the stronger our collective impact becomes.

Here's to building the future of AI, together.

Lea & Daniela