Creative Research Documentation

Prompting the Unreal
Three Approaches to AI Video Generation with Sora 2

From text-based prompting to dream transcription and image-guided surrealism

Author Arthur Gillier

Program Master 1 — Création et Édition Numérique (CEN)

Course OpenAI Creative Lab

Academic year 2025–2026

Supervisors — Paris 8 Rosa Cinelli, Everardo Reyes

OpenAI Representative Souki Mansoor

Abstract

This document presents three consecutive creative experiments conducted with Sora 2, OpenAI's video generation model, as part of the OpenAI Creative Lab at Paris 8 University. Each experiment represents a distinct methodology: pure text-based prompting inspired by 1990s hip-hop aesthetics, a personal pipeline converting voice-recorded dreams into cinematic video, and the animation of surrealist reference images sourced from Pinterest. The work raises practical and critical questions about the controllability of generative video, the limitations of human figure synthesis, and the growing inability to distinguish AI-generated footage from reality — with particular attention to the societal implications of these tools in the context of disinformation and influence operations.

Sora 2 Generative video Prompt engineering Dream transcription Surrealism AI + disinformation

Introduction

The OpenAI Creative Lab is a course offered at Paris 8 University in partnership with OpenAI, designed to give students access to production-level AI tools in an experimental, research-oriented setting. This documentation covers three video generation projects produced during the 2025–2026 academic year using Sora 2, OpenAI's state-of-the-art text-to-video and image-to-video model.

Rather than treating Sora 2 as a black box, these experiments were designed to probe it from different angles: first as a purely responsive tool driven by written instructions, then as a medium for personal and subconscious material, and finally as an animation engine guided by pre-existing visual identities. Each experiment was conceived independently, but together they form a coherent arc — moving from the general to the personal, and from the realistic to the deliberately unreal.

A secondary thread runs through all three experiments: the question of perceptual credibility. When can AI-generated video be mistaken for real footage? When can it not, and why? This question carries significant weight beyond aesthetics — it connects directly to the use of such tools in military propaganda, political influence operations, and mass disinformation. The findings presented here are partial and specific to Sora 2, but they point toward issues the field must urgently address.

Research Questions

How does the granularity and structure of a text prompt affect the quality and coherence of AI-generated video?
Can subjective, non-visual source material (spoken dream recordings) be systematically translated into convincing cinematic output?
Does grounding generation in a strong reference image produce more perceptually credible results than pure text-to-video generation?
At what point does AI-generated video become indistinguishable from documentary or fictional footage, and what are the societal implications of that threshold?

Technical Setup

All video generation was conducted through a custom interface provided by the OpenAI Creative Lab team: openai-sora-demo.vercel.app. This interface provides direct API access to Sora 2, yielding demonstrably better results than those available through the standard consumer application. The API allows finer control over generation parameters and offers higher output fidelity — a key distinction when evaluating the quality of results in this documentation.

The transcription pipeline used in Experiment II was implemented in Python using the OpenAI gpt-4o-transcribe model via the Audio API, and was assisted in its design by ChatGPT. The complete source code is included in the Appendix.

Below is a short screen recording showing the Sora 2 interface in use during production:

Technical illustration — Sora 2 API interface (openai-sora-demo.vercel.app) during video generation

Experiment I — The Hip-Hop Clip

Text-to-video Iterative prompting 90s aesthetic

The first experiment emerged from an afternoon spent watching old-school 1990s American hip-hop music videos. The visual language of that era — fisheye lenses, VHS grain, handheld cameras at low angles, the gritty aesthetics of basketball courts and lowriders — felt like a rich and specific enough starting point to test Sora 2's capacity for stylistic reproduction.

This experiment relies entirely on text-to-video generation: no reference image is used, and all visual information is encoded in natural language prompts. It is the most direct form of engagement with the tool, and also, as the results demonstrate, the most limited in terms of output quality.

Process — From One Prompt to Five

The experiment began with a single, comprehensive prompt attempting to describe an entire 12-second sequence across multiple scenes. This produced first.mp4 — a result that, while recognizable in its aesthetic references, suffered from inconsistencies in motion, character coherence, and temporal continuity.

The second iteration consolidated the prompt by clarifying the editing structure — naming five distinct scenes within a single clip — while maintaining the same aesthetic parameters. This produced second.mp4.

The third and most developed iteration decomposed the sequence into five separate prompts, each targeting one specific 4-second scene (lowriders, basketball, mural, dance cypher, group wide shot), and requested diegetic sound only — no music — to allow for post-production audio mixing. Each scene was generated independently, then assembled into final.mp4.

Key discovery — Granularity vs. Coherence

Decomposing a long sequence into independent short clips improves the quality of individual scenes but sacrifices visual coherence: characters, lighting, camera identity, and location consistency are lost between clips. Each clip belongs to a different "version" of the same imaginary world.

Full Prompts

Prompt 1 — First generation (single prompt, 12s)

→ generates first.mp4

Old-school 1990s USA hip-hop music video, gritty street-cypher energy.
Camera & look: 8mm fisheye lens, extreme wide with strong edge distortion, low-angle handheld camcorder feel, slight VHS grain + scanlines, occasional micro-jitters, high-contrast filmic grade (warm sodium-vapor streetlights, deep shadows, subtle teal highlights).
Location: nighttime in an urban USA basketball court / alley corner with a chain-link fence and a big graffiti mural (generic shapes/letters, no real tags), wet asphalt reflecting light, light haze in the air.
Characters: 4 adult men with a classic "gangster / gangsta-rap" vibe—tough, confident swagger—baggy jeans, bomber jackets, bandanas, gold chains, sunglasses; no visible brand logos or readable text.
Action (one continuous take, choreographed):
0–4s: wide fisheye, camera close to the dancers' feet then tilts up; they bounce in time and form a tight circle.
4–9s: the lead dancer steps forward and does sharp popping + a two-step, others hype him up with synchronized head nods and shoulder hits.
9–12s: quick floorwork spin into a clean freeze pose as the camera pushes in close on the lead dancer's face (fisheye exaggeration), ending on a confident stare to lens.
Sound (synced audio): instrumental 90s boom-bap beat ~92 BPM, punchy kick + snare, vinyl crackle, short DJ scratch fill near the end, light crowd "hey!" shouts (clean, no profanity).
Constraints: no weapons, no violence, no drugs, no police, no explicit gang symbols, no real-person likeness, no readable signage/brands.

Prompt 2 — Second generation (structured montage, single clip)

→ generates second.mp4

Old-school 1990s USA hip-hop music video montage, filmed on an 8mm fisheye lens. Extreme wide-angle distortion, low-angle handheld camcorder energy, subtle VHS grain + scanlines, punchy contrast, warm streetlight glow, slight motion blur on fast moves. Same neighborhood setting throughout: a city block with a lit outdoor basketball court, chain-link fence, and a permitted community mural wall nearby; light haze in the air; wet pavement reflections.

Cast & vibe: a tight crew of adult men with classic 90s street swagger (baggy jeans, bomber jackets, bandanas, gold chains, sunglasses). Confident "gangsta-rap" attitude without explicit gang symbols. No brand logos or readable text. Energetic crowd extras in the background.

Editing style: 5 distinct scenes in one clip, fast music-video cuts, whip-pans and match-cuts between actions, fisheye always on. Each scene ~2 seconds.

SCENE 1 (0–2s) — Lowriders: Night street curb. Two classic lowriders roll slowly, headlights flaring into the fisheye. One car does a gentle hydraulic bounce as it passes camera; chrome gleams; bass vibrations implied by subtle camera shake.

SCENE 2 (2–4.5s) — Basketball: Smash cut to the outdoor court. Close-to-the-ground fisheye tracking a fast dribble, then a crossover. Cut to a wide fisheye of a clean layup off the glass; teammates react and slap hands.

SCENE 3 (4.5–7s) — Graffiti mural: Whip-pan to the legal mural jam wall. Two artists in gloves and masks paint bold colorful shapes (no real tags, no readable words). Tight fisheye on spray can mist, then a quick reveal of the mural-in-progress under bright work lights.

SCENE 4 (7–10s) — Dance cypher: Match cut on a shoulder turn into a circle of dancers. The lead dancer pops and locks with sharp hits, others hype him up with synchronized steps and head nods. Fisheye pushes in close, exaggerating facial expression and swagger.

SCENE 5 (10–12s) — Group hero moment: Quick cut to a wide fisheye: lowriders in the background cruising past, basketball court lights glowing, mural wall visible, the crew dancing in the foreground. End on a confident freeze pose and direct look into the lens.

Audio (synced): instrumental 90s boom-bap ~92 BPM, punchy kick/snare, vinyl crackle, short DJ scratch fill near the end, light crowd "hey!" shouts (clean, no profanity).

Constraints: no weapons, no violence, no drugs, no police, no explicit gang insignia, no real-person likeness, no readable signage/brands.

Prompt 3 — Final generation (5 independent clips)

→ Clip 1 — Lowriders (4s)

Old-school 1990s USA hip-hop music video. 8mm camcorder + fisheye lens with strong edge distortion, VHS grain and scanlines, handheld low-angle swagger, warm sodium streetlights, wet asphalt reflections, light haze.

Location continuity: same neighborhood block all night: curb beside an outdoor basketball court and a nearby community mural wall (no readable text).

Characters: same crew of adult men in the area with a classic "gangsta-rap" look (baggy jeans, bomber jackets, bandanas, gold chains, sunglasses), tough confident swagger without explicit gang symbols; no brand logos; no readable signage.

Action (0.0–4.0s):
0.0–1.5: Two classic lowriders cruise slowly past camera; headlights bloom into lens.
1.5–3.0: One car does a smooth hydraulic bounce (gentle, controlled), chrome gleaming.
3.0–4.0: Camera dips super low near the spinning whitewall tire as it rolls by; quick whip-pan ending on the crew watching and nodding.

Sound: diegetic only—engine rumble, tire hiss on wet pavement, distant crowd murmur; no music (I'll add a single track in the edit).
Constraints: no weapons, no violence, no drugs, no police.

→ Clip 2 — Basketball (4s)

Same 90s USA hip-hop video aesthetic: fisheye, VHS/8mm texture, handheld low-angle energy, warm court lights, chain-link fence bokeh, wet ground reflections.

Location & continuity: same court on the same block; same crew visible on sidelines.

Action (0.0–4.0s):
0.0–1.2: Extreme low fisheye tracking a fast dribble approaching camera.
1.2–2.6: Sharp crossover (ball snaps left-right), defender's shoes skid slightly.
2.6–4.0: Clean layup off the glass; quick reaction—teammates clap and slap hands.

Sound: ball bounce, sneaker squeaks, rim/backboard hit, crowd "hey!" (clean). No music.
Constraints: no fighting, no injuries, no trash talk text.

→ Clip 3 — Legal mural / graffiti (4s)

Same old-school 90s USA hip-hop music video look: fisheye, VHS scanlines, gritty handheld, warm work-lights, aerosol mist visible in the haze.

Setting: a clearly permitted community mural wall (authorized painting, not vandalism). The mural is bold abstract shapes and colors—no words, no tags, no readable text.

Action (0.0–4.0s):
0.0–1.0: Tight fisheye close-up of a spray can; paint hiss and mist plume.
1.0–2.4: Whip-pan reveal: two artists paint geometric color blocks, clean lines, layered gradients.
2.4–4.0: They step back to admire; one quick fist-bump; camera pushes in on fresh paint texture.

Sound: spray hiss, light chatter, distant car pass-by. No music.
Constraints: no illegal trespassing vibe, no real-world logos or recognizable tags.

→ Clip 4 — Dance cypher (4s)

Same 90s USA hip-hop video aesthetic: 8mm fisheye, heavy lens distortion, handheld bounce, warm streetlights, VHS grain/scanlines, high-contrast grade.

Characters: same adult crew—tough, confident swagger, "gangsta-rap" styling without explicit gang symbols.

Action (0.0–4.0s):
0.0–1.0: Camera starts at feet—syncopated bounce steps in a tight circle.
1.0–3.2: Lead dancer steps forward: sharp popping + two-step + shoulder hits; others hype with synchronized nods and small steps.
3.2–4.0: Quick half-spin into a clean freeze; direct confident stare into lens (fisheye exaggeration).

Sound: stomps, clothing rustle, crowd "hey!" (clean). No music.
Constraints: no violence, no intimidation toward camera.

→ Clip 5 — Group wide shot (4s)

Final montage capstone shot—same old-school 90s USA hip-hop look: fisheye, VHS/8mm texture, handheld but readable framing, warm sodium lights, haze, wet reflections.

Wide composition (single shot): In one fisheye frame you can see the whole block:
Left: the lit basketball court with players still moving.
Right: the legal mural wall with painters finishing details.
Background: a lowrider cruises through slowly (controlled, safe).
Foreground: the main crew dancing and posing.

Action (0.0–4.0s):
0.0–2.0: Slow handheld push-in from wide; basketball play and mural painting both visible.
2.0–3.2: Lowrider glides behind; crew hits a synchronized step.
3.2–4.0: Everyone lands a final confident pose; hold for a "poster frame" ending.

Sound: ambient neighborhood bed only. No music.
Constraints: no weapons, no violence, no drugs, no police, no readable signage/brands.

Results

Generation 1 — Single prompt

Generation 2 — Structured montage prompt

Generation 3 — Five independent clips assembled

The results across all three generations remain identifiable as AI-produced. The primary failure point is human figure generation: body proportions shift between frames, movement patterns are irregular and lack the rhythmic specificity of trained dancers, and faces show the characteristic instability of current diffusion-based models under motion. The aesthetic references (VHS grain, fisheye distortion, warm color grading) are reproduced with reasonable fidelity, but they cannot compensate for the fundamental weakness in human motion synthesis.

The progression from Generation 1 to Generation 3 illustrates a key tradeoff in AI video production: precision scales with prompt granularity, but coherence degrades when generation is fragmented. A single character looks different across the five independently generated clips; the lighting shifts; the location drifts. These inconsistencies — invisible in a single clip — become disruptive at the edit stage.

Experiment II — Dreaming in Video

Dream transcription OpenAI Whisper / gpt-4o-transcribe Personal material Pipeline

The second experiment emerged from a different kind of discovery: revisiting voice memos recorded immediately upon waking — a habit kept sporadically as a way of capturing dreams before they dissolve. The question became: could this raw, semi-conscious spoken material become the source data for AI-generated video?

This experiment is also a direct reaction to the limitations of Experiment I. Text-to-video generation produced realistic-looking scenes from descriptive prompts. The question naturally arose: what happens when the source material is not a polished description but a disorganized, first-person, spoken account of something inherently non-visual?

The Production Pipeline

The pipeline developed for this experiment has four stages:

Voice memo.m4a recording at wake-up

→

Transcriptiongpt-4o-transcribe API

→

Visual translationChatGPT prompt rewrite

→

Sora 2 generationSimple + cinematic prompt

The critical step is the third: converting a spoken dream narrative into a visual language that Sora 2 can work with. A raw transcription cannot be fed directly into a video generation model — it is a first-person account structured around memory and narrative, not visual space and motion. The translation protocol, developed iteratively, follows a fixed schema: subject, setting, action, transformation, mood, visual style, camera movement, lighting, color palette, and audio.

For each dream, two levels of prompt were produced — a simple version (prioritizing clarity and stability) and a cinematic version (adding texture, camera language, and emotional detail). Where a dream contained multiple distinct moments, a three-shot storyboard structure was also produced.

The transcription script (included in full in the Appendix) uses the gpt-4o-transcribe model with a custom prompt instructing it to preserve strange imagery, proper nouns, natural punctuation, and meaningful hesitations — all of which carry interpretive weight in dream narration.

The Three Dreams

Dream 1 — Weekend with Drake

Voice transcription (recorded at wake-up)

Petit rêve là, je suis avec les gars de la fac, Max, Nathan, Yass et tout. Et en plus, je sais pas pourquoi, y a Drake. Mais genre là, Drake en légende comme ça, pose avec nous, mais genre chill quoi. Et genre il passe le week-end avec nous, genre c'est prévu. Et du coup on fait plein de trucs et tout et à un moment genre je lui fais écouter de la musique, genre je lui fais écouter du Naps et tout. Et il kiffe et tout, c'est trop trop trop drôle.

Sora simple prompt

A cinematic dream scene of a group of college friends hanging out together during a relaxed weekend in a student apartment. Drake is casually with them, smiling and acting like part of the friend group. Everyone is joking, talking, and chilling on couches in a warm, messy apartment. One friend plays French rap music for Drake on a speaker, and Drake laughs and genuinely enjoys it. Funny, surreal but natural mood. Handheld camera, warm indoor lighting, relaxed pacing, intimate group shots, soft background chatter and laughter.

Sora cinematic prompt

A dreamlike but realistic cinematic scene inside a cozy student apartment during a weekend with friends. The video opens with a clear medium close-up of Drake, instantly recognizable, sitting casually in the middle of a small group of college-aged friends in a warm, slightly messy living room. Drake has a neatly groomed beard, distinctive hairstyle, subtle luxury streetwear, chains, and a calm confident smile. He is clearly the visual focus of the scene.

Around him, a few friends joke, lounge on couches, and talk in a playful, familiar atmosphere. Bags, snacks, jackets, and half-finished drinks are scattered around the apartment. The key moment is when one friend excitedly plays French rap music for Drake on a Bluetooth speaker. Drake listens, starts nodding along, laughs, and looks genuinely impressed while the whole group reacts with laughter.

Make Drake visually prominent and consistently identifiable throughout the clip. Keep him near the center of the frame in the most important shots.

Visual style: cinematic realism with subtle dreamlike softness.
Camera: begin with a medium close-up reveal of Drake, then slow handheld movement drifting through the apartment, alternating between wide group shots and close reaction shots, always returning to Drake's face.
Pacing: relaxed and natural, like a lived-in weekend memory.
Lighting: warm late-afternoon sunlight mixed with soft apartment lamps.
Color palette: warm amber, beige, faded denim, soft brown, muted black.
Audio: casual overlapping conversation, laughter, soft room tone, Bluetooth speaker music, natural indoor reverb.
Keep the scene coherent, intimate, playful, and visually authentic.

Dream 1 — Weekend with Drake — generated output

Content moderation encounter

Sora 2's safety guardrails prevented the generation of Drake's actual face and likeness. The model systematically avoided rendering a recognizable version of the artist, producing instead a generic figure. This directly undermined the central comedic and surreal logic of the dream — the absurdity depended entirely on his specific presence. It is a clear demonstration of where platform-level content policy intersects with creative practice, and raises broader questions about the scope of "likeness protection" in generative AI systems.

Dream 2 — Trip to the United States

Voice transcription (recorded at wake-up)

Je suis aux États-Unis avec Tom, Iliès, Maxime, Léo et Arthur. Et genre en fait, on est dans des endroits trop bizarres. On voit des trucs trop bizarres à chaque fois, tout le temps quand on marche et tout. Et du coup là, on vient de se faire agresser. Tom, il a pris un coup de couteau au visage. Et du coup, ils nous ont demandé nos portefeuilles, nos chaussures. Voilà.

Sora simple prompt

A surreal cinematic scene of a group of young friends walking through strange urban streets in the United States at night. The city feels distorted and unsettling, with bizarre signs, unusual buildings, and eerie details everywhere they look. The group suddenly stops after being robbed in the street, shocked and disoriented, while one friend has a visible injury on his face. The attackers take their wallets and shoes. Tense, dreamlike, unsettling mood. Handheld camera, cold neon lighting, urban night ambience, nervous pacing.

Sora cinematic prompt

A dreamlike but realistic cinematic scene set in a strange American city at night. A group of young male friends walks together through unsettling urban streets filled with bizarre visual details: distorted storefronts, flickering neon signs, empty intersections, unusual buildings, and uncanny background figures. Everything feels slightly wrong, as if the city is shifting into a surreal nightmare.

The friends keep walking and looking around, confused by the strange things they see. The scene then cuts to the immediate aftermath of a violent street robbery. The group stands frozen in shock under cold streetlights. One friend, Tom, has a visible facial injury. The attackers have just demanded their wallets and their shoes. The friends are tense, silent, stunned, and disoriented.

Visual style: cinematic realism with surreal urban distortion.
Camera: nervous handheld camera, moving with the group while walking, quick glances toward strange visual details, then tighter close-ups on faces after the attack.
Pacing: slow tension building, then abrupt shock and suspended silence.
Lighting: cold neon signs, harsh streetlights, deep shadows, dirty urban reflections.
Color palette: blue-gray, sodium yellow, muted concrete, dark red, neon green accents.
Audio: distant traffic, footsteps on pavement, tense breathing, scattered voices, city hum, then a stunned silence after the robbery.

Dream 2 — Trip to the United States — generated output

Dream 3 — Reinvented Winter Olympics

Voice transcription (recorded at wake-up)

J'ai fait un rêve, en fait les JO sont en France et du coup j'y suis, mais dans l'orga je crois. Et genre en gros, premier truc, c'est des voitures, des genres de Formule 1, mais en gros elles ont une pince à l'avant qui leur permet de faire des trucs stylés, genre de se retourner, plein de trucs et tout. Après c'était du skate, mais un skate un peu différent aussi, et après c'était du ski, mais du ski différent aussi. En fait j'ai tout réinventé dans ma tête les sports d'hiver... et pourquoi il y avait la voiture dans les sports d'hiver, je sais pas... mais c'était bien. Et le ski c'est celui que j'ai regardé le plus longtemps je crois, et genre à un moment en gros il y a une piste et en fait ça descend, descend, descend, et si tu prends pas la descente à fond, bah la remontée elle est impossible. Et en fait il y a des gens qui allaient tomber et genre le champion olympique, je sais pas pourquoi il y avait du monde sur la piste alors que c'était les jeux olympiques quoi. Et du coup on a dû mettre les skis sur la tête, mais c'est pas vraiment des skis en fait, genre c'est trop bizarre, on aurait dit une chaise, mais après c'était des skis, fin trop bizarre.

Sora simple prompt

A surreal cinematic scene set during a reinvented Winter Olympics in France. Futuristic snow sports are being performed in a large Olympic venue: Formula 1-style cars with mechanical claws at the front flip and perform impossible tricks on ice, while strange redesigned skate and ski events unfold. The main focus is a dramatic ski slope that drops steeply downward, where athletes must take the descent at full speed or they cannot make the climb back up. The competitors wear bizarre ski equipment that first looks like chairs and then functions like skis. Spectacular, absurd, futuristic mood. Wide stadium shots, aerial views, fast motion, cold mountain light, crowd noise, Olympic atmosphere.

Sora cinematic prompt

A dreamlike but realistic cinematic scene set during a strange reimagined Winter Olympics in France. The Olympic venue is built in a snowy mountain landscape with futuristic architecture, huge screens, stadium lights, packed crowds, and dramatic competition zones. Each event looks like a surreal reinvention of winter sports: Formula 1-style cars race across snow and ice, each one equipped with a mechanical claw on the front that lets the vehicle flip, pivot, and perform impossible acrobatic maneuvers. Nearby, an altered version of skateboarding takes place on icy terrain, followed by bizarre skiing events unlike any real Olympic sport.

The central event is a massive ski slope with an extreme downward plunge. The slope drops so far and so sharply that athletes must commit to the descent at full speed, otherwise the climb back up becomes impossible. The track feels dangerous and spectacular, with people strangely standing too close to the course as competitors rush past. At the most surreal moment, the athletes place strange ski equipment on their heads: objects that initially resemble chairs, but then somehow function as skis within the logic of the event.

Visual style: cinematic realism mixed with absurd futuristic sports design.
Camera: sweeping aerial shots of the Olympic complex, low tracking shots beside the clawed racing cars, fast dynamic cuts between events, then dramatic long-lens shots and slow motion on the impossible ski slope.
Pacing: escalating spectacle, energetic and unpredictable, with moments of suspended tension on the slope.
Lighting: bright alpine daylight, reflective snow glare, cool shadows, stadium spotlights and LED accents.
Color palette: bright snow white, icy blue, metallic silver, racing red, matte black, neon display lights.
Audio: roaring crowd, stadium announcements, mechanical racing sounds, wind over the slopes, scraping on snow, sudden gasps from spectators, global Olympic event atmosphere.

Dream 3 — Reinvented Winter Olympics — generated output

Results and Limitations

The results of Experiment II are disappointing relative to the ambition of the methodology. While the translation pipeline functions well at the conceptual level — converting first-person dream narration into structured cinematic prompts — the generated videos remain unconvincing. Human figure synthesis continues to be the primary failure point: group dynamics, facial expressivity, and natural physical interaction are poorly reproduced by the model.

More fundamentally, this experiment surfaced a question about the gap between narrative logic and visual logic. Dreams are organized around meaning, feeling, and association — not space and motion. Even after systematic translation, some of their most important qualities (the mundane impossibility of Drake's presence, the escalating wrongness of a surreal American city) proved difficult to encode in a form that Sora 2 could render convincingly.

Experiment III — The Unrealistic as Reference

Image-to-video Pinterest reference Minimal animation Surrealism

The third experiment was born directly from dissatisfaction with the previous two. The video outputs from Experiments I and II were unconvincing primarily because they attempted to generate realistic-looking humans performing realistic actions — a domain where current AI video models still produce visible artifacts. A new question emerged: what if the source material was already surreal?

If the reference image already contains something impossible — a giant fish floating through a city, woodlice crawling across a surreal moss landscape, a highway interchange melting into impossible geometry — then the model's output can be evaluated on different terms. The standard of realism shifts. Artifacts become part of the visual logic. The question is no longer "does this look like real life?" but "does this look like this specific image, moving?"

Methodology

Reference images were sourced from Pinterest — selected for their strong visual identity, clear composition, and already-strange atmosphere. Before writing any prompt, each image was analyzed along five axes:

What is the primary subject?
What is the dominant atmosphere?
What is the ideal minimal movement?
What must absolutely be preserved?
What must be explicitly forbidden?

Prompts were built around four operational principles: preserve the frame (fixed or near-fixed camera), minimize motion (animate as few elements as possible), maintain internal logic (impossible scenarios with normal behavior), and use negative constraints (explicit prohibition of added elements, fast motion, camera cuts, deformation, and chaotic behavior). When initial results were insufficient, iteration targeted a single parameter at a time — more glitch, slower motion, stricter camera lock — rather than rewriting the whole prompt.

Works

1 — Woodlice Garden (PS1 Aesthetic)

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original composition, soft foggy atmosphere, strange scale, and dreamy uncanny mood.

A surreal macro landscape covered in dense green moss under a pale blue cloudy sky. Giant white woodlice slowly crawl across the mossy ground among scattered cucumber slices and orange vegetable discs, with a few tall flowers rising in the background. The world feels miniature and enormous at the same time, like a strange forgotten garden dream.

Camera: slow cinematic forward creep at ground level, very subtle dolly-in, low angle, shallow parallax, no fast motion, no cuts.

Motion:
- the woodlice move very slowly and naturally, crawling with gentle leg motion
- a few antennae twitch delicately
- the moss softly ripples in the breeze
- the flowers sway slightly in the background
- light fog drifts slowly through the scene
- the clouds move almost imperceptibly
- tiny glistening moisture on the moss catches the light

Style: eerie, dreamy, organic, soft-focus surrealism, retro low-resolution texture, slightly VHS-like, uncanny but calm, poetic, strange naturalism, muted highlights, humid atmosphere, highly detailed moss texture, gentle bloom.

Lighting: diffused daylight, soft misty glow, overcast brightness, humid garden light.

Mood: mysterious, liminal, peaceful but unsettling, alien garden, surreal nature documentary.

Keep the original image identity and composition. Preserve the strange atmosphere. No aggressive movement, no horror transformation, no new creatures, no fast camera motion, no scene cuts, no deformation, no extra limbs, no melting, no text.

2 — The Flying Fish

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original composition, lighting, urban night atmosphere, and surreal dreamlike mood.

A giant silver fish floats through a modern city plaza at night, suspended in the air between illuminated office buildings and a glass-covered station. The fish moves slowly and gracefully as if swimming through invisible water, with gentle fin motion, subtle tail movement, and slight body rotation. Below, people continue living normally: pedestrians walk calmly, a few wait at the bus stop, others cross the plaza, all behaving naturally and casually, as if the giant floating fish is an ordinary part of city life.

Camera: slow cinematic handheld-style drift or very gentle dolly shot, maintaining the framing and realism of the original image. Slight natural motion, no fast camera movement, no cuts.

Motion:
- the fish swims slowly through the air with soft, realistic fin and tail movement
- its body subtly shimmers in the city lights
- the fish slightly changes direction as if gliding through an unseen current
- pedestrians walk normally, some talking, some waiting, some passing by without reacting dramatically
- distant city lights flicker softly
- subtle movement in reflections on the glass buildings
- faint night haze and glow around streetlights

Style: surreal urban realism, dreamy, strange but believable, soft-focus night photography, slightly grainy, atmospheric, cinematic, calm, uncanny, poetic.

Lighting: strong nighttime city lights, glowing windows, soft halos around lamps, reflective highlights on the fish, deep black sky.

Mood: mysterious, calm, surreal, everyday life mixed with impossible imagery, quiet wonder, dreamlike city atmosphere.

Keep the original image identity and atmosphere. The people should act naturally and normally, not panic, not stare excessively, not run. No chaos, no horror, no aggressive motion, no scene cuts, no deformation, no extra animals, no text changes.

3 — Unicorn and Jellyfish (Vertical Diptych)

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original dreamy, snowy, flash-lit, uncanny atmosphere.

This is a single composition made of two clearly separated stacked worlds, like a vertical diptych. The top half and the bottom half must remain two distinct universes with a clean visual separation. Do not blend them together, do not morph one into the other, do not create transitions between them.

TOP HALF:
A surreal winter forest at night covered in deep snow. Soft glowing pink jellyfish float gently through the cold air between the trees, as if they belong naturally to this frozen woodland. Snow falls slowly. The jellyfish drift lightly and gracefully, with delicate tentacles moving softly. Their movement is subtle, calm, and weightless. The forest remains still, eerie, silent, and magical.

BOTTOM HALF:
A white winged unicorn stands in a snowy open landscape near a frozen shoreline under a faint rainbow-like halo in the sky. The unicorn lowers its head and gently grazes on sparse winter grass emerging through the snow. Its movement is calm and natural: small head motions, subtle breathing, slight body shifts, soft tail movement, minimal wing adjustment. The scene remains quiet, sacred, and dreamlike.

Camera:
Keep the original split composition. Static or almost static camera. Very subtle cinematic motion only. No cuts, no reframing that destroys the two-world layout.

Motion:
- top half: jellyfish float slowly in the air, tentacles trailing softly, snow drifting gently, faint movement in the trees
- bottom half: unicorn calmly grazes on visible grass peeking through the snow, with natural small head and neck movements, subtle breathing, slight tail movement
- both worlds remain slow, serene, and independent from each other

Style: surreal winter dream, soft flash photography, liminal, ethereal, mysterious, gentle bloom, slightly grainy, nocturnal, magical realism, quiet and uncanny.

Lighting: cold snowy night light with soft glowing highlights, flash-lit atmosphere, gentle luminous haze, pale reflections on snow.

Mood: strange, peaceful, sacred, dreamlike, isolated, beautiful, eerie.

Important:
The top and bottom are two distinct worlds in the same frame.
No blending between scenes. No transition from jellyfish world to unicorn world.
No new creatures. No aggressive movement. No horror. No scene cuts. No deformation.
Preserve the original composition and atmosphere.

4 — Snowy Computers

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original composition, foggy winter atmosphere, and eerie abandoned-computer aesthetic.

A static wide shot of an outdoor snowy courtyard beside a pale building, filled with old CRT computers placed on desks and towers directly in the snow. In the background, a glass greenhouse structure is barely visible through heavy fog. The scene feels empty, cold, silent, and liminal.

Camera: completely static locked-off camera, no pan, no tilt, no zoom, no dolly, no cuts.

Action: the computers power on one by one, slowly and irregularly, across the snowy courtyard. Each screen comes to life with a soft CRT glow, flicker, scanlines, and boot-up light. After turning on, the monitors begin showing computer errors, warning windows, system failures, glitchy text, frozen blue screens, terminal errors, corrupted startup messages, and blinking alert boxes. The errors should feel old, uncanny, and realistic, like obsolete machines malfunctioning in the cold.

Timing: the power-on sequence should happen gradually, one by one, spreading through the frame over time, not all at once.

Motion:
- static environment
- faint screen flicker on each monitor
- subtle CRT glow reflecting in the fog
- very light snowfall or drifting mist
- occasional tiny glitch flickers on some screens
- no movement of the desks or computers
- no people present

Style: liminal, surreal realism, abandoned technology, cold winter atmosphere, analog horror without becoming too aggressive, quiet and haunting, pale colors, soft fog, old CRT textures, subtle grain.

Lighting: soft overcast winter daylight, dense mist, muted whites and grays, dim phosphorescent screen glow, cold blue-gray atmosphere.

Mood: strange, empty, unsettling, quiet, melancholic, mysterious.

Important: keep the exact framing and static camera. Preserve the original image identity. No characters. No fast motion. No camera shake. No scene cuts. No dramatic horror creatures. No explosions. No deformation of the computers. Only the computers turning on one by one and displaying errors.

5 — Warped Roads

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original aerial perspective, highway interchange layout, and extreme road deformation.

A surreal overhead view of a multilane highway interchange in daylight. The roads are dramatically warped, bent, melted, and twisted into impossible flowing shapes, but traffic continues moving across them as if this distorted infrastructure were completely normal. Cars drive forward steadily along the deformed lanes, following the curves and warped surfaces naturally and smoothly.

Camera: high aerial view, fixed or nearly fixed camera, very subtle drift only, no fast movement, no cuts.

Motion:
- cars move continuously along the distorted roads
- vehicles follow the twisted lanes smoothly and believably
- some cars enter and exit the frame naturally
- traffic remains calm and ordinary, with no crashes or panic
- the road itself stays visibly deformed and impossible-looking
- slight shimmer or heat-haze can enhance the surreal atmosphere

Style: surreal urban realism, uncanny infrastructure, impossible highway geometry, dreamlike but believable, daytime satellite-view feeling, clean daylight, photorealistic motion with bizarre architecture.

Lighting: natural daylight, clear visibility, realistic road shadows, bright outdoor atmosphere.

Mood: strange, hypnotic, unsettling, calm, surreal, like a real city with broken physics.

Important: preserve the original composition and aerial framing. Keep the road visibly warped and distorted. The cars must continue driving normally on the deformed road. No accidents. No panic. No collapsing bridge. No explosions. No extra objects. No scene cuts. No text.

6 — Glitch Clouds

Reference image

Generated video

Full prompt

Use the uploaded image as the main visual reference and preserve the original composition: dramatic storm clouds, bright sky openings, dark cloud masses, and tree silhouettes at the bottom of the frame.

Create a static locked-off shot with no camera movement. The clouds move clearly from right to left across the sky.

Add a much stronger and more visible glitch effect than usual. The sky and clouds should be heavily corrupted by horizontal digital distortion: long scanline smears, repeated horizontal tearing, strong image displacement, duplicated cloud fragments, sideways stretching, broken cloud edges, layered digital bands, and aggressive flickering glitches. The distortion should feel intense, frequent, and visually dominant, as if the sky itself is malfunctioning.

The glitch must stay integrated into the clouds and bright sky areas, not as a full-screen overlay. The result should still feel like a real storm sky being digitally broken apart in real time.

Style: surreal atmospheric photography, intense glitch art, stormy, uncanny, cinematic, dreamlike, corrupted weather recording.

Lighting: natural daylight through storm clouds, bright glowing whites, deep grays, subtle blue tones, strong contrast.

Mood: eerie, unstable, hypnotic, ominous, beautiful, surreal.

Important: completely static camera. Clouds move from right to left. Significantly increase the amount of glitch. Make the glitch strong, frequent, and obvious. Preserve the original framing and sky composition. No cuts. No camera shake. No new objects. No text. No people. Keep the trees mostly still.

7 — Fish in Interior Space

Reference image

Generated video

The results of Experiment III mark a significant qualitative leap from the previous two experiments. The surrealist reference images already contain an internal visual logic that Sora 2 can maintain and animate, rather than construct from scratch. Movement is convincing precisely because it follows a limited, clearly defined script: the fish swims, the computers boot up, the cars drive. The impossible context is pre-established — the model does not need to invent it.

Cross-Experiment Analysis

The Human Figure Problem

The most consistent limitation across all three experiments is the synthesis of human figures in motion. Whether generating dancers in a hip-hop cypher, college friends laughing in an apartment, or Olympic athletes, Sora 2 produces characters whose movement patterns, proportions, and facial expressions remain visually unstable. Limbs shift between frames; crowds homogenize into generic masses; facial expressions lack specificity. This is not a prompting failure — it is a fundamental constraint of the current generation of diffusion-based video models when dealing with complex articulated motion in social or interactive contexts.

The Realism Paradox

Experiment III produces the most perceptually convincing results — and its source material is the most deliberately unreal. This apparent contradiction resolves when we consider what "convincing" means in each context. In Experiments I and II, the target was documentary realism: the viewer compares the output against their experience of actual people, actual cities, actual sports events. Any deviation is a failure. In Experiment III, the target is internal consistency: the viewer evaluates whether the animation is faithful to the logic of the reference image. When a giant fish glides through a city plaza with the unhurried motion of something swimming through water, the scene is convincing — not because it looks like reality, but because it behaves according to its own rules.

This suggests a practical principle: AI video generation currently produces more credible output when it is not constrained by real-world physical plausibility. Surrealist, abstract, or fantastical content sidesteps the model's weaknesses. It also raises a more unsettling question.

Text-Only vs. Image-Guided Generation

Dimension	Text-only (Exp. I & II)	Image-guided (Exp. III)
Composition control	Low — model interprets freely	High — reference locks framing
Atmosphere fidelity	Approximate	Strong — inherits image mood
Human figure quality	Poor	Avoided (no humans in Exp. III)
Perceptual credibility	Low (AI artifacts visible)	High (especially non-human subjects)
Creative iteration	Full rewrite required	Single-parameter adjustment
Narrative capacity	Medium (sequence possible)	Low (single moment extended)

Sora 2 and the Competitive Landscape

These experiments were conducted with direct API access to Sora 2. Despite this advantage, an honest assessment is that Sora 2 currently lags behind competing models — notably Kling, Runway Gen-3, and Pika — particularly in human motion quality and temporal consistency. The API interface provided by the Creative Lab improved output over the consumer product, but did not close this gap. This context matters for interpreting the limitations documented here: some are model-specific and may not generalize to other systems.

Societal Implications — Disinformation and the Credibility Threshold

The most significant finding of this research is not aesthetic but political. Experiment III demonstrates that it is now possible to generate video footage of physically impossible events that appears visually credible — not because it is indistinguishable from documentary footage, but because the viewer's perceptual system has no established reference against which to evaluate it.

A warped highway with normal traffic, a giant fish moving through a city, computers powering on in the snow: none of these could be fabricated by traditional CGI on a student budget. They now can be. The next generation of these tools will generate plausible human faces and bodies within such scenes. At that point, the classical markers of AI-generated content — unstable faces, glitching hands, anachronistic scene changes — will no longer be reliable.

The implications for military propaganda, political influence operations, and social media manipulation are direct. Sora 2's content moderation system blocked the generation of Drake's likeness — a meaningful constraint in the celebrity protection context. But the same guardrails are not uniformly applicable to political figures, anonymous crowds, or events without a registered "protected person." The tools available to a research student in a university lab in 2025 could, in the hands of a state actor or well-funded organization, produce material designed to shift public opinion at scale.

Critical observation

The credibility of AI-generated video is currently highest precisely where the content is most impossible. As these tools improve their human figure synthesis, the threshold between credible and fabricated will collapse — not gradually, but suddenly. The current "imperfect" phase is a window; the field has limited time to develop detection infrastructure before it closes.

Conclusion

These three experiments map a coherent trajectory in the creative use of Sora 2. Beginning with pure text-based prompting — the most direct and most limited approach — they move through a pipeline that takes personal, subjective material (dream recordings) as input, and arrive at an image-guided methodology that produces the most visually convincing results by deliberately departing from realism.

Several conclusions stand out. First, prompt structure matters more than prompt length: decomposing a scene into targeted, constrained instructions consistently outperforms comprehensive single prompts. Second, image references anchor generation in ways that text cannot: a strong reference image transmits atmospheric, compositional, and tonal information that would require thousands of words to approximate. Third, negative constraints are as important as positive descriptions: telling Sora 2 what not to do — no cuts, no new objects, no deformation — is often what separates a controlled output from a chaotic one.

The deeper conclusion is about limits and possibilities. Sora 2 is currently a better tool for animating the surreal than for reproducing the real. Its failure mode — unstable human figures — is precisely the failure that would matter most in a disinformation context. The current moment is therefore a transitional one: the tools are not yet dangerous in their most obvious application, but the trajectory is clear. Understanding these tools, their mechanics, their guardrails, and their blind spots is no longer optional for anyone working in media, communication, or the arts.

Appendix

A — Transcription Pipeline (main.py)

The following Python script was used to batch-transcribe voice memos into text using OpenAI's gpt-4o-transcribe model. It was developed with the assistance of ChatGPT as part of the Experiment II pipeline.

#!/usr/bin/env python3
import os
import sys
from pathlib import Path
from openai import OpenAI

SUPPORTED_EXTS = {
    ".m4a", ".mp3", ".wav", ".webm", ".mp4",
    ".mpeg", ".mpga", ".flac", ".ogg"
}
MAX_SIZE_MB = 25

def human_mb(num_bytes: int) -> float:
    return num_bytes / (1024 * 1024)

def main():
    if len(sys.argv) < 2:
        print("Usage: python3 transcribe_batch.py /path/to/audio_folder")
        sys.exit(1)

    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        print("Error: OPENAI_API_KEY environment variable not set.")
        sys.exit(1)

    input_dir = Path(sys.argv[1]).expanduser().resolve()
    output_dir = input_dir / "transcriptions"
    output_dir.mkdir(exist_ok=True)

    client = OpenAI(api_key=api_key)

    audio_files = sorted(
        p for p in input_dir.iterdir()
        if p.is_file() and p.suffix.lower() in SUPPORTED_EXTS
    )

    for i, audio_path in enumerate(audio_files, start=1):
        size_mb = human_mb(audio_path.stat().st_size)
        out_path = output_dir / f"{audio_path.stem}.txt"

        if size_mb > MAX_SIZE_MB:
            continue  # skip files over 25 MB API limit

        if out_path.exists():
            continue  # already transcribed

        with open(audio_path, "rb") as f:
            transcription = client.audio.transcriptions.create(
                model="gpt-4o-transcribe",
                file=f,
                response_format="text",
                prompt=(
                    "Faithful transcription in French of a voice memo describing a dream. "
                    "Preserve strange imagery, proper nouns, natural punctuation, "
                    "and hesitations meaningful to the content."
                ),
            )

        if isinstance(transcription, str):
            text = transcription.strip()
        else:
            text = getattr(transcription, "text", str(transcription)).strip()

        out_path.write_text(text + "\n", encoding="utf-8")

if __name__ == "__main__":
    main()

B — Dream-to-Prompt Translation Schema

The following schema was applied to each voice transcription to produce Sora-ready prompts:

Translation schema (applied via ChatGPT)

Given a cleaned dream transcription, produce:
1. Visual summary (3 lines max): what is seen / what is strange / dominant emotion
2. Scene sheet: subject · setting · action · transformation · mood · visual style · camera · light · palette · audio
3. Simple Sora prompt: one strong scene, few elements, clear central action
4. Cinematic Sora prompt: detailed setting, emotional texture, camera movement, light, rhythm, sound
5. 3-shot storyboard (if the dream contains multiple distinct moments):
   — Shot 1: setup
   — Shot 2: shift / strange element
   — Shot 3: peak moment or final emotion

Rules:
- Translate abstractions into visible equivalents ("deep nostalgia" → "empty childhood bedroom at sunset, dust in the light")
- One clip = one strong moment
- Always specify camera movement and audio
- Do not add psychological interpretation

C — Prompt Architecture (Experiment III)

Reusable template for image-guided surrealist generation

Use the uploaded image as the main visual reference and preserve the original composition, lighting, and atmosphere.

[Clear scene description — lock Sora's reading of the image]

Camera:
[fixed / very subtle drift / no cuts / no pan / no zoom]

Motion:
[list: who/what moves · at what speed · in what way · with what intensity]

Style:
[aesthetic keywords: surreal realism · dreamlike · eerie · liminal · cinematic · uncanny]

Lighting:
[light type and quality]

Mood:
[emotional atmosphere]

Important:
preserve composition · no cuts · no new objects · no deformation · no camera shake · [scene-specific prohibitions]

Prompting the UnrealThree Approaches to AI Video Generation with Sora 2

Abstract

Introduction

Research Questions

Technical Setup

Experiment I — The Hip-Hop Clip

Process — From One Prompt to Five

Full Prompts

Prompt 1 — First generation (single prompt, 12s)

Prompt 2 — Second generation (structured montage, single clip)

Prompt 3 — Final generation (5 independent clips)

Results

Experiment II — Dreaming in Video

The Production Pipeline

The Three Dreams

Dream 1 — Weekend with Drake

Dream 2 — Trip to the United States

Dream 3 — Reinvented Winter Olympics

Results and Limitations

Experiment III — The Unrealistic as Reference

Methodology

Works

1 — Woodlice Garden (PS1 Aesthetic)

2 — The Flying Fish

3 — Unicorn and Jellyfish (Vertical Diptych)

4 — Snowy Computers

5 — Warped Roads

6 — Glitch Clouds

7 — Fish in Interior Space

Cross-Experiment Analysis

The Human Figure Problem

The Realism Paradox

Text-Only vs. Image-Guided Generation

Sora 2 and the Competitive Landscape

Societal Implications — Disinformation and the Credibility Threshold

Conclusion

Appendix

A — Transcription Pipeline (main.py)

B — Dream-to-Prompt Translation Schema

C — Prompt Architecture (Experiment III)

Prompting the Unreal
Three Approaches to AI Video Generation with Sora 2