An Open-Source Multi-Modal GenAI
Workflow Studio
Every AI model is a node on an infinite canvas. Wire modalities, combine results — self-host or use our cloud. Open source. Multi-modal. Runs anywhere.
Add → Transform → Combine
One canvas. Every modality.
No parameter panels. No manual wiring. Three operations — add materials, transform between modalities, combine the results.
Add
Drop any material onto the canvas: text, images, audio, video, documents, URLs, or 3D models. Everything becomes a node.
Transform
Text→image, image→video, audio→text — every AI model is a modality transform, exposed as a swappable node. Switch models without rewiring.
Combine
Image fusion, lip sync, voice cloning, character swap, motion transfer — wire multiple inputs into one output. Built in, not bolted on.
Local-first, always
Workflows and uploaded files live on your own machine. No account required. No telemetry. Your data never leaves without your say-so.
Bring your own keys
Text runs on OpenRouter, Gemini, OpenAI, or DeepSeek — your key, your choice. GPU inference runs on Modal (free tier included).
Real models. Named.
Z-Image, FLUX.2 Klein 9B, LTX-2, SeedVR2, Qwen3, ACE-Step — every model doing actual work is listed in the README, not hidden behind a product name.
Add
Drop any material onto the canvas: text, images, audio, video, documents, URLs, or 3D models. Everything becomes a node.
Transform
Text→image, image→video, audio→text — every AI model is a modality transform, exposed as a swappable node. Switch models without rewiring.
Combine
Image fusion, lip sync, voice cloning, character swap, motion transfer — wire multiple inputs into one output. Built in, not bolted on.
Local-first, always
Workflows and uploaded files live on your own machine. No account required. No telemetry. Your data never leaves without your say-so.
Bring your own keys
Text runs on OpenRouter, Gemini, OpenAI, or DeepSeek — your key, your choice. GPU inference runs on Modal (free tier included).
Real models. Named.
Z-Image, FLUX.2 Klein 9B, LTX-2, SeedVR2, Qwen3, ACE-Step — every model doing actual work is listed in the README, not hidden behind a product name.
What's shipped today
Pulled directly from the README. If a row is here, it works today.
Text, image, photo, sketch, audio file, audio recording, video file, video recording, document, URL, 3D model — drop any material onto the canvas.
Image generation, image editing (inpaint/redraw), image understanding (captions/Q&A), image upscaling.
Text-to-video, image-to-video, first/last-frame extraction, video understanding, video upscaling.
Music generation, speech synthesis (preset / voice clone / instruction), speech recognition.
Generate or rewrite copy from a prompt — routed through OpenRouter, Gemini, OpenAI, or DeepSeek depending on the node's model slot.
Image fusion (multi-reference blending), lip sync (audio+video / audio+image / audio+text → video), voice cloning, character swap, motion transfer, text merging.
Concatenate clips, mux audio+video, split by shots, demux, extract audio track, split long text, merge text blocks, filter clips, batch arrange groups.
Document → text, URL → text — bring outside material into the canvas.
FFmpeg for media pipelines, Modal for GPU workers. Models shipping today: Z-Image, FLUX.2 Klein 9B, LTX-2, SeedVR2, InfiniteTalk, Wan-Animate, ACE-Step, Qwen3, Whisper, Gemini, OpenAI, OpenRouter.
FAQ
Straight answers
What TongFlow is. What it isn't.
Is this really open source?
Yes. AGPL-3.0. Full source at github.com/tong-io/tongflow — read it, fork it, self-host it. The cloud at app.tongflow.com runs the same code.
What's the difference between the cloud and self-hosting?
Same codebase, different setup cost. The cloud is up in seconds with no configuration. Self-hosting gives you full control: your API keys, your files, no account, nothing external. Both are first-class options.
Do I need a GPU?
Not locally. Heavy inference runs on Modal — their free tier includes real H100 time. You bring a Modal token and at least one LLM API key (OpenRouter, Gemini, OpenAI, or DeepSeek). TongFlow itself runs fine on a laptop.
How is this different from ComfyUI or n8n?
ComfyUI is built for image generation. n8n is built for API orchestration. TongFlow treats all seven modalities — text, image, video, audio, speech, music, 3D — as first-class. Combine nodes (lip sync, image fusion, motion transfer) are built in, not third-party extensions.
How do I self-host?
git clone https://github.com/tong-io/tongflow && cd tongflow && pnpm install && pnpm dev. You need Node.js 20+, a Modal token (free tier works), and one LLM API key. The README covers everything else.
Can I build my own plugins?
Yes. Define a slot in the ABI, write a Python function decorated with @node_slot, publish it as a package. Any backend works — Modal, Replicate, a local GPU, or a plain API. See the SDK docs.
What stage is this?
Shipped June 2026. Early days. Contributions, bug reports, and model integrations are very welcome. Discord and GitHub issues are the right places.
Three ways in
Download the desktop app, run it on our cloud, or self-host from source — same open-source code, your call.