- Published on
Building an Agentic AI Pipeline: Autonomous Content Creation with Human-in-the-Loop
- Authors
The Agentic AI Paradigm Shift
Traditional automation is brittle: write scripts, handle edge cases, pray nothing breaks. Agentic AI flips this model. Instead of programming every decision tree, you give an AI agent:
- Goals — "Create engaging YouTube Shorts from this content"
- Tools — FFmpeg, WhisperX, YouTube API, file system access
- Autonomy — The agent decides how to achieve the goal
- Guardrails — Human review for quality-critical decisions
The agent isn't following a script. It's reasoning about what to do next, using tools to accomplish subtasks, and adapting when things go wrong.
My Setup: Jarvis, the Content Agent
I run Clawdbot, an agentic AI framework that gives Claude persistent memory, tool access, and the ability to operate autonomously. My instance — Jarvis — handles everything from code reviews to calendar management.
For this project, Jarvis became my content creation agent:
┌─────────────────────────────────────────────────────────────────┐
│ JARVIS (AI Agent) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Analyze │ │ Extract │ │ Caption │ │ Upload │ │
│ │ Tool │ │ Tool │ │ Tool │ │ Tool │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Planner │ ← Decides what to do next │
│ │ (Claude) │ │
│ └─────┬─────┘ │
│ │ │
└──────────────────────────┼──────────────────────────────────────┘
│
┌──────▼──────┐
│ Human │ ← Quality gate (Telegram)
│ Review │
└─────────────┘
The key difference from traditional automation: the agent plans its own execution path.
Agentic Capabilities in Action
1. Autonomous Planning
When I say "process this movie into Shorts," Jarvis doesn't execute a fixed script. It reasons:
User: Process tears_of_steel.mp4 into YouTube Shorts
Jarvis (thinking):
- First, I need to analyze the video for high-engagement moments
- The video is 12 minutes, so I'll look for 30-60 second segments
- I should check if WhisperX transcription exists... it doesn't
- I'll transcribe first, then use the transcript to find interesting dialogue
- After extraction, each clip needs vertical conversion and captions
- Finally, I'll queue them for human review before upload
This planning happens in natural language, visible in the agent's reasoning traces. If something fails (quota exceeded, file missing), it re-plans.
2. Tool Use & Composition
The agent has access to shell commands, file operations, and APIs. It composes these tools dynamically:
// Agent's tool calls (simplified)
exec("whisperx tears_of_steel.mp4 --output_format json")
read("tears_of_steel.json") // Parse transcript
// Agent reasons: "Timestamp 3:42 has high-energy dialogue"
exec("ffmpeg -ss 222 -t 45 -i tears_of_steel.mp4 clip_001.mp4")
exec("node add-captions-ass.js clip_001.mp4")
// Agent decides: "Clip looks good, sending for review"
message({ action: "send", buttons: [["✅ Approve", "❌ Reject"]] })
The agent doesn't need a pre-built pipeline script. It understands the goal and selects appropriate tools.
3. Human-in-the-Loop as a Design Pattern
Full autonomy is dangerous for public-facing content. Bad titles, copyrighted clips, or low-quality extracts could hurt the channel. The solution: human-in-the-loop (HITL).
The agent operates autonomously until a quality-critical decision point:
AUTONOMOUS HUMAN REVIEW AUTONOMOUS
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Analyze │───▶│ Extract │───▶│ Caption │───▶│ Review │───▶│ Upload │
│ │ │ │ │ │ │ (Human) │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
I receive clips via Telegram with inline buttons:
🎬 Review: Tears of Steel Clip #3
Title: "The Robot's Memory Hack" 🤖
[✅ Approve] [❌ Reject] [✏️ Edit]
One tap. The agent handles everything else.
This pattern — automate the tedious, gate the critical — is central to production agentic systems. The agent does 95% of the work; I provide the 5% that requires judgment.
4. Memory & State Management
Agents need memory to operate over time. Jarvis maintains:
- Session memory — Current task context, what's been tried
- Persistent memory —
MEMORY.mdfile with long-term learnings - State files —
upload-queue.json,pipeline-status.json
When I return hours later, Jarvis knows:
- Which clips are pending review
- What's been uploaded
- Rate limit status (6 uploads/hour)
- Any errors that need attention
# From Jarvis's MEMORY.md
## YouTube Pipeline Learnings
- Clips under 30s perform better
- Avoid extracting segments with music (copyright risk)
- Upload queue rate: 6/hour to avoid shadowbans
- Telegram review flow working well — 10-15 clips reviewed in ~3 min
5. Error Handling & Recovery
Traditional scripts crash on unexpected errors. Agentic systems reason about failures:
Error: YouTube API quota exceeded
Jarvis (reasoning):
- Upload failed due to quota
- I should mark this clip as "pending_retry"
- Check when quota resets... midnight UTC
- Update the queue status
- Notify John that uploads are paused
- Set a reminder to retry tomorrow
The agent doesn't just log an error — it adapts its plan.
Why This Matters
The Agent Advantage
| Traditional Automation | Agentic AI |
|---|---|
| Fixed scripts | Dynamic planning |
| Fails on edge cases | Adapts to failures |
| Manual error handling | Self-correcting |
| One-shot execution | Persistent operation |
| Requires developer intervention | Human-in-the-loop for quality |
Production Readiness
This isn't a demo. The pipeline has processed 100+ clips across multiple source videos with:
- Zero manual script intervention
- ~5 min total review time per batch
- Automatic retry on failures
- Rate limiting preventing platform issues
Lessons Learned
Agents need clear tool boundaries. Don't give an agent raw
execwithout sandboxing. Scope tools to specific capabilities.Human-in-the-loop isn't a crutch — it's a feature. For content creation, human judgment at key points prevents costly mistakes.
Memory is essential. Without persistent state, agents lose context and repeat work. File-based memory works surprisingly well.
Natural language planning > rigid workflows. The agent's ability to reason in English about what to do next makes debugging trivial.
Start autonomous, add gates. Build the fully automated version first, then identify where human review adds value.
What's Next
- Feedback loops: Use YouTube analytics to teach the agent what clips perform well
- Multi-agent collaboration: Separate agents for analysis, editing, and distribution
- A/B testing: Agent generates title variants, learns from click-through rates
The future of content creation isn't "AI generates everything" — it's AI agents that handle the 95% that's tedious, with humans providing the 5% that requires taste.
Built with Clawdbot and Claude.

