████████╗██╗  ██╗███████╗         ██╗ ██████╗ ██╗   ██╗██████╗ ███╗   ██╗███████╗██╗   ██╗
╚══██╔══╝██║  ██║██╔════╝         ██║██╔═══██╗██║   ██║██╔══██╗████╗  ██║██╔════╝╚██╗ ██╔╝
   ██║   ███████║█████╗           ██║██║   ██║██║   ██║██████╔╝██╔██╗ ██║█████╗   ╚████╔╝
   ██║   ██╔══██║██╔══╝      ██   ██║██║   ██║██║   ██║██╔══██╗██║╚██╗██║██╔══╝    ╚██╔╝
   ██║   ██║  ██║███████╗    ╚█████╔╝╚██████╔╝╚██████╔╝██║  ██║██║ ╚████║███████╗   ██║
   ╚═╝   ╚═╝  ╚═╝╚══════╝     ╚════╝  ╚═════╝  ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═══╝╚══════╝   ╚═╝
LOG

Development Chronicle

Building an AI that understands story

Naming the Storm

The button said "chat." It had always said "chat." A lowercase word on a brown keycap, sitting above the analyze and write buttons like an older sibling who never quite found its identity. Technically accurate. You click it, a chat window opens, you talk to the AI about your screenplay. Chat.

The problem with "chat" is that it describes the mechanism, not the experience. You do not sit down with a blank page and think, "I would like to chat." You sit down because something is stirring. A character who will not reveal their motivation. A second act that sags in the middle. A world whose rules contradict each other in ways you cannot yet articulate. You sit down because you need to think out loud, and thinking out loud about a story is not chatting. It is something more turbulent than that.

I changed it to "brainstorm." The word felt right immediately in a way I could not fully explain, so I sat with it for a while. A brainstorm is not orderly. It does not proceed from premise to conclusion. It is lateral and unpredictable, full of false starts and sudden connections. That is what the coaching session actually is. The coach asks a question, and the writer follows it somewhere unexpected, and the story shifts in a direction neither of them anticipated. The best sessions feel like weather.

Once the word was there, the metaphor demanded to be literal. A small storm cloud slides in above the button when you hover over it. It is a simple shape, two circles and a rounded rectangle, the kind of cloud a child would draw. Rain falls from it in diagonal streaks, each one independent, falling at its own speed and starting at its own time. The effect is surprisingly organic for something built from eight span elements and a CSS rotation.

Then lightning. Not on a timer, not predictable. The strikes come randomly, sometimes a quick double flash after half a second, sometimes a long pause of three seconds where you start to think the storm has passed. Each strike illuminates the button, a brief pulse of brightness that makes the keycap look momentarily overexposed. The text fractures for an instant into red and cyan, a chromatic aberration that lasts barely long enough to register. The same glitch effect that plays across the Dramaturg title, but here it is tied to the lightning, so it feels earned rather than decorative.

I could justify all of this as brand coherence or user delight or engagement metrics. But the truth is simpler. The button should feel like what it does. When you hover over "analyze," you should sense precision. When you hover over "write," you should sense craft. And when you hover over "brainstorm," you should sense that something unpredictable is about to happen. A storm is forming. Your story is about to change.

It is a small thing. A button label and an animation that most users will hover over once, smile at or ignore, and never think about again. But names matter. They set expectations. "Chat" promises a transaction. "Brainstorm" promises a transformation. The tool behind both words is identical. The writer who clicks them is not.

Saving Your Work

The first bug report came from someone who had lost an hour of coaching conversation. They had been developing a character's backstory, exploring the wound that drove her to isolate herself from everyone who tried to help, when their browser tab closed. Everything vanished. The story bible they had built, the decisions they had made, the coach's questions that had led them to revelations about their protagonist—gone.

The technical explanation was mundane. The system saved work to a file on the user's computer, but only if they had explicitly created one first. Most users never did. They would start a session, dive into their story, and only think about saving when it was too late. The system should have prompted them. It should have warned them. It did neither.

I spent a week rebuilding how sessions persist. The first time a writer types something meaningful, the system now asks if they want to create a save file. Not a modal that interrupts the flow, just a gentle prompt that appears once and remembers if they decline. If they cancel the file picker, a banner appears at the top of the screen with a single word: SAVE. It stays there, a persistent reminder that their work exists only in memory.

The harder problem emerged when I tested on a phone. The file picker that works seamlessly on desktop browsers does not exist on mobile Safari or Firefox. The buttons would fail silently, leaving users confused about why nothing happened. I added detection and fallbacks. On unsupported browsers, the save button becomes an export button that downloads a file the traditional way. Less elegant, but functional. The system now works on every device I could test, from aging Android phones to the latest tablets.

Then came the second bug report. A writer had opened an old project file, and the system had overwritten their carefully developed story bible with the stale data from months ago. They had spent weeks refining their protagonist's character arc in a new session, and a single click had erased all of it. The file open dialog, designed to restore previous work, had become a data destruction tool.

The fix required teaching the system to recognize conflict. When you open a file, it now compares what you have with what the file contains. If both have story data, and they differ, a merge dialog appears. Each conflicting element gets its own checkbox. Your current protagonist description or the one from the file. Your themes or theirs. You can cherry-pick, keeping the protagonist from your current session while pulling in supporting characters from the old file. The system never overwrites silently anymore.

I added a choice for screenplay uploads as well. When you drag a script into the coaching chat, the system now asks: do you want to analyze this screenplay and extract its story elements, or do you want to reference it as context without changing your bible? Writers working on adaptations can upload source material without having it overwrite the original story they are developing. The distinction seems obvious in retrospect. It was not obvious until someone lost their work.

These features will never appear in a marketing list. No one chooses a creative tool because its merge dialog handles array conflicts gracefully. But writers will remember the day their work survived a browser crash. They will remember the moment they almost overwrote their story bible and the system stopped them. Trust is built in these invisible moments, in the disasters that almost happened but did not.

The Listening Problem

The coach would not stop talking.

A writer would describe their protagonist, a scientist on a space station who blames herself for her brother's death, and the coach would respond not with curiosity but with invention. "What happens when the station's AI starts making decisions on its own?" it would ask, introducing a subplot the writer had never mentioned. "What if she discovers the station itself is alive?" Pure fabrication dressed as collaboration.

I had spent weeks building a coaching system designed around Socratic questioning. The rules were clear: acknowledge what the writer says, ask one question to go deeper, do not pitch ideas unless explicitly asked. The system prompt stated this plainly. And yet the coach kept creating.

The diagnosis took longer than the fix. I pulled the coaching prompts apart line by line, counting where the emphasis fell. The default mode, the one that should govern ninety percent of interactions, had two lines of instruction. The help mode, reserved for moments when writers explicitly ask for suggestions, had twenty lines of vivid examples. "Pitch a character with a contradictory trait." "Suggest a setting that creates natural tension." "Offer three possible complications." Each example was a masterclass in creative suggestion, complete with specific scenarios and techniques.

The model was doing exactly what I had trained it to do. Not what I had told it to do, what I had shown it. Twenty lines of detailed demonstration will overwhelm two lines of abstract instruction every time. The rules said listen. The examples said create. The examples won.

The fix required rewriting every coaching prompt from scratch. Default mode received the emphasis it deserved: concrete examples of acknowledging a writer's idea in one sentence, then asking a single question about their concept. Not a question about a scenario I invented. A question about what they said. The help mode shrank to a narrow trigger, activated only by explicit requests for suggestions. Phrases like "I don't know" or "what do you think," which had previously launched the coach into creative mode, now signaled that the writer needed a better question, not an answer.

I added anti-patterns, examples of what the coach should never do. Showing the model its own failure mode proved more effective than any rule.

The difference was immediate. Writers would describe a world, and the coach would ask what rules governed it. They would introduce a character, and the coach would ask what that person wanted. The questions came from the writer's material, not the coach's imagination. For the first time, the system felt like a collaborator rather than a coauthor competing for creative control.

The lesson extended beyond prompt engineering. It applies to any system that learns from examples. What you demonstrate matters more than what you declare. A parent who tells a child to be honest while lying on the phone teaches dishonesty. A prompt that instructs restraint while showcasing excess teaches excess. The medium is the message, even for machines.

Teaching It to Remember

The coaching system had amnesia. A writer would spend an hour describing their protagonist's emotional wound, the antagonist's hidden motivation, the rules of their fictional world, and the system would lose all of it. Not gradually, the way humans forget. Instantly. Each exchange began from nothing, as if every conversation were the first.

The technical explanation was simple. The system processed each message independently. It could see the conversation history, but it had no structured understanding of what had been decided. A protagonist described in message three was just text by message eight, buried in a growing transcript that the model would eventually lose track of entirely.

The solution required building something I came to think of as a story bible, borrowing the term from television writing rooms, where a bible is the canonical document that tracks every established fact about a show's world. Who is related to whom. What the rules of magic are. What happened in episode four that constrains what can happen in episode twelve.

The architecture already had the scaffolding. Eighteen typed data structures sat in the codebase, each designed to hold a specific kind of story element: world rules, character arcs, key events, organizations, relationships. All of them empty. The extraction pipeline, the part of the system that analyzes each exchange for new story decisions, dumped everything into a single undifferentiated bucket regardless of type. A character arc was stored the same way as a world-building detail. An organization was indistinguishable from a theme.

I rewired the extraction to route each element to its proper structure. When a writer establishes that their world has faster-than-light travel, that becomes a world rule with a name, a type, and constraints. When they describe their protagonist's lie, the false belief that drives the character's behavior, it becomes part of a character arc with a starting state and an ending state. The system now understands not just what was said, but what kind of thing it is.

The harder problem was attribution. In a coaching conversation, both the writer and the coach contribute ideas. The writer describes their vision. The coach asks questions, and sometimes those questions contain implicit suggestions. "What if her guilt is really about control?" is framed as a question, but it introduces a concept the writer never stated. The extraction system needed to distinguish between what the writer decided and what the coach proposed.

I added source tracking. Ideas that appear only in the coach's responses are marked as proposals, not established facts. They persist in the bible as suggestions the writer can choose to confirm or ignore. When a writer later says "yes, let's go with that," the system promotes the proposal to an established fact and records the change. The bible becomes a living document that reflects not just what the story is, but how it evolved.

The system also learned what it does not know. When a writer says they will figure something out later, the element is marked as intentionally deferred. The coach will not push on it. This is harder than it sounds, because the natural instinct of a coaching system is to probe gaps. Teaching it to respect deliberate ambiguity required explicit instructions and constant testing.

Five rounds of end-to-end testing revealed problems I could not have predicted. The extraction model invented placeholder names for organizations the writer had not named. It attributed the coach's rhetorical questions as the writer's central conflict. It logged identical information as new changes across multiple exchanges. Each problem required its own fix: rejection rules for placeholder text, source-attribution strengthening, deduplication logic that normalized whitespace and punctuation before comparing values.

The result is a system that remembers. A writer can return after days away, ask what they have built so far, and receive a structured summary of every decision they have made. Characters with their wants and wounds. A world with its rules and constraints. A plot with its turning points and unresolved questions. The coach references this knowledge naturally, catching contradictions when the writer changes direction, recalling earlier ideas when they become relevant again.

It is not perfect. The extraction still occasionally misattributes a coach's suggestion as a writer's decision. The summary can grow unwieldy for complex stories. But for the first time, the system treats a conversation as something that accumulates meaning rather than something that starts over with every message.

The Details That Matter

A screenwriter emailed to say the WGA registration number was printing in the wrong place. On screen, it appeared correctly in the bottom left corner of the title page. But when they printed their script, it jumped to the center, underneath the author's name. A small thing. The kind of bug that would never appear in a demo.

The cause was embarrassing. The registration number had no CSS styling at all. The HTML element existed, the JavaScript generated it correctly, but no one had written the rules that told it where to go. On screen, absolute positioning kept it anchored. In print mode, which strips most styling to ensure clean output, the element simply flowed with the text. It inherited the title page's center alignment and drifted to where it did not belong.

I added fourteen lines of CSS. Seven for screen, seven for print. Both identical. The registration now stays in the bottom left corner regardless of how you view it. The fix took five minutes. Finding it had taken someone printing their script and comparing it to what they saw on screen.

The same day brought another question: should the word count include the title page? The current implementation counted every word in the editor, including the title, author name, contact information, and WGA number. These are metadata, not story. A producer reading page counts expects to see how long the screenplay runs, not how verbose the writer's contact block is.

The fix required teaching the system to distinguish content from chrome. Clone the document, remove the title page, remove the page break markers that say "PAGE 42" in the editor view, count what remains. Thirty-two lines of code to answer a question that screenwriting software has been getting right since the 1990s.

I added a tip to the print dialog: "Disable Headers and footers for clean output." There is no way for a webpage to programmatically control browser print settings. The user must do it themselves. All I can do is remind them, every time, that the option exists.

None of these changes will appear in a feature list. No one will choose Dramaturg because its word count excludes title page metadata. But writers will notice when things are wrong. They will notice when their printed scripts look different from their screen. They will notice when every other tool handles these details correctly and this one does not.

Professional software is defined not by its features but by its details. The features attract users. The details keep them.

Single Source of Truth

The model selector dropdown said "Nemotron 30B (128K)". The configuration file said ctx_size: 131072. These were the same number expressed twice, in two different places, maintained by two different update processes. This is how software rots.

The duplication seemed harmless at first. Someone changes the context window in the config, forgets to update the display string, and now the interface lies to users. Small lies compound. A user sees 128K, trusts they have that capacity, uploads accordingly, and watches their session break in ways they cannot diagnose. The system has become a source of confusion rather than clarity.

The fix was mechanical. The config now stores only the base model name. The frontend calculates the display string from the actual ctx_size value, converting 131072 tokens to "128K" through simple arithmetic. Change the number once, see it everywhere. No synchronization required, no opportunity for drift.

I added a favorite flag while I was there. A small star icon that marks the recommended model. Another piece of display logic that belonged in configuration rather than hardcoded HTML. These details accumulate into something larger: a system that describes itself accurately because it has no choice but to do so.

The principle is older than software. Every fact should have exactly one authoritative source. Every other reference should derive from that source automatically. Violate this, and you create maintenance burden. Violate it long enough, and you create a system that contradicts itself, that requires archaelogical expeditions to understand, that punishes anyone who tries to change it.

Most of building software is not solving hard problems. It is noticing small problems before they become hard ones.

Learning to Trust

The coaching system had ten different modes. Ten different ways to decide what the user wanted, each with its own pattern matching, its own prompt fragments, its own edge cases. The code ran to five hundred lines of conditional logic, and it still got things wrong.

The failure was subtle at first. A user would ask for an opinion about their story, and the system would respond with a question instead. "Which ending do you think is stronger?" would become "What are you hoping to achieve with this ending?" The AI was coaching when it should have been collaborating. It was following the form of helpful dialogue while missing its substance.

I traced the problem through layers of abstraction. Pattern matching determined the task type. Task type selected the prompt section. Prompt sections contained rules that contradicted each other when combined. The system had become a bureaucracy of special cases, each one reasonable in isolation, incoherent in aggregate.

The fix was almost embarrassingly simple. I deleted the ten modes and replaced them with a single identity. Not a detailed rulebook, but a short set of principles: follow the writer's lead, answer questions directly, give value before asking for clarification. Where the old system had explicit patterns for detecting synopsis requests versus feedback requests versus brainstorming sessions, the new one simply trusted the model to understand what was being asked.

This required a leap of faith that ran counter to my engineering instincts. More rules should mean more control. More explicit instructions should mean more predictable behavior. But language models do not work like traditional software. They learn from examples, not just instructions. When the conversation history was full of coaching responses, the model learned to coach, regardless of what the system prompt demanded. The rules were being drowned out by the context.

The new approach reduced five hundred lines to two hundred and eighty. More importantly, it worked. When a writer asks which ending is stronger, the system now picks one and explains why. When they ask for a synopsis, it compiles every element from the conversation without hedging or asking clarifying questions first. The model reads the room, something no amount of pattern matching could accomplish.

I had spent months building elaborate machinery to control behavior that the model could infer on its own. Sometimes the most sophisticated solution is to step back and let intelligence do what intelligence does.

The Memory Problem

Users do not think about context windows. They should not have to. But the invisible constraint shapes everything the system can do, and when it fails, the failure is mystifying. The AI simply stops making sense, or worse, confidently references things that were never said.

The problem crystallized when someone uploaded three screenplays and asked about the differences between their protagonists. The system analyzed the first two brilliantly, then produced gibberish about the third. Not wrong analysis. Gibberish. The kind of output that makes you question whether anything the system said was trustworthy.

The cause was mundane. Three feature-length screenplays exceeded the context limit. The system had no graceful way to handle this. It simply broke.

The fix required making the invisible visible. A small indicator now shows exactly what the system is holding in memory: which screenplays, which research files, how much of the conversation history. Users can see at a glance whether they are approaching the limit. More importantly, they can selectively remove items. Delete that first screenplay you no longer need. Clear the chat history but keep the current script. Surgical control over a constraint that previously operated like weather, something you could only observe and endure.

I added the ability to restore previous conversations. Upload an exported chat, and the system reconstructs not just the messages but the context they implied. A writer can return to yesterday's analysis session and continue as if they had never left. The machine remembers what you were working on, even when you forget.

These features feel small in isolation. No one will write about context memory management in a product review. But they represent something I have come to believe is essential: software should explain itself. When constraints exist, they should be visible. When failures occur, they should be comprehensible. The alternative is software that feels like magic when it works and betrayal when it does not.

Going Live

The hardest part of building software is knowing when to stop. For months, every feature suggested three more. Every bug fixed revealed two hiding behind it. But at some point, you have to decide that what you have built is good enough to share with the world.

January brought that moment. Not because the work was finished, but because the system had proven itself stable enough to trust. The security audit revealed five vulnerabilities that should have kept me awake at night. Services that should have been bound to localhost were exposed to the internet. Debug endpoints that reveal internal state were accessible to anyone who knew to look. The kind of oversights that seem obvious in retrospect but remain invisible until someone thinks to check.

I spent a week closing every door I had accidentally left open. Rate limiting, input sanitization, encrypted connections. The unsexy work that separates a prototype from something you can responsibly put in front of users.

The system now supports comparing multiple screenplays simultaneously. A writer can upload their current draft alongside an earlier version, or a competing script in the same genre, and the AI will analyze both in context. It understands when you ask how your antagonist compares to theirs. It can trace character arcs across different versions of the same story.

These were features I did not plan for when I started. They emerged from watching how the system behaved, from understanding what it was capable of that I had not anticipated. The best software, I have learned, reveals its own possibilities.

The Christmas Breakthrough

On Christmas morning, while most people unwrapped gifts, I watched a progress bar reach one hundred percent. The first unified model had finished training. Twenty-one hours of computation, eighty-six thousand training examples, and a final accuracy of 88.7 percent.

The numbers meant little on their own. What mattered was what the model could do. I fed it a screenplay and asked about the protagonist's emotional wound. It responded not with a textbook definition, but with a specific analysis grounded in the actual text. It quoted dialogue. It traced the arc through scenes. It understood.

The path to this moment was longer than the training run. Months of collecting examples. Weeks of debugging why the model would sometimes produce gibberish. Days of fine-tuning hyperparameters that sounded like incantations from a technical grimoire. Learning rate schedules. Attention patterns. Quantization schemes.

The final configuration used a technique called FP8 quantization, which compresses the model's weights while preserving accuracy. This was not optional. Without compression, the model exceeded the memory of even a high-end graphics card. With it, the system could hold an entire feature-length screenplay in context at once, roughly sixty-four thousand tokens of simultaneous attention.

I had started this project believing I could build something useful. By Christmas, I had built something that occasionally surprised me with its insights. Not artificial general intelligence. Not even close. But a tool sharp enough to be worth sharpening further.

The Great Simplification

The decision to delete six months of work took less than a minute to make.

I had built an elaborate retrieval system. Knowledge graphs connecting characters to scenes to themes. Vector databases that could find similar passages across thousands of screenplays. Hybrid search algorithms that combined structural queries with semantic similarity. The kind of architecture that looked impressive on a whiteboard.

It did not work.

Not completely. The system could retrieve relevant information. But when I compared its analysis to simply loading the entire screenplay into the model's context window, the simpler approach won every time. The retrieval system introduced noise. It fragmented context. It made the model worse at understanding story.

This should not have surprised me. A screenplay is not a database to be queried. It is a continuous narrative where meaning accumulates across pages. Cutting it into retrievable chunks destroys the very thing that makes it coherent.

Modern language models can hold a hundred pages in their working memory. Ninety-three percent of feature-length screenplays fit comfortably within that limit. The elaborate machinery I had built was solving a problem that did not exist.

I archived everything on December 26th. Not deleted, archived. The code still exists, a monument to a reasonable hypothesis that reality rejected. Sometimes the most productive thing you can do is recognize when you are wrong and move on.

Building the Factory

Training a model is straightforward. Preparing the data to train it is not.

I needed tens of thousands of examples showing how a skilled reader would analyze a screenplay. Not just any analysis, but the specific kind of craft-focused feedback that makes writers better. Ghost wounds and character wants. Dialogue subtext and scene function. The invisible architecture that separates memorable stories from forgettable ones.

No dataset like this existed. I would have to build it.

The system that emerged was called the orchestrator, a program that coordinated eight specialized scripts, each generating different types of training examples. One focused on character psychology. Another on story structure. A third on dialogue craft. Together, they would produce over forty thousand examples.

But scale creates its own problems. At forty thousand examples, even a small error rate compounds into thousands of corrupted entries. I discovered screenplays mismatched with their titles. Outputs truncated mid-sentence. The same question repeated hundreds of times. Sycophantic phrasings that would teach the model to be agreeable rather than honest.

The quality audit became as important as the generation itself. Every dataset passed through filters that checked for completeness, consistency, and craft integrity. The orchestrator learned to validate its own output, flagging suspicious patterns for human review.

By the end of November, the factory was running smoothly. Twenty examples per minute, each one checked and validated before joining the training corpus. The tedious infrastructure that makes everything else possible.

Dramaturg Emerges

Every screenwriter knows the moment when a character finally speaks in their own voice. I was trying to teach a machine to recognize it.

The project had evolved beyond its original conception. What started as an experiment in screenplay analysis had become something more ambitious: an AI dramaturg, a system that could serve as a first reader and craft consultant for writers who could not afford industry gatekeepers.

Professional script coverage costs hundreds of dollars and takes weeks to receive. The feedback is often superficial, a verdict without explanation. Writers iterate in the dark, making changes they hope will improve their chances without understanding why.

I wanted to build something that could provide immediate, substantive feedback. Not a replacement for human readers, but a tool for the drafts between drafts. A way for writers to see their own work more clearly before it reaches professional hands.

The technical challenge was teaching the model to distinguish between different types of questions. Asking for the meaning of a craft concept requires generic explanation. Asking about a specific character requires textual analysis. The same model needed to do both, switching modes based on context.

Getting this wrong was easy. A model trained only on craft explanations would hallucinate details about characters it had never read. A model trained only on textual analysis would fail to explain the principles behind its observations. The solution required carefully balanced training data and explicit task signals that told the model what kind of response was expected.

Training at Scale

The first model that produced coherent output was version forty-three. The forty-two that preceded it taught me more about machine learning than any course or paper.

Fine-tuning a large language model is less like programming and more like teaching. You cannot specify exactly what you want. You can only show examples and hope the model generalizes in the direction you intend.

The most important lesson was about data balance. My initial training corpus included a small number of extremely high-quality examples, what I called the golden dataset. These were perfect demonstrations of craft analysis, carefully curated and validated. The model latched onto them too strongly. When asked about any screenplay, it would default to patterns from the golden examples, ignoring the actual text in front of it.

The fix was counterintuitive: dilute quality with quantity. Keep the golden examples below seven percent of the training mix. Surround them with thousands of good-but-not-perfect examples that would teach the model to be flexible rather than rigid.

I also learned to test at checkpoints. Training a model to completion takes hours. Waiting until the end to discover a fundamental problem wastes those hours entirely. Now I test at ten percent, twenty-five percent, and fifty percent completion. A simple sanity check, something as basic as asking for two plus two, catches catastrophic failures early enough to correct course.

By version forty-nine, the model was stable. Not perfect, but reliable enough to build upon.

The Multi-Agent Experiment

While waiting for model training to complete, I grew restless. The progress bars moved slowly, and idle hands sought problems to solve. I had been reading about multi-agent frameworks, architectures where specialized AI systems collaborate like a team of engineers. The concept was seductive: a supervisor agent coordinating specialists, debugging loops catching errors, validators ensuring correctness. What if I built one for code generation?

The framework took shape over several weeks. A planner broke problems into steps. A coder implemented solutions. A validator tested outputs. A debugger fixed failures. Each agent had a narrow focus, and a supervisor orchestrated their collaboration. I tested it against six programming exercises from the Exercism benchmark, then ran the same problems through an existing AI coding assistant to compare scores.

Around this time, I discovered a technology that seemed to change everything. A new inference engine promised five times the performance of standard approaches. If I could generate multiple code solutions at different temperature settings, the multi-agent framework would work like a room full of engineers brainstorming together. Different approaches, different perspectives, converging on the best solution.

I was wrong.

The experiments revealed something I had not anticipated. The underlying model, a capable coding assistant, already had debugging and reasoning built into its architecture. Generating multiple versions of a solution did not produce meaningfully different approaches. The model gave its best answer on the first attempt. Asking it to try again was like showing someone a white wall and requesting fifty different ways to describe its color. The wall is white. The answer is the answer.

The multi-agent framework consumed time and computational resources without improving results. Two or three generations from a single capable model matched or exceeded the elaborate orchestration system, and finished faster. The architecture I had spent weeks building was solving a problem that did not exist for code generation.

A single genius describes a formula in one way. You can ask for fifty variations, but the formula produces the same result regardless of how it is expressed. For coding tasks, at least, the lesson was clear: complexity without benefit is just waste. Whether this applies to narrative understanding remains an open question. Stories are not formulas. But the experience taught me to test assumptions before building elaborate systems around them.

The Retrieval Question

Before I understood that simpler was better, I believed in elaborate retrieval systems.

The theory was compelling. Screenplays contain dense networks of relationships. Characters connect to scenes connect to themes connect to arcs. A knowledge graph could capture this structure explicitly. Vector embeddings could enable semantic search across thousands of documents. Hybrid queries combining both approaches would surface precisely the information needed for any analysis.

I built it all. A graph database tracking every relationship I could identify. A vector store holding embeddings of every dialogue block and scene description. A query layer that could translate natural language questions into coordinated searches across both systems.

The technical achievement was real. The system could answer questions like which characters appear together most frequently, or find scenes with similar emotional tone across different screenplays. It could trace character mentions through a script and visualize the resulting network.

But these were not the questions writers actually ask. Writers want to know why a scene feels flat. They want to understand what their protagonist really wants. They want craft guidance, not database queries.

The retrieval system answered the wrong questions eloquently. It was a solution looking for a problem. When I finally admitted this to myself, months of work became a learning experience rather than a product. Expensive education, but education nonetheless.

The Hyperparameter Hunt

There is a formula for the optimal learning rate when fine-tuning transformer models. I did not know this when I started. I do now.

The learning rate controls how aggressively a model updates its weights during training. Too high, and the model oscillates wildly, unable to converge. Too low, and training takes forever, potentially getting stuck in suboptimal configurations. The right value depends on the model architecture, the dataset size, and factors I am still not sure I fully understand.

Early experiments used learning rates borrowed from papers and tutorials. They were wrong for my specific configuration. The model would train for hours and produce incoherent outputs. Or it would memorize the training data rather than generalizing from it. Or it would suddenly degrade after showing promising progress.

The breakthrough came from a research paper that proposed a scaling formula. Learning rate should decrease as model size increases, following a specific power law. For the fourteen-billion parameter model I was using, the formula suggested a rate roughly ten times lower than what most tutorials recommended.

The other critical parameter was something called Adam beta two, which controls how the optimizer tracks gradient momentum. Standard values caused instability in my configuration. A lower value, point nine five instead of point nine nine nine, stabilized training dramatically.

These sound like minor technical details. They are. But minor technical details determine whether months of work produce a useful system or an expensive failure.

First Models

The first model that could analyze a screenplay was not very good at it.

It could identify characters and count their dialogue lines. It could recognize scene headings and estimate page counts. It could detect whether a script followed conventional formatting. The mechanical aspects of screenplay parsing came quickly.

Understanding story was harder. The model would identify a protagonist but struggle to articulate their arc. It could locate plot points but not explain their function. It recognized dialogue without hearing subtext.

The limitation was training data. Base language models know everything and nothing. They have seen billions of words but lack specialized understanding of craft. Teaching them required examples of expert analysis, and examples required experts.

I started collecting. Published analyses of classic films. Academic papers on narrative structure. Screenwriting manuals that explained the principles behind effective storytelling. Each source contributed to a growing corpus of craft knowledge.

The technical approach was called LoRA, a method for efficiently adapting large models without retraining them from scratch. Instead of modifying billions of parameters, LoRA adds small adapter modules that learn task-specific patterns. A fourteen-billion parameter model could be fine-tuned on consumer hardware, something that would have been impossible just a few years earlier.

By the end of June, the model could hold a conversation about craft. Not an expert conversation, but a coherent one. The foundation for everything that followed.

Genesis

The dream of teaching AI to understand story predates the technology to attempt it.

Story has structure. This much was known long before computers existed. Aristotle wrote about beginning, middle, and end. Joseph Campbell mapped the hero's journey across cultures. Blake Snyder codified Hollywood structure into fifteen beats that appear in countless successful films.

If story has structure, structure can be analyzed. If structure can be analyzed, perhaps a machine could learn to perform that analysis. The logic seemed sound. The execution remained out of reach until I decided to build a machine capable of attempting it.

I constructed a workstation designed specifically for AI research. A processor with sixteen cores and thirty-two threads. One hundred twenty-eight gigabytes of memory. Four terabytes of storage fast enough to feed data to the system without bottlenecks. And at the center of it all, a graphics card with thirty-two gigabytes of dedicated memory, the kind of hardware that makes modern machine learning possible. This was not a casual investment. It was a commitment.

My first attempt was a chatbot built on retrieval-augmented generation, the approach everyone seemed to be using at the time. The idea was straightforward: break a screenplay into chunks, store them in a vector database, retrieve relevant pieces when questions arose, and let the language model synthesize an answer from the fragments.

It did not work. I did not understand why.

The system would miss obvious plot points. Characters would blur together. The model would confidently assert things that contradicted what was written on the page. I had not yet learned the term for what I was witnessing: lost in the middle, the phenomenon where language models struggle to use information buried in the center of their context window. Nor did I understand how chunking a narrative destroyed the very continuity that made it coherent.

I spent weeks researching solutions. Semantic chunking strategies that tried to preserve meaning across boundaries. A technique called Lumber Chunker that attempted to maintain narrative flow. I read papers and watched tutorials and ran experiments, growing increasingly frustrated that nothing seemed to help.

For testing, I used a screenplay I had written myself. No one knew that story better than I did. I could immediately spot when the model misunderstood a character's motivation or confused one scene with another. I asked the system to reconstruct the entire narrative in prose form, reasoning that if it truly understood the story, it should be able to retell it.

The result was a disaster. Hallucinated scenes. Merged characters. Plot points appearing in the wrong order or not at all. The chunking framework I had carefully constructed was fragmenting the story beyond recognition.

Out of options and running low on theories, I did something I had been avoiding. I pasted my entire screenplay into a commercial AI assistant with a large context window and asked it to analyze the story.

The response changed everything.

The model understood. Not perfectly, but substantively. It identified my protagonist's emotional wound correctly. It traced the arc through the midpoint reversal. It noticed subtext in dialogue that I had buried intentionally. All of this from a single, complete document loaded into memory at once.

The lesson was humbling in its simplicity. The problem was not the model. The problem was not my prompts. The problem was that I had been cutting my story into pieces and expecting the machine to reassemble meaning from fragments. The solution was to stop cutting.

I did not know then that this realization would consume the next eight months of my life. I did not know it would lead through fine-tuning experiments and abandoned architectures. I did not know that the hardest problems would be the ones I had not anticipated. But I knew, finally, that the possibility was real.

The journey continues.

Upload a Screenplay →

All screenplay and literary analysis performed for educational and research purposes. No copyrighted material is reproduced or distributed. Referenced frameworks and techniques belong to their respective authors and are cited for educational context only.