Overview

An agent meant to do the actual development work. From what I can tell this is the most complex agent which leverages a great deal of context.

Members

set_up_environment

Project Step: environment_setup

First it loops through system dependencies provided by get_architecture.

These system dependency objects that are returned from the LLM include a test command to verify that the dependency is installed (neat idea). If it isn’t it just prints out a message about needing to install it before continuing.

start_coding

Internally we start tracking what % of the features have been implemented. If we hit 50%, we call out to the technical writer to create a README. It looks like at some point they want to flesh out a license and API documentation as well.

We loop through each dev task in the development plan generated by the tech lead (just a str) and we set the properties defining the task to the project’s current task values. For each dev task, we call implement_task.

After all tasks are processed, we call back to the TechnicalWriter to rewrite our README.

Feature Complete!

implement_task

I’ve been seeing this reference that if the first element in the project’s dev_steps_to_load has a property prompt_path set to the string breakdown.prompt it’s meant to signify that it’s the last element to be processed — Given the name I assumed it was around rehydrating a state from the DB, but it seems like it’s used in the normal flow?

If we aren’t on the final task, we prompt the LLM to tell us the code that needs to be written (breakdown.prompt). It includes the entire project details, including the features, file list, all development tasks, and technical details. Some things I find interesting:

  1. The sheer size of this prompt. GPT-4 has an absolutely massive working context for this to be coherent. Or is this a side effect of a very effective prompt?
  2. The output format for the code files are unprompted, are they relying on GPT-4’s training to always output code in the same format?
  3. There seems to be a mechanism for user input to self correct, but it’s unused. This isn’t abnormal, it’s good to have flexibility in the prompt template and it’s why I like they’re using jinja here, just curious when it stopped being utilized and why.

We get back a response from the LLM. It looks like we assume the LLM has wrapped the response in a prefix and postfix of the first five and last five words of the message, which we then split into instructions_prefix and instructions_postfix.

We then send a message referring to the output of our last message asking the LLM to parse it, including our pre and postfix we extracted. It includes a function definition called parse_development_task that returns a JSON object with an array called “tasks”. Each entry in the task array can be one of three values:

  • A command that needs to be run to execute the action
  • A file that needs to be created or updated
  • A notification that human intervention is required

We now have our development_task object as defined above, which we execute using execute_task

execute_task

Here’s where we get into some interesting conversation branch management business. After some state management code RE: loading / continuing from a certain step, we generate a uuid for the branch we’re about to create in the conversation.

We loop over the task steps generated by the parse_task prompt, branching off the step’s type property and executing a different function depending on the value. (step_save_file, step_command_run, step_delete_file, step_human_intervention).

step_command_run

Action for when the LLM determines it wants to run a command like npm install. Wrapper for run_command_until_success

step_save_file

Action for when the LLM determines it wants to write or update a file. Wrapper for implement_code_changes

step_delete_file

Action for when the LLM determines it wants to delete a file. Interestingly, doesn’t actually delete a file but instead logs out that the LLM attempted to.

step_human_intervention

Action for when the LLM determines it needs user clarification. First calls ask_for_human_intervention, and provides a callback that allows the user to input “R” to run the app for debugging, I assume so the user can get a specific error code that’s occurring.

If the user inputs “continue” we assume all is well and return a success flag.

Otherwise we run debug, passing the user’s description as the issue description along with the current state of the conversation.