Experiment Length:
Estimated Time: 3 days
Actual Time: 4 hours
Summary:
In this experiment, I tackled the challenge of ensuring LLM accuracy on coding tasks without human intervention, addressing the problem of unreliable synthetic data generation. Traditionally, language model performance is measured using human-curated tests, and recently there have been evals that are based on synthetic data or model grades.
Here, I proposed a simple self-aligning system that iteratively generates (1) code and (2) test sets until the test suite passes with 100% accuracy, enhancing both accuracy and reliability autonomously.
Plan:
I implemented a system using Python and GPT-4, which iteratively generates test sets and code, reiterating until 100% of tests pass.
Iโd hoped to evaluate based on the percentage of successful code function generations and test alignments, with the percentage of alignment success as the dependent variable, and the levels of generated items (code, tests, etc.) as the independent variables. Unfortunately, my time on this experiment was cut short to 4 hours instead of 3 days, so I didnโt have time to run a proper evaluation besides qualititative checks on coding tasks. That will have to be for a future experiment.
Results: ๐ข Pass
Given the time I had, I was able to generate a code function and corresponding test set array, with their formats checked as correct by an additional model.
What I Learned:
Format checking was a bit finicky. I wouldโve tried integrating function calling within the GPT framework to ensure output format correctness and reduce reliance on secondary model verification.
I also noticed significant computational costs involved in extensive text generation. When a first generation was invalid, or a test didnโt pass, the system was given 5 tries to regenerate. Given that only single functions and unit tests were generated, I could imagine this to become quite costly generating multiple functions and integration tests.
My future experiment would likely involve running this system across a dataset like SWE-bench.
This post was generated by running my experiment notes through ChatGPT.
parent::Back-translation for Alignment of LLM Generation of Code, Tests, and more