• Question - how does OpenAI take in new evaluation data in a crowd-sourced manner?
  • Context -
    • My friend and I have been looking at international math Olympiad problems, and seeing how we might be able to solve them using language models. We started running a bunch of reasoning evals in particular. We ran portions of GSM8K, miniF2F, and also looked at a lot of IMO problems. Evals are a fascinating topic because there are only 3 to 4 primary reasoning evals for math that are presently used, and that got us thinking about what it might be like to create our own dataset of evals. We wanted to check out [OpenAI’s evaluation repository](openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.) where anyone can submit a pull request and contribute data.
  • Answer -
    • because we were already looking at math Olympiad data we decided to submit IMO problems from the website. We also considered submitting Lojban eval’s, but then spent a few hours trying to understand Lojban and it was so hard and there was poor documentation online, so we discarded that idea and just decided to stick with IMO problems.
  • Primarily for personal learning or novel discovery? - Personal learning
  • What is the delta? -
  • Observations -
    • I learned a lot combing through the data set of IMO problems. One big learning was I could not solve a single IMO problem, but I could curate the dataset.
    • Responses are categorized as basic or complex, depending on the evaluation method required:
      • Basic evals demand responses that are exact, include, or are fuzzy matches to the ideal response.
      • Complex evals involve programmatic manipulation of response data, reference external datasets, or use evaluations conducted by another language model.
    • The data we did curate was a subset of IMO problems listed on their official website. We only selected problems that had either a yes or no answer, or that had a single numeric answer. This way, we could submit a simple eval as opposed to a complex one, and if the model after some chain of thought reasoning was able to get to the correct answer the the evaluation of its correctness is automatic. We curated about 15 problems total from about five years of official IMO practice problems.
    • I’m reasonably confident that that data was not previously in LLM training sets because we had to put the PDF through a LaTeX convertor and comb through the problems for the ones that fit our criteria.
    • Overall learned that it takes a lot of time to make even 15 problems and you kind of need discernment as the data curator. In this case I didn’t need to know how to solve IMO problem, but I did need to evaluate whether or not each problem was a good fit for our data set criteria and fix any errors from PDF to LaTeX conversion.

Methodology of choosing problems for dataset:

  1. Selection of Problems: Selected math Olympiad problems from official IMO problem sheets. These problems are chosen due to their objective nature. Filtered for problems with clear, single-response answers.
  2. Submission: Submit a Pull Request (PR) to OpenAI/evals containing the chosen IMO problems.
  3. Response Handling: Set up response categories within the OpenAI/evals framework to appropriately process and evaluate the responses as basic evals due to their straightforward nature.
  4. Evaluation: After the responses are generated, they will be evaluated for accuracy and precision. This evaluation will focus on the response match to the expected answer (exact, include, or fuzzy).
  5. Assessment of Outcomes: Analyze the outcomes to determine the effectiveness of using OpenAI/evals for this type of problem. This includes measuring accuracy rates and identifying any patterns in response errors.

Tags

  • personal mnemonic: firelord ozai publicly sets openai on fire
  • type: #try-repo
  • theme: #evals
  • status: #done