sábado, abril 5, 2025
HomeArtificial IntelligenceEvaluating progress of LLMs on scientific problem-solving

Evaluating progress of LLMs on scientific problem-solving


Programmatic and model-based evaluations

Tasks in CURIE are varied and have ground-truth annotations in mixed and heterogeneous form, e.g., as JSONs, latex equations, YAML files, or free-form text. Evaluating free-form generation is challenging because answers are often descriptive, and even when a format is specified, as in most of our cases, the response to each field can have differing forms. For example, materials grid points may sometimes be specified as “[p, q, r]” and at other times as “p × q × r”. Hence, in addition to the programmatic evaluation metrics, such as ROUGE-L, intersection-over-inion (used for BIOGR), and identity ratio (used in PDB), we propose two model-based evaluation metrics.

(1) LMScore: Prompts an LLM asking how closely the predictions match ground truth on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are many minor errors, and “bad” if there are major errors. We consider the weighted average of the log-likelihood scores of the tokens to produce a final confidence.

(2) LLMSim: Is used for retrieval tasks where we ask the model to exhaustively extract many details, e.g., descriptors, properties and values of materials from a research document, and provide as output an unordered list of dictionaries or records. We use a chain-of-thought (CoT) prompt that asks the LLM to look at each ground-truth record and identify the predicted records that correctly match each field (key) and value of the ground truth. Once we match the ground-truth records with predicted records, we can then measure precision and recall for the retrieval task, and compute the mean average precision, recall and F1 scores across all documents.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments