Evaluating progress of LLMs on scientific problem-solving

abril 4, 2025

14

Programmatic and model-based evaluations

Tasks in CURIE are varied and have ground-truth annotations in mixed and heterogeneous form, e.g., as JSONs, latex equations, YAML files, or free-form text. Evaluating free-form generation is challenging because answers are often descriptive, and even when a format is specified, as in most of our cases, the response to each field can have differing forms. For example, materials grid points may sometimes be specified as “[p, q, r]” and at other times as “p × q × r”. Hence, in addition to the programmatic evaluation metrics, such as ROUGE-L, intersection-over-inion (used for BIOGR), and identity ratio (used in PDB), we propose two model-based evaluation metrics.

(1) LMScore: Prompts an LLM asking how closely the predictions match ground truth on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are many minor errors, and “bad” if there are major errors. We consider the weighted average of the log-likelihood scores of the tokens to produce a final confidence.

(2) LLMSim: Is used for retrieval tasks where we ask the model to exhaustively extract many details, e.g., descriptors, properties and values of materials from a research document, and provide as output an unordered list of dictionaries or records. We use a chain-of-thought (CoT) prompt that asks the LLM to look at each ground-truth record and identify the predicted records that correctly match each field (key) and value of the ground truth. Once we match the ground-truth records with predicted records, we can then measure precision and recall for the retrieval task, and compute the mean average precision, recall and F1 scores across all documents.

Previous articlePhysicists uncover electronic interactions mediated via spin waves

Next articleSpring Training for Success: What Sports Taught Me About Customer-Focused Partner Readiness

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

State-of-the-art NLP models from R

The US has approved CRISPR pigs for food

Localized data for globalized AI

Most Popular

Poll: Concern over climate change remains strong in Trump’s first 100 days

Formula 1 Drivers Just Hit the Track in These Full-Sized Lego Cars

It’s a wrap! RSAC 2025 highlights – Week in security with Tony Anscombe

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Recent Comments

ABOUT US

POPULAR POSTS

Poll: Concern over climate change remains strong in Trump’s first 100 days

Formula 1 Drivers Just Hit the Track in These Full-Sized Lego Cars

It’s a wrap! RSAC 2025 highlights – Week in security with Tony Anscombe

POPULAR CATEGORY