Overview
The LlaSMol evaluation pipeline consists of three main steps:- Generation: Generate predictions on test datasets
- Extraction: Extract and format predictions from model outputs
- Metrics: Compute task-specific evaluation metrics
Evaluation Pipeline
1. Generate Predictions
Usegenerate_on_dataset.py to generate predictions for a specific task.
Parameters
Name or path of the fine-tuned LlaSMol model.
Base model architecture. Auto-detected from model_name if not specified.
Dataset to evaluate on (Hugging Face dataset or local path).
Dataset split to evaluate. Use slice notation (e.g.,
test[:100]) to evaluate on a subset.Tasks to evaluate. If None, evaluates on all available tasks.
Directory where prediction files will be saved (one
.jsonl file per task).Number of samples to process in parallel.
Maximum input token length. Uses task-specific or default settings if None.
Maximum tokens to generate. Uses task-specific or default settings if None.
Whether to print predictions during generation.
Device to run on (
cuda or cpu). Auto-detected if None.Additional generation parameters (e.g.,
num_beams, num_return_sequences).Output Format
Generates.jsonl files in the output directory, one per task:
2. Extract Predictions
Useextract_prediction.py to extract predictions from generation outputs.
Parameters
Directory containing generation outputs from step 1.
Directory where extracted predictions will be saved.
Tasks to process. If None, processes all tasks found in output_dir.
Functionality
The extraction script:- Reads generation outputs from
output_dir - Extracts predictions based on task-specific tags (if configured)
- Saves formatted predictions to
prediction_dir - Handles multiple prediction sequences per input
Output Format
Creates.jsonl files in prediction_dir:
3. Compute Metrics
Usecompute_metrics.py to calculate evaluation metrics.
Parameters
Directory containing prediction files from step 2.
Tasks to compute metrics for.
Task-Specific Metrics
Different tasks use different evaluation metrics: SMILES Tasks (forward_synthesis, molecule_generation, name_conversion-i2s):- Exact Match
- Validity
- Fingerprint Similarity (Tanimoto)
- MACCS FTS
- RDK FTS
- Morgan FTS
- Exact Match
- Fingerprint Similarity
- Multiple Match (top-k accuracy)
- BLEU-2, BLEU-4
- ROUGE-1, ROUGE-2, ROUGE-L
- METEOR
- Element Match: Compares molecular formulas element-by-element
- Split Match: Tokenized comparison of IUPAC names
- MAE (Mean Absolute Error)
- RMSE (Root Mean Square Error)
- Accuracy
- Precision
- Recall
- F1 Score
Output Format
Prints metrics to console:Supported Tasks
The following chemistry tasks are supported:Forward Synthesis
Predict reaction products from reactants
Retrosynthesis
Predict reactants from products
Molecule Captioning
Generate text descriptions of molecules
Molecule Generation
Generate SMILES from text descriptions
IUPAC to Formula
Convert IUPAC names to molecular formulas
IUPAC to SMILES
Convert IUPAC names to SMILES
SMILES to Formula
Convert SMILES to molecular formulas
SMILES to IUPAC
Convert SMILES to IUPAC names
Property Prediction
Predict molecular properties (ESOL, LIPO, BBBP, etc.)
Task Configuration
Task-specific generation settings are defined inconfig.py:
Complete Evaluation Workflow
Example: Evaluate on Name Conversion
Example: Evaluate All Tasks
Example: Evaluate with Custom Settings
Programmatic Usage
Generate Function
Extract Predictions
Compute Metrics
Metrics Utilities
Theutils/metrics.py module provides metric calculation functions:
calculate_smiles_metrics(): SMILES validity, exact match, fingerprint similaritycalculate_text_metrics(): BLEU, ROUGE, METEOR for text generationcalculate_formula_metrics(): Element matching for molecular formulascalculate_number_metrics(): MAE, RMSE for numerical predictionscalculate_boolean_metrics(): Accuracy, precision, recall, F1 for binary classification
File Locations
LLM4Chem/generate_on_dataset.py: Dataset generation scriptLLM4Chem/extract_prediction.py: Prediction extraction scriptLLM4Chem/compute_metrics.py: Metrics computation scriptLLM4Chem/config.py: Task configurationsLLM4Chem/utils/metrics.py: Metric calculation functions
The evaluation pipeline automatically handles task-specific settings from
config.py, including generation parameters and metric selection.