LlaSMol Model

LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine.

Overview

Paper: LlaSMol: Advancing Large Language Models for ChemistryProject: https://osu-nlp-group.github.io/LLM4Chem

LlaSMol models are instruction-tuned on SMolInstruct, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories.

Model Variants

All LlaSMol models are available on Hugging Face:

LlaSMol-Mistral-7B

Recommended - Best overall performanceosunlp/LlaSMol-Mistral-7B

LlaSMol-Llama2-7B

Meta’s Llama2 baseosunlp/LlaSMol-Llama2-7B

LlaSMol-CodeLlama-7B

Code-specialized baseosunlp/LlaSMol-CodeLlama-7B

LlaSMol-Galactica-6.7B

Science-focused baseosunlp/LlaSMol-Galactica-6.7B

Model Initialization

ChemAgent uses the Mistral variant by default:

from LLM4Chem.generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    "osunlp/LlaSMol-Mistral-7B", 
    device="cuda"
)

See plan_execute_agent/chem_tools.py:118 for integration.

Hardware Requirements

LlaSMol models require GPU acceleration and sufficient VRAM:

Minimum: 8GB GPU memory
Recommended: 16GB+ for optimal performance
CPU-only: Set LOW_VRAM=True in configuration to disable

Supported Tasks

LlaSMol is trained on 14 chemistry tasks across 4 categories:

1. Name Conversion (4 tasks)

Converts between different molecular representations:

IUPAC → Formula
IUPAC → SMILES
SMILES → Formula
SMILES → IUPAC

Task: Convert IUPAC name to molecular formula

query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"
response = generator.generate(query)
# Output: "<MOLFORMULA> C15H11NO </MOLFORMULA>"

See LLM4Chem/README.md:23 for more examples.

Task: Convert IUPAC name to SMILES string

query = "Could you provide the SMILES for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>?"
response = generator.generate(query)
# Output: "Of course. It's <SMILES> CCC1(C)COC(=O)C1 </SMILES>."

See LLM4Chem/README.md:30 for more examples.

Task: Convert SMILES to molecular formula

query = "Given <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>, what would be its molecular formula?"
response = generator.generate(query)
# Output: "It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA>."

See LLM4Chem/README.md:37 for more examples.

Task: Convert SMILES to IUPAC name

query = "Translate <SMILES> CCC(C)C1CNCCCNC1 </SMILES> into its IUPAC name."
response = generator.generate(query)
# Output: "<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>"

See LLM4Chem/README.md:44 for more examples.

2. Property Prediction (6 tasks)

Predicts molecular properties from SMILES:

ESOL - Aqueous Solubility

Predicts log solubility in mol/L:

query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
response = generator.generate(query)
# Output: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."

See LLM4Chem/README.md:52 for details.

LIPO - Lipophilicity

Predicts octanol/water distribution coefficient (logD at pH 7.4):

query = "Predict the logD for <SMILES> NC(=O)C1=CC=CC=C1O </SMILES>."
response = generator.generate(query)
# Output: "<NUMBER> 1.090 </NUMBER>"

See LLM4Chem/README.md:59 for details.

BBBP - Blood-Brain Barrier

Predicts if molecule can penetrate BBB (boolean):

query = "Is BBBP a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"
response = generator.generate(query)
# Output: "<BOOLEAN> Yes </BOOLEAN>"

See LLM4Chem/README.md:66 for details.

Clintox - Toxicity

Predicts if molecule is toxic (boolean):

query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"

See LLM4Chem/README.md:73 for details.

HIV - HIV Inhibition

Predicts if molecule inhibits HIV replication (boolean):

query = "Can <SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES> inhibit HIV?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"

See LLM4Chem/README.md:80 for details.

SIDER - Side Effects

Predicts organ-specific side effects (boolean):

query = "Are there side effects of <SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br </SMILES> affecting the heart?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"

See LLM4Chem/README.md:87 for details.

3. Molecule Description (2 tasks)

Generates or interprets molecular descriptions:

Molecule Captioning

Describes a molecule from its SMILES:

query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
response = generator.generate(query)
# Output: "The molecule is an imidazole derivative with short-acting sedative, 
#          hypnotic, and general anesthetic properties. Etomidate appears to have 
#          gamma-aminobutyric acid (GABA) like effects..."

See LLM4Chem/README.md:96 for examples.

Molecule Generation

Generates SMILES from a text description:

query = """Give me a molecule that satisfies: The molecule is a member of the class of 
tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia 
marcescens. It has a role as an antimicrobial agent..."""

response = generator.generate(query)
# Output: "Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"

For molecule generation, tags are not required in the input description.

See LLM4Chem/README.md:103 for examples.

4. Chemical Reactions (2 tasks)

Predicts reaction products or reactants:

Forward Synthesis
Retrosynthesis

Predicts products from reactants:

query = "<SMILES> NC1=CC=C2OCOC2=C1.O=CO </SMILES> Based on the reactants and reagents given above, suggest a possible product."
response = generator.generate(query)
# Output: "A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>."

See LLM4Chem/README.md:115 for examples.

Predicts reactants from products:

query = "Identify possible reactants for the product: <SMILES> CC1=CC=C(N)N=C1N </SMILES>"
response = generator.generate(query)
# Output: "<SMILES> CC(C#N)CCC#N.N </SMILES>"

See LLM4Chem/README.md:122 for examples.

SMolInstruct Dataset

LlaSMol models are trained on SMolInstruct, a large-scale chemistry instruction dataset:

Dataset: https://huggingface.co/datasets/osunlp/SMolInstruct

Key Features

Scale: Millions of instruction-response pairs
Coverage: All 14 chemistry tasks with balanced distribution
Quality: Curated and validated chemical data
Format: Instruction-tuning format with special tags

Training Details

Fine-tuning uses LoRA (Low-Rank Adaptation):

MODELNAME=LlaSMol-Mistral-7B
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py \
  --data_path osunlp/SMolInstruct \
  --base_model mistralai/Mistral-7B-v0.1 \
  --output_dir checkpoint/$MODELNAME

See LLM4Chem/README.md:131 for training instructions.

Tag System

LlaSMol uses specialized tags to structure chemistry information:

Input Tags

string

Wraps SMILES representations in queriesExample: <SMILES> CC(C)Cl </SMILES>

<IUPAC>

string

Wraps IUPAC names in queriesExample: <IUPAC> aspirin </IUPAC>

Output Tags

string

Molecular formula in model responsesExample: <MOLFORMULA> C9H8O4 </MOLFORMULA>

string

Numerical predictions (solubility, logD, etc.)Example: <NUMBER> -1.41 </NUMBER>

string

Yes/No predictions (toxicity, BBB, etc.)Example: <BOOLEAN> Yes </BOOLEAN>

See LLM4Chem/README.md:157 for complete tag documentation.

SMILES Canonicalization

LlaSMol automatically canonicalizes SMILES strings using RDKit:

from rdkit import Chem

def canonicalize_smiles(smiles: str) -> str:
    """Convert SMILES to canonical form."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return smiles
    return Chem.MolToSmiles(mol)

Why canonicalization matters:

CCO and OCC represent the same molecule (ethanol)
Canonical form ensures consistent training and inference
Improves model accuracy and reduces ambiguity

Canonicalization happens automatically when SMILES are wrapped in <SMILES> tags.

See LLM4Chem/README.md:168 for details.

Usage in ChemAgent

Direct Generation

The answer_chemistry_query tool wraps LlaSMol:

@tool
def answer_chemistry_query(query: str) -> str:
    """Answer a chemistry-related query using LlaSMol."""
    response = generator.generate(query)
    return response[0]["output"][0]

See plan_execute_agent/chem_tools.py:124 for implementation.

Query Format

Queries must be properly tagged before being sent to LlaSMol:

# ❌ Incorrect - no tags
query = "What is the SMILES for aspirin?"

# ✅ Correct - IUPAC tagged
query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"

The structure_chem_prompt tool handles automatic tagging.

Response Handling

LlaSMol responses are stored for validation and error tracking:

import plan_execute_agent.llasmol_response as llasmol_response

response = generator.generate(query)
llasmol_response.model_response = response  # Store for later access
return response[0]["output"][0]

See plan_execute_agent/chem_tools.py:159 for response management.

Performance Characteristics

Accuracy

metrics

Performance varies by task:

Name conversions: 80-90% exact match
Property predictions: 60-85% task-dependent
Molecule captioning: Qualitative, high coherence
Reaction prediction: 50-70% exact match

Latency

timing

Single query: 2-5 seconds on GPU
Batch processing: ~1 second per query
First load: +10 seconds model initialization

Memory

resources

Model size: ~14GB (7B parameters)
Peak memory: ~16GB during inference
Batch size 1: ~8GB VRAM minimum

Evaluation Pipeline

LlaSMol includes tools for evaluation on SMolInstruct:

Step 1: Generate Responses

python generate_on_dataset.py \
  --model_name osunlp/LlaSMol-Mistral-7B \
  --output_dir eval/LlaSMol-Mistral-7B/output

Step 2: Extract Predictions

python extract_prediction.py \
  --output_dir eval/LlaSMol-Mistral-7B/output \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction

Step 3: Compute Metrics

python compute_metrics.py \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction

See LLM4Chem/README.md:172 for evaluation documentation.

Limitations

Known Limitations:

Task Scope: Only supports the 14 trained tasks
Complex Queries: May struggle with multi-step reasoning
Novel Compounds: Less accurate for molecules not in training data
Numerical Precision: Property predictions are approximate
Context Length: Limited to standard transformer context window

Overview

Model Variants

LlaSMol-Mistral-7B

LlaSMol-Llama2-7B

LlaSMol-CodeLlama-7B

LlaSMol-Galactica-6.7B

Model Initialization

Hardware Requirements

Supported Tasks

1. Name Conversion (4 tasks)

2. Property Prediction (6 tasks)

3. Molecule Description (2 tasks)

Molecule Captioning

Molecule Generation

4. Chemical Reactions (2 tasks)

SMolInstruct Dataset

Key Features

Training Details

Tag System

Input Tags

Output Tags

SMILES Canonicalization

Usage in ChemAgent

Direct Generation

Query Format

Response Handling

Performance Characteristics

Evaluation Pipeline

Step 1: Generate Responses

Step 2: Extract Predictions

Step 3: Compute Metrics

Limitations

Next Steps

Chemistry Tags

Agent Workflow

​Overview

​Model Variants

LlaSMol-Mistral-7B

LlaSMol-Llama2-7B

LlaSMol-CodeLlama-7B

LlaSMol-Galactica-6.7B

​Model Initialization

​Hardware Requirements

​Supported Tasks

​1. Name Conversion (4 tasks)

​2. Property Prediction (6 tasks)

​3. Molecule Description (2 tasks)

​Molecule Captioning

​Molecule Generation

​4. Chemical Reactions (2 tasks)

​SMolInstruct Dataset

​Key Features

​Training Details

​Tag System

​Input Tags

​Output Tags

​SMILES Canonicalization

​Usage in ChemAgent

​Direct Generation

​Query Format

​Response Handling

​Performance Characteristics

​Evaluation Pipeline

​Step 1: Generate Responses

​Step 2: Extract Predictions

​Step 3: Compute Metrics

​Limitations

​Next Steps

Chemistry Tags

Agent Workflow

Overview

Model Variants

Model Initialization

Hardware Requirements

Supported Tasks

1. Name Conversion (4 tasks)

2. Property Prediction (6 tasks)

3. Molecule Description (2 tasks)

Molecule Captioning

Molecule Generation

4. Chemical Reactions (2 tasks)

SMolInstruct Dataset

Key Features

Training Details

Tag System

Input Tags

Output Tags

SMILES Canonicalization

Usage in ChemAgent

Direct Generation

Query Format

Response Handling

Performance Characteristics

Evaluation Pipeline

Step 1: Generate Responses

Step 2: Extract Predictions

Step 3: Compute Metrics

Limitations

Next Steps