Skip to main content
LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine.

Overview

LlaSMol models are instruction-tuned on SMolInstruct, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories.

Model Variants

All LlaSMol models are available on Hugging Face:

LlaSMol-Mistral-7B

Recommended - Best overall performanceosunlp/LlaSMol-Mistral-7B

LlaSMol-Llama2-7B

Meta’s Llama2 baseosunlp/LlaSMol-Llama2-7B

LlaSMol-CodeLlama-7B

Code-specialized baseosunlp/LlaSMol-CodeLlama-7B

LlaSMol-Galactica-6.7B

Science-focused baseosunlp/LlaSMol-Galactica-6.7B

Model Initialization

ChemAgent uses the Mistral variant by default:
from LLM4Chem.generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    "osunlp/LlaSMol-Mistral-7B", 
    device="cuda"
)
See plan_execute_agent/chem_tools.py:118 for integration.

Hardware Requirements

LlaSMol models require GPU acceleration and sufficient VRAM:
  • Minimum: 8GB GPU memory
  • Recommended: 16GB+ for optimal performance
  • CPU-only: Set LOW_VRAM=True in configuration to disable

Supported Tasks

LlaSMol is trained on 14 chemistry tasks across 4 categories:

1. Name Conversion (4 tasks)

Converts between different molecular representations:
Task: Convert IUPAC name to molecular formula
query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"
response = generator.generate(query)
# Output: "<MOLFORMULA> C15H11NO </MOLFORMULA>"
See LLM4Chem/README.md:23 for more examples.

2. Property Prediction (6 tasks)

Predicts molecular properties from SMILES:
Predicts log solubility in mol/L:
query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
response = generator.generate(query)
# Output: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
See LLM4Chem/README.md:52 for details.
Predicts octanol/water distribution coefficient (logD at pH 7.4):
query = "Predict the logD for <SMILES> NC(=O)C1=CC=CC=C1O </SMILES>."
response = generator.generate(query)
# Output: "<NUMBER> 1.090 </NUMBER>"
See LLM4Chem/README.md:59 for details.
Predicts if molecule can penetrate BBB (boolean):
query = "Is BBBP a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"
response = generator.generate(query)
# Output: "<BOOLEAN> Yes </BOOLEAN>"
See LLM4Chem/README.md:66 for details.
Predicts if molecule is toxic (boolean):
query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:73 for details.
Predicts if molecule inhibits HIV replication (boolean):
query = "Can <SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES> inhibit HIV?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:80 for details.
Predicts organ-specific side effects (boolean):
query = "Are there side effects of <SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br </SMILES> affecting the heart?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:87 for details.

3. Molecule Description (2 tasks)

Generates or interprets molecular descriptions:

Molecule Captioning

Describes a molecule from its SMILES:
query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
response = generator.generate(query)
# Output: "The molecule is an imidazole derivative with short-acting sedative, 
#          hypnotic, and general anesthetic properties. Etomidate appears to have 
#          gamma-aminobutyric acid (GABA) like effects..."
See LLM4Chem/README.md:96 for examples.

Molecule Generation

Generates SMILES from a text description:
query = """Give me a molecule that satisfies: The molecule is a member of the class of 
tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia 
marcescens. It has a role as an antimicrobial agent..."""

response = generator.generate(query)
# Output: "Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"
For molecule generation, tags are not required in the input description.
See LLM4Chem/README.md:103 for examples.

4. Chemical Reactions (2 tasks)

Predicts reaction products or reactants:
Predicts products from reactants:
query = "<SMILES> NC1=CC=C2OCOC2=C1.O=CO </SMILES> Based on the reactants and reagents given above, suggest a possible product."
response = generator.generate(query)
# Output: "A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>."
See LLM4Chem/README.md:115 for examples.

SMolInstruct Dataset

LlaSMol models are trained on SMolInstruct, a large-scale chemistry instruction dataset:

Key Features

  • Scale: Millions of instruction-response pairs
  • Coverage: All 14 chemistry tasks with balanced distribution
  • Quality: Curated and validated chemical data
  • Format: Instruction-tuning format with special tags

Training Details

Fine-tuning uses LoRA (Low-Rank Adaptation):
MODELNAME=LlaSMol-Mistral-7B
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py \
  --data_path osunlp/SMolInstruct \
  --base_model mistralai/Mistral-7B-v0.1 \
  --output_dir checkpoint/$MODELNAME
See LLM4Chem/README.md:131 for training instructions.

Tag System

LlaSMol uses specialized tags to structure chemistry information:

Input Tags

<SMILES>
string
Wraps SMILES representations in queriesExample: <SMILES> CC(C)Cl </SMILES>
<IUPAC>
string
Wraps IUPAC names in queriesExample: <IUPAC> aspirin </IUPAC>

Output Tags

<MOLFORMULA>
string
Molecular formula in model responsesExample: <MOLFORMULA> C9H8O4 </MOLFORMULA>
<NUMBER>
string
Numerical predictions (solubility, logD, etc.)Example: <NUMBER> -1.41 </NUMBER>
<BOOLEAN>
string
Yes/No predictions (toxicity, BBB, etc.)Example: <BOOLEAN> Yes </BOOLEAN>
See LLM4Chem/README.md:157 for complete tag documentation.

SMILES Canonicalization

LlaSMol automatically canonicalizes SMILES strings using RDKit:
from rdkit import Chem

def canonicalize_smiles(smiles: str) -> str:
    """Convert SMILES to canonical form."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return smiles
    return Chem.MolToSmiles(mol)
Why canonicalization matters:
  • CCO and OCC represent the same molecule (ethanol)
  • Canonical form ensures consistent training and inference
  • Improves model accuracy and reduces ambiguity
Canonicalization happens automatically when SMILES are wrapped in <SMILES> tags.
See LLM4Chem/README.md:168 for details.

Usage in ChemAgent

Direct Generation

The answer_chemistry_query tool wraps LlaSMol:
@tool
def answer_chemistry_query(query: str) -> str:
    """Answer a chemistry-related query using LlaSMol."""
    response = generator.generate(query)
    return response[0]["output"][0]
See plan_execute_agent/chem_tools.py:124 for implementation.

Query Format

Queries must be properly tagged before being sent to LlaSMol:
# ❌ Incorrect - no tags
query = "What is the SMILES for aspirin?"

# ✅ Correct - IUPAC tagged
query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"
The structure_chem_prompt tool handles automatic tagging.

Response Handling

LlaSMol responses are stored for validation and error tracking:
import plan_execute_agent.llasmol_response as llasmol_response

response = generator.generate(query)
llasmol_response.model_response = response  # Store for later access
return response[0]["output"][0]
See plan_execute_agent/chem_tools.py:159 for response management.

Performance Characteristics

Accuracy
metrics
Performance varies by task:
  • Name conversions: 80-90% exact match
  • Property predictions: 60-85% task-dependent
  • Molecule captioning: Qualitative, high coherence
  • Reaction prediction: 50-70% exact match
Latency
timing
  • Single query: 2-5 seconds on GPU
  • Batch processing: ~1 second per query
  • First load: +10 seconds model initialization
Memory
resources
  • Model size: ~14GB (7B parameters)
  • Peak memory: ~16GB during inference
  • Batch size 1: ~8GB VRAM minimum

Evaluation Pipeline

LlaSMol includes tools for evaluation on SMolInstruct:

Step 1: Generate Responses

python generate_on_dataset.py \
  --model_name osunlp/LlaSMol-Mistral-7B \
  --output_dir eval/LlaSMol-Mistral-7B/output

Step 2: Extract Predictions

python extract_prediction.py \
  --output_dir eval/LlaSMol-Mistral-7B/output \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction

Step 3: Compute Metrics

python compute_metrics.py \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction
See LLM4Chem/README.md:172 for evaluation documentation.

Limitations

Known Limitations:
  1. Task Scope: Only supports the 14 trained tasks
  2. Complex Queries: May struggle with multi-step reasoning
  3. Novel Compounds: Less accurate for molecules not in training data
  4. Numerical Precision: Property predictions are approximate
  5. Context Length: Limited to standard transformer context window

Next Steps

Chemistry Tags

Learn the tag system in detail

Agent Workflow

See how LlaSMol fits into the workflow