> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
> Use this file to discover all available pages before exploring further.

# LlaSMol Model

> Understanding the LlaSMol chemistry-specialized language model

LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine.

## Overview

<Info>
  **Paper**: [LlaSMol: Advancing Large Language Models for Chemistry](https://arxiv.org/abs/2402.09391)

  **Project**: [https://osu-nlp-group.github.io/LLM4Chem](https://osu-nlp-group.github.io/LLM4Chem)
</Info>

LlaSMol models are instruction-tuned on **SMolInstruct**, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories.

## Model Variants

All LlaSMol models are available on Hugging Face:

<CardGroup cols={2}>
  <Card title="LlaSMol-Mistral-7B" icon="star">
    **Recommended** - Best overall performance

    `osunlp/LlaSMol-Mistral-7B`
  </Card>

  <Card title="LlaSMol-Llama2-7B" icon="llama">
    Meta's Llama2 base

    `osunlp/LlaSMol-Llama2-7B`
  </Card>

  <Card title="LlaSMol-CodeLlama-7B" icon="code">
    Code-specialized base

    `osunlp/LlaSMol-CodeLlama-7B`
  </Card>

  <Card title="LlaSMol-Galactica-6.7B" icon="atom">
    Science-focused base

    `osunlp/LlaSMol-Galactica-6.7B`
  </Card>
</CardGroup>

### Model Initialization

ChemAgent uses the Mistral variant by default:

```python theme={null}
from LLM4Chem.generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    "osunlp/LlaSMol-Mistral-7B", 
    device="cuda"
)
```

See plan\_execute\_agent/chem\_tools.py:118 for integration.

### Hardware Requirements

<Warning>
  LlaSMol models require GPU acceleration and sufficient VRAM:

  * **Minimum**: 8GB GPU memory
  * **Recommended**: 16GB+ for optimal performance
  * **CPU-only**: Set `LOW_VRAM=True` in configuration to disable
</Warning>

## Supported Tasks

LlaSMol is trained on **14 chemistry tasks** across 4 categories:

### 1. Name Conversion (4 tasks)

Converts between different molecular representations:

<Tabs>
  <Tab title="IUPAC → Formula">
    **Task**: Convert IUPAC name to molecular formula

    ```python theme={null}
    query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"
    response = generator.generate(query)
    # Output: "<MOLFORMULA> C15H11NO </MOLFORMULA>"
    ```

    See LLM4Chem/README.md:23 for more examples.
  </Tab>

  <Tab title="IUPAC → SMILES">
    **Task**: Convert IUPAC name to SMILES string

    ```python theme={null}
    query = "Could you provide the SMILES for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>?"
    response = generator.generate(query)
    # Output: "Of course. It's <SMILES> CCC1(C)COC(=O)C1 </SMILES>."
    ```

    See LLM4Chem/README.md:30 for more examples.
  </Tab>

  <Tab title="SMILES → Formula">
    **Task**: Convert SMILES to molecular formula

    ```python theme={null}
    query = "Given <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>, what would be its molecular formula?"
    response = generator.generate(query)
    # Output: "It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA>."
    ```

    See LLM4Chem/README.md:37 for more examples.
  </Tab>

  <Tab title="SMILES → IUPAC">
    **Task**: Convert SMILES to IUPAC name

    ```python theme={null}
    query = "Translate <SMILES> CCC(C)C1CNCCCNC1 </SMILES> into its IUPAC name."
    response = generator.generate(query)
    # Output: "<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>"
    ```

    See LLM4Chem/README.md:44 for more examples.
  </Tab>
</Tabs>

### 2. Property Prediction (6 tasks)

Predicts molecular properties from SMILES:

<AccordionGroup>
  <Accordion title="ESOL - Aqueous Solubility">
    Predicts log solubility in mol/L:

    ```python theme={null}
    query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
    response = generator.generate(query)
    # Output: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
    ```

    See LLM4Chem/README.md:52 for details.
  </Accordion>

  <Accordion title="LIPO - Lipophilicity">
    Predicts octanol/water distribution coefficient (logD at pH 7.4):

    ```python theme={null}
    query = "Predict the logD for <SMILES> NC(=O)C1=CC=CC=C1O </SMILES>."
    response = generator.generate(query)
    # Output: "<NUMBER> 1.090 </NUMBER>"
    ```

    See LLM4Chem/README.md:59 for details.
  </Accordion>

  <Accordion title="BBBP - Blood-Brain Barrier">
    Predicts if molecule can penetrate BBB (boolean):

    ```python theme={null}
    query = "Is BBBP a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"
    response = generator.generate(query)
    # Output: "<BOOLEAN> Yes </BOOLEAN>"
    ```

    See LLM4Chem/README.md:66 for details.
  </Accordion>

  <Accordion title="Clintox - Toxicity">
    Predicts if molecule is toxic (boolean):

    ```python theme={null}
    query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
    response = generator.generate(query)
    # Output: "<BOOLEAN> No </BOOLEAN>"
    ```

    See LLM4Chem/README.md:73 for details.
  </Accordion>

  <Accordion title="HIV - HIV Inhibition">
    Predicts if molecule inhibits HIV replication (boolean):

    ```python theme={null}
    query = "Can <SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES> inhibit HIV?"
    response = generator.generate(query)
    # Output: "<BOOLEAN> No </BOOLEAN>"
    ```

    See LLM4Chem/README.md:80 for details.
  </Accordion>

  <Accordion title="SIDER - Side Effects">
    Predicts organ-specific side effects (boolean):

    ```python theme={null}
    query = "Are there side effects of <SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br </SMILES> affecting the heart?"
    response = generator.generate(query)
    # Output: "<BOOLEAN> No </BOOLEAN>"
    ```

    See LLM4Chem/README.md:87 for details.
  </Accordion>
</AccordionGroup>

### 3. Molecule Description (2 tasks)

Generates or interprets molecular descriptions:

#### Molecule Captioning

Describes a molecule from its SMILES:

```python theme={null}
query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
response = generator.generate(query)
# Output: "The molecule is an imidazole derivative with short-acting sedative, 
#          hypnotic, and general anesthetic properties. Etomidate appears to have 
#          gamma-aminobutyric acid (GABA) like effects..."
```

See LLM4Chem/README.md:96 for examples.

#### Molecule Generation

Generates SMILES from a text description:

```python theme={null}
query = """Give me a molecule that satisfies: The molecule is a member of the class of 
tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia 
marcescens. It has a role as an antimicrobial agent..."""

response = generator.generate(query)
# Output: "Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"
```

<Note>
  For molecule generation, tags are **not required** in the input description.
</Note>

See LLM4Chem/README.md:103 for examples.

### 4. Chemical Reactions (2 tasks)

Predicts reaction products or reactants:

<Tabs>
  <Tab title="Forward Synthesis">
    Predicts products from reactants:

    ```python theme={null}
    query = "<SMILES> NC1=CC=C2OCOC2=C1.O=CO </SMILES> Based on the reactants and reagents given above, suggest a possible product."
    response = generator.generate(query)
    # Output: "A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>."
    ```

    See LLM4Chem/README.md:115 for examples.
  </Tab>

  <Tab title="Retrosynthesis">
    Predicts reactants from products:

    ```python theme={null}
    query = "Identify possible reactants for the product: <SMILES> CC1=CC=C(N)N=C1N </SMILES>"
    response = generator.generate(query)
    # Output: "<SMILES> CC(C#N)CCC#N.N </SMILES>"
    ```

    See LLM4Chem/README.md:122 for examples.
  </Tab>
</Tabs>

## SMolInstruct Dataset

LlaSMol models are trained on **SMolInstruct**, a large-scale chemistry instruction dataset:

<Info>
  **Dataset**: [https://huggingface.co/datasets/osunlp/SMolInstruct](https://huggingface.co/datasets/osunlp/SMolInstruct)
</Info>

### Key Features

* **Scale**: Millions of instruction-response pairs
* **Coverage**: All 14 chemistry tasks with balanced distribution
* **Quality**: Curated and validated chemical data
* **Format**: Instruction-tuning format with special tags

### Training Details

Fine-tuning uses LoRA (Low-Rank Adaptation):

```bash theme={null}
MODELNAME=LlaSMol-Mistral-7B
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py \
  --data_path osunlp/SMolInstruct \
  --base_model mistralai/Mistral-7B-v0.1 \
  --output_dir checkpoint/$MODELNAME
```

See LLM4Chem/README.md:131 for training instructions.

## Tag System

LlaSMol uses specialized tags to structure chemistry information:

### Input Tags

<ParamField path="<SMILES>" type="string">
  Wraps SMILES representations in queries

  Example: `<SMILES> CC(C)Cl </SMILES>`
</ParamField>

<ParamField path="<IUPAC>" type="string">
  Wraps IUPAC names in queries

  Example: `<IUPAC> aspirin </IUPAC>`
</ParamField>

### Output Tags

<ParamField path="<MOLFORMULA>" type="string">
  Molecular formula in model responses

  Example: `<MOLFORMULA> C9H8O4 </MOLFORMULA>`
</ParamField>

<ParamField path="<NUMBER>" type="string">
  Numerical predictions (solubility, logD, etc.)

  Example: `<NUMBER> -1.41 </NUMBER>`
</ParamField>

<ParamField path="<BOOLEAN>" type="string">
  Yes/No predictions (toxicity, BBB, etc.)

  Example: `<BOOLEAN> Yes </BOOLEAN>`
</ParamField>

See LLM4Chem/README.md:157 for complete tag documentation.

## SMILES Canonicalization

LlaSMol automatically canonicalizes SMILES strings using RDKit:

```python theme={null}
from rdkit import Chem

def canonicalize_smiles(smiles: str) -> str:
    """Convert SMILES to canonical form."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return smiles
    return Chem.MolToSmiles(mol)
```

**Why canonicalization matters**:

* `CCO` and `OCC` represent the same molecule (ethanol)
* Canonical form ensures consistent training and inference
* Improves model accuracy and reduces ambiguity

<Note>
  Canonicalization happens automatically when SMILES are wrapped in `<SMILES>` tags.
</Note>

See LLM4Chem/README.md:168 for details.

## Usage in ChemAgent

### Direct Generation

The `answer_chemistry_query` tool wraps LlaSMol:

```python theme={null}
@tool
def answer_chemistry_query(query: str) -> str:
    """Answer a chemistry-related query using LlaSMol."""
    response = generator.generate(query)
    return response[0]["output"][0]
```

See plan\_execute\_agent/chem\_tools.py:124 for implementation.

### Query Format

Queries must be properly tagged before being sent to LlaSMol:

```python theme={null}
# ❌ Incorrect - no tags
query = "What is the SMILES for aspirin?"

# ✅ Correct - IUPAC tagged
query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"
```

The `structure_chem_prompt` tool handles automatic tagging.

### Response Handling

LlaSMol responses are stored for validation and error tracking:

```python theme={null}
import plan_execute_agent.llasmol_response as llasmol_response

response = generator.generate(query)
llasmol_response.model_response = response  # Store for later access
return response[0]["output"][0]
```

See plan\_execute\_agent/chem\_tools.py:159 for response management.

## Performance Characteristics

<ResponseField name="Accuracy" type="metrics">
  Performance varies by task:

  * **Name conversions**: 80-90% exact match
  * **Property predictions**: 60-85% task-dependent
  * **Molecule captioning**: Qualitative, high coherence
  * **Reaction prediction**: 50-70% exact match
</ResponseField>

<ResponseField name="Latency" type="timing">
  * **Single query**: 2-5 seconds on GPU
  * **Batch processing**: \~1 second per query
  * **First load**: +10 seconds model initialization
</ResponseField>

<ResponseField name="Memory" type="resources">
  * **Model size**: \~14GB (7B parameters)
  * **Peak memory**: \~16GB during inference
  * **Batch size 1**: \~8GB VRAM minimum
</ResponseField>

## Evaluation Pipeline

LlaSMol includes tools for evaluation on SMolInstruct:

### Step 1: Generate Responses

```bash theme={null}
python generate_on_dataset.py \
  --model_name osunlp/LlaSMol-Mistral-7B \
  --output_dir eval/LlaSMol-Mistral-7B/output
```

### Step 2: Extract Predictions

```bash theme={null}
python extract_prediction.py \
  --output_dir eval/LlaSMol-Mistral-7B/output \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```

### Step 3: Compute Metrics

```bash theme={null}
python compute_metrics.py \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction
```

See LLM4Chem/README.md:172 for evaluation documentation.

## Limitations

<Warning>
  **Known Limitations**:

  1. **Task Scope**: Only supports the 14 trained tasks
  2. **Complex Queries**: May struggle with multi-step reasoning
  3. **Novel Compounds**: Less accurate for molecules not in training data
  4. **Numerical Precision**: Property predictions are approximate
  5. **Context Length**: Limited to standard transformer context window
</Warning>

## Next Steps

<CardGroup cols={2}>
  <Card title="Chemistry Tags" icon="tags" href="/concepts/chemistry-tags">
    Learn the tag system in detail
  </Card>

  <Card title="Agent Workflow" icon="diagram-project" href="/concepts/agent-workflow">
    See how LlaSMol fits into the workflow
  </Card>
</CardGroup>
