> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt > Use this file to discover all available pages before exploring further. # LlaSMol Model > Understanding the LlaSMol chemistry-specialized language model LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine. ## Overview **Paper**: [LlaSMol: Advancing Large Language Models for Chemistry](https://arxiv.org/abs/2402.09391) **Project**: [https://osu-nlp-group.github.io/LLM4Chem](https://osu-nlp-group.github.io/LLM4Chem) LlaSMol models are instruction-tuned on **SMolInstruct**, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories. ## Model Variants All LlaSMol models are available on Hugging Face: **Recommended** - Best overall performance `osunlp/LlaSMol-Mistral-7B` Meta's Llama2 base `osunlp/LlaSMol-Llama2-7B` Code-specialized base `osunlp/LlaSMol-CodeLlama-7B` Science-focused base `osunlp/LlaSMol-Galactica-6.7B` ### Model Initialization ChemAgent uses the Mistral variant by default: ```python theme={null} from LLM4Chem.generation import LlaSMolGeneration generator = LlaSMolGeneration( "osunlp/LlaSMol-Mistral-7B", device="cuda" ) ``` See plan\_execute\_agent/chem\_tools.py:118 for integration. ### Hardware Requirements LlaSMol models require GPU acceleration and sufficient VRAM: * **Minimum**: 8GB GPU memory * **Recommended**: 16GB+ for optimal performance * **CPU-only**: Set `LOW_VRAM=True` in configuration to disable ## Supported Tasks LlaSMol is trained on **14 chemistry tasks** across 4 categories: ### 1. Name Conversion (4 tasks) Converts between different molecular representations: **Task**: Convert IUPAC name to molecular formula ```python theme={null} query = "What is the molecular formula of 2,5-diphenyl-1,3-oxazole ?" response = generator.generate(query) # Output: " C15H11NO " ``` See LLM4Chem/README.md:23 for more examples. **Task**: Convert IUPAC name to SMILES string ```python theme={null} query = "Could you provide the SMILES for 4-ethyl-4-methyloxolan-2-one ?" response = generator.generate(query) # Output: "Of course. It's CCC1(C)COC(=O)C1 ." ``` See LLM4Chem/README.md:30 for more examples. **Task**: Convert SMILES to molecular formula ```python theme={null} query = "Given S=P1(N(CCCl)CCCl)NCCCO1 , what would be its molecular formula?" response = generator.generate(query) # Output: "It is C7H15Cl2N2OPS ." ``` See LLM4Chem/README.md:37 for more examples. **Task**: Convert SMILES to IUPAC name ```python theme={null} query = "Translate CCC(C)C1CNCCCNC1 into its IUPAC name." response = generator.generate(query) # Output: " 3-butan-2-yl-1,5-diazocane " ``` See LLM4Chem/README.md:44 for more examples. ### 2. Property Prediction (6 tasks) Predicts molecular properties from SMILES: Predicts log solubility in mol/L: ```python theme={null} query = "How soluble is CC(C)Cl ?" response = generator.generate(query) # Output: "Its log solubility is -1.41 mol/L." ``` See LLM4Chem/README.md:52 for details. Predicts octanol/water distribution coefficient (logD at pH 7.4): ```python theme={null} query = "Predict the logD for NC(=O)C1=CC=CC=C1O ." response = generator.generate(query) # Output: " 1.090 " ``` See LLM4Chem/README.md:59 for details. Predicts if molecule can penetrate BBB (boolean): ```python theme={null} query = "Is BBBP a property of CCNC(=O)/C=C/C1=CC=CC(Br)=C1 ?" response = generator.generate(query) # Output: " Yes " ``` See LLM4Chem/README.md:66 for details. Predicts if molecule is toxic (boolean): ```python theme={null} query = "Is COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 toxic?" response = generator.generate(query) # Output: " No " ``` See LLM4Chem/README.md:73 for details. Predicts if molecule inhibits HIV replication (boolean): ```python theme={null} query = "Can CC1=CN(C2C=CCCC2O)C(=O)NC1=O inhibit HIV?" response = generator.generate(query) # Output: " No " ``` See LLM4Chem/README.md:80 for details. Predicts organ-specific side effects (boolean): ```python theme={null} query = "Are there side effects of CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br affecting the heart?" response = generator.generate(query) # Output: " No " ``` See LLM4Chem/README.md:87 for details. ### 3. Molecule Description (2 tasks) Generates or interprets molecular descriptions: #### Molecule Captioning Describes a molecule from its SMILES: ```python theme={null} query = "Describe this molecule: CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 " response = generator.generate(query) # Output: "The molecule is an imidazole derivative with short-acting sedative, # hypnotic, and general anesthetic properties. Etomidate appears to have # gamma-aminobutyric acid (GABA) like effects..." ``` See LLM4Chem/README.md:96 for examples. #### Molecule Generation Generates SMILES from a text description: ```python theme={null} query = """Give me a molecule that satisfies: The molecule is a member of the class of tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia marcescens. It has a role as an antimicrobial agent...""" response = generator.generate(query) # Output: "Here is a potential molecule: CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 " ``` For molecule generation, tags are **not required** in the input description. See LLM4Chem/README.md:103 for examples. ### 4. Chemical Reactions (2 tasks) Predicts reaction products or reactants: Predicts products from reactants: ```python theme={null} query = " NC1=CC=C2OCOC2=C1.O=CO Based on the reactants and reagents given above, suggest a possible product." response = generator.generate(query) # Output: "A possible product can be O=CNC1=CC=C2OCOC2=C1 ." ``` See LLM4Chem/README.md:115 for examples. Predicts reactants from products: ```python theme={null} query = "Identify possible reactants for the product: CC1=CC=C(N)N=C1N " response = generator.generate(query) # Output: " CC(C#N)CCC#N.N " ``` See LLM4Chem/README.md:122 for examples. ## SMolInstruct Dataset LlaSMol models are trained on **SMolInstruct**, a large-scale chemistry instruction dataset: **Dataset**: [https://huggingface.co/datasets/osunlp/SMolInstruct](https://huggingface.co/datasets/osunlp/SMolInstruct) ### Key Features * **Scale**: Millions of instruction-response pairs * **Coverage**: All 14 chemistry tasks with balanced distribution * **Quality**: Curated and validated chemical data * **Format**: Instruction-tuning format with special tags ### Training Details Fine-tuning uses LoRA (Low-Rank Adaptation): ```bash theme={null} MODELNAME=LlaSMol-Mistral-7B CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py \ --data_path osunlp/SMolInstruct \ --base_model mistralai/Mistral-7B-v0.1 \ --output_dir checkpoint/$MODELNAME ``` See LLM4Chem/README.md:131 for training instructions. ## Tag System LlaSMol uses specialized tags to structure chemistry information: ### Input Tags Wraps SMILES representations in queries Example: ` CC(C)Cl ` Wraps IUPAC names in queries Example: ` aspirin ` ### Output Tags Molecular formula in model responses Example: ` C9H8O4 ` Numerical predictions (solubility, logD, etc.) Example: ` -1.41 ` Yes/No predictions (toxicity, BBB, etc.) Example: ` Yes ` See LLM4Chem/README.md:157 for complete tag documentation. ## SMILES Canonicalization LlaSMol automatically canonicalizes SMILES strings using RDKit: ```python theme={null} from rdkit import Chem def canonicalize_smiles(smiles: str) -> str: """Convert SMILES to canonical form.""" mol = Chem.MolFromSmiles(smiles) if mol is None: return smiles return Chem.MolToSmiles(mol) ``` **Why canonicalization matters**: * `CCO` and `OCC` represent the same molecule (ethanol) * Canonical form ensures consistent training and inference * Improves model accuracy and reduces ambiguity Canonicalization happens automatically when SMILES are wrapped in `` tags. See LLM4Chem/README.md:168 for details. ## Usage in ChemAgent ### Direct Generation The `answer_chemistry_query` tool wraps LlaSMol: ```python theme={null} @tool def answer_chemistry_query(query: str) -> str: """Answer a chemistry-related query using LlaSMol.""" response = generator.generate(query) return response[0]["output"][0] ``` See plan\_execute\_agent/chem\_tools.py:124 for implementation. ### Query Format Queries must be properly tagged before being sent to LlaSMol: ```python theme={null} # ❌ Incorrect - no tags query = "What is the SMILES for aspirin?" # ✅ Correct - IUPAC tagged query = "What is the SMILES for aspirin ?" ``` The `structure_chem_prompt` tool handles automatic tagging. ### Response Handling LlaSMol responses are stored for validation and error tracking: ```python theme={null} import plan_execute_agent.llasmol_response as llasmol_response response = generator.generate(query) llasmol_response.model_response = response # Store for later access return response[0]["output"][0] ``` See plan\_execute\_agent/chem\_tools.py:159 for response management. ## Performance Characteristics Performance varies by task: * **Name conversions**: 80-90% exact match * **Property predictions**: 60-85% task-dependent * **Molecule captioning**: Qualitative, high coherence * **Reaction prediction**: 50-70% exact match * **Single query**: 2-5 seconds on GPU * **Batch processing**: \~1 second per query * **First load**: +10 seconds model initialization * **Model size**: \~14GB (7B parameters) * **Peak memory**: \~16GB during inference * **Batch size 1**: \~8GB VRAM minimum ## Evaluation Pipeline LlaSMol includes tools for evaluation on SMolInstruct: ### Step 1: Generate Responses ```bash theme={null} python generate_on_dataset.py \ --model_name osunlp/LlaSMol-Mistral-7B \ --output_dir eval/LlaSMol-Mistral-7B/output ``` ### Step 2: Extract Predictions ```bash theme={null} python extract_prediction.py \ --output_dir eval/LlaSMol-Mistral-7B/output \ --prediction_dir eval/LlaSMol-Mistral-7B/prediction ``` ### Step 3: Compute Metrics ```bash theme={null} python compute_metrics.py \ --prediction_dir eval/LlaSMol-Mistral-7B/prediction ``` See LLM4Chem/README.md:172 for evaluation documentation. ## Limitations **Known Limitations**: 1. **Task Scope**: Only supports the 14 trained tasks 2. **Complex Queries**: May struggle with multi-step reasoning 3. **Novel Compounds**: Less accurate for molecules not in training data 4. **Numerical Precision**: Property predictions are approximate 5. **Context Length**: Limited to standard transformer context window ## Next Steps Learn the tag system in detail See how LlaSMol fits into the workflow