Generation - ChemAgent

LlaSMolGeneration

The LlaSMolGeneration class provides an interface for generating chemistry predictions using fine-tuned LlaSMol models.

Initialization

from generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    model_name="osunlp/LlaSMol-Mistral-7B",
    base_model="mistralai/Mistral-7B-v0.1",
    device="cuda"
)

Parameters

model_name

string

required

The name or path of the fine-tuned LlaSMol model. Supported models:

osunlp/LlaSMol-Mistral-7B
osunlp/LlaSMol-Galactica-6.7B
osunlp/LlaSMol-Llama2-7B
osunlp/LlaSMol-CodeLlama-7B

base_model

string

default:"None"

The base model architecture. If not specified, it will be inferred from the model_name using the BASE_MODELS mapping in config.py.

device

string

default:"None"

The device to run inference on. Options: cuda or cpu. If None, automatically selects CUDA if available.

generate()

Generate predictions for input text prompts.

outputs = generator.generate(
    input_text="Convert the IUPAC name to SMILES: aspirin",
    batch_size=1,
    max_input_tokens=512,
    max_new_tokens=1024,
    canonicalize_smiles=True,
    print_out=True,
    num_return_sequences=1,
    num_beams=4
)

Parameters

input_text

string | list[string]

required

The input prompt(s) for generation. Can be a single string or a list of strings.

batch_size

int

default:"1"

Number of samples to process in parallel.

max_input_tokens

int

default:"512"

Maximum number of tokens for input text. Inputs exceeding this limit will be skipped.

max_new_tokens

int

default:"1024"

Maximum number of tokens to generate in the response.

canonicalize_smiles

bool

default:"True"

Whether to canonicalize SMILES strings in the input text before generation.

print_out

bool

default:"False"

Whether to print input and output pairs during generation.

**generation_settings

dict

Additional generation parameters passed to the Hugging Face GenerationConfig:

num_return_sequences: Number of sequences to generate per input
num_beams: Number of beams for beam search
temperature: Sampling temperature
top_p: Nucleus sampling parameter
And other Hugging Face generation parameters

Returns

A list of dictionaries, one for each input, containing:

input_text: Original input text
real_input_text: Processed input text (with canonicalized SMILES and chat formatting)
output: List of generated text sequences (or None if input was too long)

create_sample()

Create a tokenized sample from input text.

sample = generator.create_sample(
    text="Convert IUPAC to SMILES: benzene",
    canonicalize_smiles=True,
    max_input_tokens=512
)

Parameters

text

string

required

The input text to process.

canonicalize_smiles

bool

default:"True"

Whether to canonicalize SMILES strings in the input.

max_input_tokens

int

default:"None"

Maximum token limit. If exceeded, the sample will be marked with input_too_long=True.

Returns

A dictionary containing:

input_text: Original input text
real_input_text: Formatted prompt with chat template
input_ids: Tokenized input IDs
attention_mask: Attention mask
labels: Labels for training (same as input_ids)
input_too_long: Boolean flag if input exceeds max_input_tokens

Usage Examples

Basic Generation

from generation import LlaSMolGeneration

# Initialize generator
generator = LlaSMolGeneration(
    model_name="osunlp/LlaSMol-Mistral-7B",
    device="cuda"
)

# Generate prediction
outputs = generator.generate(
    input_text="What is the molecular formula of caffeine?",
    print_out=True
)

print(outputs[0]['output'][0])

Batch Generation

inputs = [
    "Convert IUPAC to SMILES: ethanol",
    "Convert IUPAC to SMILES: acetic acid",
    "Convert IUPAC to SMILES: benzene"
]

outputs = generator.generate(
    input_text=inputs,
    batch_size=4,
    num_return_sequences=3,
    num_beams=6
)

for inp, out in zip(inputs, outputs):
    print(f"Input: {inp}")
    print(f"Outputs: {out['output']}")

Name Conversion (IUPAC to SMILES)

from config import TASKS_GENERATION_SETTINGS

# Use task-specific settings
task = "name_conversion-i2s"
task_settings = TASKS_GENERATION_SETTINGS.get(task, {})
generation_kargs = task_settings.get("generation_kargs", {})

outputs = generator.generate(
    input_text="Convert IUPAC name to SMILES: 2-acetoxybenzoic acid",
    **generation_kargs
)

Forward Synthesis

task = "forward_synthesis"
task_settings = TASKS_GENERATION_SETTINGS[task]

outputs = generator.generate(
    input_text="Predict the product: <SMILES> CC(=O)OC1=CC=CC=C1C(=O)O </SMILES>",
    **task_settings.get("generation_kargs", {})
)

Helper Functions

canonicalize_smiles_in_text()

Canonicalizes SMILES strings within text that are wrapped in <SMILES> tags.

from generation import canonicalize_smiles_in_text

text = "The molecule is <SMILES> c1ccccc1 </SMILES>"
canonicalized = canonicalize_smiles_in_text(
    text,
    tags=('<SMILES>', '</SMILES>'),
    keep_text_unchanged_if_no_tags=True,
    keep_text_unchanged_if_error=False
)

tokenize()

Tokenizes prompt text for the model.

from generation import tokenize

tokenizer = generator.tokenizer
result = tokenize(
    tokenizer,
    prompt="Convert to SMILES: benzene",
    add_eos_token=True
)
# Returns: {'input_ids': [...], 'attention_mask': [...], 'labels': [...]}

File Location

LLM4Chem/generation.py

The generation module automatically handles chat formatting, SMILES canonicalization, and batch processing. For custom tasks, refer to config.py for task-specific generation settings.

​LlaSMolGeneration

​Initialization

​Parameters

​generate()

​Parameters

​Returns

​create_sample()

​Parameters

​Returns

​Usage Examples

​Basic Generation

​Batch Generation

​Name Conversion (IUPAC to SMILES)

​Forward Synthesis

​Helper Functions

​canonicalize_smiles_in_text()

​tokenize()

​File Location

LlaSMolGeneration

Initialization

Parameters

generate()

Parameters

Returns

create_sample()

Parameters

Returns

Usage Examples

Basic Generation

Batch Generation

Name Conversion (IUPAC to SMILES)

Forward Synthesis

Helper Functions

canonicalize_smiles_in_text()

tokenize()

File Location