LlaSMolGeneration
TheLlaSMolGeneration class provides an interface for generating chemistry predictions using fine-tuned LlaSMol models.
Initialization
Parameters
The name or path of the fine-tuned LlaSMol model. Supported models:
osunlp/LlaSMol-Mistral-7Bosunlp/LlaSMol-Galactica-6.7Bosunlp/LlaSMol-Llama2-7Bosunlp/LlaSMol-CodeLlama-7B
The base model architecture. If not specified, it will be inferred from the model_name using the BASE_MODELS mapping in config.py.
The device to run inference on. Options:
cuda or cpu. If None, automatically selects CUDA if available.generate()
Generate predictions for input text prompts.Parameters
The input prompt(s) for generation. Can be a single string or a list of strings.
Number of samples to process in parallel.
Maximum number of tokens for input text. Inputs exceeding this limit will be skipped.
Maximum number of tokens to generate in the response.
Whether to canonicalize SMILES strings in the input text before generation.
Whether to print input and output pairs during generation.
Additional generation parameters passed to the Hugging Face GenerationConfig:
num_return_sequences: Number of sequences to generate per inputnum_beams: Number of beams for beam searchtemperature: Sampling temperaturetop_p: Nucleus sampling parameter- And other Hugging Face generation parameters
Returns
A list of dictionaries, one for each input, containing:input_text: Original input textreal_input_text: Processed input text (with canonicalized SMILES and chat formatting)output: List of generated text sequences (or None if input was too long)
create_sample()
Create a tokenized sample from input text.Parameters
The input text to process.
Whether to canonicalize SMILES strings in the input.
Maximum token limit. If exceeded, the sample will be marked with
input_too_long=True.Returns
A dictionary containing:input_text: Original input textreal_input_text: Formatted prompt with chat templateinput_ids: Tokenized input IDsattention_mask: Attention masklabels: Labels for training (same as input_ids)input_too_long: Boolean flag if input exceeds max_input_tokens
Usage Examples
Basic Generation
Batch Generation
Name Conversion (IUPAC to SMILES)
Forward Synthesis
Helper Functions
canonicalize_smiles_in_text()
Canonicalizes SMILES strings within text that are wrapped in<SMILES> tags.
tokenize()
Tokenizes prompt text for the model.File Location
LLM4Chem/generation.py
The generation module automatically handles chat formatting, SMILES canonicalization, and batch processing. For custom tasks, refer to
config.py for task-specific generation settings.