AtomDisc: An Interpretable Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure–Property Relationships
Type: publication, Submitted to Nature Machine Intelligence, 2025
Recent advances in large language models (LLMs) have spurred growing interest in their application to molecular modeling and property prediction. However, existing molecular LLMs either rely solely on SMILES strings—thus ignoring rich atomic and structural context—or introduce molecular features via auxiliary adapters, which fails to achieve true modality integration and interpretability. Here, we present AtomDisc, an interpretable atom-level tokenizer that discretizes local chemical environments into structure-aware tokens directly embedded within SMILES sequences. This unified representation enables LLMs to jointly model chemical syntax and atomic structure, providing both fine-grained performance gains and unprecedented interpretability. Through systematic case studies and attribution analysis, AtomDisc not only achieves state-of-the-art results on molecular generation and property prediction tasks, but also reveals new structure–property relationships, demonstrating its potential for AI-driven scientific discovery in chemistry.