SynthonGPT: Diversity-Oriented Retrieval in Ultra-Large Enumerated Chemical Spaces

Tech report: report.pdf
Demo: synthongpt.mireklzicar.com

SynthonGPT is a compact synthon-conditioned transformer for navigating makeable chemical space. Instead of generating arbitrary SMILES without synthetic grounding, it is built around synthesis-aware building blocks and vendor enumerations, making it more aligned with practical discovery workflows.

Highlights

Count-matched benchmarks show up to 3.1x higher unique scaffold recovery than F-Trees and 1.76x higher than SpaceLight while maintaining lower mean similarity.
The model has roughly 90M parameters, trains in about 10 hours on a single RTX 4090, and supports sub-second inference on CPU and GPU.

SynthonGPT: Diversity-Oriented Retrieval in Ultra-Large Enumerated Chemical Spaces

Highlights

Links