DiffAlign

Abstract

Graph diffusion models, dominant in graph generative modeling, remain underexplored for graph-to-graph translation tasks like chemical reaction prediction. We demonstrate that standard permutation equivariant denoisers face fundamental limitations in these tasks due to their inability to break symmetries in noisy inputs. To address this, we propose aligning input and target graphs to break input symmetries while preserving permutation equivariance in non-matching graph portions. Using retrosynthesis (i.e., the task of predicting precursors for synthesis of a given target molecule) as our application domain, we show how alignment dramatically improves discrete diffusion model performance from $5\%$ to a SOTA-matching $54.7\%$ top-1 accuracy.

Diffusion for graph translation

Diffusion has been successfully applied to a wide range of tasks, including molecular generation ([Vignac et al., 2023](https://arxiv.org/abs/2303.00537), [Hoogeboom et al., 2022](https://arxiv.org/abs/2203.17003)). The most notable gains of diffusion models over other generative models are the ability to generate diverse samples and the ability to condition on additional information. We want to bring these benefits to graph translation, the task of predicting a graph structure given another graph structure.

Diffusion models in this context start with a noisy graph (e.g. made of empty nodes and no edges) and an input (or source) graph. The denoising process iteratively transforms the noisy graph into a target graph conditional representing a transformation of the input graph. We show the denoising process in the context of retrosynthesis (predicting precursors for a given target molecule) below, and we define the diffusion process formally in the next section.

Background

Consider a database of $N_\mathrm{obs}$ graphs $\mathcal{D} = \{ (\X_n, \Y_n, \P^{\Y\to\X}_n) \}_{n=1}^{N_\mathrm{obs}}$, where $\X_n$ represents the target graph, $\Y_n$ the input graph, and $\P^{\Y\to\X}_n$ are matrices defining node mappings between the two graphs. The graph translation task is: given that the data is sampled from an unknown distribution $p(\X,\Y,\P^{\Y\to\X})$, predict valid targets $\X \sim p(\X\mid\Y)$ for a given input $\Y$. We define the general diffusion process via a forward process \begin{equation}\textstyle \label{eq:forward-graph-diffusion} q(\X_{t+1} \mid \X_t) = \prod^{N_X}_{i=1} q(\X^{\N,i}_{t+1} \mid \X^{\N,i}_t) \prod_{i,j=1}^{N_X} q(\X^{E,ij}_{t+1} \mid \X^{E,ij}_t), \end{equation} to diffuse the reactant to noise, and a reverse process \begin{equation}\textstyle \label{eq:diff-backward} p_{\theta}(\X_{t-1} \mid \X_t, \Y) = \prod_{i=1}^{N_X} p_{\theta}(\X^{\N,i}_{t-1} \mid \X_t, \Y) \prod_{i,j}^{N_X} p_{\theta}(\X^{E,ij}_{t-1} \mid \X_t, \Y), \end{equation} defining our generative model. we use the neural network specifically to predict ground truth labels from noised samples, meaning that the neural network outputs a distribution $\tilde p_{\theta}\big(\X_0 \mid \X_t, \Y\big)$. The reverse process is then parameterized by \begin{equation}\textstyle \label{eq:conditional-node-backward} p_{\theta}(\X_{t-1} \mid \X_t, \Y) = \sum_{\X_0} q\big(\X_{t-1} \mid \X_t, \X_0 \big) \tilde p_{\theta}\big(\X_0 \mid \X_t, \Y\big). \end{equation} It is common [CITE] to choose $p_{\theta}\big(\X_0 \mid \X_t, \Y\big)$ as an equivariant model. Next we show how this limits the performance of the model.

Equivariance and Self-Symmetry

It has been noted in the literature [CITE] that permutation equivariant functions struggle to map between a self-symmetrical input space to a less self-symmetrical output space. To confirm this observation, we study the task of translating a "ring" molecular structure (self-symmetrical) into a "tail" molecular structure (less self-symmetrical).

Trioxane (highly self-symmetrical)

3-chloropropan-1-ol (less self-symmetrical)

We train two models to translate between the two structures. The first model is an equivariant GNN (GraphTransformer) and the second is a non-equivariant GNN (achieved by adding unique positional encodings to the input features of each node and edge). Both models are trained with the same hyperparameters (i.e. same number of layers, hidden dimensions, etc.) We share the samples generated by the two models below and you can run the models yourself in this notebook.

samples

Equivariant

Non-Equivariant

What this means for Diffusion Models

Diffusion models iteratively remove noise from a noisy input to transform it into a target output. However, the noisy input is likely to be self-symmetrical (e.g. an empty graph), and the model struggles to produce a non-symmetrical output. To showcase this effect, we design a simple experiment where the goal is to copy a simple grid structure (left) from an empty, self-symmetrical grid (right).

With few diffusion steps (e.g. 20), we can see that the model struggles to copy the input grid, a task that should be trivial for a deep neural network.

However, if we run the diffusion process for more steps, the model will start to produce output that resembles more and more the target grid. You can explore the effect of the number of diffusion steps on the model's performance in this notebook.

What happens, mathematically?

It turns out that an equivariant model (and by extension the denoiser in the diffusion process), in an attempt to balance the requirement to break symmetries while maintaining equivariance, will eventually learn a distribution equal to the marginal distribution of the input labels. Formally, we write

Theorem (The optimal permutation equivariant denoiser). Let $D_\theta(\X_T, \Y)$ be permutation equivariant s.t. $D_\theta(\P\X_T, \Y) = \P D_\theta(\X_T, \Y)$, and let $q(\X_T)$ be permutation invariant. The optimal solution with respect to the cross-entropy loss with the identity data is, for all nodes $i$ and $j$ \begin{equation} \begin{cases} D_\theta(\X_T, \Y)^\N_{i,:} = \hat\y^\N, \hat \y^\N_k = \sum_{i} \Y_{i,k} / \sum_{i,k} \Y_{i,k},\\ D_\theta(\X_T, \Y)^E_{i,j,:} = \hat\y^E, \hat \y^E_k = \sum_{i,j} \Y_{i,j,k} / \sum_{i,j,k} \Y_{i,j,k}, \end{cases} \end{equation} where $\hat\y^\N_k$ and $\hat\y^E_k$ are the marginal distributions of node and edge values in $\Y$.

Solution: Aligned Equivariance

So how can we help the model break self-symmetries while maintaining equivariance? We can use node identifiers to match the source and target graphs! Assume we have a source graph $G$ and a target graph $G'$. If we know some connection between the nodes of $G$ and $G'$, we can use that to align the two graphs. In retrosynthesis, we can identify where nodes in the source graph (products) are in the target graph (reactants) through atom-mappping.

For other applications, if the graphs represent two configurations of the same nodes (e.g. evolution of a graph over time), we can use unique markers to identify the nodes in different configurations.

In the next section, we explore ways to pass this information to the neural network.

How to ensure alignment?

How can we tell a denoiser that specific nodes are paired? We explore three different methods: Note that the methods can be combined to strengthen the alignment signal. We show that aligned equivariant denoiser remain equivariant to the non-paired nodes in the generated graph.

Results on retrosynthesis

We show that our method achieves a SOTA-matching $54.7\%$ top-1 accuracy, compared to a $5\%$ accuracy without alignment.

$$ \begin{array}{l} \begin{array}{c} \end{array} \\ \begin{array}{lcccccc} \hline \textbf{Method} & \mathbf{k=1} \uparrow & \mathbf{k=3} \uparrow & \mathbf{k=5} \uparrow & \mathbf{k=10} \uparrow & \widehat{\mathrm{\textbf{MRR}}} \uparrow \\ \hline \text{Unaligned} & 4.1 & 6.5 & 7.8 & 9.8 & 0.056 \\ \text{DiffAlign-input} & 44.1 & 65.9 & 72.2 & 78.7 & 0.554 \\ \text{DiffAlign-PE} & 49.0 & 70.7 & 76.6 & \mathbf{81.8} & 0.601 \\ \text{DiffAlign-PE+skip} & \mathbf{54.7} & \mathbf{73.3} & \mathbf{77.8} & 81.1 & \mathbf{0.639} \\ \hline \end{array} \end{array} $$

The models' performance persists with low sampling steps, indicating that the alignment is effective.

What's next? (downstream applications)

Unlocking the potential of diffusion models in graph-to-graph translation enables a range of downstream applications. We show how these applications can benefit our main target application: retrosynthesis.

Thanks to inference guidance, we can use our denoiser to generate reactants with desired properties, e.g. synthesizability. This is especially useful for mutli-step retrosynthesis where the output of a single-step model is chained through a search until reaching available starting materials.

We can also apply inpainting to generate one of the reactants conditional on the other reactants and the product, or to generate a reactant with a specific structure.

BibTeX

@proceedings{DiffAlign2025Laabid,
              title={Equivariant Denoisers Cannot Copy Graphs: Align your Graph Diffusion Models},
              author={Laabid, Najwa and Rissanen, Severi and Heinonen, Markus and Solin, Arno and Garg, Vikas},
              booktitle={International Conference on Learning Representations (ICLR)},
              year={2025},
              url={https://openreview.net/forum?id=onIro14tHv}}

Equivariant Denoisers Cannot Copy Graphs: Align your Graph Diffusion Models

We use node identifiers to align source and target graphs, thus breaking the self-symmetries in the input space, while preserving permutation equivariance in non-matching graph portions.