Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions

González Laffitte, Marcos E.; Weinbauer, Klaus; Phan, Tieu-Long; Beier, Nora; Domschke, Nico; Flamm, Christoph; Gatter, Thomas; Merkle, Daniel; Stadler, Peter F.

doi:10.3390/sym16091217

Open AccessArticle

Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions

by

Marcos E. González Laffitte

^1,2,*

,

Klaus Weinbauer

^1,3

,

Tieu-Long Phan

^1,4

,

Nora Beier

¹

,

Nico Domschke

¹

,

Christoph Flamm

⁵

,

Thomas Gatter

^1,2

,

Daniel Merkle

^4,6

and

Peter F. Stadler

^{1,2,5,7,8,9,10,*}

¹

Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Leipzig University, Härtelstraße 16-18, D-04107 Leipzig, Germany

²

Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Leipzig University, D-04103 Leipzig, Germany

³

Machine Learning Research Unit, TU Wien Informatics, Erzherzog-Johann-Platz 1 (FB02), A-1040 Wien, Austria

⁴

Department of Mathematics and Computer Science University of Southern Denmark DK-5230 Odense, Denmark

⁵

Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria

⁶

Algorithmic Cheminformatics Group, Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany

⁷

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany

⁸

Center for Non-Coding RNA in Technology and Health, University of Copenhagen, DK-1870 Fredriksberg, Denmark

⁹

Facultad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Bogotá CO-111321, Colombia

¹⁰

Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA

^*

Authors to whom correspondence should be addressed.

Symmetry 2024, 16(9), 1217; https://doi.org/10.3390/sym16091217

Submission received: 30 July 2024 / Revised: 28 August 2024 / Accepted: 5 September 2024 / Published: 16 September 2024

(This article belongs to the Special Issue Symmetry in Graph Algorithms and Graph Theory III)

Download

Browse Figures

Versions Notes

Abstract

:

Atom-to-atom maps (AAMs) are bijections that establish the correspondence of reactant and product atoms across chemical reactions. They capture crucial features of the reaction mechanism and thus play a central role in modeling chemistry at the level of graph transformations. AAMs are equivalent to so-called “imaginary transition state” (ITS) graphs, making it possible to reduce tasks such as the computational comparison of AAMs to testing graph isomorphisms. In many application scenarios, nonetheless, only partial information is available, i.e., only partial maps or, equivalently, only subgraphs of the ITS graphs, are known. Here, we investigate whether and how, and to what extent, such partial chemical data can be completed and compared. The focus of this contribution is entirely on the development of a solid mathematical foundation for the analysis of partial AAMs and their associated partial ITS graphs.

Keywords:

chemical graphs; cheminformatics; chemical reaction mechanisms; atom-to-atom maps; imaginary transition state; condensed graph of the reaction; (sub)graph isomorphism; (chemical) graph alignment; rebalancing of reactions; tautomerization reactions; hydrogen atom mapping

1. Introduction

Chemical reactions, by definition, comprise the rearrangement of bonds among the atoms of a set of reactant molecules. That is, chemical reactions preserve the atoms involved. This fundamental property is formalized by an atom-to-atom map (AAM). Mathematically, an AAM is simply a bijection between corresponding atoms in the reactants and products that preserves atom types. From a chemical point of view, the AAM encodes the mechanism of the reaction by implying the bonds that are formed and broken, thereby providing a condensed description of the transition from reactants to products.

Despite the fundamental conceptual importance of AAMs in chemical reactions, and its many practical applications, AAMs constitute data that are not commonly available. Large-scale reaction databases such as

{Reaxys}^{®}

[1] typically store each reaction as a pair comprising the (multi)sets of reactant and product molecules, respectively, recording with a stoichiometric coefficient whose molecule appears more than once on either side of the reaction. The computational reconstruction of AAMs, on the other hand, turns out to be a highly non-trivial problem [2,3]. Following the development of combinatorial approaches maximizing common subgraphs or minimizing chemical distance [4,5], several machine learning tools have recently become available for this purpose; see refs. [6,7,8,9,10]. While most tools report bijections between non-hydrogen atoms for balanced reactions, usually only partial maps are reported for unbalanced, i.e., incomplete reaction data. Moreover, some tools deliberately focus on the reaction center only [11]. This raises the question of if and when such partial AAM information can be extended to a complete description of a reaction.

The task of benchmarking tools that compute AAMs and/or perceive reaction centers turns out to be more challenging than expected because it requires a precise comparison of AAMs or parts thereof. The difficulty arises from the fact that different tools report (partial) AAMs using essentially arbitrary numberings of reactant and product atoms. Hence, the problem is not the comparison of two AAMs on the same pair of reactant and product graphs but on isomorphic pairs. For balanced reactions, some of us showed in [12] that the problem reduces to determining a solution for an isomorphism search between certain graphs, in particular, the Imaginary Transition State (ITS) graph of a reaction. This graph is obtained by “superimposing” the reactant and product molecules over corresponding pairs of atoms obtained with respect to an AAM and assigning special labels to the bonds that change during the reaction. This graph representation of a reaction was introduced by Shinsaku Fujita [13] and Wilcox and Levinson [14] with the aim of facilitating storage, retrieval, and processing of chemical information. Essentially the same construction was later proposed as Condensed Graph of the Reaction (CGR) for machine learning applications [15]. ITS graphs and AAMs are equivalent in that one can be constructed uniquely from the other. As a consequence, two AAMs are equivalent if and only if their ITS graphs are isomorphic [12].

In this contribution, we are interested in situations where only partial information on AAMs is available, naturally leading to the concept of partial ITS graphs. This is particularly the case for unbalanced reaction data. Nonetheless, it should be noted first that even in balanced reactions, however, it becomes more difficult to formalize the equivalence of partial information. In addition, partial AAMs that are not defined on isomorphic subgraphs may still provide consistent information. The situation is aggravated if the reaction data are not balanced. A large fraction of the millions of chemical reactions stored in large-scale databases such as Reaxys^® [1] or (USPTO) (United States Patent and Trademark Office) [16] omits co-reactants or co-products. While these are usually “obvious” to a trained chemist, these omissions are a serious problem for computational methods, limiting the use of the data, particularly for machine learning applications. Several methods to alleviate the problem have been devised to rebalance reactions [17,18,19], highlighting the difficulty of this problem.

Partial ITS graphs that completely cover the reaction center, moreover, encode the “reaction pattern”, i.e., a combination of (a) the prerequisites and adjacent context of the reaction in terms of functional groups and their structural arrangement, and (b) the bonds that change in the transition from reactants to products. Thus, there is a 1–1 correspondence; furthermore, between such “good” reaction patterns and graph rewriting rules that implement reaction mechanisms in the so-called Double Pushout (DPO) graph grammars, a powerful tool to model large reaction networks at the atomic level, and which is part of our ongoing research [20,21]. In this context, a key open problem is the inference of comprehensive sets of rules, i.e., the determination of suitable partial ITS graphs from sets of ITS graphs for reactions that share the same mechanism.

The framework presented here is based on the assumption that there is a 1–1 correspondence between molecules and their graph-theoretic representation, i.e., their structural formula. This is an approximation, albeit an extremely useful one. Mesomerism and tautomerism, however, introduce further ambiguities. While mesomeric formulas [22,23] describe the same molecule by equivalent but distinct labeled graphs that represent delocalized electron systems, e.g., by different sequences of single and double bonds, the situation is more complicated for tautomers. Tautomers are distinct molecules that are related by well-defined intra-molecular chemical reactions that are fast enough to establish an equilibrium between the alternatives [24,25]. This begs the question of whether such equivalence classes of molecules or reactions can be described and analyzed with a similar formalism.

Reaction rebalancing, the construction of AAMs, reaction center perception, the extraction of reaction rules or patterns from reaction data, as well as the classification of reactions, are problems that share partial bijections and concepts of equivalence, consistency, isomorphism, and extensions, and also convey algorithmic problems concerning the comparison of associated graph representations. The purpose of this contribution is to provide a solid mathematical foundation by investigating the properties of partial AAMs and their associated partial ITS graphs and to clarify the relationship with conceptually similar structures, in particular the pairwise graph alignments explored in [26].

This contribution is organized as follows: we first introduce AAMs and ITS graphs as mathematical objects and fix the notation that will be used throughout. In Section 3, we present the partial ITS graphs by considering subgraphs of ITS graphs and the corresponding incomplete AAMs. The focus of this section is on the conditions required for completing such partial structures to full AAMs and complete ITS graphs, as well as on their connection to graph alignments. In Section 4, we introduce the notions of equivalence and consistency of partial AAMs and study the connection to isomorphisms of the corresponding partial ITS graphs. We proceed in Section 5 to generalize these concepts to tautomers, showing that more general maps take the place of isomorphisms. We complete our contribution with a short overview of the issues caused by hydrogen atoms, which are often omitted in chemical structural formulae. The concluding remarks focus on important open questions.

2. Preliminaries

2.1. Notation

For standard notation and terminology in Graph Theory, we reference [27]. Molecules are represented as vertex and edge labeled (simple) graphs, i.e., without loops or multiple edges. For a graph G, we write

V (G)

and

E (G)

for its vertex and edge sets, and

a_{G} : V (G) \to L_{v}

and

b_{G} : E (G) \to L_{e}

for the labeling function on the vertices and edges, respectively. We choose

L_{v}

and

L_{e}

to be non-empty and disjoint. Moreover, we reserve the symbol

⌀ \notin L_{v} \cup L_{e}

. Given vertices x and y of G, if there is an edge

e \in E (G)

joining them, we say that x and y are adjacent to each other and incident with e. As is customary, we write

e = x y

throughout. Two labeled graphs G and H are said to be isomorphic if there exists a bijection

φ : V (G) \to V (H)

, preserving adjacency, and vertex and edge labels, i.e., preserving edges and non-edges, such that

a_{H} (φ (x)) = a_{G} (x)

for all

x \in V (G)

, and

b_{H} (φ (x) φ (y)) = b_{G} (x y)

for every

x y \in E (G)

. We say that

φ

is an isomorphism from G to H and write

G ≅ H

. On the other hand,

φ

is said to be an automorphism of G if it is an isomorphism from G to G itself. In general, the composition or concatenation of maps

α : X \to Y

and

β : Y \to Z

will be written

β \circ α (x) ≔ β (α (x))

.

A chemical reaction will be considered as a pair of graphs

G ⟶ H

. Connecting to the theory of chemical reaction networks [28,29], the vertex- and edge-labeled graphs G and H represent the so-called “complexes” of the reaction. In this setting, the connected components of G and H are the reactant and product molecules, respectively.

2.2. Atom-to-Atom Map, ITS Graph, and Reaction Center

A chemical reaction

G ⟶ H

is balanced if for every reactant atom

x \in V (G)

, there is a (unique) corresponding product atom

α (x) \in V (H)

, and vice versa.

The map

α : V (G) \to V (H)

is usually called the atom-to-atom map (AAM) of the reaction. More formally, we have the following.

Definition 1.

Let G and H be two labeled graphs. A map

α : V (G) \to V (H)

is an AAM if it is a bijection and preserves vertex labels, i.e., if

a_{H} (α (x)) = a_{G} (x)

for all

x \in V (G)

.

The concept of an imaginary transition state (ITS) representing the superposition of reactants and products as well as the changes in bonds was introduced by Shinsaku Fujita [13] and Wilcox and Levinson [14]:

Definition 2.

Let

α : V (G) \to V (H)

be an AAM for two labeled graphs G and H. Then, the imaginary transition state (ITS)

Υ ≔ Υ (G, H, α)

is the graph defined by all the following:

(i): There is a bijection $η : V (Υ) \to V (G)$ ,
(ii): For $x, y \in V (Υ)$ we have $x y \in E (Υ)$ iff $η (x) η (y) \in E (G)$ or $α (η (x)) α (η (y)) \in E (H)$ ,
(iii): $x \in V (Υ)$ is labeled by the ordered pair $a_{Υ} (x) = (a_{G} (η (x)), a_{H} (α (η (x))))$ and $x y \in E (Υ)$ is labeled by the pair

b_{Υ} (x y) = \{\begin{matrix} (b_{G} (η (x) η (y)), b_{H} (α (η (x)) α (η (y)))) & if η (x) η (y) \in E (G) and α (η (x)) α (η (y)) \in E (H) \\ (b_{G} (η (x) η (y)), ⌀) & if η (x) η (y) \in E (G) and α (η (x)) α (η (y)) \notin E (H) \\ (⌀, b_{H} (α (η (x)) α (η (y)))) & if η (x) η (y) \notin E (G) and α (η (x)) α (η (y)) \in E (H) \end{matrix}

Note that by definition

η^{'} = α \circ η

is a bijection

η^{'} : V (Υ) \to V (H)

. It should be noted as well that

η

is not unique in general but only determined up to automorphisms of G and H. Since

α

is an AAM for G and H, we note

a_{H} (η^{'} (x)) = a_{H} (α (η (x))) = a_{G} (η (x))

. Given an arbitrary ITS graph

Υ

, we shall write

a_{Υ} (x) = (a_{Υ}^{1} (x), a_{Υ}^{2} (x))

for the vertex labels, and

b_{Υ} (x y) = (b_{Υ}^{1} (x y), b_{Υ}^{2} (x y))

for the edge labels of

Υ

. The coordinates

a_{Υ}^{i} (x)

and

b_{Υ}^{i} (x y)

, with

i = 1, 2

, are variables that take specific values determined by both the presence and labels of the vertices and edges of the graphs in the underlying reaction

G ⟶ H

from which

Υ

is derived. For a concrete example of the usage of this notation, see Figure 1. In the literature, the vertex sets

V (Υ)

,

V (G)

and/or

V (H)

are often identified to simplify the notation. We will, however, need the notational distinction between these three sets.

The ITS graph is closely related to earlier matrix representations of chemical reactions [30]. In the context of machine learning applications, essentially the same construction has appeared more recently as Condensed Graph of the Reaction (CGR) [15]. The notion of a reaction center arose for example, in [31,32,33].

Definition 3.

Let

α : V (G) \to V (H)

be an AAM for two labeled graphs G and H and let

Υ ≔ Υ (G, H, α)

be the corresponding ITS. The reaction center

Γ : = Γ (G, H, α)

is the subgraph of Y induced by the reaction edges, i.e., edges

x y \in E (Υ)

with

b_{Υ}^{1} (x y) \neq b_{Υ}^{2} (x y)

and comprising only the reaction vertices incident with these edges.

We say that

x y \in E (G)

is a reaction edge in G if

η^{- 1} (x) η^{- 1} (y)

is a reaction edge in

Υ (G, H, α)

, and similarly for edges in H with respect to

η^{- 1} \circ α^{- 1}

. Denote by

{\hat{G}}_{α}

,

{\hat{H}}_{α}

, and

\hat{Υ}

the subgraphs of G, H, and

Υ

obtained by removing all the reaction edges in these graphs, i.e., all those edges that correspond to reaction edges in

Υ

as defined above.

Lemma 1.

If

α : V (G) \to V (H)

is an AAM for two labeled graphs G and H, then α is an isomorphism from

{\hat{G}}_{α}

to

{\hat{H}}_{α}

.

Proof.

First note that

V (G) = V ({\hat{G}}_{α})

and

V (H) = V ({\hat{H}}_{α})

. Let

Υ

and

η

be as in Definition 2. By definition

x y \in E ({\hat{G}}_{α})

is not a reaction edge, i.e.,

b_{Υ}^{1} (η^{- 1} (x) η^{- 1} (y)) = b_{Υ}^{2} (η^{- 1} (x) η^{- 1} (y))

, which can only happen if

b_{Υ}^{2} (η^{- 1} (x) η^{- 1} (y)) \neq ⌀

by the selection of the label sets, implying

α (x) α (y) \in E (H)

with

b_{G} (x y) = b_{H} (α (x) α (y))

, and thus

α (x) α (y) \in E ({\hat{H}}_{α})

. Similarly, any edge

u v \in E ({\hat{H}}_{α})

is not a reaction edge and thus

α^{- 1} (u) α^{- 1} (v) \in E ({\hat{G}}_{α})

and

b_{G} (α^{- 1} (u) α^{- 1} (v)) = b_{H} (u v)

.

Hence,

α

is a bijection, preserves the adjacency of vertices in

{\hat{G}}_{α}

,

{\hat{H}}_{α}

, respectively, and preserves vertex and edge labels. Therefore,

α

is an isomorphism. □

In addition to the doubling of the labels, moreover,

{\hat{G}}_{α}

and

{\hat{H}}_{α}

are also isomorphic to

\hat{Υ}

, that is, isomorphic to

\hat{Υ}

up to the first and second coordinates of the labels of each vertex

x \in V (\hat{Υ})

and every edge

u v \in E (\hat{Υ})

, respectively, with

η

and

η^{'} = α \circ η

being the corresponding isomorphisms; see Figure 1.

2.3. Connectedness of the ITS Graphs, Reaction Center and AAMs

We say that an AAM

α : V (G) \to V (H)

is connected if its corresponding ITS graph is connected. From Definition 2 follows that there exist subgraphs

G^{'}

and

H^{'}

of

Υ

whose underlying graphs are isomorphic to the underlying graphs of G and H. Hence,

Observation 1.

Let α be an AAM for

G ⟶ H

. If G or H are connected, then

Υ (G, H, α)

, and thus α, are connected.

From a chemical point of view, moreover, connectedness implies that every molecule takes part in the reaction in the following sense.

Lemma 2.

Let

α : V (G) \to V (H)

be a connected AAM for the reaction

G ⟶ H

. Then, every connected component of G and H contains at least one reaction vertex.

Proof.

Suppose this is not the case. Then, there is, without loss of generality, a connected component

G_{1} ⊊ G

without a reaction vertex, i.e.,

G_{1}

and

H [α (V (G_{1}))]

are isomorphic, and

H [α (V (G_{1}))]

contains no reaction vertex. Thus, for every

x \in α (V (G_{1}))

and

x y \in E (H)

,

α^{- 1} (x) α^{- 1} (y) \in E (G_{1})

holds. Therefore,

H [α (V (G_{1}))] H_{1}

is a connected component in H, and the restriction

α_{1} : V (G_{1}) \to V (H_{1})

is a label-preserving isomorphism. It follows immediately that

Υ (G_{1}, H_{1}, α_{1})

is a connected component of the ITS

Υ (G, H, α)

. Since

G_{1} \neq G

,

Υ (G_{1}, H_{1}, α_{1}) \neq Υ (G, H, α)

, contradicting the connectedness of

α

. □

Nonetheless, neither the connectedness of G or H, nor of

Υ

, imply the connectedness of the reaction center; see Figure 2. In particular, Figure 2a exhibits an example of an AAM defining a connected ITS but a disconnected reaction center. Moreover, Figure 2b shows that the connectedness of the reaction center depends on the selection of the AAM between the same pair of graphs. We note, finally, that the converse of Lemma 2 is not true. To see this, consider two reactions

G ⟶ H

and

P ⟶ Q

, each with connected ITS graphs that do not share common molecules (i.e., isomorphic connected components). By Lemma 2, every connected component of G, P, H and Q contains a reaction vertex, and thus, this is also true for the disjoint unions

G \cup P

and

H \cup Q

. From a mathematical point of view,

(G \cup P) ⟶ (H \cup Q)

is also a valid chemical reaction. Its ITS graph is, by construction, the disjoint union of the ITS graphs for

G ⟶ H

and

P ⟶ Q

, and thus not connected.

From a chemical perspective, every reaction can be explained as a sequence of elementary reactions, each of which proceeds via a single transition state and can be described by cyclic electron pushing. The latter implies that an elementary reaction necessarily has a connected ITS that, in fact, can be described by a closed circular walk of broken and formed bonds [34]. In reaction databases, however, reactions are rarely classified with respect to their detailed reaction mechanism. Instead, “single-step reaction” may designate a single step in a synthesis plan, which may involve a more complex sequence of steps that, however, proceed without the need or even the practical possibility of isolating intermediate products. In a sequence of elementary reactions, it is entirely possible that the reaction centers of the different steps are vertex disjoint. It is also possible that bonds that are formed in an earlier step are broken again in a later step. This may also lead to an AAM with a disconnected reaction center.

Even elementary reactions, however, may appear to have a disconnected reaction center. This is typically the case if the conventional structural formulas are used to represent molecules. In general, these do not represent hydrogen atoms and thus, the ITS graphs defined by an AAM on the non-hydrogen atom, by construction, also does not include hydrogens. For example, see Figure 2c. The making and breaking of bonds to hydrogen atoms is an important part of reaction mechanisms. We shall return to this and other complications that are often associated in particular with hydrogen atoms in Section 6.

2.4. Good and Bad Reaction Patterns

Chemical reactions

G ⟶ H

can be grouped into equivalence classes that are defined by their “mechanism”, i.e., the bonds that are broken and formed and the so-called functional groups that are directly or indirectly involved. From a mathematical point of view, this amounts to a common subgraph L that appears in the different reactants G that can undergo the reaction of interest, and a common subgraph R in the product H that is the result of braking and making bonds in L. In the simplest case, L and R cover only the reaction centers; see [31,32,33,35]. In general, however, atoms adjacent to the reaction center may be critical for enabling the transformation of the reaction center. Hence, it is of interest to consider also larger subgraphs L and R. It is worth noting that several subgraphs of the ITS were considered already in [36].

Definition 4.

Let

G \to H

be a reaction with AAM

α : V (G) \to V (H)

, ITS

Υ ≔ Υ (G, H, α)

, and reaction center

Γ ≔ Γ (G, H, α)

. A good reaction pattern for

G ⟶ H

is a pair of subgraphs

L \subseteq G

and

R \subseteq H

, together with the restriction

α^{'} : V (L) \to V (R)

of α such that

Γ \subseteq Υ (L, R, α^{'}) \subseteq Υ

.

Note that by construction, R is induced from H by the image

α (V (L))

of

V (L)

under

α

; that is,

R = H [α^{'} (V (L))]

. If

(L, R, α^{'})

is a good reaction pattern, then

Γ (L, R, α^{'}) = Γ (G, H, α)

. A reaction pattern

(L, R, α^{'})

is uniquely defined by a subgraph

Ψ \subseteq Υ (G, H, α)

by virtue of the restriction of

η

and

η^{'}

to

V (Ψ)

and defining

η (u) η (v) \in E (L)

if and only if

u v \in E (Ψ)

and

b_{Υ}^{1} (u v) \neq ⌀

; similarly,

η^{'} (u) η^{'} (v) \in E (R)

if and only if

u v \in E (Ψ)

and

b_{Υ}^{2} (u v) \neq ⌀

. This implies the following.

Observation 2.

A subgraph Ψ of

Υ (G, H, α)

defines a good reaction pattern

(L, R, α^{'})

satisfying

Υ (L, R, α^{'}) = Ψ

if and only if

Γ (G, H, α) \subseteq Ψ \subseteq Υ (G, H, α)

.

We also note that

{\hat{L}}_{α^{'}}

and

{\hat{R}}_{α^{'}}

are isomorphic, and

α^{'}

serves as isomorphism. These graphs are also isomorphic to

\hat{Ψ}

up to the doubling of the labels.

3. Incomplete Atom-to-Atom Maps and partial ITS Graphs

3.1. Good and Bad Partial AAMs

Some atom mapping tools such as Indigo [7] do not compute a complete AAM but instead only produce a bijection between subsets of vertices of the graphs in the reaction.

Formally, they produce a bijection

β : U \to W

, preserving vertex labels between subsets

U \subseteq V (G)

and

W \subseteq V (H)

for the reaction

G \to H

. More recent tools such as RXNMapper [8], or LocalMapper [10] produce global AAMs by default but switch to a “local mode” if the input reactions are unbalanced. GraphormerMapper [9] also can be used to rebalance an unbalanced reaction while computing a full AAM. KEGG RCLASSes [37] in general also provide only partial AAMs; see [38].

By definition,

β

defines an AAM from the induced subgraph

G [U]

to the induced subgraph

H [W]

; hence, both the partial ITS graph

Υ (G [U], H [W], β)

and the subgraph

Γ (G [U], H [W], β)

are well defined. The subgraph

Γ (G [U], H [W], β)

describes a possibly partial reaction center; that is, it may or may not contain already all the reaction edges implied by

G ⟶ H

. Let us write

{\hat{G}}_{β}

and

{\hat{H}}_{β}

, respectively, for subgraphs obtained from G and H by removing the reaction edges determined by

β

in

G [U]

and

H [W]

. We are now interested in using

β

as an “anchor”, which we intend to extend into an AAM that is as complete as possible.

Definition 5.

Let

G ⟶ H

be a reaction and

β : U \to W

a partial AAM, for

U \subseteq V (G)

and

W \subseteq V (H)

. A proper extension of β is a bijective map

γ : U^{'} \to W^{'}

, with

U \subseteq U^{'} \subseteq V (G)

and

W \subseteq W^{'} \subseteq V (H)

, that (i) preserves vertex labels, i.e.,

a_{G} (x) = a_{H} (γ (x))

for all

x \in U^{'}

, and (ii) coincides with β on U, i.e.,

γ (x) = β (x)

for all

x \in U

.

Clearly, a proper extension

γ

of

β

is again a (partial) AAM for

G ⟶ H

and thus defines a partial ITS graph

Υ (G [U^{'}], H [W^{'}], γ)

, as well as reaction edges and subgraphs

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

. Since

β

is a bijection that conserves vertex labels, and a reaction is balanced if there exists a bijection from

V (G) \to V (H)

preserving vertex labels, such a bijection must also exist on

V (G) ∖ U \to V (H) ∖ W

. Thus, we have the following.

Observation 3.

If

G ⟶ H

is balanced and

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

is a partial AAM, then there exists a proper extension

γ : V (G) \to V (H)

of β, i.e., a complete AAM.

It is important to note that

γ

is not necessarily unique, even for fixed graphs G and H. It is possible, moreover, for the partial map

β

to explain reactions with distinct (i.e., non-isomorphic) reactants-graph G and/or distinct products-graph H, see Figure 3. This is relevant, for example, for the KEGG RCLASSes [37]. Each RCLASS specifies a partial AAM between a connected component of G and a connected component of H; see ref. [38]. Such a “seed” is not sufficient, in general, to completely determine an AAM.

Definition 6.

Let

G ⟶ H

be a (not necessarily balanced) reaction. We say that a partial AAM

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

, is good if U and W contain all reaction vertices and thus

Υ (G [U], H [W], β)

contains the reaction center as a subgraph. Otherwise, β is called bad.

From the chemical point of view, the following observation is of use.

Observation 4.

A partial AAM β for a (not necessarily balanced) reaction

G ⟶ H

, is good if and only if

{\hat{G}}_{β}

and

{\hat{H}}_{β}

contain no more reaction edges of

G ⟶ H

.

In many applications, however, the true AAM will be unknown. From a mathematical perspective, therefore, Definition 6 does not refer to a ground truth but only requires that the reaction can be explained in such a way that

β

captures all bonds that change. In the case of balanced reactions, we obtain a simple characterization of good partial AAMs together with a simple algorithmic approach to construct a full AAM.

Proposition 1.

Let

G ⟶ H

be a balanced reaction and let

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

be a partial AAM. Then, β is a good partial AAM if and only if there is a proper extension γ of β such that γ is an isomorphism for

{\hat{G}}_{β}

and

{\hat{H}}_{β}

.

Proof.

Suppose first that

γ

is an extension of

β

and an isomorphism from

{\hat{G}}_{β}

and

{\hat{H}}_{β}

, i.e., it is, in particular, a bijection that preserves adjacency, and vertex and edge labels, and thus is a complete AAM. Then,

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

contain no reaction edges. Moreover, the set of reaction edges implied by

β

, called

E (Γ (G, H, β))

, is contained in the set of reaction edges

E (Γ (G, H, γ))

implied by

γ

. Since

γ

is also an isomorphism for

{\hat{G}}_{β}

and

{\hat{H}}_{β}

, then

γ

does not imply more reaction edges than

β

, that is,

E (Γ (G, H, β)) = E (Γ (G, H, γ))

. Thus,

{\hat{G}}_{β} = {\hat{G}}_{γ}

and

{\hat{H}}_{β} = {\hat{H}}_{γ}

, and therefore

β

is a good partial AAM by Observation 4.

Suppose, on the other hand, that such an extension

γ

does not exist, i.e., for every bijection

γ : V (G) \to V (H)

that preserves vertex labels and satisfies

E (Γ (G, H, β)) \subseteq E (Γ (G, H, γ))

, there is an edge

e = u v \in E (Γ (G, H, γ)) ∖ E (Γ (G, H, β))

. By definition, therefore,

E (Υ (G [U], H [W], β))

does not contain the complete reaction center and hence,

β : U \to W

is a bad partial AAM. □

Note that for a good partial AAM

β

, we have

Γ (G, H, γ) = Γ (G, H, β) \subseteq Υ (G, H, β) \subseteq Υ (G, H, γ)

. Moreover, it is feasible in practice to compute a proper extension

γ

using any algorithm for testing graph isomorphism or subgraph isomorphism provided it is amenable to specifying a partial isomorphism, i.e.,

β

, as a constraint. Proposition 1 guarantees that finding

γ

can be performed by extending

β

to an isomorphism of

{\hat{G}}_{β}

and

{\hat{H}}_{β}

. A suitable family of algorithms for this purpose is the VF2 and relatives [39,40]. Supplied with the vertex matches defined by

β

as starting point, VF2 extends the set of vertices matched between

V (G)

and

V (H)

stepwisely such that each added match has the same adjacencies to all previously matched vertices in both G and H. Upon termination, it therefore either returns a bijection of

V (G)

and

V (H)

guaranteed to be the desired extension

γ

, or unmatched vertices remain, in which case no extension of

β

exists. Although the worst-case running time of VF2 is exponential, this algorithm is applicable in practice for chemical graphs. We note that it is not sufficient to extract

{\hat{G}}_{β}

and

{\hat{H}}_{β}

and check for their isomorphism since we are only concerned with isomorphisms

γ

that extend the given partial AAM

β

. It is possible, however, to tag pairs of vertices matched by

β

with the same unique label in both

{\hat{G}}_{β}

and

{\hat{H}}_{β}

, and test for the isomorphism of these modified graphs.

Before we proceed to the general case, let us consider a balanced reaction with partial AAM that does not cover all bonds that are broken or formed. This case is of practical importance since most AAM prediction tools ignore hydrogens. Consider the nucleophilic substitution reaction depicted in Figure 4. We can directly determine the ITS and the reaction center of reactions with a complete AAM, illustrated in Figure 4A. In this process, the C−Cl bond breaks, and a C−N bond forms.

From the mathematical point of view, the corresponding AAM

β

is trivially good on the graphs without hydrogens since it is a complete AAM for these. However, if the hydrogens are attached,

β

becomes a bad partial AAM with respect to such new graphs since now the reaction center involves a hydrogen atom that dissociates from the nitrogen and attaches to the chlorine. Indeed,

β

cannot be extended to an isomorphism between

{\hat{G}}_{β}

and

{\hat{H}}_{β}

. Instead, the correct reaction center contains a new edge representing the breaking bond between nitrogen and hydrogen, as well as the newly formed bond between hydrogen and chloride (Figure 4B). With respect to a detailed understanding of the reaction mechanism, however, we have to distinguish between hydrogens that are exchanged with the medium, and hydrogens that can be traced as individual entities through a reaction. This issue is not strictly limited to H atoms, however. We shall return to this point in Section 6 below.

If

G ⟶ H

is balanced, then by Observation 3, an extension

γ

of any bad partial AAM

β

to a complete AAM exists. By construction,

γ

is then a good AAM. Of course,

γ

is not unique.

Among all such complete extensions, it is most natural to consider those that maximize

| E ({\hat{G}}_{γ}) | = | E ({\hat{H}}_{γ}) |

, i.e., those that maximize the number of edges in

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

, which is equivalent to attributing as few reaction edges as possible to G and H. This cost function [30,41] is closely related with the Principle of Minimal Chemical Distance, which stipulates that most chemical reactions proceed via the redistribution of a minimum number of valance electrons, which in most cases is the same as minimizing the total number of broken and form bonds. Since the total number of valence electrons is constant in balanced reactions, this measure is equivalent to maximizing the number

| E (\hat{Υ}) |

of unchanged bonds in the ITS. We remark that for reactions that also involve the shifting of individual electrons it may be of interest to use slightly different quantifications of the “chemical distance”. This issue will be further investigated elsewhere.

3.2. An ILP for Completing Bad Partial AAMs on Balanced Reactions

Given a partial AAM

β

, the task of finding an extension

γ

that maximizes

| E (\hat{Υ}) |

can be phrased as an Integer Linear Program (ILP). To this end, denote by

A_{G}

and

A_{H}

the adjacency matrices of G and H. The AAM is a bijection and thus can be expressed as a permutation matrix

X

of size

| V (G) | \times | V (G) |

with entries

X_{i j} = 1

if

α (x_{i}) = y_{j}

where

x_{i} \in V (G)

and

y_{j} \in V (H)

, and

X_{i j} = 0

otherwise. To ensure that

X

is a permutation matrix, it suffices to ensure

\sum_{i = 1}^{m} X_{i j} = 1 for all j and \sum_{j = 1}^{m} X_{i j} = 1 for all i .

Two graphs are isomorphic if a permutation matrix, such as that defined above, exists with

A_{H} = X A_{G} X^{T}

[42], motivating our optimization function below.

In addition, we have the constraint

X_{i j} = 0

if

a_{G} (x_{i}) \neq a_{H} (y_{j})

to enforce the preservation of atom labels, and we have

X_{i j} = 1

if

β (x_{i}) = y_{j}

as a means of constraining

γ

to coincide with

β

of all vertices mapped by

β

. The difference of the bonds is expressed by

D ≔ A_{H} - X A_{G} X^{T}

. As

X^{T} = X^{- 1}

for permutation matrices,

D ≔ A_{H} X - X A_{G}

yields an equivalent definition suitable for linear programming. We have

D_{i j} = 0

if and only if there is no change in bond between G and H under the mapping

γ

encoded by

X

.

D_{i j}

takes bounded positive and negative integer values when a bond is formed or broken, respectively. Thus, the goal is to maximize the number of zero entries in

D

. This can be achieved, e.g., by introducing binary indicator variables with

g_{i j} = 1

if

D_{i j} > 0

and

g_{i j} = 0

otherwise, and similarly

s_{i j} = 1

if

0 > D_{i j}

and else

s_{i j} = 0

, employing the usual trick:

\begin{matrix} D_{i j} \leq K g_{i j} \\ D_{i j} \geq - K + g_{i j} (K + 1) \end{matrix} and \begin{matrix} - D_{i j} \leq K s_{i j} \\ - D_{i j} \geq - K + s_{i j} (K + 1) \end{matrix},

where K is the maximal bond order in

A_{G}

and

A_{H}

, respectively. Then,

f = \sum_{i j} (g_{i j} + s_{i j})

counts the number of non-zero entries in

D

and needs to be minimized.

An implementation of the proposed ILP completion algorithm is available as the Python package aamutils in https://github.com/klausweinbauer/AAMUtils. It was is evaluated on a test set of 1000 balanced reactions with correct complete AAMs. The test set is derived from the validation dataset used in [19], which is a composition of corrected reactions from USPTO 50k, Jaworski and Golden datasets. AAMs are generated by three atom-to-atom mapping tools (RXNMapper [8], Graphormer [9], LocalMapper [10]) and are considered correct if all three AAMs are equivalent. To test the completion rate, we construct (bad) partial AAMs (1) by removing the mapping from nodes outside the reaction center, (2) by removing the mapping from nodes in the reaction center and (3) by removing any mappings. These partial AAMs are then extended by aamutils and compared to the initial AAM. For this benchmark, the CBC solver [43] is used in version 2.10.3. The success rate is perfect (100%) for (1) for extending the AAM when only having the reaction center mapped, and (2) for up to 35% of missing nodes in the reaction center. This rate drops to 87.4% when all mappings are removed, which corresponds to performing atom mapping based on the Minimal Chemical Distance. This is expected because minimizing the number of bond changes is not a suitable mapping method for some reaction types. Using the Diels–Alder reaction as an example, this method yields a disconnected reaction center with twice a change of one double bond into two single bonds (4 reaction edges) instead of the expected

[4 + 2]

cycloaddition reaction center (6 reaction edges). However, it performs well when expanding a reaction center that is already partially mapped. Figure 5 depicts the success rate of the benchmark results for missing atom mappings between 0% and 100%.

3.3. Unbalanced Reactions and Graph Alignments

In the general case, the reaction is unbalanced and there is no bijection

α : V (G) \to V (H)

that preserves vertex labels. In other words, there will be unmapped vertices in G, H, or both graphs. Still, any proper extension

γ : U \to W

defines an isomorphism between

{\hat{G}}_{γ} [U]

and

{\hat{H}}_{γ} [W]

, while the rest of the vertices, i.e.,

V (G) ∖ U

and

V (H) ∖ W

, remain unmatched. To handle this situation, we introduce the following.

Definition 7.

A graph

A

with vertices and edges labeled by pairs

a_{A} (x) = (a_{A}^{1} (x), a_{A}^{2} (x))

with

a_{A}^{i} (x) \in L_{v} \cup {⌀}

, and

b_{A} (x y) = (b_{A}^{1} (x y), b_{A}^{2} (x y))

with

b_{A}^{i} (x y) \in L_{e} \cup {⌀}

, respectively, is said to be an alignment of two labeled graphs

G

and

H

if

G

and

H

are isomorphic, respectively, to the subgraphs G and H of

A

induced, each, by

V_{1} = {x \in V (A) | a_{A}^{1} (x) \neq ⌀}

and

V_{2} = {x \in V (A) | a_{A}^{2} (x) \neq ⌀}

with vertex labels

a_{G} (x) = a_{A}^{1} (x)

for

x \in V_{1}

and

a_{H} (x) = a_{A}^{2} (x)

for

x \in V_{2}

, and edge labels

b_{G} (x y) = b_{A}^{1} (x y)

for

x, y \in V_{1}

and

x y \in E (A)

, and

b_{H} (x y) = b_{A}^{2} (x y)

for

x, y \in V_{2}

and

x y \in E (A)

.

We note that, in practice, the analysis of unbalanced reactions becomes more difficult when compounds are missing on both sides, see ref. [19]. From a formal point of view, which is the emphasis of this contribution, it is more natural to consider the general case.

Each partial AAM

γ : U \to W

, thus, defines an alignment

A_{γ}

of

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

, whose vertex set consists of alignment columns of two types: match columns defined to be the ordered pairs

(x, γ (x))

for every

x \in U

, while all unmatched vertices

x \in V (G) ∖ U

and

y \in V (H) ∖ W

correspond, respectively, to in/del columns

(x, ⌀)

and

(⌀, y)

. Given

v \in V (A_{γ})

, we write

v^{1}

and

v^{2}

to denote, respectively, the first and second entries of v. This alignment

A_{γ}

has, furthermore, an edge

u v

whenever either

u^{1} \neq ⌀ \neq v^{1}

and

u^{1} v^{1} \in E ({\hat{G}}_{γ})

, or

u^{2} \neq ⌀ \neq v^{2}

and

u^{2} v^{2} \in E ({\hat{H}}_{γ})

, i.e., the respective non-gap entries (i.e., not ⌀) of u and v are adjacent in

{\hat{G}}_{γ}

or

{\hat{H}}_{γ}

. Labels are inherited, entrywise, from the vertex and edge labels in

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

in each case.

Thus, given a partial AAM

γ

for an unbalanced reaction, the corresponding alignment

A_{γ}

can be used to infer a completion, i.e., a rebalancing of the reaction. This idea has been proposed also in the CGRtools manual [44], which, however, cautions that the “Decomposition of CGRs for unbalanced reactions back to reaction may lead to strange (and erroneous) structures”. We proceed by showing that ITS graphs obtained from alignments yield at least a mathematically sound avenue for reaction rebalancing. Whether this yields a chemically correct solution of course depends on the correctness of the alignment.

The basic idea is to assume that vertices of G and H that appear in columns of the form

(x, ⌀)

or

(⌀, x)

in the alignment

A_{γ}

, refer to atoms x that should be present in both the reactants and the products but that, in the reaction data, were omitted in either the reactant graph or in the product graph. The alignment then specifies the partial AAM and the balancing of the reaction. More formally, we have the following.

Proposition 2.

Let γ be a partial AAM for an unbalanced reaction

G ⟶ H

. Then, the corresponding alignment

A_{γ}

of

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

uniquely defines (up to isomorphism) a balanced reaction

G^{*} ⟶ H^{*}

and a corresponding complete AAM

γ^{*} : V (G^{*}) \to V (H^{*})

.

Proof.

We observe that the alignment graph

A_{γ}

can be converted into an ITS graph

Υ^{*} (G, H, A_{γ})

by replacing gap labels ⌀ of in/del columns with the label of the corresponding atom that is present in the reactant or product graph, respectively. Similarly, edge labels are inherited. Since the alignment

A_{γ}

again refers to the graphs

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

, it does not include the reaction edges. The construction of

Υ^{*} : = Υ^{*} (G, H, A_{γ})

thus requires the re-insertion of the reaction edges. To this end, we proceed as follows:

(A1): In in/del columns $x \in V (A_{γ})$ , a gap label ⌀ is replaced by the non-gap atom label $a_{G} (x^{1})$ or $a_{H} (x^{2})$ , respectively, that appears in the same column.
(A2): If $x y \in E (A_{γ})$ , then $b_{A_{γ}} (x y) = (b_{G} (x^{1} y^{1}), ⌀)$ or $b_{A_{γ}} (x y) = (⌀, b_{H} (x^{2} y^{2}))$ are replaced, respectively, by $b_{Υ^{*}} (x y) = (b_{G} (x^{1} y^{1}), b_{G} (x^{1} y^{1}))$ and $b_{Υ^{*}} (x y) = (b_{H} (x^{2} y^{2}), b_{H} (x^{2} y^{2}))$ , only if x and y are in/del columns of the same type, i.e., they have the same gap entry; otherwise, the label is left unchanged. This step handles all non-reaction edges.
(A3): If $x, y \in V (A_{γ})$ and $x y \notin E (A_{γ})$ then let $b^{1} (x y) = ⌀$ if $x^{1} y^{1} \notin E (G)$ or $b^{1} (x y) = b_{G} (x^{1} y^{1})$ if $x^{1} y^{1} \in E (G)$ , and also $b^{2} (x y) = ⌀$ if $x^{2} y^{2} \notin E (H)$ or $b^{2} (x y) = b_{H} (x^{2} y^{2})$ if $x^{2} y^{2} \in E (H)$ . Set $b (x y) = (b^{1} (x y), b^{2} (x y))$ . If $b (x y) = (⌀, ⌀)$ , or if either x or y are in/del columns, then there is no edge between them; otherwise, the reaction edge $x y$ with label $b (x y)$ is added to $E (Υ^{*} (G, H, A_{γ}))$ . This step pertains to reaction edges.

By construction, the graph

Υ^{*} (G, H, A_{γ})

built from the alignment

A_{γ}

has the form of an ITS graph

Υ^{*} (G, H, A_{γ}) = Υ (G^{*}, H^{*}, γ^{*})

. In other words, the vertex sets of the graphs

G^{*}

and

H^{*}

correspond to the columns of

A_{γ}

, and thus contain exactly the atoms present in at least one of the reactant graph G and the product graph H. The map

γ^{*} : V (G^{*}) \to V (H^{*})

is the natural one-to-one correspondence between

V (G^{*})

and

V (H^{*})

implied by the alignment columns. Since the labels in

A_{γ}

are pairs that either contain a gap symbol or are of the form

(a, a)

with identical labels for the aligned atoms, the map

γ^{*} : V (G^{*}) \to V (H^{*})

preserves the vertex labels of

G^{*}

and

H^{*}

, and thus the same vertex label appears in

Υ (G^{*}, H^{*}, γ^{*})

. The edges and edge labels, on the other hand, are given by the first component for

G^{*}

, and the second component for

H^{*}

, of the edge labels of

Υ^{*} (G, H, A_{γ})

distinct from ⌀ as specified in (A2) and (A3) above, together with the reaction edges and their labels that were already defined by the partial AAM

γ

between G and H. Taken together,

A_{γ}

therefore defines a unique ITS graph

Υ (G^{*}, H^{*}, γ^{*})

that describes a balanced reaction

G^{*} ⟶ H^{*}

with AAM

γ^{*}

. Moreover,

γ^{*}

by construction extends

γ

. □

A concrete example is shown in Figure 6. By slight abuse of the notation, moreover, it follows in general that

G^{*} [V (G)] ≅ G

and

H^{*} [V (H)] ≅ H

, i.e., adjacency and both vertex and edge labels are preserved between

G^{*} [V (G)]

and G as well as between

H^{*} [V (H)]

and H. In other words,

G^{*}

and

H^{*}

contain G and H as induced subgraphs with preserved labels.

Similarly, we consider now two vertices

x \in V (G)

and

y \in V (G^{*}) ∖ V (G)

. The corresponding columns in

A

are of the form

(x^{1}, x^{2})

or

(x^{1}, ⌀)

and

(⌀, y^{2})

, and thus

x y \notin E (G^{*})

, i.e., there is no edge connecting

V (G^{*}) ∖ V (G)

and

V (G)

. An analogous argument holds for

V (H^{*}) ∖ V (H)

and

V (H)

. Thus,

G^{*}

and

H^{*}

are extensions of G and H by one or more connected components. Therefore, the rebalancing of an unbalanced reaction

G ⟶ H

to a balanced reaction

G^{*} ⟶ H^{*}

thus only attaches additional connected components to G and H. Of course, these connected components correspond to the molecules that are missing in the unbalanced input reaction.

Our discussion so far suggests to further investigate algorithmic approaches for the inference of alignment-based AAMs that address the following scenario, which is of obvious practical relevance: Suppose we have obtained, for an unbalanced reaction

G ⟶ H

, the partial AAM

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

. The task is then to compute an extension

γ : U^{'} \to W^{'}

of

β

that (a) covers as many vertices of

V (G)

and

V (H)

as possible, and (b) at the same time implies as few (additional) reaction edges as possible. Once

γ

has been computed, both the rebalanced reaction

G^{*} ⟶ H^{*}

and the complete AAM

γ^{*}

are then uniquely defined by Proposition 2.

Proposition 2 sets, therefore, the stage for considering the rebalancing and inference of AAMs as a joint optimization problem, possibly given an input partial AAM as a constraint. Our discussion above shows that this can also be viewed as a constrained graph alignment problem. The problem simplifies considerably for unbalanced reactions, where compounds are missing either only on the reactant side or the product side since in this case,

\hat{G}

is an induced subgraph of

\hat{H}

, or vice versa, i.e., the alignment consists only of matches and insertions, or matches and deletions. We will pursue this topic in separate work.

4. Comparison of Partial AAM

4.1. Equivalence of Partial AAMs

Consider the dehydration reaction in Figure 7. From a chemical point of view, the two hydrogen atoms H7 and H8 are equivalent. This is not necessarily inherently obvious for the computational methods for inferring AAMs; an enumeration of atom maps thus may yield two alternative reactions.

In a more general setting, different AAM tools may return AAMs using their own numbering of atoms. It is of interest, therefore, to ask under which conditions two complete AAMs

α : V (G) \to V (H)

and

α^{'} : V (G^{'}) \to V (H^{'})

for isomorphic representations of the same reaction, i.e.,

G ≅ G^{'}

and

H ≅ H^{'}

, are equivalent. More formally, we ask whether there is a pair of isomorphisms

φ : V (G) \to V (G^{'})

and

ψ : V (H) \to V (H^{'})

such that

ψ (α (x)) = α^{'} (φ (x))

for all

x \in V (G)

. In [12], we showed that this condition is equivalent to the isomorphism of the complete ITS graphs

Υ (G, H, α)

and

Υ (G^{'}, H^{'}, α^{'})

. The two alternative ITS graphs for the dehydration reaction in Figure 7 indeed are isomorphic, and hence the corresponding AAMs are equivalent.

In this section, we extend the notion of equivalence to partial AAMs. The basic idea is that partial AAMs can only be equivalent if they cover “the same” parts of the reactant and product molecules in “the same” way. That is, the mapped parts of the molecules must match just like the full AAMs in [12] under isomorphisms for reactants and products, respectively.

Definition 8.

The partial AAMs

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

and

β^{'} : U^{'} \to W^{'}

with

U^{'} \subseteq V (G^{'})

and

W^{'} \subseteq V (H^{'})

are equivalent,

β \equiv β^{'}

, if there are extensions

β_{*} : V (G) \to V (H)

and

β_{*}^{'} : V (G^{'}) \to V (H^{'})

, respectively, of β and

β^{'}

, such that there exist isomorphisms

φ : V (G) \to V (G^{'})

for

G ≅ G^{'}

and

ψ : V (H) \to V (H^{'})

for

H ≅ H^{'}

, for which

β_{*}^{'} (φ (x)) = ψ (β_{*} (x))

holds for all

x \in V (G)

.

Note that in particular, we have

U^{'} = φ (U)

and

W^{'} = ψ (W)

.

As an immediate consequence of the definition and the results in [12], we obtain the following.

Observation 5.

Let for

G ≅ G^{'}

,

H ≅ H^{'}

,

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

, and

β^{'} : U^{'} \to W^{'}

with

U^{'} \subseteq V (G^{'})

and

W^{'} \subseteq V (H^{'})

. Then,

β \equiv β^{'}

implies

Υ (G [U], H [W], β) ≅ Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

.

In general, the isomorphism of the two partial ITS graphs is of course not sufficient to conclude that ITS graphs are isomorphic. This statement remains true even if, in addition, the reactant graphs (or the product graphs) are isomorphic; see Figure 8 for an example.

Here, the Diels–Alder reaction yields two distinct products, ortho (A) and meta (B). These products, despite their differences, still result in identical partial ITS graphs, in this case, the reaction centers. In fact, the isomorphism of partial ITS graphs and, in particular, reaction centers, is a very natural starting point to the defined classes of reactions [35].

The situation becomes less obvious if we assume

G ≅ G^{'}

and

H ≅ H^{'}

. The esterification in Figure 9 shows an example with isomorphic molecular graphs and isomorphic partial ITS graphs. Nevertheless, there are two non-isomorphic extensions. The example suggests that this is a consequence of the reaction edges that are not contained in the given partial ITS graphs.

Examples of non-equivalent AAMs with isomorphic partial ITS graphs as shown in Figure 9 seem to involve partial ITS graphs that are bad reaction patterns. This suggests that the isomorphism of the reactant and product graphs together with the isomorphism of a good reaction pattern is sufficient. This is not the case, however, as the example in Figure 10 shows: two isomorphic good reaction patterns

Υ (G [U], H [W], β)

and

Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

, for

G ⟶ H

and

G^{'} ⟶ H^{'}

with

G ≅ G^{'}

and

H ≅ H^{'}

. Nevertheless, the full ITS graphs

Υ (G, H, β)

and

Υ (G^{'}, H^{'}, β^{'})

are non-isomorphic. This is possible, in particular, whenever there are automorphisms of the reaction centers of G and H preserving the structure of the partial ITS graphs but not of the extensions of such good AAMs.

In principle, it is possible to test whether

β \equiv β^{'}

by iterating over all isomorphisms

φ

of G and

G^{'}

and

ψ

of H and

H^{'}

and to test the condition in the definition directly.

For computational efficiency, it is of interest, however, to make use of the necessary condition in Observation 5. To this end, we note first that there is a 1–1 correspondence between U and W with the vertices

V (Υ (G, H, β))

. To see this, let

η_{G} : U \to V (Υ)

,

η_{H} : W \to V (Υ)

,

η_{G^{'}} : U^{'} \to V (Υ^{'})

,

η_{H^{'}} : W^{'} \to V (Υ^{'})

be known embeddings mapping the parts of G and H into

Υ = Υ (G, H, β)

and

G^{'}

and

H^{'}

into

Υ^{'} = Υ (G^{'}, H^{'}, β^{'})

, respectively, as depicted by Figure 11. Note that in practical applications, these maps are part of the output of a tool that computes partial AAMs. Thus, we can assume that they are known and fixed. Now consider an isomorphism

ϑ : V (Υ) \to V (Υ^{'})

of the partial ITS graphs. The construction of the ITS graph immediately implies that the two maps

{\tilde{φ}}_{ϑ} ≔ η_{G^{'}}^{- 1} \circ ϑ \circ η_{G} {\tilde{ψ}}_{ϑ} ≔ η_{H^{'}}^{- 1} \circ ϑ \circ η_{H}

(1)

are the isomorphisms

{\tilde{φ}}_{ϑ} : U \to U^{'}

from

G [U]

and

G^{'} [U^{'}]

, and

{\tilde{ψ}}_{ϑ} : W \to W^{'}

for

H [W]

and

H^{'} [W^{'}]

induced by the isomorphism

ϑ

of the ITS graphs. This discussion constitutes a proof of the following.

Proposition 3.

Let for

G ≅ G^{'}

,

H ≅ H^{'}

,

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

, and

β^{'} : U^{'} \to W^{'}

with

U^{'} \subseteq V (G^{'})

and

W^{'} \subseteq V (H^{'})

. Then,

β \equiv β^{'}

if and only if there is an isomorphism

ϑ : V (Υ) \to V (Υ^{'})

from

Υ = Υ (G [U], H [W], β)

and

Υ^{'} = Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

such that

{\tilde{φ}}_{ϑ}

can be extended to an isomorphism

φ_{ϑ} : V (G) \to V (G^{'})

and

{\tilde{ψ}}_{ϑ}

can be extended to an isomorphism

ψ_{ϑ} : V (H) \to V (H^{'})

.

This result provides us with a more efficient algorithm to check the equivalence of partial AAMs: it is sufficient to list all isomorphisms

ϑ

of the partial ITS graphs

Υ (G [U], H [W], β)

and

Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

and to use, for example, the VF2 algorithm to inspect whether the implied partial isomorphisms

{\tilde{φ}}_{ϑ}

and

{\tilde{ψ}}_{ϑ}

can be extended to full automorphisms of the reactant and product graphs, respectively. As soon as one such isomorphism

ϑ

is encountered, the answer is affirmative.

Analogous to the discussion following Proposition 1, one could also use an isomorphism test on suitable relabeled graphs that encode the partial isomorphisms

{\tilde{φ}}_{ϑ}

and

{\tilde{ψ}}_{ϑ}

. Figure 11 summarizes the most important (sub)graphs we have encountered so far and the maps between them.

This notion of equivalence of partial ITS graphs does not require that the reaction

G ⟶ H

is balanced. In fact, the possibility to extend that partial ITS to isomorphisms

G ≅ G^{'}

and

H ≅ H^{'}

immediately implies that one can also obtain isomorphic alignment graphs. It should be noted, however, that score-optimal alignments, and thus the ITS graphs obtained from them as described in Section 3.3, are not necessarily unique (up to isomorphism). In general, therefore, it will not suffice to only check the isomorphism of a score-optimal alignment-based ITS graph for

G ⟶ H

and

G^{'} ⟶ H^{'}

, unless all co-optimal solutions are considered.

4.2. Consistency of Partial AAMs

There is no guarantee, however, that two distinct AAM computations will produce maps on “the same” atoms. The AAMs then are clearly non-equivalent, but they may still be consistent in the sense that their “union” is a partial AAM. More formally, we consider

U, U^{'} \subseteq V (G)

and

W, W^{'} \subseteq V (H)

and two bijective maps

β : U \to W

and

β^{'} : U^{'} \to W^{'}

. Then, their union

β \cup β^{'} : U \cup U^{'} \to W \cup W^{'}

given by

β \cup β^{'} (x) = β (x)

if

x \in U

and

β \cup β^{'} (x) = β^{'} (x)

if

x \in U^{'}

is well defined if and only if

β (x) = β^{'} (x)

for all

x \in U \cap U^{'}

. Moreover,

β \cup β^{'}

is again a bijection if and only if

β (U ∖ U^{'}) = W ∖ W^{'}

and

β^{'} (U^{'} ∖ U) = W^{'} ∖ W

.

The problem with this idea is again that the notions of “the same atoms” and of “union” in the context of AAMs is again subject to isomorphisms.

Definition 9.

The partial AAMs

β : U \to W

with

U \subseteq V (G)

and

W \subseteq V (H)

and

β^{'} : U^{'} \to W^{'}

with

U^{'} \subseteq V (G^{'})

and

W^{'} \subseteq V (H^{'})

are consistent if there are isomorphisms

φ : V (G) \to V (G^{'})

for

G ≅ G^{'}

and

ψ : V (H) \to V (H^{'})

for

H ≅ H^{'}

such that the union

γ ≔ β \cup (ψ^{- 1} \circ β^{'} \circ φ) : U \cup φ^{- 1} (U^{'}) \to W \cup ψ^{- 1} (W^{'})

is well defined and bijective.

In particular, therefore

β = ψ^{- 1} \circ β^{'} \circ φ

on

\tilde{U} ≔ U \cap φ^{- 1} (U^{'})

. Equivalently,

ψ (β (x)) = β^{'} (φ (x))

for all

x \in \tilde{U}

. Let us write

\tilde{W} ≔ W \cap ψ^{- 1} (W^{'})

and

\tilde{β} : \tilde{U} \to \tilde{W}

for the restriction of

β

to the vertices also mapped under the alternative atom map

ψ^{- 1} \circ β^{'} \circ φ

. Then, there is a partial ITS graph

Υ (G [\tilde{U}], H [\tilde{W}], \tilde{β})

that is isomorphic to a common induced subgraph of

Υ (G [U], H [W], β)

and

Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

.

Moreover, if

β : U \to W

and

β^{'} : U^{'} \to W^{'}

are consistent with associated isomorphisms

φ : V (G) \to V (G^{'})

and

ψ : V (H) \to V (H^{'})

, then the union

γ = β \cup (ψ^{- 1} \circ β^{'} \circ φ)

is a partial AAM and thus there is a common partial ITS graph

Υ (G [U \cup φ^{- 1} (U^{'})], H [W \cup ψ^{- 1} (W^{'})], γ)

that contains both

Υ (G [U], H [W], β)

and

Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

as subgraphs.

Furthermore,

γ^{'} = ψ \circ γ \circ φ^{- 1}

is the corresponding union of AAMs for

G^{'} ⟶ H^{'}

, and by construction, we have

γ \equiv γ^{'}

.

The problem of consistency between partial ITS that may overlap at least partially arises, for example, when reconstructing AAMs from KEGG RCLASSes [38]. An example of partial AAMs obtained from KEGG RCLASSes pertaining to the same reaction that turn out to be inconsistent is given in Appendix A.1. This real-life example emphasizes the utility of a mathematically sound formalization of partial AAM data. The question also naturally arises when partial AAMs obtained from different tools are computed for unbalanced reactions. In this setting, it is a natural task to reconcile these partial AAMs before computing extensions and alignment-based ITS graph as outlined in Section 3.2.

5. Atom-to-Atom Maps and Tautomerism

5.1. The Problem

Up to this point, we have studied molecules and molecular graphs. The phenomenon of tautomerism, however, rapidly interconverts certain structural isomers. Tautomers are distinct chemical species whose interconversion depends on external conditions such as solvent, temperature, or other physicochemical parameters. Tautomerization is therefore a bona fide chemical reaction that, in the simplest case, matches the scheme

H - M - Q = Z \leftrightarrow M = Q - Z - H

(2)

where H is a mobile group, typically a hydrogen, and Q is a neutral atom [24,25].

A difficulty arising from tautomerism is that in some cases atoms are “chemically equivalent” despite the fact that they are not in a common orbit of any automorphism of the molecular graph. This is of relevance when comparing AAMs and their ITS graphs because non-isomorphic maps may still both correctly describe a given reaction. A well-known example is the two oxygen atoms in a carboxylic acid R−COOH. In the esterification reaction shown in Figure 12, the oxygen atom O³ and O⁴ on the left-hand side of the reactions are equivalent from a chemical point of view; thus, both atom maps can be viewed as chemically correct.

More generally, whenever M and Z in Equation(2) are atoms of the same type, both carrying no isomorphic additional substituents, M and Z become chemically equivalent due to the migration of H and the concurrent changes in bonding.

5.2. Tautomerism Equivalence

In order to handle such situations mathematically, it seems natural to combine the reaction of interest with possible tautomerizations on both the reactant and the product sides. Two reactions

G ⟶ H

and

G^{'} ⟶ H^{'}

with AAMs

α : V (G) \to V (H)

and

α^{'} : V (G^{'}) \to V (H^{'})

then could be considered “chemically the same” if G and

G^{'}

as well as H and

H^{'}

differ by tautomerizations of one or more of their constituent molecules.

From a mathematical point of view, this means that there are tautomerization reactions

G \to G^{'}

and

H \to H^{'}

with corresponding AAMs

τ : V (G) \to V (G^{'})

and

π : V (H) \to V (H^{'})

. Since the atoms in G and

G^{'}

as well as H and

H^{'}

will in general be numbered independently, we think of

τ

and

π

as maps that already include, i.e., are already composed with, the respective graph isomorphisms. The set of admissible maps

τ

and

π

is, in this new formulation, not only graph isomorphisms. Instead, we now admit ITS graphs

Υ (G, G^{'}, τ)

and

Υ (H, H^{'}, π)

that describe tautomerizations.

Definition 10.

Two reactions

G ⟶ H

with AAM

α : V (G) \to V (H)

and

G^{'} ⟶ H^{'}

with AAM

α^{'} : V (G^{'}) \to V (H^{'})

are tautomerism equivalent if there are bijections

τ : V (G) \to V (G^{'})

and

π : V (H) \to V (H^{'})

such that both

Υ (G, G^{'}, τ)

and

Υ (H, H^{'}, π)

are ITS graphs of valid tautomerization reactions and

π \circ α = α^{'} \circ τ

.

Of course, we have to include isomorphisms as “trivial tautomerizations” which correspond to ITS graphs without reaction edges. As an immediate consequence of this definition, a pair of equivalent AAMs is also tautomerism equivalent, but the converse is not true. While this may be a non-trivial problem in its own right, we assume for the remainder of this section that there is a computational procedure that decides whether or not a given ITS graph

Υ (G, H, τ)

such that

τ

describes a tautomerization reaction.

Observation 6.

The reactions

G ⟶ H

and

G^{'} ⟶ H^{'}

are tautomerism equivalent (up to automorphisms of G and

H^{'}

) if and only if there are tautomerization AAMs τ for

G ⟶ G^{'}

and π for

H ⟶ H^{'}

such that

Υ (G, H^{'}, π \circ α)

and

Υ (G, H^{'}, α^{'} \circ τ)

are isomorphic.

Proof.

Each of

τ

,

π

,

α

, and

α^{'}

is an AAM, i.e., a bijection that preserves vertex labels; thus, this remains true for their compositions. That is, both

π \circ α

and

α^{'} \circ τ

are AAMs for the composed reaction

G ⟶ H^{'}

. But by Cor.1 in [12], the equivalence of the AAMs is equivalent to the isomorphism of the corresponding ITS graphs, from where the statement follows. □

The check for tautomerization equivalence thus again reduces to checking the graph isomorphism of the ITS graph. However, now we have to check all concatenations of the given AAMs with tautomerization reactions on the reactant and product sides.

An alternative approach is to compare the ITS graphs

Υ ≔ Υ (G, H, α)

and

Υ^{'} ≔ Υ (G^{'}, H^{'}, α^{'})

directly. While they are not isomorphic in general, there is still a bijection

ξ : V (Υ) \to V (Υ^{'})

that preserves the vertex labels. Recall that both vertex labels and edge labels in the ITS graphs

Υ (G, H, α)

and

Υ (G^{'}, H^{'}, α^{'})

are pairs, whose individual components are labels of G and H, and

G^{'}

and

H^{'}

, respectively. We can thus use

ξ

to construct a graph

Ξ

whose vertices have labels

a_{Ξ} (x) = (a_{G} (x), a_{H} (x), a_{G^{'}} (ξ (x)), a_{H^{'}} (ξ (x)))

, all of which are identical by definition. There are edges

x y \in E (Ξ)

if

x y \in E (Υ (G, H, α))

or

ξ (x) ξ (y) \in E (Υ (G^{'}, H^{'}, α^{'}))

with labels

b_{Ξ} (x y) = (b_{G} (x y), b_{H} (x y), b_{G^{'}} (ξ (x) ξ (y)), b_{H^{'}} (ξ (x) ξ (y)))

, with the convention that the label is set to

(b_{G} (x y), b_{H} (x y)) = (⌀, ⌀)

if

x y \notin E (Υ (G, H, α))

and

(b_{G^{'}} (ξ (x) ξ (y)), b_{H^{'}} (ξ (x) ξ (y))) = (⌀, ⌀)

if

ξ (x) ξ (y) \notin E (Υ (G^{'}, H^{'}, α^{'}))

. Thus, for an arbitrary graph

Ξ

of this type with vertex labels

a_{Ξ} (x) = (a_{Ξ}^{1} (x), a_{Ξ}^{2} (x), a_{Ξ}^{3} (x), a_{Ξ}^{4} (x))

and edge labels

b_{Ξ} (x y) = (b_{Ξ}^{1} (x y), b_{Ξ}^{2} (x y), b_{Ξ}^{3} (x y), b_{Ξ}^{4} (x y))

, we find:

Observation 7.

Υ (G, G^{'}, τ)

is obtained from Ξ by retaining only the vertex labels

(a_{Ξ}^{1} (x), a_{Ξ}^{3} (x))

and edge labels

(b_{Ξ}^{1} (x y), b_{Ξ}^{3} (x y))

and removing all edges with empty labels, i.e.,

(b_{Ξ}^{1} (x y), b_{Ξ}^{3} (x y)) = (⌀, ⌀)

.

Υ (H, H^{'}, π)

is obtained analogously from the second and fourth components of the labels.

Thus, it is possible, at least in principle, to test whether two ITS graphs are tautomerism equivalent by constructing, for a given vertex label-preserving bijection

ξ

, the auxiliary graph

Ξ

and checking whether the subgraphs

Υ (G, G^{'}, τ)

and

Υ (H, H^{'}, π)

, which are obtained as “projections” from

Ξ

, describe tautomerization reactions. In practice, it will not be necessary to consider all bijections

ξ

.

Instead, we expect that it will usually suffice to compute the graph

Ξ

, which (almost) minimizes the total number of unmatched edges and matching edges with different labels. This corresponds to computing the minimal chemical distance of the two ITS graphs

Υ (G, G^{'}, τ)

and

Υ (H, H^{'}, π)

.

5.3. Tautomerism ITS Graphs

In order to characterize the ITS graphs of a valid tautomerization reaction with AAM

τ

, we first note that tautomerization reactions are by definition intramolecular. Thus,

τ

is a disjoint union of bijections between the connected components of

G, G^{'}

, and thus,

Υ (G, G^{'}, τ)

comprises connected components that are in 1–1 correspondence with the connected components of both G and

G^{'}

. It therefore suffices to consider tautomerism ITS graphs for single molecules. That is,

Υ (G, G^{'}, τ)

is a tautomerism ITS graph for the reactants if and only if every connected component of

Υ (G, G^{'}, τ)

is a tautomerism ITS graph. In other words, it is sufficient to specify what constitutes connected tautomerism ITS graphs.

Though Equation (2) defines elementary tautomeric reactions, it is no trivial task to determine whether a graph

Υ (G, G^{'}, τ)

connects two tautomers. The reason is that these may be connected by a sequence of reactions of the form of Equation (2) rather than by a single transformation of this type. This is complicated by the fact that tautomerization reactions in general cannot be applied in arbitrary order, and the composition of tautomerization reactions does not necessarily conform to the pattern in Equation (2). For an elaborate example, see Figure 13.

These issues are studied in considerable detail in the theory of rule composition for graph rewriting systems [20,45]. While it may be possible to construct a formal system to recognize arbitrary tautomers on this basis, such an endeavor goes beyond the scope of the present contribution. As a pragmatic alternative, tautomer databases [46] will be helpful.

We note, finally, that mesomeric (resonance) structures could be handled in the same manner. This pertains in particular to aromatic systems. Alternatively, however, in this case, aromatic bonds may also be used as a separate bond types.

6. Hydrogens—A Necessary Nuisance

The representation of hydrogen atoms causes a variety of problems in formal descriptions of chemistry. There are good reasons to omit them partially or entirely in structural formulae. There are, however, as many good reasons to include them in other contexts. The omission of hydrogens without loss of information is possible because of the valency rules for carbon C and nitrogen N. The number of adjacent hydrogen is simply given as the difference between the expected valency of 4 or 3 and the number of incident bonds. This requires a complete molecule, possibly with explicitly given wildcard residues (e.g., −R). In reaction rules, on the other hand, the presence of a hydrogen taking part in a reaction, is essential for the reaction pattern since the reaction cannot take place in its absence.

Apart from a more compact structural formula, the omission of hydrogens simplifies the computation of automorphisms and isomorphisms. If hydrogens are represented explicitly, a −CH₃ group adds an extra six automorphisms, just mapping the three hydrogens onto each other. The omission of hydrogens is therefore also common practice for AAM tools, even though this leads to bad reaction patterns whenever hydrogens are part of the reaction center.

A more subtle issue at least closely related to the common omission of hydrogen is the question of which hydrogens are actually preserved through a formally balanced reaction. As an example, consider the balanced esterification reaction

\begin{matrix} R - C O O H + R - O H ⟶ R - (C = O) - O - R + H_{2} O \end{matrix}

At face value, the hydrogen atoms in H₂O on the product side come from the acid and alcohol reactants, respectively. A closer look at the reaction mechanism, however, shows that this is indeed a multi-step reaction that can be written as

\begin{matrix} \begin{matrix} H^{+} + R - C O O H + R - O H ⟶ R - C^{+} {(O H)}_{2} + R - O H ⟶ R - C {(O H)}_{2} - O - R + H^{+} \end{matrix} \\ \begin{matrix} R - C {(O H)}_{2} - O - R ⟶ R - (C = O) - O - R + H_{2} O \end{matrix} \end{matrix}

Here, the H⁺ of the alcohol is exchanged for the H⁺ reactant that is taken from the solvent, while two hydrogens in the H₂O in the second line are the solvent-derived H⁺ and the hydrogen from the carboxylic acid. Assuming an aqueous solvent, all protons are thus exchanged with the solvent here. In reaction mechanisms such as unimolecular nucleophilic substitution (

S_{N 1}

) and unimolecular elimination (

E_{1}

), the involvement of a solvent-borne proton is crucial, and it is incorrect to assume that the hydrogen exchange occurs solely within the internal reaction system.

On the other hand, it is also not correct to assume that hydrogens are always exchanged with the solvent system. In the last step of the biosynthesis of lanosterol in Figure 14, for instance, the hydrogen atoms are relocated by 1,2-hydrogen and 1,2-methyl shifts that keep them within the system.

This suggests to address the issue in a more formal way that accommodates the correct mappings both with and without exchange with the medium. From a formal point of view, the best solution would be a decomposition of reactions into sequences of reactions that explicitly represent necessary hydrogens “borrowed” from the environment. Unfortunately, few data have this level of resolution.

The next-best approach is to attach an additional attribute label to hydrogens that can be exchanged with the medium in the course of a reaction. Together with this extra label, AAMs on the available reaction data could carry the pertinent information. Still, it is not clear how to compute such labels efficiently. Learning approaches are appealing for this purpose but will depend on well-curated datasets.

7. Concluding Remarks

We have surveyed the connections between incomplete atom-to-atom mapping data and their natural graph-theoretical description. In particular, in the context of unbalanced reactions, there is a natural but non-trivial connection to graph alignments.

The focus of this work is entirely on the mathematical framework and the formal properties of partial AAMs and their equivalent ITS graphs. Nevertheless, we show that in the case of balanced chemical reactions, the partial AAM can be extended to a global AAM with very high accuracy by solving a purely combinatorial optimization problem (framed here as an ILP that minimized chemical distance). To this end, only a small “anchor” is needed, e.g., a partial reaction center.

This task appears to become much more difficult on unbalanced reaction data. While we provide here a solid formal basis for this problem, practical implementations and an empirical analysis is the subject of ongoing work that will be published elsewhere. In particular, we have not attempted to devise efficient algorithms for the various restricted graph isomorphism problems appearing throughout the text. In this context, it may be of interest to invoke topological graph indices. While global graph invariants [48] may help to rule out the existence of isomorphisms, local versions, so-called vertex invariants, may be used to reduce the possible mappings and hence result in more efficient subgraph isomorphism checkers; see refs. [49,50].

Beyond the question of how AAMs can be obtained and completed, their comparison is a problem of practical significance, particularly in the context of benchmarking the increasing portfolio of atom-to-atom mapping tools. Again, our focus lies on the treatment of partial data, and the connection to testing isomorphisms of the corresponding subgraphs.

Not surprisingly, the comparison of subgraphs alone provides a necessary but not sufficient condition, which has to be complemented by additional requirements on the embeddings of the subgraphs. Interestingly, this framework extends to weaker forms of the equivalence of AAMs that naturally encompass tautomerism. As an open problem, however, the question remains of how to formally determine when two molecular graphs are tautomers of each other. Mesomeric formulas are captured by the same formalism.

A notoriously difficult problem is the handling of hydrogen atoms and their persistence through chemical reactions. Here, we have only touched upon this topic to the extent that it has a direct bearing on AAMs and their completion. Our treatment of the topic is far from complete or comprehensive, although we highlight some of the major issues and sketch some possible approaches that could be explored in future work.

The detailed presentation of the partial AAMs and ITS graphs results in a concise specification of several optimization problems associated with the completion of partial reaction data. As an example, we considered the completion of bad partial AAMs in the setting of balanced reactions (Section 3.2).

Related problems, touched upon in our presentation, most likely can be tackled by similar ILP formulations, providing a solid starting point for dedicated computational investigations. Algorithmic aspects, however, are beyond the scope of this contribution.

It should not be forgotten, furthermore, that even in the presence of partial chemically correct information, there is no guarantee that purely combinatorial objectives, such as the minimum edit distance or maximum common subgraphs, will lead to chemically correct solutions. We expect, however, that this will be the case in most instances.

The main practical use of the framework outlined here, thus, is to expose how partial information, that we expect to become increasingly available with the rise in learning-based approaches in computational chemistry, can be incorporated into optimization-centered approaches.

Author Contributions

M.E.G.L. and P.F.S. designed the study. K.W. and T.G. performed the computational experiments. M.E.G.L., K.W., T.-L.P., N.B., N.D., C.F., T.G., D.M. and P.F.S. contributed to the derivation of the theoretical results, the discussion, and the writing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

M.E.G.L acknowledges financial support by the Federal Ministry of Education and Research of Germany and by the Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus in the program Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig”, project identification number: ScaDS.AI. The research of K.W. and T.-L.P received funding from the European Unions Horizon Europe Doctoral Network program under the Marie-Skłodowska-Curie grant agreement No 101072930 (TACsy—Training Alliance for Computational systems chemistry). The research of N.D. is supported by the German Research Foundation (DFG) as part of the Priority Program SPP 2363 (grant no. STA 850/58-1 497135079 to P.F.S.). The research of N.B. and T.G. is supported by the Novo Nordisk Foundation (grant NNF21OC0066551 “MATOMIC” to C.F., D.M., and P.F.S.). P.F.S acknowledges support by the German Federal Ministry of Education and Research BMBF through DAAD project 57616814 (SECAI, School of Embedded Composite AI). Publication costs were covered in part by ScaDS.AI and in part by the Open Access Publishing Fund of Leipzig University supported by the German Research Foundation within the program Open Access Publication Funding.

Data Availability Statement

The software package as well as the scripts and data to re-run the validation experiments are available at https://github.com/klausweinbauer/AAMUtils/tree/paper (accessed on 27 August 2024).

Acknowledgments

We thank Maria Waldl for the valuable comments on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AAM	atom-to-atom map
CGR	condensed graph of the reaction
DPO	double pushout (graph grammars)
ILP	integer linear program
ITS	imaginary transition state

Appendix A

Appendix A.1. Inconsistent AAMs from KEGG RCLASSes

The KEGG RCLASSes shown in Figure A1 determine two partial AAMs for the KEGG reaction R01144. The map AAM1 in this figure comprises atoms maintained between oxaloacetate 4-methyl ester and oxaloacetate, while AAM2 connects oxaloacetate 4-methyl ester and methanol.

The two partial AAMs are inconsistent because the O with label 2 in the reactants is being mapped to two non-equivalent atoms on the product side, i.e., there is no isomorphism between the product graphs mapping the corresponding images of this atom to each other.

Figure A1. The KEGG RCLASSes RC00460 (upper panel) and RC00461 (lower panel) available for the KEGG reaction R01144.

References

Elsevier “Reaxys”. Available online: https://www.elsevier.com/products/reaxys (accessed on 20 July 2024).
Crabtree, J.D.; Mehta, D.P. Automated reaction mapping (ACM). J. Exp. Algorithmics 2009, 13, 15. [Google Scholar] [CrossRef]
Chen, W.L.; Chen, D.Z.; Taylor, K.T. Automatic reaction mapping and reaction center detection. WIREs Comput. Mol. Sci. 2013, 3, 560–593. [Google Scholar] [CrossRef]
Fooshee, D.; Andronico, A.; Baldi, P. ReactionMap: An efficient atom-mapping algorithm for chemical reactions. J. Chem. Inf. Model. 2013, 53, 2812–2819. [Google Scholar] [CrossRef] [PubMed]
Mann, M.; Nahar, F.; Schnorr, N.; Backofen, R.; Stadler, P.F.; Flamm, C. Atom mapping with constraint programming. Alg. Mol. Biol. 2014, 9, 23. [Google Scholar] [CrossRef] [PubMed]
Rahman, S.A.; Torrance, G.; Baldacci, L.; Martínez Cuesta, S.; Fenninger, F.; Gopal, N.; Choudhary, S.; May, J.W.; Holliday, G.L.; Steinbeck, C.; et al. Reaction decoder tool (RDT): Extracting features from chemical reactions. Bioinformatics 2016, 32, 2065–2066. [Google Scholar] [CrossRef]
Savelev, A.; Puzanov, I.; Samoilov, V.; Karnaukhov, V. Indigo Toolkit. 2019. Available online: https://lifescience.opensource.epam.com/indigo/index.html (accessed on 27 August 2024).
Schwaller, P.; Hoover, B.; Reymond, J.L.; Strobelt, H.; Laino, T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv. 2021, 7, eabe4166. [Google Scholar] [CrossRef]
Nugmanov, R.; Dyubankova, N.; Gedich, A.; Wegner, J.K. Bidirectional graphormer for reactivity understanding: Neural network trained to reaction atom-to-atom mapping task. J. Chem. Inf. Model. 2022, 62, 3307–3315. [Google Scholar] [CrossRef]
Chen, S.; An, S.; Babazade, R.; Jung, Y. Precise atom-to-atom mapping for organic reactions via human-in-the-loop machine learning. Nat. Commun. 2024, 15, 2250. [Google Scholar] [CrossRef]
Lan, Z.; Zeng, Z.; Hong, B.; Liu, Z.; Ma, F. RCsearcher: Reaction center identification in retrosynthesis via deep Q-learning. Pattern Recogn. 2024, 150, 110318. [Google Scholar] [CrossRef]
González Laffitte, M.E.; Beier, N.; Domschke, N.; Stadler, P.F. Comparison of atom maps. MATCH Comm. Math. Comp. Chem. 2023, 90, 75–102. [Google Scholar] [CrossRef]
Fujita, S. Description of organic reactions based on imaginary transition structures. 1. Introduction of new concepts. J. Chem. Inf. Comput. Sci. 1986, 26, 205–212. [Google Scholar] [CrossRef]
Wilcox, C.S.; Levinson, R.A. A self-organized knowledge base for recall, design, and discovery in organic chemistry. In Artificial Intelligence Applications in Chemistry; Pierce, T.H., Hohne, B.A., Eds.; Am. Chem. Soc. Symposium Series, Chapter 18; American Chemical Society: Washington, DC, USA, 1986; Volume 306, pp. 209–230. [Google Scholar] [CrossRef]
Hoonakker, F.; Lachiche, N.; Varnek, A.; Wagner, A. A representation to apply usual data mining techniques to chemical reactions—Illustration on the rate constant of SN₂ reactions in water. Int. J. Artif. Intell. Tools 2011, 20, 253–270. [Google Scholar] [CrossRef]
Lowe, D.M. Extraction of Chemical Structures and Reactions from the Literature; Technical Report; Apollo—University of Cambridge Repository: Cambridge, UK, 2012. [Google Scholar] [CrossRef]
Nugmanov, R.I.; Mukhametgaleev, R.N.; Akhmetshin, T.; Gimadiev, T.R.; Afonina, V.A.; Madzhidov, T.I.; Varnek, A. CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J. Chem. Inf. Mod. 2019, 59, 2516–2521. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Arun, A.; Lapkin, A.A. Completing and balancing database excerpted chemical reactions with a hybrid mechanistic-machine learning approach. ACS Omega 2024, 9, 18385–18399. [Google Scholar] [CrossRef]
Phan, T.L.; Weinbauer, K.; Gärtner, T.; Merkle, D.; Andersen, J.L.; Fagerberg, R.; Stadler, P.F. Reaction rebalancing: A novel approach to curating reaction databases. J. Cheminf. 2024, 16, 82. [Google Scholar] [CrossRef]
Andersen, J.L.; Flamm, C.; Merkle, D.; Stadler, P.F. Inferring chemical reaction patterns using graph grammar rule composition. J. Syst. Chem. 2013, 4, 4. [Google Scholar] [CrossRef]
Andersen, J.L.; Flamm, C.; Merkle, D.; Stadler, P.F. A software package for chemically inspired graph transformation. In Proceedings of the Graph Transformation, ICGT 2016, Vienna, Austria, 5–6 July 2016; Echahed, R., Minas, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9761, pp. 73–88. [Google Scholar] [CrossRef]
IUPAC Chemical Nomenclature and Struture Representation Division; Brecher, J. Graphical representation standards for chemical structure diagrams (IUPAC Recommendations 2008). Pure Appl. Chem. 2008, 80, 277–410. [Google Scholar] [CrossRef]
Mann, M.; Thiel, B. Kekulé structure enumeration yields unique SMILES. In Proceedings of the WCB13: Workshop on Constraint Based Methods for Bioinformatics, Uppsala, Sweden, 16–20 September 2013; pp. 57–65. [Google Scholar]
Connolly Martin, Y. Let’s not forget tautomers. J. Comput. Aided. Mol. Des. 2009, 23, 693–704. [Google Scholar] [CrossRef]
Sayle, R.A. So you think you understand tautomerism? J. Comput. Aided. Mol. Des. 2010, 24, 485–496. [Google Scholar] [CrossRef]
González Laffitte, M.E.; Stadler, P.F. Progressive multiple alignment of graphs. Algorithms 2024, 17, 116. [Google Scholar] [CrossRef]
Harary, F. Graph Theory; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Horn, F.J.M. Necessary and sufficient conditions for complex balancing in chemical kinetics. Arch. Ration. Mech. Anal. 1972, 49, 172–186. [Google Scholar] [CrossRef]
Feinberg, M. Complex balancing in general kinetic systems. Arch. Ration. Mech. Anal. 1972, 49, 187–194. [Google Scholar] [CrossRef]
Dugundji, J.; Ugi, I. An algebraic model of constitutional chemistry as a basis for chemical computer programs. Topics Curr. Chem. 1973, 39, 19–64. [Google Scholar] [CrossRef]
Lynch, M.F.; Willett, P. The automatic detection of chemical reaction sites. J. Chem. Inf. Comput. Sci. 1978, 18, 154–159. [Google Scholar] [CrossRef]
Funatsu, K.; Endo, T.; Kotera, N.; Sasaki, S.I. Automatic recognition of reaction site in organic chemical reactions. Tetrahedron Comput. Methodol 1988, 1, 53–69. [Google Scholar] [CrossRef]
Fujita, S. Imaginary transition structures. A novel approach to computer oriented representation of organic reactions. J. Synth. Org. Chem. Japan 1989, 47, 396–412. [Google Scholar] [CrossRef]
Flamm, C.; Müller, S.; Stadler, P.F. Every atom-atom map for neutral molecules can be explained by electron pair pushing diagrams. Discr. Math. Chem. 2024, in press. [Google Scholar] [CrossRef]
Hendrickson, J.B. Descriptions of reactions: Their logic and applications. Red. Trav. Chim. Pays-Bas 1992, 111, 323–334. [Google Scholar] [CrossRef]
Fujita, S. A novel approach to the enumeration of reaction types by counting reaction-center graphs which appear as the substructures of imaginary transition structures. Bull. Chem. Soc. Japan 1988, 61, 4189–4206. [Google Scholar] [CrossRef]
Hattori, M.; Okuno, Y.; Goto, S.; Kanehisa, M. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc. 2003, 125, 11853–11865. [Google Scholar] [CrossRef]
Beier, N.; Gatter, T.; Stadler, P.F. Converting KEGG RCLASS data into DPO graph rewriting rules. 2024; in preparation. [Google Scholar]
Cordella, L.P.; Foggia, P.; Sansone, C.; Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1367–1372. [Google Scholar] [CrossRef] [PubMed]
Alpár Jüttner, A.; Madarasi, P. VF2++—An improved subgraph isomorphism algorithm. Discr. Appl. Math. 2018, 242, 69–81. [Google Scholar] [CrossRef]
Jochum, C.; Gasteiger, J.; Ugi, I. The principle of minimum chemical distance (PMCD). Ang. Chem. Intl. Ed. 1980, 19, 495–505. [Google Scholar] [CrossRef]
Takapoui, R.; Boyd, S. Linear programming heuristics for the graph isomorphism problem. Technical Report 1611.00711. arXiv 2016, arXiv:1611.00711. [Google Scholar] [CrossRef]
Linderoth, J.T.; Ralphs, T.K. Noncommercial software for mixed-integer linear programming. In Integer Programming: Theory and Practice; Karlof, J., Ed.; CRC Press: Boca Raton, FL, USA, 2005; pp. 253–303. [Google Scholar]
CGRTools, Read the Docs: Data Types and Operations with Them. Available online: https://cgrtools.readthedocs.io/tutorial/1_data_types_and_operations.html#Bonds-has-order-and-p_order-attribute (accessed on 3 June 2024).
Behr, N.; Krivine, J. Compositionality of rewriting rules with conditions. Compositionality 2021, 3, 2. [Google Scholar] [CrossRef]
Dhaked, D.K.; Guasch, L.; Nicklaus, M.C. Tautomer database: A comprehensive resource for tautomerism analyses. J. Chem. Inf. Model. 2020, 60, 1090–1100. [Google Scholar] [CrossRef]
Huff, M.; Telford, D. Lord of the rings—The mechanism for oxidosqualene: Lanosterol cyclase becomes crystal clear. Trends Pharmacol. Sci. 2005, 26, 335–340. [Google Scholar] [CrossRef] [PubMed]
Dehmer, M.; Chen, Z.; Emmert-Streib, F.; Shi, Y.; Tripathi, S.; Musa, A.; Mowshowitz, A. Properties of graph distance measures by means of discrete inequalities. Appl. Math. Model. 2018, 59, 739–749. [Google Scholar] [CrossRef]
Bonnici, V.; Giugno, R.; Pulvirenti, A.; Shasha, D.; Ferro, A. A subgraph isomorphism algorithm and its application to biochemical data. BMC Bioinform. 2013, 14, S13. [Google Scholar] [CrossRef]
Aparo, A.; Bonnici, V.; Micale, G.; Ferro, A.; Shasha, D.; Pulvirenti, A.; Giugno, R. Fast subgraph matching strategies based on pattern-only heuristics. Interdiscip. Sci. 2019, 11, 21–32. [Google Scholar] [CrossRef]

Figure 1. Reaction illustrating the formalism introduced in Section 2. Atom types are determined by the color of vertices and their labels (N •, H ∘, C•, and Cl •), while edges represent single bonds depicted by label “−”. Row (a) shows an AAM

α

assigned to

G ⟶ H

, together with the ITS graph

Υ : = Υ (G, H, α)

and the reaction center

Γ (G, H, α)

. Here,

α

is implied by the position of the vertices in the drawing. Row (b) shows the graphs

{\hat{G}}_{α}

,

{\hat{H}}_{α}

and

\hat{Υ}

obtained by removing the reaction edges from their counterparts, i.e., edges belonging to, or leading to edges in,

Γ (G, H, α)

. Since the remaining edges conserve their labels under

α

, the graphs

{\hat{G}}_{α}

and

{\hat{H}}_{α}

are isomorphic to

\hat{Υ}

, with the exception that the latter is, formally, labeled by ordered pairs of the repeated values for

{\hat{G}}_{α}

and

{\hat{H}}_{α}

. The maps

η

and

α \circ η

provide the isomorphisms between

{\hat{G}}_{α}

and

\hat{Υ}

, and

{\hat{H}}_{α}

and

\hat{Υ}

, respectively. Moreover, as an example of the usage of the variables

a_{Υ}^{i}

and

b_{Υ}^{i}

, consider the vertex x and the edge

x y

of

Υ

. In this case, it holds

(a_{Υ}^{1} (x), a_{Υ}^{2} (x))

= (C, C), meaning that both variables hold the value

a_{Υ}^{i} (x) =

C. Similarly, we have

(b_{Υ}^{1} (x y), b_{Υ}^{2} (x y)) = (-, ⌀)

, meaning that

b_{Υ}^{1} (x y) = -

and

b_{Υ}^{2} (x y) = ⌀

. Introducing this notation is necessary to speak formally of arbitrary ITS graphs.

Figure 1. Reaction illustrating the formalism introduced in Section 2. Atom types are determined by the color of vertices and their labels (N •, H ∘, C•, and Cl •), while edges represent single bonds depicted by label “−”. Row (a) shows an AAM

α

assigned to

G ⟶ H

, together with the ITS graph

Υ : = Υ (G, H, α)

and the reaction center

Γ (G, H, α)

. Here,

α

is implied by the position of the vertices in the drawing. Row (b) shows the graphs

{\hat{G}}_{α}

,

{\hat{H}}_{α}

and

\hat{Υ}

obtained by removing the reaction edges from their counterparts, i.e., edges belonging to, or leading to edges in,

Γ (G, H, α)

. Since the remaining edges conserve their labels under

α

, the graphs

{\hat{G}}_{α}

and

{\hat{H}}_{α}

are isomorphic to

\hat{Υ}

, with the exception that the latter is, formally, labeled by ordered pairs of the repeated values for

{\hat{G}}_{α}

and

{\hat{H}}_{α}

. The maps

η

and

α \circ η

provide the isomorphisms between

{\hat{G}}_{α}

and

\hat{Υ}

, and

{\hat{H}}_{α}

and

\hat{Υ}

, respectively. Moreover, as an example of the usage of the variables

a_{Υ}^{i}

and

b_{Υ}^{i}

, consider the vertex x and the edge

x y

of

Υ

. In this case, it holds

(a_{Υ}^{1} (x), a_{Υ}^{2} (x))

= (C, C), meaning that both variables hold the value

a_{Υ}^{i} (x) =

C. Similarly, we have

(b_{Υ}^{1} (x y), b_{Υ}^{2} (x y)) = (-, ⌀)

, meaning that

b_{Υ}^{1} (x y) = -

and

b_{Υ}^{2} (x y) = ⌀

. Introducing this notation is necessary to speak formally of arbitrary ITS graphs.

Figure 2. Reactions relating the selection of AAMs with the (dis-)connectedness of the ITS graphs and the reaction centers. Vertex colors depict atom types: N •, H •, and C •. Only hydrogens that take part in the reaction are shown. Similarly, edge labels “−” and “=” denote single and double bonds, respectively. To simplify the drawing, not all edge labels are shown. The remaining edges have labels according to Definitions 2 and 3. We use the convention that the AAMs are implied by the relative positions of the vertices unless explicitly stated by their labels. (a) shows the reaction

G ⟶ H

subject to an AAM

α

that defines a connected ITS graph

Υ (G, H, α)

. The reaction center

Γ (G, H, α)

, however, is disconnected. In (b), the same reaction

G ⟶ H

is explained by a different AAM

β

, differing from

α

only in the map of the hydrogens u and v of G, while preserving the image of every other atom. This small change produces a connected reaction center

Γ (G, H, β)

. Finally, (c) shows the reaction

G^{'} ⟶ H^{'}

between the graphs

G^{'}

and

H^{'}

obtained by removing the hydrogen atoms from G and H, and the atom map

γ

corresponding to the restriction of both

α

and

β

to non-hydrogen atoms. Ignoring the hydrogen atoms produces a disconnected reaction center

Γ (G^{'}, H^{'}, γ)

, from which neither

Γ (G, H, α)

nor

Γ (G, H, β)

can be recovered unambiguously.

Figure 2. Reactions relating the selection of AAMs with the (dis-)connectedness of the ITS graphs and the reaction centers. Vertex colors depict atom types: N •, H •, and C •. Only hydrogens that take part in the reaction are shown. Similarly, edge labels “−” and “=” denote single and double bonds, respectively. To simplify the drawing, not all edge labels are shown. The remaining edges have labels according to Definitions 2 and 3. We use the convention that the AAMs are implied by the relative positions of the vertices unless explicitly stated by their labels. (a) shows the reaction

G ⟶ H

subject to an AAM

α

that defines a connected ITS graph

Υ (G, H, α)

. The reaction center

Γ (G, H, α)

, however, is disconnected. In (b), the same reaction

G ⟶ H

is explained by a different AAM

β

, differing from

α

only in the map of the hydrogens u and v of G, while preserving the image of every other atom. This small change produces a connected reaction center

Γ (G, H, β)

. Finally, (c) shows the reaction

G^{'} ⟶ H^{'}

between the graphs

G^{'}

and

H^{'}

obtained by removing the hydrogen atoms from G and H, and the atom map

γ

corresponding to the restriction of both

α

and

β

to non-hydrogen atoms. Ignoring the hydrogen atoms produces a disconnected reaction center

Γ (G^{'}, H^{'}, γ)

, from which neither

Γ (G, H, α)

nor

Γ (G, H, β)

can be recovered unambiguously.

Figure 3. The KEGG RCLASS entry RC00460 corresponds to the partial AAM

β

shown in the upper panel. The lower panel exhibits 3 distinct reactions with maps

γ_{1}

,

γ_{2}

and

γ_{3}

extending

β

, with blue and red patterns portraying the embeddings of the subgraphs implied by

β

. Partial AAMs therefore are not sufficient to guarantee a unique global AAM.

Figure 3. The KEGG RCLASS entry RC00460 corresponds to the partial AAM

β

shown in the upper panel. The lower panel exhibits 3 distinct reactions with maps

γ_{1}

,

γ_{2}

and

γ_{3}

extending

β

, with blue and red patterns portraying the embeddings of the subgraphs implied by

β

. Partial AAMs therefore are not sufficient to guarantee a unique global AAM.

Figure 4. The process of hydrogen insertion in a nucleophilic substitution reaction. Vertex colors correspond to the atom types stated in the upper panel. (A) ITS graph with orange edges representing the reaction center. (B) The ITS graph following the insertion of hydrogen. The AAM

β

(when defined only on atoms 1–5) is a good AAM in representation (A) but a bad partial AAM in (B).

Figure 4. The process of hydrogen insertion in a nucleophilic substitution reaction. Vertex colors correspond to the atom types stated in the upper panel. (A) ITS graph with orange edges representing the reaction center. (B) The ITS graph following the insertion of hydrogen. The AAM

β

(when defined only on atoms 1–5) is a good AAM in representation (A) but a bad partial AAM in (B).

Figure 5. The completion rate (success rate) for reactions with a given fraction of mapped atoms. (1) For missing mappings outside the reaction center, (2) for incompletely mapped reaction centers, and (3) for partially mapped reactions in general. The dashed line marks the interpolated point where the success rate first drops below 95%. Based on the figure, we estimate that around 40% of the atoms in a reaction should be mapped to obtain completions with a small error rate of at most 5%.

Figure 6. Construction of an alignment

A_{γ}

and the ITS graph

Υ (G^{*}, H^{*}, γ^{*})

produced by a partial AAM

γ

for

G ⟶ H

. For clarity of the figure, double bonds are drawn explicitly in the first two graphs of each row and only specified as labels in the last one. (a) shows the reaction center

Γ (G, H, γ)

comprising the reaction edges implied by

γ

, which are removed from G and H to obtain

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

. Then, in (b), the alignment

A_{γ}

is produced by “gluing” the graphs

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

through

γ

. Formally, this operation is performed with respect to the alignment columns and assigns labels to them depending on their presence and labels in the input graphs, e.g., column

(x, ⌀)

obtains the label (•,

⌀)

, and similarly for the edges. (c) exhibits the ITS graph

Υ (G^{*}, H^{*}, γ^{*})

obtained by applying (A1)–(A3) to

A_{γ}

. In particular, vertex labels are duplicated and so are edge labels only if their end points are in/dels of the same type; otherwise, they remain unchanged, e.g.,

(⌀, z) (1, γ (1)) \in E (A_{γ})

conserves its label. This ensures that an edge is created in

G^{*}

and

H^{*}

only if both vertices are present in the same input graph.

Figure 6. Construction of an alignment

A_{γ}

and the ITS graph

Υ (G^{*}, H^{*}, γ^{*})

produced by a partial AAM

γ

for

G ⟶ H

. For clarity of the figure, double bonds are drawn explicitly in the first two graphs of each row and only specified as labels in the last one. (a) shows the reaction center

Γ (G, H, γ)

comprising the reaction edges implied by

γ

, which are removed from G and H to obtain

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

. Then, in (b), the alignment

A_{γ}

is produced by “gluing” the graphs

{\hat{G}}_{γ}

and

{\hat{H}}_{γ}

through

γ

. Formally, this operation is performed with respect to the alignment columns and assigns labels to them depending on their presence and labels in the input graphs, e.g., column

(x, ⌀)

obtains the label (•,

⌀)

, and similarly for the edges. (c) exhibits the ITS graph

Υ (G^{*}, H^{*}, γ^{*})

obtained by applying (A1)–(A3) to

A_{γ}

. In particular, vertex labels are duplicated and so are edge labels only if their end points are in/dels of the same type; otherwise, they remain unchanged, e.g.,

(⌀, z) (1, γ (1)) \in E (A_{γ})

conserves its label. This ensures that an edge is created in

G^{*}

and

H^{*}

only if both vertices are present in the same input graph.

Figure 7. Reactions 1 and 2 illustrate two different inferred AAMs for hydrogen atom insertion (white vertices in the lower panels). Panel (A) shows the ITS of Reaction 1 and Panel (B) the ITS of Reaction 2. The two ITS graphs in (A,B) are isomorphic meaning that the two AAMs are equivalent.

Figure 8. Diels–Alder reactions (A) ortho and (B) meta. The isomorphism of reactant graphs

G ≅ G^{'}

and of partial ITS graphs, here of the reaction centers

Γ (G, H_{1}, α) ≅ Γ (G^{'}, H_{2}^{'}, α^{'})

on the right and depicted with orange edges, does not imply isomorphisms of product graphs

H_{1}

and

H_{2}^{'}

.

Figure 8. Diels–Alder reactions (A) ortho and (B) meta. The isomorphism of reactant graphs

G ≅ G^{'}

and of partial ITS graphs, here of the reaction centers

Γ (G, H_{1}, α) ≅ Γ (G^{'}, H_{2}^{'}, α^{'})

on the right and depicted with orange edges, does not imply isomorphisms of product graphs

H_{1}

and

H_{2}^{'}

.

Figure 9. Isomorphism of reactant graphs

G ≅ G^{'}

in row (A) and of product graphs

H ≅ H^{'}

in rwo (B). These are two non-equivalent AAMs for the esterification of acetic acid and methanol, having isomorphic partial ITSs that capture only the breaking of C−O bonds. The example shows that conditions

G ≅ G^{'}

,

H ≅ H^{'}

, and

Υ (G [U], H [W], β) ≅ Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

are not sufficient to ensure that all extensions of

β

are equivalent. The key feature is that the non-equivalent ITS graphs contain reaction edges that are not present in the partial ITS graphs.

Figure 9. Isomorphism of reactant graphs

G ≅ G^{'}

in row (A) and of product graphs

H ≅ H^{'}

in rwo (B). These are two non-equivalent AAMs for the esterification of acetic acid and methanol, having isomorphic partial ITSs that capture only the breaking of C−O bonds. The example shows that conditions

G ≅ G^{'}

,

H ≅ H^{'}

, and

Υ (G [U], H [W], β) ≅ Υ (G^{'} [U^{'}], H^{'} [W^{'}], β^{'})

are not sufficient to ensure that all extensions of

β

are equivalent. The key feature is that the non-equivalent ITS graphs contain reaction edges that are not present in the partial ITS graphs.

Figure 10. The vertices where

G_{0}

is being attached to U and

U^{'}

are (pairwise) in the same orbits in G and

G^{'}

, and the same for

H_{0}

and thus

G ≅ G^{'}

and

H ≅ H^{'}

. The partial ITSs produced by

β

and

β^{'}

are also isomorphic, with red, green, and bold edges depicting the edges present, correspondingly, only in

G [U]

and

G^{'} [U^{'}]

, only in

H [W]

and

H^{'} [W^{'}]

, and in both graphs of each reaction. Let

G_{0} ≅ H_{0}

. Then, the edges in

G_{0}

,

H_{0}

and in their attachments (blue edges) can be preserved by the extended maps in each case, so they contain no reaction edges and thus the maps are good. Note, nonetheless, the non-isomorphism of the full-ITS graphs due to the bold and blue edges. We emphasize, moreover, that in this counterexample, the maps

β

and

β^{'}

are degree-preserving, which is a property of interest for our research and has been studied in more depth in related work [34].

Figure 10. The vertices where

G_{0}

is being attached to U and

U^{'}

are (pairwise) in the same orbits in G and

G^{'}

, and the same for

H_{0}

and thus

G ≅ G^{'}

and

H ≅ H^{'}

. The partial ITSs produced by

β

and

β^{'}

are also isomorphic, with red, green, and bold edges depicting the edges present, correspondingly, only in

G [U]

and

G^{'} [U^{'}]

, only in

H [W]

and

H^{'} [W^{'}]

, and in both graphs of each reaction. Let

G_{0} ≅ H_{0}

. Then, the edges in

G_{0}

,

H_{0}

and in their attachments (blue edges) can be preserved by the extended maps in each case, so they contain no reaction edges and thus the maps are good. Note, nonetheless, the non-isomorphism of the full-ITS graphs due to the bold and blue edges. We emphasize, moreover, that in this counterexample, the maps

β

and

β^{'}

are degree-preserving, which is a property of interest for our research and has been studied in more depth in related work [34].

Figure 11. Visual summary of the most important graphs (e.g., reactant graph G, product graph H, and ITS graph

Υ (G, H, α)

) and the maps that connect them. The figure is intended as a visual guideline to help the reader to keep track of the most important notations.

Figure 11. Visual summary of the most important graphs (e.g., reactant graph G, product graph H, and ITS graph

Υ (G, H, α)

) and the maps that connect them. The figure is intended as a visual guideline to help the reader to keep track of the most important notations.

Figure 12. Identical AAMs in chemistry with O³ and O⁴ being symmetric due to tautomerism. However, there is no isomorphism present between their ITS graphs.

Figure 13. Three AAMs describing Keto–Enol tautomerizations between

(G)

pyridine-4-ol,

(H)

pyridin-4(5H)-one and

(F)

4-pyridone. Atom types are shown as vertex colors (N •, H •, C •, and O •). Single and double bonds are labeled as − and =, respectively. The AAMs

α

in (a) and

β

in (b) produce reaction centers comprising similar tautomerization mechanisms, i.e., reaction centers isomorphic up to edge labels and hydrogen labels. The atom map in (c) from G to F given by their composition

β \circ α

, however, does not conform to Equation (2). It is important to emphasize that the reaction centers shown here do not depict the physical migration of hydrogen atoms, but rather they can be thought of as computational diagrams conveying the graph transformations encoding the reactions, and thus this example exhibits some limitations of the composition of AAMs as mathematical functions.

Figure 13. Three AAMs describing Keto–Enol tautomerizations between

(G)

pyridine-4-ol,

(H)

pyridin-4(5H)-one and

(F)

4-pyridone. Atom types are shown as vertex colors (N •, H •, C •, and O •). Single and double bonds are labeled as − and =, respectively. The AAMs

α

in (a) and

β

in (b) produce reaction centers comprising similar tautomerization mechanisms, i.e., reaction centers isomorphic up to edge labels and hydrogen labels. The atom map in (c) from G to F given by their composition

β \circ α

, however, does not conform to Equation (2). It is important to emphasize that the reaction centers shown here do not depict the physical migration of hydrogen atoms, but rather they can be thought of as computational diagrams conveying the graph transformations encoding the reactions, and thus this example exhibits some limitations of the composition of AAMs as mathematical functions.

Figure 14. The last step of the biosynthesis of lanosterol showcases the problems arising from partial maps and hydrogen maps. Colors in the figure represent the correspondence of the hydrogen atoms and functional groups preserved by the reaction. Here, the hydrogen atoms are being rewired but conserved during multiple 1,2-hydrogen and 1,2-methyl shifts [47]. Only the hydrogen colored in yellow is being lost to the media as a proton and thus has no correspondence on the product side.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

González Laffitte, M.E.; Weinbauer, K.; Phan, T.-L.; Beier, N.; Domschke, N.; Flamm, C.; Gatter, T.; Merkle, D.; Stadler, P.F. Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions. Symmetry 2024, 16, 1217. https://doi.org/10.3390/sym16091217

AMA Style

González Laffitte ME, Weinbauer K, Phan T-L, Beier N, Domschke N, Flamm C, Gatter T, Merkle D, Stadler PF. Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions. Symmetry. 2024; 16(9):1217. https://doi.org/10.3390/sym16091217

Chicago/Turabian Style

González Laffitte, Marcos E., Klaus Weinbauer, Tieu-Long Phan, Nora Beier, Nico Domschke, Christoph Flamm, Thomas Gatter, Daniel Merkle, and Peter F. Stadler. 2024. "Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions" Symmetry 16, no. 9: 1217. https://doi.org/10.3390/sym16091217

APA Style

González Laffitte, M. E., Weinbauer, K., Phan, T. -L., Beier, N., Domschke, N., Flamm, C., Gatter, T., Merkle, D., & Stadler, P. F. (2024). Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions. Symmetry, 16(9), 1217. https://doi.org/10.3390/sym16091217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Partial Imaginary Transition State (ITS) Graphs: A Formal Framework for Research and Analysis of Atom-to-Atom Maps of Unbalanced Chemical Reactions and Their Completions

Abstract

1. Introduction

2. Preliminaries

2.1. Notation

2.2. Atom-to-Atom Map, ITS Graph, and Reaction Center

2.3. Connectedness of the ITS Graphs, Reaction Center and AAMs

2.4. Good and Bad Reaction Patterns

3. Incomplete Atom-to-Atom Maps and partial ITS Graphs

3.1. Good and Bad Partial AAMs

3.2. An ILP for Completing Bad Partial AAMs on Balanced Reactions

3.3. Unbalanced Reactions and Graph Alignments

4. Comparison of Partial AAM

4.1. Equivalence of Partial AAMs

4.2. Consistency of Partial AAMs

5. Atom-to-Atom Maps and Tautomerism

5.1. The Problem

5.2. Tautomerism Equivalence

5.3. Tautomerism ITS Graphs

6. Hydrogens—A Necessary Nuisance

7. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Inconsistent AAMs from KEGG RCLASSes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI