1Getting Started with the RDKit in Python
2%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3
4Important note
5**************
6
7Beginning with the 2019.03 release, the RDKit is no longer supporting Python 2.
8If you need to continue using Python 2, please stick with a release from the 2018.09
9release cycle.
10
11What is this?
12*************
13
14This document is intended to provide an overview of how one can use
15the RDKit functionality from Python.  It's not comprehensive and it's
16not a manual.
17
18If you find mistakes, or have suggestions for improvements, please
19either fix them yourselves in the source document (the .rst file) or
20send them to the mailing list: rdkit-devel@lists.sourceforge.net
21In particular, if you find yourself spending time working out how to
22do something that doesn't appear to be documented please contribute by writing
23it up for this document. Contributing to the documentation is a great service
24both to the RDKit community and to your future self.
25
26Reading and Writing Molecules
27*****************************
28
29Reading single molecules
30========================
31
32.. testsetup::
33
34  # clean up in case these tests are running in a python process that has already
35  # imported the IPythonConsole code
36  from rdkit.Chem.Draw import IPythonConsole
37  IPythonConsole.UninstallIPythonRenderer()
38
39The majority of the basic molecular functionality is found in module :py:mod:`rdkit.Chem`:
40
41.. doctest::
42
43  >>> from rdkit import Chem
44
45Individual molecules can be constructed using a variety of approaches:
46
47.. doctest::
48
49  >>> m = Chem.MolFromSmiles('Cc1ccccc1')
50  >>> m = Chem.MolFromMolFile('data/input.mol')
51  >>> stringWithMolData=open('data/input.mol','r').read()
52  >>> m = Chem.MolFromMolBlock(stringWithMolData)
53
54All of these functions return a :py:class:`rdkit.Chem.rdchem.Mol` object on success:
55
56.. doctest::
57
58  >>> m
59  <rdkit.Chem.rdchem.Mol object at 0x...>
60
61or None on failure:
62
63.. doctest::
64
65  >>> m = Chem.MolFromMolFile('data/invalid.mol')
66  >>> m is None
67  True
68
69An attempt is made to provide sensible error messages:
70
71.. doctest::
72
73  >>> m1 = Chem.MolFromSmiles('CO(C)C')
74
75displays a message like: ``[12:18:01] Explicit valence for atom # 1 O greater than permitted`` and
76
77.. doctest::
78
79  >>> m2 = Chem.MolFromSmiles('c1cc1')
80
81displays something like: ``[12:20:41] Can't kekulize mol``. In each case the value ``None`` is returned:
82
83.. doctest::
84
85  >>> m1 is None
86  True
87  >>> m2 is None
88  True
89
90
91Reading sets of molecules
92=========================
93
94Groups of molecules are read using a Supplier (for example, an :py:class:`rdkit.Chem.rdmolfiles.SDMolSupplier` or a :py:class:`rdkit.Chem.rdmolfiles.SmilesMolSupplier`):
95
96.. doctest::
97
98  >>> suppl = Chem.SDMolSupplier('data/5ht3ligs.sdf')
99  >>> for mol in suppl:
100  ...   print(mol.GetNumAtoms())
101  ...
102  20
103  24
104  24
105  26
106
107You can easily produce lists of molecules from a Supplier:
108
109.. doctest::
110
111  >>> mols = [x for x in suppl]
112  >>> len(mols)
113  4
114
115or just treat the Supplier itself as a random-access object:
116
117.. doctest::
118
119  >>> suppl[0].GetNumAtoms()
120  20
121
122Two good practices when working with Suppliers are to use a context manager and
123to test each molecule to see if it was correctly read before working with it:
124
125.. doctest::
126
127  >>> with Chem.SDMolSupplier('data/5ht3ligs.sdf') as suppl:
128  ...   for mol in suppl:
129  ...     if mol is None: continue
130  ...     print(mol.GetNumAtoms())
131  ...
132  20
133  24
134  24
135  26
136
137An alternate type of Supplier, the :py:class:`rdkit.Chem.rdmolfiles.ForwardSDMolSupplier`
138can be used to read from file-like objects:
139
140.. doctest::
141
142  >>> inf = open('data/5ht3ligs.sdf','rb')
143  >>> with Chem.ForwardSDMolSupplier(inf) as fsuppl:
144  ...   for mol in fsuppl:
145  ...     if mol is None: continue
146  ...     print(mol.GetNumAtoms())
147  ...
148  20
149  24
150  24
151  26
152
153This means that they can be used to read from compressed files:
154
155.. doctest::
156
157  >>> import gzip
158  >>> inf = gzip.open('data/actives_5ht3.sdf.gz')
159  >>> with Chem.ForwardSDMolSupplier(inf) as gzsuppl:
160  ...    ms = [x for x in gzsuppl if x is not None]
161  >>> len(ms)
162  180
163
164Note that ForwardSDMolSuppliers cannot be used as random-access objects:
165
166.. doctest::
167
168  >>> inf = open('data/5ht3ligs.sdf','rb')
169  >>> with Chem.ForwardSDMolSupplier(inf) as fsuppl:
170  ...   fsuppl[0]
171  Traceback (most recent call last):
172    ...
173  TypeError: 'ForwardSDMolSupplier' object does not support indexing
174
175For reading Smiles or SDF files with large number of records concurrently, MultithreadedMolSuppliers can be used like this:
176
177.. doctest::
178
179  >>> i = 0
180  >>> with Chem.MultithreadedSDMolSupplier('data/5ht3ligs.sdf') as sdSupl:
181  ...   for mol in sdSupl:
182  ...     if mol is not None:
183  ...       i += 1
184  ...
185  >>> print(i)
186  4
187
188By default a single reader thread is used to extract records from the file and a
189single writer thread is used to process them. Note that due to multithreading
190the output may not be in the expected order. Furthermore, the
191MultithreadedSmilesMolSupplier and the MultithreadedSDMolSupplier cannot be used
192as random-access objects.
193
194
195Writing molecules
196=================
197
198Single molecules can be converted to text using several functions present in the :py:mod:`rdkit.Chem` module.
199
200For example, for SMILES:
201
202.. doctest::
203
204  >>> m = Chem.MolFromMolFile('data/chiral.mol')
205  >>> Chem.MolToSmiles(m)
206  'C[C@H](O)c1ccccc1'
207  >>> Chem.MolToSmiles(m,isomericSmiles=False)
208  'CC(O)c1ccccc1'
209
210Note that the SMILES provided is canonical, so the output should be the same no matter how a particular molecule is input:
211
212.. doctest::
213
214  >>> Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC=CN=C1'))
215  'c1ccncc1'
216  >>> Chem.MolToSmiles(Chem.MolFromSmiles('c1cccnc1'))
217  'c1ccncc1'
218  >>> Chem.MolToSmiles(Chem.MolFromSmiles('n1ccccc1'))
219  'c1ccncc1'
220
221If you'd like to have the Kekule form of the SMILES, first Kekulize the molecule, then use the “kekuleSmiles” option:
222
223.. doctest::
224
225  >>> Chem.Kekulize(m)
226  >>> Chem.MolToSmiles(m,kekuleSmiles=True)
227  'C[C@H](O)C1=CC=CC=C1'
228
229Note: as of this writing (Aug 2008), the smiles provided when one requests kekuleSmiles are not canonical.
230The limitation is not in the SMILES generation, but in the kekulization itself.
231
232MDL Mol blocks are also available:
233
234.. doctest::
235
236  >>> m2 = Chem.MolFromSmiles('C1CCC1')
237  >>> print(Chem.MolToMolBlock(m2))    # doctest: +NORMALIZE_WHITESPACE
238  <BLANKLINE>
239       RDKit          2D
240  <BLANKLINE>
241    4  4  0  0  0  0  0  0  0  0999 V2000
242      1.0607    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
243     -0.0000   -1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
244     -1.0607    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
245      0.0000    1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
246    1  2  1  0
247    2  3  1  0
248    3  4  1  0
249    4  1  1  0
250  M  END
251  <BLANKLINE>
252
253To include names in the mol blocks, set the molecule's “_Name” property:
254
255.. doctest::
256
257  >>> m2.SetProp("_Name","cyclobutane")
258  >>> print(Chem.MolToMolBlock(m2))     # doctest: +NORMALIZE_WHITESPACE
259  cyclobutane
260       RDKit          2D
261  <BLANKLINE>
262    4  4  0  0  0  0  0  0  0  0999 V2000
263      1.0607    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
264     -0.0000   -1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
265     -1.0607    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
266      0.0000    1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
267    1  2  1  0
268    2  3  1  0
269    3  4  1  0
270    4  1  1  0
271  M  END
272  <BLANKLINE>
273
274In order for atom or bond stereochemistry to be recognised correctly by most
275software, it's essential that the mol block have atomic coordinates.
276It's also convenient for many reasons, such as drawing the molecules.
277Generating a mol block for a molecule that does not have coordinates will, by
278default, automatically cause coordinates to be generated. These are not,
279however, stored with the molecule.
280Coordinates can be generated and stored with the molecule using functionality
281in the :py:mod:`rdkit.Chem.AllChem` module (see the `Chem vs AllChem`_ section for
282more information).
283
284You can either include 2D coordinates (i.e. a depiction):
285
286.. doctest::
287
288  >>> from rdkit.Chem import AllChem
289  >>> AllChem.Compute2DCoords(m2)
290  0
291  >>> print(Chem.MolToMolBlock(m2))     # doctest: +NORMALIZE_WHITESPACE
292  cyclobutane
293       RDKit          2D
294  <BLANKLINE>
295    4  4  0  0  0  0  0  0  0  0999 V2000
296      1.0607   -0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
297     -0.0000   -1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
298     -1.0607    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
299      0.0000    1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
300    1  2  1  0
301    2  3  1  0
302    3  4  1  0
303    4  1  1  0
304  M  END
305  <BLANKLINE>
306
307Or you can add 3D coordinates by embedding the molecule (this uses the ETKDG
308method, which is described in more detail below). Note that we add Hs to the
309molecule before generating the conformer. This is essential to get good structures:
310
311.. doctest::
312
313  >>> m3 = Chem.AddHs(m2)
314  >>> AllChem.EmbedMolecule(m3,randomSeed=0xf00d)   # optional random seed for reproducibility)
315  0
316  >>> print(Chem.MolToMolBlock(m3))    # doctest: +NORMALIZE_WHITESPACE
317  cyclobutane
318       RDKit          3D
319  <BLANKLINE>
320   12 12  0  0  0  0  0  0  0  0999 V2000
321      1.0256    0.2491   -0.0964 C   0  0  0  0  0  0  0  0  0  0  0  0
322     -0.2041    0.9236    0.4320 C   0  0  0  0  0  0  0  0  0  0  0  0
323     -1.0435   -0.2466   -0.0266 C   0  0  0  0  0  0  0  0  0  0  0  0
324      0.2104   -0.9922   -0.3417 C   0  0  0  0  0  0  0  0  0  0  0  0
325      1.4182    0.7667   -0.9782 H   0  0  0  0  0  0  0  0  0  0  0  0
326      1.8181    0.1486    0.6820 H   0  0  0  0  0  0  0  0  0  0  0  0
327     -0.1697    1.0826    1.5236 H   0  0  0  0  0  0  0  0  0  0  0  0
328     -0.5336    1.8391   -0.1051 H   0  0  0  0  0  0  0  0  0  0  0  0
329     -1.6809   -0.0600   -0.8987 H   0  0  0  0  0  0  0  0  0  0  0  0
330     -1.6501   -0.6194    0.8220 H   0  0  0  0  0  0  0  0  0  0  0  0
331      0.4659   -1.7768    0.3858 H   0  0  0  0  0  0  0  0  0  0  0  0
332      0.3439   -1.3147   -1.3988 H   0  0  0  0  0  0  0  0  0  0  0  0
333    1  2  1  0
334    2  3  1  0
335    3  4  1  0
336    4  1  1  0
337    1  5  1  0
338    1  6  1  0
339    2  7  1  0
340    2  8  1  0
341    3  9  1  0
342    3 10  1  0
343    4 11  1  0
344    4 12  1  0
345  M  END
346<BLANKLINE>
347
348If we don't want the Hs in our later analysis, they are easy to remove:
349
350.. doctest::
351
352  >>> m3 = Chem.RemoveHs(m3)
353  >>> print(Chem.MolToMolBlock(m3))    # doctest: +NORMALIZE_WHITESPACE
354  cyclobutane
355       RDKit          3D
356  <BLANKLINE>
357    4  4  0  0  0  0  0  0  0  0999 V2000
358      1.0256    0.2491   -0.0964 C   0  0  0  0  0  0  0  0  0  0  0  0
359     -0.2041    0.9236    0.4320 C   0  0  0  0  0  0  0  0  0  0  0  0
360     -1.0435   -0.2466   -0.0266 C   0  0  0  0  0  0  0  0  0  0  0  0
361      0.2104   -0.9922   -0.3417 C   0  0  0  0  0  0  0  0  0  0  0  0
362    1  2  1  0
363    2  3  1  0
364    3  4  1  0
365    4  1  1  0
366  M  END
367  <BLANKLINE>
368
369If you'd like to write the molecule to a file, use Python file objects:
370
371.. doctest::
372
373  >>> print(Chem.MolToMolBlock(m2),file=open('data/foo.mol','w+'))
374  >>>
375
376
377Writing sets of molecules
378=========================
379
380Multiple molecules can be written to a file using an :py:class:`rdkit.Chem.rdmolfiles.SDWriter` object:
381
382.. doctest::
383
384  >>> with Chem.SDWriter('data/foo.sdf') as w:
385  ...   for m in mols:
386  ...     w.write(m)
387  >>>
388
389An SDWriter can also be initialized using a file-like object:
390
391.. doctest::
392
393  >>> from rdkit.six import StringIO
394  >>> sio = StringIO()
395  >>> with Chem.SDWriter(sio) as w:
396  ...   for m in mols:
397  ...     w.write(m)
398  >>> print(sio.getvalue())
399  mol-295
400       RDKit          3D
401  <BLANKLINE>
402   20 22  0  0  1  0  0  0  0  0999 V2000
403      2.3200    0.0800   -0.1000 C   0  0  0  0  0  0  0  0  0  0  0  0
404      1.8400   -1.2200    0.1200 C   0  0  0  0  0  0  0  0  0  0  0  0
405  ...
406    1  3  1  0
407    1  4  1  0
408    2  5  1  0
409  M  END
410  $$$$
411  <BLANKLINE>
412
413
414
415Other available Writers include the :py:class:`rdkit.Chem.rdmolfiles.SmilesWriter` and the :py:class:`rdkit.Chem.rdmolfiles.TDTWriter`.
416
417
418Working with Molecules
419**********************
420
421
422Looping over Atoms and Bonds
423============================
424
425Once you have a molecule, it's easy to loop over its atoms and bonds:
426
427.. doctest::
428
429  >>> m = Chem.MolFromSmiles('C1OC1')
430  >>> for atom in m.GetAtoms():
431  ...   print(atom.GetAtomicNum())
432  ...
433  6
434  8
435  6
436  >>> print(m.GetBonds()[0].GetBondType())
437  SINGLE
438
439You can also request individual bonds or atoms:
440
441.. doctest::
442
443  >>> m.GetAtomWithIdx(0).GetSymbol()
444  'C'
445  >>> m.GetAtomWithIdx(0).GetExplicitValence()
446  2
447  >>> m.GetBondWithIdx(0).GetBeginAtomIdx()
448  0
449  >>> m.GetBondWithIdx(0).GetEndAtomIdx()
450  1
451  >>> m.GetBondBetweenAtoms(0,1).GetBondType()
452  rdkit.Chem.rdchem.BondType.SINGLE
453
454Atoms keep track of their neighbors:
455
456.. doctest::
457
458  >>> atom = m.GetAtomWithIdx(0)
459  >>> [x.GetAtomicNum() for x in atom.GetNeighbors()]
460  [8, 6]
461  >>> len(atom.GetNeighbors()[-1].GetBonds())
462  2
463
464
465Ring Information
466================
467
468Atoms and bonds both carry information about the molecule's rings:
469
470.. doctest::
471
472  >>> m = Chem.MolFromSmiles('OC1C2C1CC2')
473  >>> m.GetAtomWithIdx(0).IsInRing()
474  False
475  >>> m.GetAtomWithIdx(1).IsInRing()
476  True
477  >>> m.GetAtomWithIdx(2).IsInRingSize(3)
478  True
479  >>> m.GetAtomWithIdx(2).IsInRingSize(4)
480  True
481  >>> m.GetAtomWithIdx(2).IsInRingSize(5)
482  False
483  >>> m.GetBondWithIdx(1).IsInRingSize(3)
484  True
485  >>> m.GetBondWithIdx(1).IsInRing()
486  True
487
488But note that the information is only about the smallest rings:
489
490.. doctest::
491
492  >>> m.GetAtomWithIdx(1).IsInRingSize(5)
493  False
494
495More detail about the smallest set of smallest rings (SSSR) is available:
496
497.. doctest::
498
499  >>> ssr = Chem.GetSymmSSSR(m)
500  >>> len(ssr)
501  2
502  >>> list(ssr[0])
503  [1, 2, 3]
504  >>> list(ssr[1])
505  [4, 5, 2, 3]
506
507As the name indicates, this is a symmetrized SSSR; if you are interested in the number of “true” SSSR, use the GetSSSR function.
508
509
510.. doctest::
511
512  >>> Chem.GetSSSR(m)
513  2
514
515The distinction between symmetrized and non-symmetrized SSSR is discussed in more detail below in the section `The SSSR Problem`_.
516
517For more efficient queries about a molecule's ring systems (avoiding repeated calls to Mol.GetAtomWithIdx), use the :py:class:`rdkit.Chem.rdchem.RingInfo` class:
518
519.. doctest::
520
521  >>> m = Chem.MolFromSmiles('OC1C2C1CC2')
522  >>> ri = m.GetRingInfo()
523  >>> ri.NumAtomRings(0)
524  0
525  >>> ri.NumAtomRings(1)
526  1
527  >>> ri.NumAtomRings(2)
528  2
529  >>> ri.IsAtomInRingOfSize(1,3)
530  True
531  >>> ri.IsBondInRingOfSize(1,3)
532  True
533
534Modifying molecules
535===================
536
537Normally molecules are stored in the RDKit with the hydrogen atoms implicit (e.g. not explicitly present in the molecular graph.
538When it is useful to have the hydrogens explicitly present, for example when generating or optimizing the 3D geometry, the :py:func:rdkit.Chem.rdmolops.AddHs function can be used:
539
540.. doctest::
541
542  >>> m=Chem.MolFromSmiles('CCO')
543  >>> m.GetNumAtoms()
544  3
545  >>> m2 = Chem.AddHs(m)
546  >>> m2.GetNumAtoms()
547  9
548
549The Hs can be removed again using the :py:func:`rdkit.Chem.rdmolops.RemoveHs` function:
550
551.. doctest::
552
553  >>> m3 = Chem.RemoveHs(m2)
554  >>> m3.GetNumAtoms()
555  3
556
557RDKit molecules are usually stored with the bonds in aromatic rings having aromatic bond types.
558This can be changed with the :py:func:`rdkit.Chem.rdmolops.Kekulize` function:
559
560.. doctest::
561
562  >>> m = Chem.MolFromSmiles('c1ccccc1')
563  >>> m.GetBondWithIdx(0).GetBondType()
564  rdkit.Chem.rdchem.BondType.AROMATIC
565  >>> Chem.Kekulize(m)
566  >>> m.GetBondWithIdx(0).GetBondType()
567  rdkit.Chem.rdchem.BondType.DOUBLE
568  >>> m.GetBondWithIdx(1).GetBondType()
569  rdkit.Chem.rdchem.BondType.SINGLE
570
571By default, the bonds are still marked as being aromatic:
572
573.. doctest::
574
575  >>> m.GetBondWithIdx(1).GetIsAromatic()
576  True
577
578because the flags in the original molecule are not cleared (clearAromaticFlags defaults to False).
579You can explicitly force or decline a clearing of the flags:
580
581.. doctest::
582
583  >>> m = Chem.MolFromSmiles('c1ccccc1')
584  >>> m.GetBondWithIdx(0).GetIsAromatic()
585  True
586  >>> m1 = Chem.MolFromSmiles('c1ccccc1')
587  >>> Chem.Kekulize(m1, clearAromaticFlags=True)
588  >>> m1.GetBondWithIdx(0).GetIsAromatic()
589  False
590
591Bonds can be restored to the aromatic bond type using the :py:func:`rdkit.Chem.rdmolops.SanitizeMol` function:
592
593.. doctest::
594
595  >>> Chem.SanitizeMol(m)
596  rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
597  >>> m.GetBondWithIdx(0).GetBondType()
598  rdkit.Chem.rdchem.BondType.AROMATIC
599
600The value returned by `SanitizeMol()` indicates that no problems were encountered.
601
602Working with 2D molecules: Generating Depictions
603================================================
604
605The RDKit has a library for generating depictions (sets of 2D) coordinates for molecules.
606This library, which is part of the AllChem module, is accessed using the :py:func:`rdkit.Chem.rdDepictor.Compute2DCoords` function:
607
608.. doctest::
609
610  >>> m = Chem.MolFromSmiles('c1nccc2n1ccc2')
611  >>> AllChem.Compute2DCoords(m)
612  0
613
614The 2D conformer is constructed in a canonical orientation and is
615built to minimize intramolecular clashes, i.e. to maximize the clarity
616of the drawing.
617
618If you have a set of molecules that share a common template and you'd
619like to align them to that template, you can do so as follows:
620
621.. doctest::
622
623  >>> template = Chem.MolFromSmiles('c1nccc2n1ccc2')
624  >>> AllChem.Compute2DCoords(template)
625  0
626  >>> ms = [Chem.MolFromSmiles(smi) for smi in ('OCCc1ccn2cnccc12','C1CC1Oc1cc2ccncn2c1','CNC(=O)c1nccc2cccn12')]
627  >>> for m in ms:
628  ...     _ = AllChem.GenerateDepictionMatching2DStructure(m,template)
629
630Running this process for the molecules above gives:
631
632+---------------+---------------+---------------+
633| |picture_1|   | |picture_0|   | |picture_3|   |
634+---------------+---------------+---------------+
635
636Another option for Compute2DCoords allows you to generate 2D depictions for molecules that closely mimic 3D conformers.
637This is available using the function :py:func:`rdkit.Chem.AllChem.GenerateDepictionMatching3DStructure`.
638
639Here is an illustration of the results using the ligand from PDB structure 1XP0:
640
641+---------------+---------------+
642| |picture_2|   | |picture_4|   |
643+---------------+---------------+
644
645More fine-grained control can be obtained using the core function
646:py:func:`rdkit.Chem.rdDepictor.Compute2DCoordsMimicDistmat`, but that is
647beyond the scope of this document.  See the implementation of
648GenerateDepictionMatching3DStructure in AllChem.py for an example of
649how it is used.
650
651
652Working with 3D Molecules
653=========================
654
655The RDKit can generate conformers for molecules using two different
656methods.  The original method used distance geometry. [#blaney]_
657The algorithm followed is:
658
6591. The molecule's distance bounds matrix is calculated based on the connection table and a set of rules.
660
6612. The bounds matrix is smoothed using a triangle-bounds smoothing algorithm.
662
6633. A random distance matrix that satisfies the bounds matrix is generated.
664
6654. This distance matrix is embedded in 3D dimensions (producing coordinates for each atom).
666
6675. The resulting coordinates are cleaned up somewhat using a crude force field and the bounds matrix.
668
669Note that the conformers that result from this procedure tend to be fairly ugly.
670They should be cleaned up using a force field. This can be done within the RDKit
671using its implementation of the Universal Force Field (UFF). [#rappe]_
672
673More recently, there is an implementation of the ETKDG method of Riniker and
674Landrum [#riniker2]_ which uses torsion angle preferences from the Cambridge
675Structural Database (CSD) to correct the conformers after distance geometry has
676been used to generate them.  With this method, there should be no need to use a
677minimisation step to clean up the structures.
678
679More detailed information about the conformer generator and the parameters
680controlling it can be found in the "RDKit Book".
681
682Since the 2018.09 release of the RDKit, ETKDG is the default conformer generation method.
683
684The full process of embedding a molecule is easier than all the above verbiage makes it sound:
685
686.. doctest::
687
688  >>> m2=Chem.AddHs(m)
689  >>> AllChem.EmbedMolecule(m2)
690  0
691
692The RDKit also has an implementation of the MMFF94 force field available.
693[#mmff1]_, [#mmff2]_, [#mmff3]_, [#mmff4]_, [#mmffs]_ Please note that the MMFF
694atom typing code uses its own aromaticity model, so the aromaticity flags of the
695molecule will be modified after calling MMFF-related methods.
696
697Here's an example of using MMFF94 to minimize an RDKit-generated conformer:
698.. doctest::
699
700  >>> m = Chem.MolFromSmiles('C1CCC1OC')
701  >>> m2=Chem.AddHs(m)
702  >>> AllChem.EmbedMolecule(m2)
703  0
704  >>> AllChem.MMFFOptimizeMolecule(m2)
705  0
706
707Note the calls to `Chem.AddHs()` in the examples above. By default
708RDKit molecules do not have H atoms explicitly present in the graph,
709but they are important for getting realistic geometries, so they
710generally should be added.  They can always be removed afterwards
711if necessary with a call to `Chem.RemoveHs()`.
712
713With the RDKit, multiple conformers can also be generated using the
714different embedding methods. In both cases this is simply a matter of
715running the distance geometry calculation multiple times from
716different random start points. The option `numConfs` allows the user to
717set the number of conformers that should be generated.  Otherwise the
718procedures are as before. The conformers so generated can be aligned
719to each other and the RMS values calculated.
720
721.. doctest::
722
723  >>> m = Chem.MolFromSmiles('C1CCC1OC')
724  >>> m2=Chem.AddHs(m)
725  >>> # run ETKDG 10 times
726  >>> cids = AllChem.EmbedMultipleConfs(m2, numConfs=10)
727  >>> print(len(cids))
728  10
729  >>> rmslist = []
730  >>> AllChem.AlignMolConformers(m2, RMSlist=rmslist)
731  >>> print(len(rmslist))
732  9
733
734rmslist contains the RMS values between the first conformer and all others.
735The RMS between two specific conformers (e.g. 1 and 9) can also be calculated.
736The flag prealigned lets the user specify if the conformers are already aligned
737(by default, the function aligns them).
738
739.. doctest::
740
741  >>> rms = AllChem.GetConformerRMS(m2, 1, 9, prealigned=True)
742
743If you are interested in running MMFF94 on a molecule's conformers (note that
744this is often not necessary when using ETKDG), there's a convenience
745function available:
746
747.. doctest::
748
749  >>> res = AllChem.MMFFOptimizeMoleculeConfs(m2)
750
751The result is a list a containing 2-tuples: `(not_converged, energy)` for
752each conformer. If `not_converged` is 0, the minimization for that conformer
753converged.
754
755By default `AllChem.EmbedMultipleConfs` and `AllChem.MMFFOptimizeMoleculeConfs()`
756run single threaded, but you can cause them to use
757multiple threads simultaneously for these embarassingly parallel tasks
758via the `numThreads` argument:
759
760.. doctest::
761
762  >>> cids = AllChem.EmbedMultipleConfs(m2, numThreads=0)
763  >>> res = AllChem.MMFFOptimizeMoleculeConfs(m2, numThreads=0)
764
765Setting `numThreads` to zero causes the software to use the maximum number
766of threads allowed on your computer.
767
768*Disclaimer/Warning*: Conformer generation is a difficult and subtle task. The
769plain distance-geometry 2D->3D conversion provided with the RDKit is not
770intended to be a replacement for a “real” conformer analysis tool; it merely
771provides quick 3D structures for cases when they are required. We believe,
772however, that the newer ETKDG method [#riniker2]_ is suitable for most purposes.
773
774
775Preserving Molecules
776====================
777
778Molecules can be converted to and from text using Python's pickling machinery:
779
780.. doctest::
781
782  >>> m = Chem.MolFromSmiles('c1ccncc1')
783  >>> import pickle
784  >>> pkl = pickle.dumps(m)
785  >>> m2=pickle.loads(pkl)
786  >>> Chem.MolToSmiles(m2)
787  'c1ccncc1'
788
789The RDKit pickle format is fairly compact and it is much, much faster to build a
790molecule from a pickle than from a Mol file or SMILES string, so storing
791molecules you will be working with repeatedly as pickles can be a good idea.
792
793The raw binary data that is encapsulated in a pickle can also be directly
794obtained from a molecule:
795
796.. doctest::
797
798  >>> binStr = m.ToBinary()
799
800This can be used to reconstruct molecules using the Chem.Mol constructor:
801
802.. doctest::
803
804  >>> m2 = Chem.Mol(binStr)
805  >>> Chem.MolToSmiles(m2)
806  'c1ccncc1'
807  >>> len(binStr)
808  123
809
810Note that this is smaller than the pickle:
811
812.. doctest::
813
814  >>> len(binStr) < len(pkl)
815  True
816
817The small overhead associated with python's pickling machinery normally doesn't
818end up making much of a difference for collections of larger molecules (the
819extra data associated with the pickle is independent of the size of the
820molecule, while the binary string increases in length as the molecule gets
821larger).
822
823*Tip*: The performance difference associated with storing molecules in a pickled
824form on disk instead of constantly reparsing an SD file or SMILES table is
825difficult to overstate. In a test I just ran on my laptop, loading a set of 699
826drug-like molecules from an SD file took 10.8 seconds; loading the same
827molecules from a pickle file took 0.7 seconds. The pickle file is also smaller –
8281/3 the size of the SD file – but this difference is not always so dramatic
829(it's a particularly fat SD file).
830
831Drawing Molecules
832=================
833
834The RDKit has some built-in functionality for creating images from
835molecules found in the :py:mod:`rdkit.Chem.Draw` package:
836
837.. doctest::
838
839  >>> with Chem.SDMolSupplier('data/cdk2.sdf') as suppl:
840  ...   ms = [x for x in suppl if x is not None]
841  >>> for m in ms: tmp=AllChem.Compute2DCoords(m)
842  >>> from rdkit.Chem import Draw
843  >>> Draw.MolToFile(ms[0],'images/cdk2_mol1.o.png')    # doctest: +SKIP
844  >>> Draw.MolToFile(ms[1],'images/cdk2_mol2.o.png')    # doctest: +SKIP
845
846Producing these images:
847
848+----------------------------------+----------------------------------+
849| .. image:: images/cdk2_mol1.png  | .. image:: images/cdk2_mol2.png  |
850+----------------------------------+----------------------------------+
851
852It's also possible to produce an image grid out of a set of molecules:
853
854.. doctest::
855
856  >>> img=Draw.MolsToGridImage(ms[:8],molsPerRow=4,subImgSize=(200,200),legends=[x.GetProp("_Name") for x in ms[:8]])    # doctest: +SKIP
857
858This returns a PIL image, which can then be saved to a file:
859
860.. doctest::
861
862  >>> img.save('images/cdk2_molgrid.o.png')    # doctest: +SKIP
863
864The result looks like this:
865
866.. image:: images/cdk2_molgrid.png
867
868These would of course look better if the common core were
869aligned. This is easy enough to do:
870
871.. doctest::
872
873  >>> p = Chem.MolFromSmiles('[nH]1cnc2cncnc21')
874  >>> subms = [x for x in ms if x.HasSubstructMatch(p)]
875  >>> len(subms)
876  14
877  >>> AllChem.Compute2DCoords(p)
878  0
879  >>> for m in subms:
880  ...   _ = AllChem.GenerateDepictionMatching2DStructure(m,p)
881  >>> img=Draw.MolsToGridImage(subms,molsPerRow=4,subImgSize=(200,200),legends=[x.GetProp("_Name") for x in subms])    # doctest: +SKIP
882  >>> img.save('images/cdk2_molgrid.aligned.o.png')    # doctest: +SKIP
883
884
885The result looks like this:
886
887.. image:: images/cdk2_molgrid_aligned.png
888
889Atoms in a molecule can be highlighted by drawing a coloured solid or
890open circle around them, and bonds likewise can have a coloured
891outline applied.  An obvious use is to show atoms and bonds that have
892matched a substructure query
893
894.. doctest::
895
896   >>> from rdkit.Chem.Draw import rdMolDraw2D
897   >>> smi = 'c1cc(F)ccc1Cl'
898   >>> mol = Chem.MolFromSmiles(smi)
899   >>> patt = Chem.MolFromSmarts('ClccccF')
900   >>> hit_ats = list(mol.GetSubstructMatch(patt))
901   >>> hit_bonds = []
902   >>> for bond in patt.GetBonds():
903   ...    aid1 = hit_ats[bond.GetBeginAtomIdx()]
904   ...    aid2 = hit_ats[bond.GetEndAtomIdx()]
905   ...    hit_bonds.append(mol.GetBondBetweenAtoms(aid1,aid2).GetIdx())
906   >>> d = rdMolDraw2D.MolDraw2DSVG(500, 500) # or MolDraw2DCairo to get PNGs
907   >>> rdMolDraw2D.PrepareAndDrawMolecule(d, mol, highlightAtoms=hit_ats,
908   ...                                    highlightBonds=hit_bonds)
909
910will produce:
911
912.. image:: images/atom_highlights_1.png
913
914It is possible to specify the colours for individual atoms and bonds:
915
916.. doctest::
917
918   >>> colours = [(0.8,0.0,0.8),(0.8,0.8,0),(0,0.8,0.8),(0,0,0.8)]
919   >>> atom_cols = {}
920   >>> for i, at in enumerate(hit_ats):
921   ...     atom_cols[at] = colours[i%4]
922   >>> bond_cols = {}
923   >>> for i, bd in enumerate(hit_bonds):
924   ...     bond_cols[bd] = colours[3 - i%4]
925   >>>
926   >>> d = rdMolDraw2D.MolDraw2DCairo(500, 500)
927   >>> rdMolDraw2D.PrepareAndDrawMolecule(d, mol, highlightAtoms=hit_ats,
928   ...                                    highlightAtomColors=atom_cols,
929   ...                                    highlightBonds=hit_bonds,
930   ...                                    highlightBondColors=bond_cols)
931
932to give:
933
934.. image:: images/atom_highlights_2.png
935
936Atoms and bonds can also be highlighted with multiple colours if they
937fall into multiple sets, for example if they are matched by more
938than 1 substructure pattern.  This is too complicated to show in this
939simple introduction, but there is an example in
940data/test_multi_colours.py, which produces the somewhat garish
941
942.. image:: images/atom_highlights_3.png
943
944As of version 2020.03, it is possible to add arbitrary small strings to annotate
945atoms and bonds in the drawing.  The strings are added as properties
946``atomNote`` and ``bondNote`` and they will be placed automatically close to the
947atom or bond in question in a manner intended to minimise their clash with the
948rest of the drawing.  For convenience, here are 3 flags in ``MolDraw2DOptions``
949that will add stereo information (R/S to atoms, E/Z to bonds) and atom and bond
950sequence numbers.
951
952.. doctest::
953
954   >>> mol = Chem.MolFromSmiles('Cl[C@H](F)NC\C=C\C')
955   >>> d = rdMolDraw2D.MolDraw2DCairo(250, 200) # or MolDraw2DSVG to get SVGs
956   >>> mol.GetAtomWithIdx(2).SetProp('atomNote', 'foo')
957   >>> mol.GetBondWithIdx(0).SetProp('bondNote', 'bar')
958   >>> d.drawOptions().addStereoAnnotation = True
959   >>> d.drawOptions().addAtomIndices = True
960   >>> d.DrawMolecule(mol)
961   >>> d.FinishDrawing()
962   >>> d.WriteDrawingText('atom_annotation_1.png')   # doctest: +SKIP
963
964will produce
965
966.. image:: images/atom_annotation_1.png
967
968If atoms have an ``atomLabel`` property set, this will be used when drawing them:
969
970.. doctest::
971
972   >>> smi = 'c1nc(*)ccc1* |$;;;R1;;;;R2$|'
973   >>> mol = Chem.MolFromSmiles(smi)
974   >>> mol.GetAtomWithIdx(3).GetProp("atomLabel")
975   'R1'
976   >>> mol.GetAtomWithIdx(7).GetProp("atomLabel")
977   'R2'
978   >>> d = rdMolDraw2D.MolDraw2DCairo(250, 250)
979   >>> rdMolDraw2D.PrepareAndDrawMolecule(d,mol)
980   >>> d.WriteDrawingText("./images/atom_labels_1.png")   # doctest: +SKIP
981
982gives:
983
984.. image:: images/atom_labels_1.png
985
986Since the ``atomLabel`` property is also used for other things (for example in CXSMILES as demonstrated),
987if you want to provide your own atom labels, it's better to use the ``_displayLabel`` property:
988
989   >>> smi = 'c1nc(*)ccc1* |$;;;R1;;;;R2$|'
990   >>> mol = Chem.MolFromSmiles(smi)
991   >>> mol.GetAtomWithIdx(3).SetProp("_displayLabel","R<sub>1</sub>")
992   >>> mol.GetAtomWithIdx(7).SetProp("_displayLabel","R<sub>2</sub>")
993   >>> d = rdMolDraw2D.MolDraw2DCairo(250, 250)
994   >>> rdMolDraw2D.PrepareAndDrawMolecule(d,mol)
995   >>> d.WriteDrawingText("./images/atom_labels_2.png")   # doctest: +SKIP
996
997this gives:
998
999.. image:: images/atom_labels_2.png
1000
1001Note that you can use ``<sup>`` and ``<sub>`` in these labels to provide super- and subscripts.
1002
1003Finally, if you have atom labels which should be displayed differently when the bond comes
1004into them from the right (the West), you can also set the ``_displayLabelW`` property:
1005
1006
1007.. doctest::
1008
1009   >>> smi = 'c1nc(*)ccc1* |$;;;R1;;;;R2$|'
1010   >>> mol = Chem.MolFromSmiles(smi)
1011   >>> mol.GetAtomWithIdx(3).SetProp("_displayLabel","CO<sub>2</sub>H")
1012   >>> mol.GetAtomWithIdx(3).SetProp("_displayLabelW","HO<sub>2</sub>C")
1013   >>> mol.GetAtomWithIdx(7).SetProp("_displayLabel","CO<sub>2</sub><sup>-</sup>")
1014   >>> mol.GetAtomWithIdx(7).SetProp("_displayLabelW","<sup>-</sup>OOC")
1015   >>> d = rdMolDraw2D.MolDraw2DCairo(250, 250)
1016   >>> rdMolDraw2D.PrepareAndDrawMolecule(d,mol)
1017   >>> d.WriteDrawingText("./images/atom_labels_3.png")   # doctest: +SKIP
1018
1019this gives:
1020
1021.. image:: images/atom_labels_3.png
1022
1023
1024Metadata in Molecule Images
1025===========================
1026
1027*New in 2020.09 release*
1028
1029The PNG files generated by the `MolDraw2DCairo` class by default include
1030metadata about the molecule(s) or chemical reaction included in the drawing.
1031This metadata can be used later to reconstruct the molecule(s) or reaction.
1032
1033.. doctest::
1034
1035  >>> template = Chem.MolFromSmiles('c1nccc2n1ccc2')
1036  >>> AllChem.Compute2DCoords(template)
1037  0
1038  >>> ms = [Chem.MolFromSmiles(smi) for smi in ('OCCc1ccn2cnccc12','C1CC1Oc1cc2ccncn2c1','CNC(=O)c1nccc2cccn12')]
1039  >>> _ = [AllChem.GenerateDepictionMatching2DStructure(m,template) for m in ms]
1040  >>> d = rdMolDraw2D.MolDraw2DCairo(250, 200)
1041  >>> d.DrawMolecule(ms[0])
1042  >>> d.FinishDrawing()
1043  >>> png = d.GetDrawingText()
1044  >>> mol = Chem.MolFromPNGString(png)
1045  >>> Chem.MolToSmiles(mol)
1046  'OCCc1c2ccncn2cc1'
1047
1048The molecular metadata is stored using standard metadata tags in the PNG and is,
1049of course, not visible when you look at the PNG:
1050
1051.. image:: images/mol_metadata1.png
1052
1053If the PNG contains multiple molecules we can retrieve them all at once using
1054`Chem.MolsFromPNGString()`:
1055
1056.. doctest::
1057
1058  >>> from rdkit.Chem import Draw
1059  >>> png = Draw.MolsToGridImage(ms,returnPNG=True)
1060  >>> mols = Chem.MolsFromPNGString(png)
1061  >>> for mol in mols:
1062  ...     print(Chem.MolToSmiles(mol))
1063  ...
1064  OCCc1c2ccncn2cc1
1065  c1cc2cc(OC3CC3)cn2cn1
1066  CNC(=O)c1nccc2cccn12
1067
1068Substructure Searching
1069**********************
1070
1071Substructure matching can be done using query molecules built from SMARTS:
1072
1073.. doctest::
1074
1075  >>> m = Chem.MolFromSmiles('c1ccccc1O')
1076  >>> patt = Chem.MolFromSmarts('ccO')
1077  >>> m.HasSubstructMatch(patt)
1078  True
1079  >>> m.GetSubstructMatch(patt)
1080  (0, 5, 6)
1081
1082Those are the atom indices in ``m``, ordered as ``patt``'s atoms. To get all of the matches:
1083
1084.. doctest::
1085
1086  >>> m.GetSubstructMatches(patt)
1087  ((0, 5, 6), (4, 5, 6))
1088
1089This can be used to easily filter lists of molecules:
1090
1091.. doctest::
1092
1093  >>> patt = Chem.MolFromSmarts('c[NH1]')
1094  >>> matches = []
1095  >>> with Chem.SDMolSupplier('data/actives_5ht3.sdf') as suppl:
1096  ...   for mol in suppl:
1097  ...     if mol.HasSubstructMatch(patt):
1098  ...       matches.append(mol)
1099  >>> len(matches)
1100  22
1101
1102We can write the same thing more compactly using Python's list comprehension syntax:
1103
1104.. doctest::
1105
1106  >>> with Chem.SDMolSupplier('data/actives_5ht3.sdf') as suppl:
1107  ...   matches = [x for x in suppl if x.HasSubstructMatch(patt)]
1108  >>> len(matches)
1109  22
1110
1111Substructure matching can also be done using molecules built from SMILES instead of SMARTS:
1112
1113.. doctest::
1114
1115  >>> m = Chem.MolFromSmiles('C1=CC=CC=C1OC')
1116  >>> m.HasSubstructMatch(Chem.MolFromSmarts('CO'))
1117  True
1118  >>> m.HasSubstructMatch(Chem.MolFromSmiles('CO'))
1119  True
1120
1121But don't forget that the semantics of the two languages are not exactly equivalent:
1122
1123.. doctest::
1124
1125  >>> m.HasSubstructMatch(Chem.MolFromSmiles('COC'))
1126  True
1127  >>> m.HasSubstructMatch(Chem.MolFromSmarts('COC'))
1128  False
1129  >>> m.HasSubstructMatch(Chem.MolFromSmarts('COc')) #<- need an aromatic C
1130  True
1131
1132
1133Stereochemistry in substructure matches
1134=======================================
1135
1136By default information about stereochemistry is not used in
1137substructure searches:
1138
1139.. doctest::
1140
1141  >>> m = Chem.MolFromSmiles('CC[C@H](F)Cl')
1142  >>> m.HasSubstructMatch(Chem.MolFromSmiles('C[C@H](F)Cl'))
1143  True
1144  >>> m.HasSubstructMatch(Chem.MolFromSmiles('C[C@@H](F)Cl'))
1145  True
1146  >>> m.HasSubstructMatch(Chem.MolFromSmiles('CC(F)Cl'))
1147  True
1148
1149But this can be changed via the `useChirality` argument:
1150
1151.. doctest::
1152
1153  >>> m.HasSubstructMatch(Chem.MolFromSmiles('C[C@H](F)Cl'),useChirality=True)
1154  True
1155  >>> m.HasSubstructMatch(Chem.MolFromSmiles('C[C@@H](F)Cl'),useChirality=True)
1156  False
1157  >>> m.HasSubstructMatch(Chem.MolFromSmiles('CC(F)Cl'),useChirality=True)
1158  True
1159
1160Notice that when `useChirality` is set a non-chiral query **does** match a chiral
1161molecule. The same is not true for a chiral query and a non-chiral molecule:
1162
1163.. doctest::
1164
1165  >>> m.HasSubstructMatch(Chem.MolFromSmiles('CC(F)Cl'))
1166  True
1167  >>> m2 = Chem.MolFromSmiles('CCC(F)Cl')
1168  >>> m2.HasSubstructMatch(Chem.MolFromSmiles('C[C@H](F)Cl'),useChirality=True)
1169  False
1170
1171Atom Map Indices in SMARTS
1172==========================
1173
1174It is possible to attach indices to the atoms in the SMARTS
1175pattern. This is most often done in reaction SMARTS (see `Chemical
1176Reactions`_), but is more general than that.  For example, in the
1177SMARTS patterns for torsion angle analysis published by Guba `et al.`
1178(``DOI: acs.jcim.5b00522``) indices are used to define the four atoms of
1179the torsion of interest. This allows additional atoms to be used to
1180define the environment of the four torsion atoms, as in
1181``[cH0:1][c:2]([cH0])!@[CX3!r:3]=[NX2!r:4]`` for an aromatic C=N
1182torsion.  We might wonder in passing why they didn't use
1183recursive SMARTS for this, which would have made life easier, but it
1184is what it is. The atom lists from ``GetSubstructureMatches`` are
1185guaranteed to be in order of the SMARTS, but in this case we'll get five
1186atoms so we need a way of picking out, in the correct order, the four of
1187interest.  When the SMARTS is parsed, the relevant atoms are assigned an
1188atom map number property that we can easily extract:
1189
1190.. doctest::
1191
1192  >>> qmol = Chem.MolFromSmarts( '[cH0:1][c:2]([cH0])!@[CX3!r:3]=[NX2!r:4]' )
1193  >>> ind_map = {}
1194  >>> for atom in qmol.GetAtoms() :
1195  ...     map_num = atom.GetAtomMapNum()
1196  ...     if map_num:
1197  ...         ind_map[map_num-1] = atom.GetIdx()
1198  >>> ind_map
1199  {0: 0, 1: 1, 2: 3, 3: 4}
1200  >>> map_list = [ind_map[x] for x in sorted(ind_map)]
1201  >>> map_list
1202  [0, 1, 3, 4]
1203
1204Then, when using the query on a molecule you can get the indices of the four
1205matching atoms like this:
1206
1207.. doctest::
1208
1209  >>> mol = Chem.MolFromSmiles('Cc1cccc(C)c1C(C)=NC')
1210  >>> for match in mol.GetSubstructMatches( qmol ) :
1211  ...     mas = [match[x] for x in map_list]
1212  ...     print(mas)
1213  [1, 7, 8, 10]
1214
1215
1216Advanced substructure matching
1217==============================
1218
1219Starting with the 2020.03 release, the RDKit allows you to provide an optional
1220function that is used to check whether or not a possible substructure match should
1221be accepted. This function is called with the molecule to be matched and the indices
1222of the matching atoms.
1223
1224Here's an example of how you can use the functionality to do "Markush-like" matching,
1225requiring that all atoms in a sidechain are either carbon (type "all_carbon") or aren't
1226aromatic (type "alkyl"). We start by defining the class that we'll use to test the
1227sidechains:
1228
1229.. testcode::
1230
1231  from rdkit import Chem
1232
1233  class SidechainChecker(object):
1234    matchers = {
1235      'alkyl': lambda at: not at.GetIsAromatic(),
1236      'all_carbon': lambda at: at.GetAtomicNum() == 6
1237    }
1238
1239    def __init__(self, query, pName="queryType"):
1240      # identify the atoms that have the properties we care about
1241      self._atsToExamine = [(x.GetIdx(), x.GetProp(pName)) for x in query.GetAtoms()
1242                            if x.HasProp(pName)]
1243      self._pName = pName
1244
1245    def __call__(self, mol, vect):
1246      seen = [0] * mol.GetNumAtoms()
1247      for idx in vect:
1248        seen[idx] = 1
1249      # loop over the atoms we care about:
1250      for idx, qtyp in self._atsToExamine:
1251        midx = vect[idx]
1252        stack = [midx]
1253        atom = mol.GetAtomWithIdx(midx)
1254        # now do a breadth-first search from that atom, checking
1255        # all of its neighbors that aren't in the substructure
1256        # query:
1257        stack = [atom]
1258        while stack:
1259          atom = stack.pop(0)
1260          if not self.matchers[qtyp](atom):
1261            return False
1262          seen[atom.GetIdx()] = 1
1263          for nbr in atom.GetNeighbors():
1264            if not seen[nbr.GetIdx()]:
1265              stack.append(nbr)
1266      return True
1267
1268
1269Here's the molecule we'll use:
1270
1271.. image:: images/substruct_search_parameters1.png
1272
1273And the default behavior:
1274
1275.. doctest::
1276
1277  >>> m = Chem.MolFromSmiles('C2NCC2CC1C(CCCC)C(OCCCC)C1c2ccccc2')
1278  >>> p = Chem.MolFromSmarts('C1CCC1*')
1279  >>> p.GetAtomWithIdx(4).SetProp("queryType", "all_carbon")
1280  >>> m.GetSubstructMatches(p)
1281  ((5, 6, 11, 17, 18), (5, 17, 11, 6, 7), (6, 5, 17, 11, 12), (6, 11, 17, 5, 4))
1282
1283Now let's add the final check to filter the results:
1284
1285.. doctest::
1286
1287  >>> params = Chem.SubstructMatchParameters()
1288  >>> checker = SidechainChecker(p)
1289  >>> params.setExtraFinalCheck(checker)
1290  >>> m.GetSubstructMatches(p,params)
1291  ((5, 6, 11, 17, 18), (5, 17, 11, 6, 7))
1292
1293Repeat that using the 'alkyl' query:
1294
1295.. doctest::
1296
1297  >>> p.GetAtomWithIdx(4).SetProp("queryType", "alkyl")
1298  >>> checker = SidechainChecker(p)
1299  >>> params.setExtraFinalCheck(checker)
1300  >>> m.GetSubstructMatches(p,params)
1301  ((5, 17, 11, 6, 7), (6, 5, 17, 11, 12), (6, 11, 17, 5, 4))
1302
1303
1304Chemical Transformations
1305************************
1306
1307The RDKit contains a number of functions for modifying molecules. Note
1308that these transformation functions are intended to provide an easy
1309way to make simple modifications to molecules.
1310For more complex transformations, use the `Chemical Reactions`_ functionality.
1311
1312Substructure-based transformations
1313==================================
1314
1315There's a variety of functionality for using the RDKit's
1316substructure-matching machinery for doing quick molecular transformations.
1317These transformations include deleting substructures:
1318
1319.. doctest::
1320
1321  >>> m = Chem.MolFromSmiles('CC(=O)O')
1322  >>> patt = Chem.MolFromSmarts('C(=O)[OH]')
1323  >>> rm = AllChem.DeleteSubstructs(m,patt)
1324  >>> Chem.MolToSmiles(rm)
1325  'C'
1326
1327replacing substructures:
1328
1329.. doctest::
1330
1331  >>> repl = Chem.MolFromSmiles('OC')
1332  >>> patt = Chem.MolFromSmarts('[$(NC(=O))]')
1333  >>> m = Chem.MolFromSmiles('CC(=O)N')
1334  >>> rms = AllChem.ReplaceSubstructs(m,patt,repl)
1335  >>> rms
1336  (<rdkit.Chem.rdchem.Mol object at 0x...>,)
1337  >>> Chem.MolToSmiles(rms[0])
1338  'COC(C)=O'
1339
1340as well as simple SAR-table transformations like removing side chains:
1341
1342.. doctest::
1343
1344  >>> m1 = Chem.MolFromSmiles('BrCCc1cncnc1C(=O)O')
1345  >>> core = Chem.MolFromSmiles('c1cncnc1')
1346  >>> tmp = Chem.ReplaceSidechains(m1,core)
1347  >>> Chem.MolToSmiles(tmp)
1348  '[1*]c1cncnc1[2*]'
1349
1350and removing cores:
1351
1352.. doctest::
1353
1354  >>> tmp = Chem.ReplaceCore(m1,core)
1355  >>> Chem.MolToSmiles(tmp)
1356  '[1*]CCBr.[2*]C(=O)O'
1357
1358By default the sidechains are labeled based on the order they are found.
1359They can also be labeled according by the number of that core-atom they're attached to:
1360
1361.. doctest::
1362
1363  >>> m1 = Chem.MolFromSmiles('c1c(CCO)ncnc1C(=O)O')
1364  >>> tmp=Chem.ReplaceCore(m1,core,labelByIndex=True)
1365  >>> Chem.MolToSmiles(tmp)
1366  '[1*]CCO.[5*]C(=O)O'
1367
1368:py:func:`rdkit.Chem.rdmolops.ReplaceCore` returns the sidechains in a single molecule.
1369This can be split into separate molecules using :py:func:`rdkit.Chem.rdmolops.GetMolFrags` :
1370
1371.. doctest::
1372
1373  >>> rs = Chem.GetMolFrags(tmp,asMols=True)
1374  >>> len(rs)
1375  2
1376  >>> Chem.MolToSmiles(rs[0])
1377  '[1*]CCO'
1378  >>> Chem.MolToSmiles(rs[1])
1379  '[5*]C(=O)O'
1380
1381
1382Murcko Decomposition
1383====================
1384
1385The RDKit provides standard Murcko-type decomposition [#bemis1]_ of molecules
1386into scaffolds:
1387
1388.. doctest::
1389
1390  >>> from rdkit.Chem.Scaffolds import MurckoScaffold
1391  >>> with Chem.SDMolSupplier('data/cdk2.sdf') as cdk2mols:
1392  ...   m1 = cdk2mols[0]
1393  >>> core = MurckoScaffold.GetScaffoldForMol(m1)
1394  >>> Chem.MolToSmiles(core)
1395  'c1ncc2nc[nH]c2n1'
1396
1397or into a generic framework:
1398
1399.. doctest::
1400
1401  >>> fw = MurckoScaffold.MakeScaffoldGeneric(core)
1402  >>> Chem.MolToSmiles(fw)
1403  'C1CCC2CCCC2C1'
1404
1405
1406Maximum Common Substructure
1407***************************
1408
1409The FindMCS function find a maximum common substructure (MCS) of two
1410or more molecules:
1411
1412.. doctest::
1413
1414  >>> from rdkit.Chem import rdFMCS
1415  >>> mol1 = Chem.MolFromSmiles("O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C")
1416  >>> mol2 = Chem.MolFromSmiles("CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC")
1417  >>> mol3 = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
1418  >>> mols = [mol1,mol2,mol3]
1419  >>> res=rdFMCS.FindMCS(mols)
1420  >>> res
1421  <rdkit.Chem.rdFMCS.MCSResult object at 0x...>
1422  >>> res.numAtoms
1423  10
1424  >>> res.numBonds
1425  10
1426  >>> res.smartsString
1427  '[#6]1(-[#6]):[#6]:[#6](-[#8]-[#6]):[#6](:[#6]:[#6]:1)-[#8]'
1428  >>> res.canceled
1429  False
1430
1431It returns an MCSResult instance with information about the number of
1432atoms and bonds in the MCS, the SMARTS string which matches the
1433identified MCS, and a flag saying if the algorithm timed out. If no
1434MCS is found then the number of atoms and bonds is set to 0 and the
1435SMARTS to ``''``.
1436
1437By default, two atoms match if they are the same element and two bonds
1438match if they have the same bond type. Specify ``atomCompare`` and
1439``bondCompare`` to use different comparison functions, as in:
1440
1441.. doctest::
1442
1443  >>> mols = (Chem.MolFromSmiles('NCC'),Chem.MolFromSmiles('OC=C'))
1444  >>> rdFMCS.FindMCS(mols).smartsString
1445  '[#6]'
1446  >>> rdFMCS.FindMCS(mols, atomCompare=rdFMCS.AtomCompare.CompareAny).smartsString
1447  '[#7,#8]-[#6]'
1448  >>> rdFMCS.FindMCS(mols, bondCompare=rdFMCS.BondCompare.CompareAny).smartsString
1449  '[#6]-,=[#6]'
1450
1451The options for the atomCompare argument are: CompareAny says that any
1452atom matches any other atom, CompareElements compares by element type,
1453and CompareIsotopes matches based on the isotope label. Isotope labels
1454can be used to implement user-defined atom types. A bondCompare of
1455CompareAny says that any bond matches any other bond, CompareOrderExact says
1456bonds are equivalent if and only if they have the same bond type, and
1457CompareOrder allows single and aromatic bonds to match each other, but
1458requires an exact order match otherwise:
1459
1460.. doctest::
1461
1462  >>> mols = (Chem.MolFromSmiles('c1ccccc1'),Chem.MolFromSmiles('C1CCCC=C1'))
1463  >>> rdFMCS.FindMCS(mols,bondCompare=rdFMCS.BondCompare.CompareAny).smartsString
1464  '[#6]1:,-[#6]:,-[#6]:,-[#6]:,-[#6]:,=[#6]:,-1'
1465  >>> rdFMCS.FindMCS(mols,bondCompare=rdFMCS.BondCompare.CompareOrderExact).smartsString
1466  '[#6]'
1467  >>> rdFMCS.FindMCS(mols,bondCompare=rdFMCS.BondCompare.CompareOrder).smartsString
1468  '[#6](:,-[#6]:,-[#6]:,-[#6]):,-[#6]:,-[#6]'
1469
1470A substructure has both atoms and bonds. By default, the algorithm
1471attempts to maximize the number of bonds found. You can change this by
1472setting the ``maximizeBonds`` argument to False.
1473Maximizing the number of bonds tends to maximize the number of rings,
1474although two small rings may have fewer bonds than one large ring.
1475
1476You might not want a 3-valent nitrogen to match one which is 5-valent.
1477The default ``matchValences`` value of False ignores valence
1478information.  When True, the atomCompare setting is modified to also
1479require that the two atoms have the same valency.
1480
1481.. doctest::
1482
1483  >>> mols = (Chem.MolFromSmiles('NC1OC1'),Chem.MolFromSmiles('C1OC1[N+](=O)[O-]'))
1484  >>> rdFMCS.FindMCS(mols).numAtoms
1485  4
1486  >>> rdFMCS.FindMCS(mols, matchValences=True).numBonds
1487  3
1488
1489It can be strange to see a linear carbon chain match a carbon ring,
1490which is what the ``ringMatchesRingOnly`` default of False does. If
1491you set it to True then ring bonds will only match ring bonds.
1492
1493.. doctest::
1494
1495  >>> mols = [Chem.MolFromSmiles("C1CCC1CCC"), Chem.MolFromSmiles("C1CCCCCC1")]
1496  >>> rdFMCS.FindMCS(mols).smartsString
1497  '[#6](-[#6]-[#6])-[#6]-[#6]-[#6]-[#6]'
1498  >>> rdFMCS.FindMCS(mols, ringMatchesRingOnly=True).smartsString
1499  '[#6&R](-&@[#6&R]-&@[#6&R])-&@[#6&R]'
1500
1501Notice that the SMARTS returned now include ring queries on the atoms and bonds.
1502
1503You can further restrict things and require that partial rings (as in
1504this case) are not allowed. That is, if an atom is part of the MCS and
1505the atom is in a ring of the entire molecule then that atom is also in
1506a ring of the MCS. Setting ``completeRingsOnly`` to True toggles this
1507requirement.
1508
1509.. doctest::
1510
1511  >>> mols = [Chem.MolFromSmiles("CCC1CC2C1CN2"), Chem.MolFromSmiles("C1CC2C1CC2")]
1512  >>> rdFMCS.FindMCS(mols).smartsString
1513  '[#6]1-[#6]-[#6](-[#6]-1-[#6])-[#6]'
1514  >>> rdFMCS.FindMCS(mols, ringMatchesRingOnly=True).smartsString
1515  '[#6&R](-&@[#6&R]-&@[#6&R]-&@[#6&R]-&@[#6&R])-&@[#6&R]'
1516  >>> rdFMCS.FindMCS(mols, completeRingsOnly=True).smartsString
1517  '[#6]1-&@[#6]-&@[#6]-&@[#6]-&@1'
1518
1519Of course the two options can be combined with each other:
1520
1521.. doctest::
1522
1523  >>> ms = [Chem.MolFromSmiles(x) for x in ('CC1CCC1','CCC1CC1',)]
1524  >>> rdFMCS.FindMCS(ms,ringMatchesRingOnly=True).smartsString
1525  '[#6&!R]-&!@[#6&R](-&@[#6&R])-&@[#6&R]'
1526  >>> rdFMCS.FindMCS(ms,completeRingsOnly=True).smartsString
1527  '[#6]-&!@[#6]'
1528  >>> rdFMCS.FindMCS(ms,ringMatchesRingOnly=True,completeRingsOnly=True).smartsString
1529  '[#6&!R]-&!@[#6&R]'
1530
1531
1532The MCS algorithm will exhaustively search for a maximum common substructure.
1533Typically this takes a fraction of a second, but for some comparisons this
1534can take minutes or longer. Use the ``timeout`` parameter to stop the search
1535after the given number of seconds (wall-clock seconds, not CPU seconds) and
1536return the best match found in that time. If timeout is reached then the
1537``canceled`` property of the MCSResult will be True instead of False.
1538
1539.. doctest::
1540
1541  >>> mols = [Chem.MolFromSmiles("Nc1ccccc1"*10), Chem.MolFromSmiles("Nc1ccccccccc1"*10)]
1542  >>> rdFMCS.FindMCS(mols, timeout=1).canceled
1543  True
1544
1545(The MCS after 50 seconds contained 511 atoms.)
1546
1547
1548Fingerprinting and Molecular Similarity
1549***************************************
1550
1551The RDKit has a variety of built-in functionality for generating molecular fingerprints and using them to calculate molecular similarity.
1552
1553
1554Topological Fingerprints
1555========================
1556
1557.. doctest::
1558
1559  >>> from rdkit import DataStructs
1560  >>> ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'),
1561  ... Chem.MolFromSmiles('COC')]
1562  >>> fps = [Chem.RDKFingerprint(x) for x in ms]
1563  >>> DataStructs.FingerprintSimilarity(fps[0],fps[1])
1564  0.6...
1565  >>> DataStructs.FingerprintSimilarity(fps[0],fps[2])
1566  0.4...
1567  >>> DataStructs.FingerprintSimilarity(fps[1],fps[2])
1568  0.25
1569
1570More details about the algorithm used for the RDKit fingerprint can be found in the "RDKit Book".
1571
1572The default set of parameters used by the fingerprinter is:
1573- minimum path size: 1 bond
1574- maximum path size: 7 bonds
1575- fingerprint size: 2048 bits
1576- number of bits set per hash: 2
1577- minimum fingerprint size: 64 bits
1578- target on-bit density 0.0
1579
1580You can control these when calling
1581:py:func:`rdkit.Chem.rdmolops.RDKFingerprint`.
1582The function
1583:py:func:`rdkit.Chem.Fingerprints.FingerprintMols.FingerprintMol` (written
1584in python) shows how this is done.
1585
1586The default similarity metric used by
1587:py:func:`rdkit.DataStructs.FingerprintSimilarity` is the Tanimoto
1588similarity.  One can use different similarity metrics:
1589
1590>>> DataStructs.FingerprintSimilarity(fps[0],fps[1], metric=DataStructs.DiceSimilarity)
15910.75
1592
1593Available similarity metrics include Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky.
1594
1595
1596MACCS Keys
1597==========
1598
1599There is a SMARTS-based implementation of the 166 public MACCS keys.
1600
1601.. doctest::
1602
1603  >>> from rdkit.Chem import MACCSkeys
1604  >>> fps = [MACCSkeys.GenMACCSKeys(x) for x in ms]
1605  >>> DataStructs.FingerprintSimilarity(fps[0],fps[1])
1606  0.5
1607  >>> DataStructs.FingerprintSimilarity(fps[0],fps[2])
1608  0.538...
1609  >>> DataStructs.FingerprintSimilarity(fps[1],fps[2])
1610  0.214...
1611
1612The MACCS keys were critically evaluated and compared to other MACCS
1613implementations in Q3 2008. In cases where the public keys are fully defined,
1614things looked pretty good.
1615
1616
1617Atom Pairs and Topological Torsions
1618===================================
1619
1620Atom-pair descriptors [#carhart]_ are available in several different forms.
1621The standard form is as fingerprint including counts for each bit instead of just zeros and ones:
1622
1623.. doctest::
1624
1625  >>> from rdkit.Chem.AtomPairs import Pairs
1626  >>> ms = [Chem.MolFromSmiles('C1CCC1OCC'),Chem.MolFromSmiles('CC(C)OCC'),Chem.MolFromSmiles('CCOCC')]
1627  >>> pairFps = [Pairs.GetAtomPairFingerprint(x) for x in ms]
1628
1629Because the space of bits that can be included in atom-pair fingerprints is
1630huge, they are stored in a sparse manner. We can get the list of bits and their
1631counts for each fingerprint as a dictionary:
1632
1633.. doctest::
1634
1635  >>> d = pairFps[-1].GetNonzeroElements()
1636  >>> d[541732]
1637  1
1638  >>> d[1606690]
1639  2
1640
1641Descriptions of the bits are also available:
1642
1643.. doctest::
1644
1645  >>> Pairs.ExplainPairScore(558115)
1646  (('C', 1, 0), 3, ('C', 2, 0))
1647
1648The above means: C with 1 neighbor and 0 pi electrons which is 3 bonds
1649from a C with 2 neighbors and 0 pi electrons
1650
1651The usual metric for similarity between atom-pair fingerprints is Dice similarity:
1652
1653.. doctest::
1654
1655  >>> from rdkit import DataStructs
1656  >>> DataStructs.DiceSimilarity(pairFps[0],pairFps[1])
1657  0.333...
1658  >>> DataStructs.DiceSimilarity(pairFps[0],pairFps[2])
1659  0.258...
1660  >>> DataStructs.DiceSimilarity(pairFps[1],pairFps[2])
1661  0.56
1662
1663It's also possible to get atom-pair descriptors encoded as a standard
1664bit vector fingerprint (ignoring the count information):
1665
1666.. doctest::
1667
1668  >>> pairFps = [Pairs.GetAtomPairFingerprintAsBitVect(x) for x in ms]
1669
1670Since these are standard bit vectors, the :py:mod:`rdkit.DataStructs`
1671module can be used for similarity:
1672
1673.. doctest::
1674
1675  >>> from rdkit import DataStructs
1676  >>> DataStructs.DiceSimilarity(pairFps[0],pairFps[1])
1677  0.48
1678  >>> DataStructs.DiceSimilarity(pairFps[0],pairFps[2])
1679  0.380...
1680  >>> DataStructs.DiceSimilarity(pairFps[1],pairFps[2])
1681  0.625
1682
1683Topological torsion descriptors [#nilakantan]_ are calculated in
1684essentially the same way:
1685
1686.. doctest::
1687
1688  >>> from rdkit.Chem.AtomPairs import Torsions
1689  >>> tts = [Torsions.GetTopologicalTorsionFingerprintAsIntVect(x) for x in ms]
1690  >>> DataStructs.DiceSimilarity(tts[0],tts[1])
1691  0.166...
1692
1693At the time of this writing, topological torsion fingerprints have too many bits
1694to be encodeable using the BitVector machinery, so there is no
1695GetTopologicalTorsionFingerprintAsBitVect function.
1696
1697
1698Morgan Fingerprints (Circular Fingerprints)
1699===========================================
1700
1701This family of fingerprints, better known as circular fingerprints
1702[#rogers]_, is built by applying the Morgan algorithm to a set of
1703user-supplied atom invariants.  When generating Morgan fingerprints,
1704the radius of the fingerprint must also be provided :
1705
1706.. doctest::
1707
1708  >>> from rdkit.Chem import AllChem
1709  >>> m1 = Chem.MolFromSmiles('Cc1ccccc1')
1710  >>> fp1 = AllChem.GetMorganFingerprint(m1,2)
1711  >>> fp1
1712  <rdkit.DataStructs.cDataStructs.UIntSparseIntVect object at 0x...>
1713  >>> m2 = Chem.MolFromSmiles('Cc1ncccc1')
1714  >>> fp2 = AllChem.GetMorganFingerprint(m2,2)
1715  >>> DataStructs.DiceSimilarity(fp1,fp2)
1716  0.55...
1717
1718Morgan fingerprints, like atom pairs and topological torsions, use
1719counts by default, but it's also possible to calculate them as bit
1720vectors:
1721
1722.. doctest::
1723
1724  >>> fp1 = AllChem.GetMorganFingerprintAsBitVect(m1,2,nBits=1024)
1725  >>> fp1
1726  <rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
1727  >>> fp2 = AllChem.GetMorganFingerprintAsBitVect(m2,2,nBits=1024)
1728  >>> DataStructs.DiceSimilarity(fp1,fp2)
1729  0.51...
1730
1731The default atom invariants use connectivity information similar to
1732those used for the well known ECFP family of fingerprints.
1733Feature-based invariants, similar to those used for the FCFP
1734fingerprints, can also be used. The feature definitions used are
1735defined in the section `Feature Definitions Used in the Morgan
1736Fingerprints`_.  At times this can lead to quite different similarity
1737scores:
1738
1739.. doctest::
1740
1741  >>> m1 = Chem.MolFromSmiles('c1ccccn1')
1742  >>> m2 = Chem.MolFromSmiles('c1ccco1')
1743  >>> fp1 = AllChem.GetMorganFingerprint(m1,2)
1744  >>> fp2 = AllChem.GetMorganFingerprint(m2,2)
1745  >>> ffp1 = AllChem.GetMorganFingerprint(m1,2,useFeatures=True)
1746  >>> ffp2 = AllChem.GetMorganFingerprint(m2,2,useFeatures=True)
1747  >>> DataStructs.DiceSimilarity(fp1,fp2)
1748  0.36...
1749  >>> DataStructs.DiceSimilarity(ffp1,ffp2)
1750  0.90...
1751
1752When comparing the ECFP/FCFP fingerprints and the Morgan fingerprints
1753generated by the RDKit, remember that the 4 in ECFP4 corresponds to
1754the diameter of the atom environments considered, while the Morgan
1755fingerprints take a radius parameter.  So the examples above, with
1756radius=2, are roughly equivalent to ECFP4 and FCFP4.
1757
1758The user can also provide their own atom invariants using the optional
1759invariants argument to
1760:py:func:`rdkit.Chem.rdMolDescriptors.GetMorganFingerprint`.  Here's a
1761simple example that uses a constant for the invariant; the resulting
1762fingerprints compare the topology of molecules:
1763
1764.. doctest::
1765
1766  >>> m1 = Chem.MolFromSmiles('Cc1ccccc1')
1767  >>> m2 = Chem.MolFromSmiles('Cc1ncncn1')
1768  >>> fp1 = AllChem.GetMorganFingerprint(m1,2,invariants=[1]*m1.GetNumAtoms())
1769  >>> fp2 = AllChem.GetMorganFingerprint(m2,2,invariants=[1]*m2.GetNumAtoms())
1770  >>> fp1==fp2
1771  True
1772
1773Note that bond order is by default still considered:
1774
1775.. doctest::
1776
1777  >>> m3 = Chem.MolFromSmiles('CC1CCCCC1')
1778  >>> fp3 = AllChem.GetMorganFingerprint(m3,2,invariants=[1]*m3.GetNumAtoms())
1779  >>> fp1==fp3
1780  False
1781
1782But this can also be turned off:
1783
1784.. doctest::
1785
1786  >>> fp1 = AllChem.GetMorganFingerprint(m1,2,invariants=[1]*m1.GetNumAtoms(),
1787  ... useBondTypes=False)
1788  >>> fp3 = AllChem.GetMorganFingerprint(m3,2,invariants=[1]*m3.GetNumAtoms(),
1789  ... useBondTypes=False)
1790  >>> fp1==fp3
1791  True
1792
1793
1794Explaining bits from Morgan Fingerprints
1795----------------------------------------
1796
1797Information is available about the atoms that contribute to particular
1798bits in the Morgan fingerprint via the bitInfo argument.  The
1799dictionary provided is populated with one entry per bit set in the
1800fingerprint, the keys are the bit ids, the values are lists of (atom
1801index, radius) tuples.
1802
1803
1804.. doctest::
1805
1806  >>> m = Chem.MolFromSmiles('c1cccnc1C')
1807  >>> info={}
1808  >>> fp = AllChem.GetMorganFingerprint(m,2,bitInfo=info)
1809  >>> len(fp.GetNonzeroElements())
1810  16
1811  >>> len(info)
1812  16
1813  >>> info[98513984]
1814  ((1, 1), (2, 1))
1815  >>> info[4048591891]
1816  ((5, 2),)
1817
1818Interpreting the above: bit 98513984 is set twice: once by atom 1 and
1819once by atom 2, each at radius 1. Bit 4048591891 is set once by atom 5
1820at radius 2.
1821
1822Focusing on bit 4048591891, we can extract the submolecule consisting
1823of all atoms within a radius of 2 of atom 5:
1824
1825.. doctest::
1826
1827  >>> env = Chem.FindAtomEnvironmentOfRadiusN(m,2,5)
1828  >>> amap={}
1829  >>> submol=Chem.PathToSubmol(m,env,atomMap=amap)
1830  >>> submol.GetNumAtoms()
1831  6
1832  >>> amap
1833  {0: 0, 1: 1, 3: 2, 4: 3, 5: 4, 6: 5}
1834
1835And then “explain” the bit by generating SMILES for that submolecule:
1836
1837.. doctest::
1838
1839  >>> Chem.MolToSmiles(submol)
1840  'ccc(C)nc'
1841
1842This is more useful when the SMILES is rooted at the central atom:
1843
1844.. doctest::
1845
1846  >>> Chem.MolToSmiles(submol,rootedAtAtom=amap[5],canonical=False)
1847  'c(cc)(nc)C'
1848
1849An alternate (and faster, particularly for large numbers of molecules)
1850approach to do the same thing, using the function :py:func:`rdkit.Chem.MolFragmentToSmiles` :
1851
1852.. doctest::
1853
1854  >>> atoms=set()
1855  >>> for bidx in env:
1856  ...     atoms.add(m.GetBondWithIdx(bidx).GetBeginAtomIdx())
1857  ...     atoms.add(m.GetBondWithIdx(bidx).GetEndAtomIdx())
1858  ...
1859  >>> Chem.MolFragmentToSmiles(m,atomsToUse=list(atoms),bondsToUse=env,rootedAtAtom=5)
1860  'c(C)(cc)nc'
1861
1862Generating images of fingerprint bits
1863=====================================
1864
1865For the Morgan and RDKit fingerprint types, it's possible to generate images of
1866the atom environment that defines the bit using the functions
1867:py:func:`rdkit.Chem.Draw.DrawMorganBit()` and :py:func:`rdkit.Chem.Draw.DrawRDKitBit()`
1868
1869.. doctest::
1870
1871  >>> from rdkit.Chem import Draw
1872  >>> mol = Chem.MolFromSmiles('c1ccccc1CC1CC1')
1873  >>> bi = {}
1874  >>> fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bi)
1875  >>> bi[872]
1876  ((6, 2),)
1877  >>> mfp2_svg = Draw.DrawMorganBit(mol, 872, bi, useSVG=True)
1878  >>> rdkbi = {}
1879  >>> rdkfp = Chem.RDKFingerprint(mol, maxPath=5, bitInfo=rdkbi)
1880  >>> rdkbi[1553]
1881  [[0, 1, 9, 5, 4], [2, 3, 4, 9, 5]]
1882  >>> rdk_svg = Draw.DrawRDKitBit(mol, 1553, rdkbi, useSVG=True)
1883
1884Producing these images:
1885
1886+-----------------------------------+-----------------------------------+
1887| .. image:: images/mfp2_bit872.svg | .. image:: images/rdk_bit1553.svg |
1888+-----------------------------------+-----------------------------------+
1889|         Morgan bit                |            RDKit bit              |
1890+-----------------------------------+-----------------------------------+
1891
1892The default highlight colors for the Morgan bits indicate:
1893
1894  - blue: the central atom in the environment
1895  - yellow: aromatic atoms
1896  - gray: aliphatic ring atoms
1897
1898The default highlight colors for the RDKit bits indicate:
1899
1900  - yellow: aromatic atoms
1901
1902Note that in cases where the same bit is set by multiple atoms in the molecule
1903(as for bit 1553 for the RDKit fingerprint in the example above), the drawing
1904functions will display the first example. You can change this by specifying which
1905example to show:
1906
1907.. doctest::
1908
1909  >>> rdk_svg = Draw.DrawRDKitBit(mol, 1553, rdkbi, whichExample=1, useSVG=True)
1910
1911Producing this image:
1912
1913+-------------------------------------+
1914| .. image:: images/rdk_bit1553_2.svg |
1915+-------------------------------------+
1916|            RDKit bit                |
1917+-------------------------------------+
1918
1919
1920Picking Diverse Molecules Using Fingerprints
1921============================================
1922
1923A common task is to pick a small subset of diverse molecules from a larger set.
1924The RDKit provides a number of approaches for doing this in the
1925:py:mod:`rdkit.SimDivFilters` module.  The most efficient of these uses the
1926MaxMin algorithm. [#ashton]_ Here's an example:
1927
1928Start by reading in a set of molecules and generating Morgan fingerprints:
1929
1930.. doctest::
1931
1932  >>> from rdkit import Chem
1933  >>> from rdkit.Chem.rdMolDescriptors import GetMorganFingerprint
1934  >>> from rdkit import DataStructs
1935  >>> from rdkit.SimDivFilters.rdSimDivPickers import MaxMinPicker
1936  >>> with Chem.SDMolSupplier('data/actives_5ht3.sdf') as suppl:
1937  ...   ms = [x for x in suppl if x is not None]
1938  >>> fps = [GetMorganFingerprint(x,3) for x in ms]
1939  >>> nfps = len(fps)
1940
1941The algorithm requires a function to calculate distances between
1942objects, we'll do that using DiceSimilarity:
1943
1944.. doctest::
1945
1946  >>> def distij(i,j,fps=fps):
1947  ...   return 1-DataStructs.DiceSimilarity(fps[i],fps[j])
1948
1949Now create a picker and grab a set of 10 diverse molecules:
1950
1951.. doctest::
1952
1953  >>> picker = MaxMinPicker()
1954  >>> pickIndices = picker.LazyPick(distij,nfps,10,seed=23)
1955  >>> list(pickIndices)
1956  [93, 109, 154, 6, 95, 135, 151, 61, 137, 139]
1957
1958Note that the picker just returns indices of the fingerprints; we can
1959get the molecules themselves as follows:
1960
1961.. doctest::
1962
1963  >>> picks = [ms[x] for x in pickIndices]
1964
1965Generating Similarity Maps Using Fingerprints
1966=============================================
1967
1968Similarity maps are a way to visualize the atomic contributions to
1969the similarity between a molecule and a reference molecule. The
1970methodology is described in Ref. [#riniker]_ .
1971They are in the :py:mod:`rdkit.Chem.Draw.SimilarityMaps` module :
1972
1973Start by creating two molecules:
1974
1975.. doctest::
1976
1977  >>> from rdkit import Chem
1978  >>> mol = Chem.MolFromSmiles('COc1cccc2cc(C(=O)NCCCCN3CCN(c4cccc5nccnc54)CC3)oc21')
1979  >>> refmol = Chem.MolFromSmiles('CCCN(CCCCN1CCN(c2ccccc2OC)CC1)Cc1ccc2ccccc2c1')
1980
1981The SimilarityMaps module supports three kind of fingerprints:
1982atom pairs, topological torsions and Morgan fingerprints.
1983
1984.. doctest::
1985
1986  >>> from rdkit.Chem import Draw
1987  >>> from rdkit.Chem.Draw import SimilarityMaps
1988  >>> fp = SimilarityMaps.GetAPFingerprint(mol, fpType='normal')
1989  >>> fp = SimilarityMaps.GetTTFingerprint(mol, fpType='normal')
1990  >>> fp = SimilarityMaps.GetMorganFingerprint(mol, fpType='bv')
1991
1992The types of atom pairs and torsions are normal (default), hashed and bit vector (bv).
1993The types of the Morgan fingerprint are bit vector (bv, default) and count vector (count).
1994
1995The function generating a similarity map for two fingerprints requires the
1996specification of the fingerprint function and optionally the similarity metric.
1997The default for the latter is the Dice similarity. Using all the default arguments
1998of the Morgan fingerprint function, the similarity map can be generated like this:
1999
2000.. doctest::
2001
2002  >>> fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, SimilarityMaps.GetMorganFingerprint)
2003
2004Producing this image:
2005
2006.. image:: images/similarity_map_fp1.png
2007
2008For a different type of Morgan (e.g. count) and radius = 1 instead of 2, as well as a different
2009similarity metric (e.g. Tanimoto), the call becomes:
2010
2011.. doctest::
2012
2013  >>> from rdkit import DataStructs
2014  >>> fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, mol, lambda m,idx: SimilarityMaps.GetMorganFingerprint(m, atomId=idx, radius=1, fpType='count'), metric=DataStructs.TanimotoSimilarity)
2015
2016Producing this image:
2017
2018.. image:: images/similarity_map_fp2.png
2019
2020The convenience function GetSimilarityMapForFingerprint involves the normalisation
2021of the atomic weights such that the maximum absolute weight is 1. Therefore, the
2022function outputs the maximum weight that was found when creating the map.
2023
2024.. doctest::
2025
2026  >>> print(maxweight)
2027  0.05747...
2028
2029If one does not want the normalisation step, the map can be created like:
2030
2031.. doctest::
2032
2033  >>> weights = SimilarityMaps.GetAtomicWeightsForFingerprint(refmol, mol, SimilarityMaps.GetMorganFingerprint)
2034  >>> print(["%.2f " % w for w in weights])
2035  ['0.05 ', ...
2036  >>> fig = SimilarityMaps.GetSimilarityMapFromWeights(mol, weights)
2037
2038Producing this image:
2039
2040.. image:: images/similarity_map_fp3.png
2041
2042
2043Descriptor Calculation
2044**********************
2045
2046A variety of descriptors are available within the RDKit.
2047The complete list is provided in `List of Available Descriptors`_.
2048
2049Most of the descriptors are straightforward to use from Python via the
2050centralized :py:mod:`rdkit.Chem.Descriptors` module :
2051
2052.. doctest::
2053
2054  >>> from rdkit.Chem import Descriptors
2055  >>> m = Chem.MolFromSmiles('c1ccccc1C(=O)O')
2056  >>> Descriptors.TPSA(m)
2057  37.3
2058  >>> Descriptors.MolLogP(m)
2059  1.3848
2060
2061Partial charges are handled a bit differently:
2062
2063.. doctest::
2064
2065  >>> m = Chem.MolFromSmiles('c1ccccc1C(=O)O')
2066  >>> AllChem.ComputeGasteigerCharges(m)
2067  >>> m.GetAtomWithIdx(0).GetDoubleProp('_GasteigerCharge')
2068  -0.047...
2069
2070
2071Visualization of Descriptors
2072============================
2073
2074Similarity maps can be used to visualize descriptors that can be divided into
2075atomic contributions.
2076
2077The Gasteiger partial charges can be visualized as (using a different color scheme):
2078
2079.. doctest::
2080
2081  >>> from rdkit.Chem.Draw import SimilarityMaps
2082  >>> mol = Chem.MolFromSmiles('COc1cccc2cc(C(=O)NCCCCN3CCN(c4cccc5nccnc54)CC3)oc21')
2083  >>> AllChem.ComputeGasteigerCharges(mol)
2084  >>> contribs = [mol.GetAtomWithIdx(i).GetDoubleProp('_GasteigerCharge') for i in range(mol.GetNumAtoms())]
2085  >>> fig = SimilarityMaps.GetSimilarityMapFromWeights(mol, contribs, colorMap='jet', contourLines=10)
2086
2087Producing this image:
2088
2089.. image:: images/similarity_map_charges.png
2090
2091Or for the Crippen contributions to logP:
2092
2093.. doctest::
2094
2095  >>> from rdkit.Chem import rdMolDescriptors
2096  >>> contribs = rdMolDescriptors._CalcCrippenContribs(mol)
2097  >>> fig = SimilarityMaps.GetSimilarityMapFromWeights(mol,[x for x,y in contribs], colorMap='jet', contourLines=10)
2098
2099Producing this image:
2100
2101.. image:: images/similarity_map_crippen.png
2102
2103Chemical Reactions
2104******************
2105
2106The RDKit also supports applying chemical reactions to sets of
2107molecules.  One way of constructing chemical reactions is to use a
2108SMARTS-based language similar to Daylight's Reaction SMILES
2109[#rxnsmarts]_:
2110
2111.. doctest::
2112
2113  >>> rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]')
2114  >>> rxn
2115  <rdkit.Chem.rdChemReactions.ChemicalReaction object at 0x...>
2116  >>> rxn.GetNumProductTemplates()
2117  1
2118  >>> ps = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'),Chem.MolFromSmiles('NC')))
2119  >>> len(ps) # one entry for each possible set of products
2120  1
2121  >>> len(ps[0]) # each entry contains one molecule for each product
2122  1
2123  >>> Chem.MolToSmiles(ps[0][0])
2124  'CNC(C)=O'
2125  >>> ps = rxn.RunReactants((Chem.MolFromSmiles('C(COC(=O)O)C(=O)O'),Chem.MolFromSmiles('NC')))
2126  >>> len(ps)
2127  2
2128  >>> Chem.MolToSmiles(ps[0][0])
2129  'CNC(=O)OCCC(=O)O'
2130  >>> Chem.MolToSmiles(ps[1][0])
2131  'CNC(=O)CCOC(=O)O'
2132
2133Reactions can also be built from MDL rxn files:
2134
2135.. doctest::
2136
2137  >>> rxn = AllChem.ReactionFromRxnFile('data/AmideBond.rxn')
2138  >>> rxn.GetNumReactantTemplates()
2139  2
2140  >>> rxn.GetNumProductTemplates()
2141  1
2142  >>> ps = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'), Chem.MolFromSmiles('NC')))
2143  >>> len(ps)
2144  1
2145  >>> Chem.MolToSmiles(ps[0][0])
2146  'CNC(C)=O'
2147
2148It is, of course, possible to do reactions more complex than amide
2149bond formation:
2150
2151.. doctest::
2152
2153  >>> rxn = AllChem.ReactionFromSmarts('[C:1]=[C:2].[C:3]=[*:4][*:5]=[C:6]>>[C:1]1[C:2][C:3][*:4]=[*:5][C:6]1')
2154  >>> ps = rxn.RunReactants((Chem.MolFromSmiles('OC=C'), Chem.MolFromSmiles('C=CC(N)=C')))
2155  >>> Chem.MolToSmiles(ps[0][0])
2156  'NC1=CCCC(O)C1'
2157
2158Note in this case that there are multiple mappings of the reactants
2159onto the templates, so we have multiple product sets:
2160
2161.. doctest::
2162
2163  >>> len(ps)
2164  4
2165
2166You can use canonical smiles and a python dictionary to get the unique products:
2167
2168.. doctest::
2169
2170  >>> uniqps = {}
2171  >>> for p in ps:
2172  ...   smi = Chem.MolToSmiles(p[0])
2173  ...   uniqps[smi] = p[0]
2174  ...
2175  >>> sorted(uniqps.keys())
2176  ['NC1=CCC(O)CC1', 'NC1=CCCC(O)C1']
2177
2178Note that the molecules that are produced by the chemical reaction
2179processing code are not sanitized, as this artificial reaction
2180demonstrates:
2181
2182.. doctest::
2183
2184  >>> rxn = AllChem.ReactionFromSmarts('[C:1]=[C:2][C:3]=[C:4].[C:5]=[C:6]>>[C:1]1=[C:2][C:3]=[C:4][C:5]=[C:6]1')
2185  >>> ps = rxn.RunReactants((Chem.MolFromSmiles('C=CC=C'), Chem.MolFromSmiles('C=C')))
2186  >>> Chem.MolToSmiles(ps[0][0])
2187  'C1=CC=CC=C1'
2188  >>> p0 = ps[0][0]
2189  >>> Chem.SanitizeMol(p0)
2190  rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
2191  >>> Chem.MolToSmiles(p0)
2192  'c1ccccc1'
2193
2194Drawing Chemical Reactions
2195==========================
2196
2197The RDKit's MolDraw2D-based rendering can also handle chemical reactions.
2198
2199.. doctest::
2200
2201  >>> from rdkit.Chem import Draw
2202  >>> rxn = AllChem.ReactionFromSmarts('[cH:5]1[cH:6][c:7]2[cH:8][n:9][cH:10][cH:11][c:12]2[c:3]([cH:4]1)[C:2](=[O:1])O.[N-:13]=[N+:14]=[N-:15]>C(Cl)Cl.C(=O)(C(=O)Cl)Cl>[cH:5]1[cH:6][c:7]2[cH:8][n:9][cH:10][cH:11][c:12]2[c:3]([cH:4]1)[C:2](=[O:1])[N:13]=[N+:14]=[N-:15]',useSmiles=True)
2203  >>> d2d = Draw.MolDraw2DCairo(800,300)
2204  >>> d2d.DrawReaction(rxn)
2205  >>> png = d2d.GetDrawingText()
2206  >>> open('./images/reaction1.o.png','wb+').write(png)     # doctest: +SKIP
2207
2208the result looks like this:
2209
2210.. image:: images/reaction1.png
2211
2212There's another drawing mode which leaves out the atom map information but which
2213highlights which of the reactants atoms in the products come from:
2214
2215.. doctest::
2216
2217  >>> d2d = Draw.MolDraw2DCairo(800,300)
2218  >>> d2d.DrawReaction(rxn,highlightByReactant=True)
2219  >>> png = d2d.GetDrawingText()
2220  >>> open('./images/reaction1_highlight.o.png','wb+').write(png)    # doctest: +SKIP
2221
2222.. image:: images/reaction1_highlight.png
2223
2224As of the 2020.09 release, PNG images of reactions include metadata allowing the
2225reaction to be reconstructed:
2226
2227.. doctest::
2228
2229  >>> newRxn = AllChem.ReactionFromPNGString(png)
2230  >>> AllChem.ReactionToSmarts(newRxn)
2231  '[#6H:5]1:[#6H:6]:[#6:7]2:[#6H:8]:[#7:9]:[#6H:10]:[#6H:11]:[#6:12]:2:[#6:3](:[#6H:4]:1)-[#6:2](=[#8:1])-[#8].[#7-:13]=[#7+:14]=[#7-:15]>[#6](-[#17])-[#17].[#6](=[#8])(-[#6](=[#8])-[#17])-[#17]>[#6H:5]1:[#6H:6]:[#6:7]2:[#6H:8]:[#7:9]:[#6H:10]:[#6H:11]:[#6:12]:2:[#6:3](:[#6H:4]:1)-[#6:2](=[#8:1])-[#7:13]=[#7+:14]=[#7-:15]'
2232
2233Advanced Reaction Functionality
2234===============================
2235
2236Protecting Atoms
2237----------------
2238
2239Sometimes, particularly when working with rxn files, it is difficult
2240to express a reaction exactly enough to not end up with extraneous
2241products. The RDKit provides a method of "protecting" atoms to
2242disallow them from taking part in reactions.
2243
2244This can be demonstrated re-using the amide-bond formation reaction used
2245above. The query for amines isn't specific enough, so it matches any
2246nitrogen that has at least one H attached. So if we apply the reaction
2247to a molecule that already has an amide bond, the amide N is also
2248treated as a reaction site:
2249
2250.. doctest::
2251
2252  >>> rxn = AllChem.ReactionFromRxnFile('data/AmideBond.rxn')
2253  >>> acid = Chem.MolFromSmiles('CC(=O)O')
2254  >>> base = Chem.MolFromSmiles('CC(=O)NCCN')
2255  >>> ps = rxn.RunReactants((acid,base))
2256  >>> len(ps)
2257  2
2258  >>> Chem.MolToSmiles(ps[0][0])
2259  'CC(=O)N(CCN)C(C)=O'
2260  >>> Chem.MolToSmiles(ps[1][0])
2261  'CC(=O)NCCNC(C)=O'
2262
2263The first product corresponds to the reaction at the amide N.
2264
2265We can prevent this from happening by protecting all amide Ns. Here we
2266do it with a substructure query that matches amides and thioamides and
2267then set the "_protected" property on matching atoms:
2268
2269.. doctest::
2270
2271  >>> amidep = Chem.MolFromSmarts('[N;$(NC=[O,S])]')
2272  >>> for match in base.GetSubstructMatches(amidep):
2273  ...     base.GetAtomWithIdx(match[0]).SetProp('_protected','1')
2274
2275
2276Now the reaction only generates a single product:
2277
2278.. doctest::
2279
2280  >>> ps = rxn.RunReactants((acid,base))
2281  >>> len(ps)
2282  1
2283  >>> Chem.MolToSmiles(ps[0][0])
2284  'CC(=O)NCCNC(C)=O'
2285
2286
2287Recap Implementation
2288====================
2289
2290Associated with the chemical reaction functionality is an
2291implementation of the Recap algorithm. [#lewell]_ Recap uses a set of
2292chemical transformations mimicking common reactions carried out in the
2293lab in order to decompose a molecule into a series of reasonable
2294fragments.
2295
2296The RDKit :py:mod:`rdkit.Chem.Recap` implementation keeps track of the hierarchy of
2297transformations that were applied:
2298
2299.. doctest::
2300
2301  >>> from rdkit import Chem
2302  >>> from rdkit.Chem import Recap
2303  >>> m = Chem.MolFromSmiles('c1ccccc1OCCOC(=O)CC')
2304  >>> hierarch = Recap.RecapDecompose(m)
2305  >>> type(hierarch)
2306  <class 'rdkit.Chem.Recap.RecapHierarchyNode'>
2307
2308The hierarchy is rooted at the original molecule:
2309
2310.. doctest::
2311
2312  >>> hierarch.smiles
2313  'CCC(=O)OCCOc1ccccc1'
2314
2315and each node tracks its children using a dictionary keyed by SMILES:
2316
2317.. doctest::
2318
2319  >>> ks=hierarch.children.keys()
2320  >>> sorted(ks)
2321  ['*C(=O)CC', '*CCOC(=O)CC', '*CCOc1ccccc1', '*OCCOc1ccccc1', '*c1ccccc1']
2322
2323The nodes at the bottom of the hierarchy (the leaf nodes) are easily
2324accessible, also as a dictionary keyed by SMILES:
2325
2326.. doctest::
2327
2328  >>> ks=hierarch.GetLeaves().keys()
2329  >>> ks=sorted(ks)
2330  >>> ks
2331  ['*C(=O)CC', '*CCO*', '*CCOc1ccccc1', '*c1ccccc1']
2332
2333Notice that dummy atoms are used to mark points where the molecule was fragmented.
2334
2335The nodes themselves have associated molecules:
2336
2337.. doctest::
2338
2339  >>> leaf = hierarch.GetLeaves()[ks[0]]
2340  >>> Chem.MolToSmiles(leaf.mol)
2341  '*C(=O)CC'
2342
2343
2344BRICS Implementation
2345====================
2346
2347The RDKit also provides an implementation of the BRICS
2348algorithm. [#degen]_ BRICS provides another
2349method for fragmenting molecules along synthetically accessible bonds:
2350
2351.. doctest::
2352
2353  >>> from rdkit.Chem import BRICS
2354  >>> with Chem.SDMolSupplier('data/cdk2.sdf') as cdk2mols:
2355  ...   m1 = cdk2mols[0]
2356  ...   m2 = cdk2mols[20]
2357  >>> sorted(BRICS.BRICSDecompose(m1))
2358  ['[14*]c1nc(N)nc2[nH]cnc12', '[3*]O[3*]', '[4*]CC(=O)C(C)C']
2359  >>> sorted(BRICS.BRICSDecompose(m2))
2360  ['[1*]C(=O)NN(C)C', '[14*]c1[nH]nc2c1C(=O)c1c([16*])cccc1-2', '[16*]c1ccc([16*])cc1', '[3*]OC', '[5*]N[5*]']
2361
2362Notice that RDKit BRICS implementation returns the unique fragments
2363generated from a molecule and that the dummy atoms are tagged to
2364indicate which type of reaction applies.
2365
2366It's quite easy to generate the list of all fragments for a
2367group of molecules:
2368
2369.. doctest::
2370
2371  >>> allfrags=set()
2372  >>> with Chem.SDMolSupplier('data/cdk2.sdf') as cdk2mols:
2373  ...   for m in cdk2mols:
2374  ...      if m is None:
2375  ...        continue
2376  ...      pieces = BRICS.BRICSDecompose(m)
2377  ...      allfrags.update(pieces)
2378  >>> len(allfrags)
2379  90
2380  >>> sorted(allfrags)[:5]
2381  ['NS(=O)(=O)c1ccc(N/N=C2\\C(=O)Nc3ccc(Br)cc32)cc1', '[1*]C(=O)C(C)C', '[1*]C(=O)NN(C)C', '[1*]C(=O)NN1CC[NH+](C)CC1', '[1*]C(C)=O']
2382
2383The BRICS module also provides an option to apply the BRICS rules to a
2384set of fragments to create new molecules:
2385
2386.. doctest::
2387
2388  >>> import random
2389  >>> random.seed(127)
2390  >>> fragms = [Chem.MolFromSmiles(x) for x in sorted(allfrags)]
2391  >>> random.seed(0xf00d)
2392  >>> ms = BRICS.BRICSBuild(fragms)
2393
2394The result is a generator object:
2395
2396.. doctest::
2397
2398  >>> ms
2399  <generator object BRICSBuild at 0x...>
2400
2401That returns molecules on request:
2402
2403.. doctest::
2404
2405  >>> prods = [next(ms) for x in range(10)]
2406  >>> prods[0]
2407  <rdkit.Chem.rdchem.Mol object at 0x...>
2408
2409The molecules have not been sanitized, so it's a good idea to at least update the valences before continuing:
2410
2411.. doctest::
2412
2413  >>> for prod in prods:
2414  ...     prod.UpdatePropertyCache(strict=False)
2415  ...
2416  >>> Chem.MolToSmiles(prods[0],True)
2417  'CC(C)C(=O)N/C=C1\\C(=O)Nc2ccc3ncsc3c21'
2418  >>> Chem.MolToSmiles(prods[1],True)
2419  'CC(C)C(=O)N/C=C1\\C(=O)Nc2ccccc21'
2420  >>> Chem.MolToSmiles(prods[2],True)
2421  'CNC(=O)C(C)C'
2422
2423
2424By default those results come back in a random order (technically the example
2425above will always return the same results since we seeded Python's random number
2426generator just before calling BRICSBuild()). If you want the results to be
2427returned in a consistent order use the scrambleReagents argument:
2428
2429  >>> ms = BRICS.BRICSBuild(fragms, scrambleReagents=False)
2430  >>> prods = [next(ms) for x in range(10)]
2431  >>> for prod in prods:
2432  ...     prod.UpdatePropertyCache(strict=False)
2433  ...
2434  >>> Chem.MolToSmiles(prods[0],True)
2435  'COC(=O)C(C)C'
2436  >>> Chem.MolToSmiles(prods[1],True)
2437  'CNC(=O)C(C)C'
2438  >>> Chem.MolToSmiles(prods[2],True)
2439  'CC(C)C(=O)NC(=N)N'
2440
2441Other fragmentation approaches
2442==============================
2443
2444In addition to the methods described above, the RDKit provide a very
2445flexible generic function for fragmenting molecules along
2446user-specified bonds.
2447
2448Here's a quick demonstration of using that to break all bonds between
2449atoms in rings and atoms not in rings. We start by finding all the
2450atom pairs:
2451
2452.. doctest::
2453
2454  >>> m = Chem.MolFromSmiles('CC1CC(O)C1CCC1CC1')
2455  >>> bis = m.GetSubstructMatches(Chem.MolFromSmarts('[!R][R]'))
2456  >>> bis
2457  ((0, 1), (4, 3), (6, 5), (7, 8))
2458
2459then we get the corresponding bond indices:
2460
2461.. doctest::
2462
2463  >>> bs = [m.GetBondBetweenAtoms(x,y).GetIdx() for x,y in bis]
2464  >>> bs
2465  [0, 3, 5, 7]
2466
2467then we use those bond indices as input to the fragmentation function:
2468
2469.. doctest::
2470
2471  >>> nm = Chem.FragmentOnBonds(m,bs)
2472
2473the output is a molecule that has dummy atoms marking the places where
2474bonds were broken:
2475
2476.. doctest::
2477
2478  >>> Chem.MolToSmiles(nm,True)
2479  '*C1CC([4*])C1[6*].[1*]C.[3*]O.[5*]CC[8*].[7*]C1CC1'
2480
2481By default the attachment points are labelled (using isotopes) with
2482the index of the atom that was removed. We can also provide our own set of
2483atom labels in the form of pairs of unsigned integers. The first value
2484in each pair is used as the label for the dummy that replaces the
2485bond's begin atom, the second value in each pair is for the dummy that
2486replaces the bond's end atom. Here's an example, repeating the
2487analysis above and marking the positions where the non-ring atoms were
2488with the label 10 and marking the positions where the ring atoms were
2489with label 1:
2490
2491.. doctest::
2492
2493  >>> bis = m.GetSubstructMatches(Chem.MolFromSmarts('[!R][R]'))
2494  >>> bs = []
2495  >>> labels=[]
2496  >>> for bi in bis:
2497  ...    b = m.GetBondBetweenAtoms(bi[0],bi[1])
2498  ...    if b.GetBeginAtomIdx()==bi[0]:
2499  ...        labels.append((10,1))
2500  ...    else:
2501  ...        labels.append((1,10))
2502  ...    bs.append(b.GetIdx())
2503  >>> nm = Chem.FragmentOnBonds(m,bs,dummyLabels=labels)
2504  >>> Chem.MolToSmiles(nm,True)
2505  '[1*]C.[1*]CC[1*].[1*]O.[10*]C1CC([10*])C1[10*].[10*]C1CC1'
2506
2507
2508Chemical Features and Pharmacophores
2509************************************
2510
2511
2512Chemical Features
2513=================
2514
2515Chemical features in the RDKit are defined using a SMARTS-based feature
2516definition language (described in detail in the RDKit book). To identify
2517chemical features in molecules, you first must build a feature factory:
2518
2519.. doctest::
2520
2521  >>> from rdkit import Chem
2522  >>> from rdkit.Chem import ChemicalFeatures
2523  >>> from rdkit import RDConfig
2524  >>> import os
2525  >>> fdefName = os.path.join(RDConfig.RDDataDir,'BaseFeatures.fdef')
2526  >>> factory = ChemicalFeatures.BuildFeatureFactory(fdefName)
2527
2528and then use the factory to search for features:
2529
2530.. doctest::
2531
2532  >>> m = Chem.MolFromSmiles('OCc1ccccc1CN')
2533  >>> feats = factory.GetFeaturesForMol(m)
2534  >>> len(feats)
2535  8
2536
2537The individual features carry information about their family (e.g. donor,
2538acceptor, etc.), type (a more detailed description), and the atom(s) that is/are
2539associated with the feature:
2540
2541.. doctest::
2542
2543  >>> feats[0].GetFamily()
2544  'Donor'
2545  >>> feats[0].GetType()
2546  'SingleAtomDonor'
2547  >>> feats[0].GetAtomIds()
2548  (0,)
2549  >>> feats[4].GetFamily()
2550  'Aromatic'
2551  >>> feats[4].GetAtomIds()
2552  (2, 3, 4, 5, 6, 7)
2553
2554If the molecule has coordinates, then the features will also have reasonable locations:
2555
2556.. doctest::
2557
2558  >>> from rdkit.Chem import AllChem
2559  >>> AllChem.Compute2DCoords(m)
2560  0
2561  >>> feats[0].GetPos()
2562  <rdkit.Geometry.rdGeometry.Point3D object at 0x...>
2563  >>> list(feats[0].GetPos())
2564  [2.07..., -2.335..., 0.0]
2565
2566
25672D Pharmacophore Fingerprints
2568=============================
2569
2570Combining a set of chemical features with the 2D (topological)
2571distances between them gives a 2D pharmacophore.  When the distances
2572are binned, unique integer ids can be assigned to each of these
2573pharmacophores and they can be stored in a fingerprint.  Details of
2574the encoding are in the :doc:`RDKit_Book`.
2575
2576Generating pharmacophore fingerprints requires chemical features
2577generated via the usual RDKit feature-typing mechanism:
2578
2579.. doctest::
2580
2581  >>> from rdkit import Chem
2582  >>> from rdkit.Chem import ChemicalFeatures
2583  >>> fdefName = 'data/MinimalFeatures.fdef'
2584  >>> featFactory = ChemicalFeatures.BuildFeatureFactory(fdefName)
2585
2586The fingerprints themselves are calculated using a signature
2587(fingerprint) factory, which keeps track of all the parameters
2588required to generate the pharmacophore:
2589
2590.. doctest::
2591
2592  >>> from rdkit.Chem.Pharm2D.SigFactory import SigFactory
2593  >>> sigFactory = SigFactory(featFactory,minPointCount=2,maxPointCount=3)
2594  >>> sigFactory.SetBins([(0,2),(2,5),(5,8)])
2595  >>> sigFactory.Init()
2596  >>> sigFactory.GetSigSize()
2597  885
2598
2599The signature factory is now ready to be used to generate
2600fingerprints, a task which is done using the
2601:py:mod:`rdkit.Chem.Pharm2D.Generate` module:
2602
2603.. doctest::
2604
2605  >>> from rdkit.Chem.Pharm2D import Generate
2606  >>> mol = Chem.MolFromSmiles('OCC(=O)CCCN')
2607  >>> fp = Generate.Gen2DFingerprint(mol,sigFactory)
2608  >>> fp
2609  <rdkit.DataStructs.cDataStructs.SparseBitVect object at 0x...>
2610  >>> len(fp)
2611  885
2612  >>> fp.GetNumOnBits()
2613  57
2614
2615Details about the bits themselves, including the features that are
2616involved and the binned distance matrix between the features, can be
2617obtained from the signature factory:
2618
2619.. doctest::
2620
2621  >>> list(fp.GetOnBits())[:5]
2622  [1, 2, 6, 7, 8]
2623  >>> sigFactory.GetBitDescription(1)
2624  'Acceptor Acceptor |0 1|1 0|'
2625  >>> sigFactory.GetBitDescription(2)
2626  'Acceptor Acceptor |0 2|2 0|'
2627  >>> sigFactory.GetBitDescription(8)
2628  'Acceptor Donor |0 2|2 0|'
2629  >>> list(fp.GetOnBits())[-5:]
2630  [704, 706, 707, 708, 714]
2631  >>> sigFactory.GetBitDescription(707)
2632  'Donor Donor PosIonizable |0 1 2|1 0 1|2 1 0|'
2633  >>> sigFactory.GetBitDescription(714)
2634  'Donor Donor PosIonizable |0 2 2|2 0 0|2 0 0|'
2635
2636For the sake of convenience (to save you from having to edit the fdef
2637file every time) it is possible to disable particular feature types
2638within the SigFactory:
2639
2640.. doctest::
2641
2642  >>> sigFactory.skipFeats=['PosIonizable']
2643  >>> sigFactory.Init()
2644  >>> sigFactory.GetSigSize()
2645  510
2646  >>> fp2 = Generate.Gen2DFingerprint(mol,sigFactory)
2647  >>> fp2.GetNumOnBits()
2648  36
2649
2650Another possible set of feature definitions for 2D pharmacophore
2651fingerprints in the RDKit are those published by Gobbi and
2652Poppinger. [#gobbi]_ The module
2653:py:mod:`rdkit.Chem.Pharm2D.Gobbi_Pharm2D` has a pre-configured signature
2654factory for these fingerprint types.  Here's an example of using it:
2655
2656.. doctest::
2657
2658  >>> from rdkit import Chem
2659  >>> from rdkit.Chem.Pharm2D import Gobbi_Pharm2D,Generate
2660  >>> m = Chem.MolFromSmiles('OCC=CC(=O)O')
2661  >>> fp = Generate.Gen2DFingerprint(m,Gobbi_Pharm2D.factory)
2662  >>> fp
2663  <rdkit.DataStructs.cDataStructs.SparseBitVect object at 0x...>
2664  >>> fp.GetNumOnBits()
2665  8
2666  >>> list(fp.GetOnBits())
2667  [23, 30, 150, 154, 157, 185, 28878, 30184]
2668  >>> Gobbi_Pharm2D.factory.GetBitDescription(157)
2669  'HA HD |0 3|3 0|'
2670  >>> Gobbi_Pharm2D.factory.GetBitDescription(30184)
2671  'HA HD HD |0 3 0|3 0 3|0 3 0|'
2672
2673
2674Molecular Fragments
2675*******************
2676
2677The RDKit contains a collection of tools for fragmenting molecules and
2678working with those fragments.  Fragments are defined to be made up of
2679a set of connected atoms that may have associated functional groups.
2680This is more easily demonstrated than explained:
2681
2682.. doctest::
2683
2684  >>> fName=os.path.join(RDConfig.RDDataDir,'FunctionalGroups.txt')
2685  >>> from rdkit.Chem import FragmentCatalog
2686  >>> fparams = FragmentCatalog.FragCatParams(1,6,fName)
2687  >>> fparams.GetNumFuncGroups()
2688  39
2689  >>> fcat=FragmentCatalog.FragCatalog(fparams)
2690  >>> fcgen=FragmentCatalog.FragCatGenerator()
2691  >>> m = Chem.MolFromSmiles('OCC=CC(=O)O')
2692  >>> fcgen.AddFragsFromMol(m,fcat)
2693  3
2694  >>> fcat.GetEntryDescription(0)
2695  'C<-O>C'
2696  >>> fcat.GetEntryDescription(1)
2697  'C=C<-C(=O)O>'
2698  >>> fcat.GetEntryDescription(2)
2699  'C<-C(=O)O>=CC<-O>'
2700
2701The fragments are stored as entries in a
2702:py:class:`rdkit.Chem.rdfragcatalog.FragCatalog`.  Notice that the
2703entry descriptions include pieces in angular brackets (e.g. between
2704'<' and '>').  These describe the functional groups attached to the
2705fragment.  For example, in the above example, the catalog entry 0
2706corresponds to an ethyl fragment with an alcohol attached to one of
2707the carbons and entry 1 is an ethylene with a carboxylic acid on one
2708carbon.  Detailed information about the functional groups can be
2709obtained by asking the fragment for the ids of the functional groups
2710it contains and then looking those ids up in the
2711:py:class:`rdkit.Chem.rdfragcatalog.FragCatParams`
2712object:
2713
2714.. doctest::
2715
2716  >>> list(fcat.GetEntryFuncGroupIds(2))
2717  [34, 1]
2718  >>> fparams.GetFuncGroup(1)
2719  <rdkit.Chem.rdchem.Mol object at 0x...>
2720  >>> Chem.MolToSmarts(fparams.GetFuncGroup(1))
2721  '*-C(=O)[O&D1]'
2722  >>> Chem.MolToSmarts(fparams.GetFuncGroup(34))
2723  '*-[O&D1]'
2724  >>> fparams.GetFuncGroup(1).GetProp('_Name')
2725  '-C(=O)O'
2726  >>> fparams.GetFuncGroup(34).GetProp('_Name')
2727  '-O'
2728
2729The catalog is hierarchical: smaller fragments are combined to form
2730larger ones.  From a small fragment, one can find the larger fragments
2731to which it contributes using the
2732:py:meth:`rdkit.Chem.rdfragcatalog.FragCatalog.GetEntryDownIds`
2733method:
2734
2735.. doctest::
2736
2737  >>> fcat=FragmentCatalog.FragCatalog(fparams)
2738  >>> m = Chem.MolFromSmiles('OCC(NC1CC1)CCC')
2739  >>> fcgen.AddFragsFromMol(m,fcat)
2740  15
2741  >>> fcat.GetEntryDescription(0)
2742  'C<-O>C'
2743  >>> fcat.GetEntryDescription(1)
2744  'CN<-cPropyl>'
2745  >>> list(fcat.GetEntryDownIds(0))
2746  [3, 4]
2747  >>> fcat.GetEntryDescription(3)
2748  'C<-O>CC'
2749  >>> fcat.GetEntryDescription(4)
2750  'C<-O>CN<-cPropyl>'
2751
2752The fragments from multiple molecules can be added to a catalog:
2753
2754.. doctest::
2755
2756  >>> with Chem.SmilesMolSupplier('data/bzr.smi') as suppl:
2757  ...    ms = [x for x in suppl]
2758  >>> fcat=FragmentCatalog.FragCatalog(fparams)
2759  >>> for m in ms: nAdded=fcgen.AddFragsFromMol(m,fcat)
2760  >>> fcat.GetNumEntries()
2761  1169
2762  >>> fcat.GetEntryDescription(0)
2763  'Cc'
2764  >>> fcat.GetEntryDescription(100)
2765  'cc-nc(C)n'
2766
2767The fragments in a catalog are unique, so adding a molecule a second
2768time doesn't add any new entries:
2769
2770.. doctest::
2771
2772  >>> fcgen.AddFragsFromMol(ms[0],fcat)
2773  0
2774  >>> fcat.GetNumEntries()
2775  1169
2776
2777Once a :py:class:`rdkit.Chem.rdfragcatalog.FragCatalog` has been
2778generated, it can be used to fingerprint molecules:
2779
2780.. doctest::
2781
2782  >>> fpgen = FragmentCatalog.FragFPGenerator()
2783  >>> fp = fpgen.GetFPForMol(ms[8],fcat)
2784  >>> fp
2785  <rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
2786  >>> fp.GetNumOnBits()
2787  189
2788
2789The rest of the machinery associated with fingerprints can now be
2790applied to these fragment fingerprints.  For example, it's easy to
2791find the fragments that two molecules have in common by taking the
2792intersection of their fingerprints:
2793
2794.. doctest::
2795
2796  >>> fp2 = fpgen.GetFPForMol(ms[7],fcat)
2797  >>> andfp = fp&fp2
2798  >>> obl = list(andfp.GetOnBits())
2799  >>> fcat.GetEntryDescription(obl[-1])
2800  'ccc(cc)NC<=O>'
2801  >>> fcat.GetEntryDescription(obl[-5])
2802  'c<-X>ccc(N)cc'
2803
2804or we can find the fragments that distinguish one molecule from
2805another:
2806
2807.. doctest::
2808
2809  >>> combinedFp=fp&(fp^fp2) # can be more efficient than fp&(!fp2)
2810  >>> obl = list(combinedFp.GetOnBits())
2811  >>> fcat.GetEntryDescription(obl[-1])
2812  'cccc(N)cc'
2813
2814Or we can use the bit ranking functionality from the
2815:py:class:`rdkit.ML.InfoTheory.rdInfoTheory.InfoBitRanker` class to identify fragments
2816that distinguish actives from inactives:
2817
2818.. doctest::
2819
2820  >>> with Chem.SDMolSupplier('data/bzr.sdf') as suppl:
2821  ...    sdms = [x for x in suppl]
2822  >>> fps = [fpgen.GetFPForMol(x,fcat) for x in sdms]
2823  >>> from rdkit.ML.InfoTheory import InfoBitRanker
2824  >>> ranker = InfoBitRanker(len(fps[0]),2)
2825  >>> acts = [x.GetDoubleProp('ACTIVITY') for x in sdms]
2826  >>> for i,fp in enumerate(fps):
2827  ...   act = int(acts[i]>7)
2828  ...   ranker.AccumulateVotes(fp,act)
2829  ...
2830  >>> top5 = ranker.GetTopN(5)
2831  >>> for id,gain,n0,n1 in top5:
2832  ...   print(int(id),'%.3f'%gain,int(n0),int(n1))
2833  ...
2834  702 0.081 20 17
2835  328 0.073 23 25
2836  341 0.073 30 43
2837  173 0.073 30 43
2838  1034 0.069 5 53
2839
2840The columns above are: bitId, infoGain, nInactive, nActive. Note that
2841this approach isn't particularly effective for this artificial
2842example.
2843
2844
2845R-Group Decomposition
2846*********************
2847
2848Let's look at how it works. We'll read in a group of molecules (these were taken
2849ChEMBL), define a core with labelled R groups, and then use the simplest call to
2850do R-group decomposition:
2851:py:func:`rdkit.Chem.rdRGroupDecomposition.RGroupDecompose`
2852
2853.. doctest::
2854
2855  >>> from rdkit import Chem
2856  >>> from rdkit.Chem import rdRGroupDecomposition as rdRGD
2857  >>> with Chem.SmilesMolSupplier('data/s1p_chembldoc89753.txt',delimiter=",",
2858  ...                              smilesColumn=9,nameColumn=10) as suppl:
2859  ...   ms = [x for x in suppl if x is not None]
2860  >>> len(ms)
2861  40
2862  >>> core = Chem.MolFromSmarts('[*:1]c1nc([*:2])on1')
2863  >>> res,unmatched = rdRGD.RGroupDecompose([core],ms,asSmiles=True)
2864  >>> unmatched
2865  []
2866  >>> len(res)
2867  40
2868  >>> res[:2]            # doctest: +NORMALIZE_WHITESPACE
2869  [{'Core': 'n1oc([*:2])nc1[*:1]', 'R1': 'O=C(O)CCCC1NCCOc2c1cccc2[*:1]', 'R2': 'CC(C)Oc1ccc([*:2])cc1Cl'},
2870   {'Core': 'n1oc([*:2])nc1[*:1]', 'R1': 'O=C(O)CCC1NCCOc2c1cccc2[*:1]', 'R2': 'CC(C)Oc1ccc([*:2])cc1Cl'}]
2871
2872The `unmatched` return value has the indices of the molecules that did not match
2873a core; in this case there are none. The other result is a list with one dict
2874for each molecule; each dict contains the core that matched the molecule (in
2875this case there was only one) and the molecule's R groups.
2876
2877As an aside, if you are a Pandas user, it's very easy to get the R-group
2878decomposition results into a DataFrame:
2879
2880.. doctest::
2881
2882  >>> import pandas as pd
2883  >>> res,unmatched = rdRGD.RGroupDecompose([core],ms,asSmiles=True,asRows=False)
2884  >>> df= pd.DataFrame(res)
2885  >>> df.head()
2886                    Core                              R1                       R2
2887  0  n1oc([*:2])nc1[*:1]   O=C(O)CCCC1NCCOc2c1cccc2[*:1]  CC(C)Oc1ccc([*:2])cc1Cl
2888  1  n1oc([*:2])nc1[*:1]    O=C(O)CCC1NCCOc2c1cccc2[*:1]  CC(C)Oc1ccc([*:2])cc1Cl
2889  2  n1oc([*:2])nc1[*:1]  O=C(O)CCC1COc2ccc([*:1])cc2CN1  CC(C)Oc1ccc([*:2])cc1Cl
2890  3  n1oc([*:2])nc1[*:1]   O=C(O)CCCC1NCCOc2c1cccc2[*:1]  CC(C)Oc1ncc([*:2])cc1Cl
2891  4  n1oc([*:2])nc1[*:1]   O=C(O)CCCC1NCCOc2c1cccc2[*:1]  CC(C)Oc1ncc([*:2])cc1Cl
2892
2893It's not necessary to label the attachment points on the core, if you leave them
2894out the code will automatically assign labels:
2895
2896.. doctest::
2897
2898  >>> core2 = Chem.MolFromSmarts('c1ncon1')
2899  >>> res,unmatched = rdRGD.RGroupDecompose([core2],ms,asSmiles=True)
2900  >>> res[:2]            # doctest: +NORMALIZE_WHITESPACE
2901  [{'Core': 'n1oc([*:1])nc1[*:2]', 'R1': 'CC(C)Oc1ccc([*:1])cc1Cl', 'R2': 'O=C(O)CCCC1NCCOc2c1cccc2[*:2]'},
2902   {'Core': 'n1oc([*:1])nc1[*:2]', 'R1': 'CC(C)Oc1ccc([*:1])cc1Cl', 'R2': 'O=C(O)CCC1NCCOc2c1cccc2[*:2]'}]
2903
2904R-group decomposition is actually pretty complex, so there's a lot more there.
2905Hopefully this is enough to get you started.
2906
2907Non-Chemical Functionality
2908**************************
2909
2910
2911Bit vectors
2912===========
2913
2914Bit vectors are containers for efficiently storing a set number of binary values, e.g. for fingerprints.
2915The RDKit includes two types of fingerprints differing in how they store the values internally; the two types are easily interconverted but are best used for different purpose:
2916
2917- SparseBitVects store only the list of bits set in the vector; they are well suited for storing very large, very sparsely occupied vectors like pharmacophore fingerprints.
2918  Some operations, such as retrieving the list of on bits, are quite fast.
2919  Others, such as negating the vector, are very, very slow.
2920
2921- ExplicitBitVects keep track of both on and off bits.
2922  They are generally faster than SparseBitVects, but require more memory to store.
2923
2924
2925Discrete value vectors
2926======================
2927
2928
29293D grids
2930========
2931
2932
2933Points
2934======
2935
2936
2937Getting Help
2938************
2939
2940There is a reasonable amount of documentation available within from the RDKit's docstrings.
2941These are accessible using Python's help command:
2942
2943.. doctest::
2944
2945  >>> m = Chem.MolFromSmiles('Cc1ccccc1')
2946  >>> m.GetNumAtoms()
2947  7
2948  >>> help(m.GetNumAtoms)
2949  Help on method GetNumAtoms:
2950  <BLANKLINE>
2951  GetNumAtoms(...) method of rdkit.Chem.rdchem.Mol instance
2952      GetNumAtoms( (Mol)arg1 [, (int)onlyHeavy=-1 [, (bool)onlyExplicit=True]]) -> int :
2953          Returns the number of atoms in the molecule.
2954  <BLANKLINE>
2955            ARGUMENTS:
2956              - onlyExplicit: (optional) include only explicit atoms (atoms in the molecular graph)
2957                              defaults to 1.
2958            NOTE: the onlyHeavy argument is deprecated
2959  <BLANKLINE>
2960  <BLANKLINE>
2961          C++ signature :
2962              int GetNumAtoms(...)
2963  <BLANKLINE>
2964  >>> m.GetNumAtoms(onlyExplicit=False)
2965  15
2966
2967When working in an environment that does command completion or tooltips, one can see the available methods quite easily.
2968Here's a sample screenshot from within the Jupyter notebook:
2969
2970.. image:: images/picture_6.png
2971
2972
2973Advanced Topics/Warnings
2974************************
2975
2976
2977Editing Molecules
2978=================
2979
2980Some of the functionality provided allows molecules to be edited “in place”:
2981
2982.. doctest::
2983
2984  >>> m = Chem.MolFromSmiles('c1ccccc1')
2985  >>> m.GetAtomWithIdx(0).SetAtomicNum(7)
2986  >>> Chem.SanitizeMol(m)
2987  rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
2988  >>> Chem.MolToSmiles(m)
2989  'c1ccncc1'
2990
2991Do not forget the sanitization step, without it one can end up with results that look ok (so long as you don't think):
2992
2993.. doctest::
2994
2995  >>> m = Chem.MolFromSmiles('c1ccccc1')
2996  >>> m.GetAtomWithIdx(0).SetAtomicNum(8)
2997  >>> Chem.MolToSmiles(m)
2998  'c1ccocc1'
2999
3000but that are, of course, complete nonsense, as sanitization will indicate:
3001
3002.. doctest::
3003
3004  >>> Chem.SanitizeMol(m)
3005  Traceback (most recent call last):
3006    File "/usr/lib/python3.6/doctest.py", line 1253, in __run
3007      compileflags, 1) in test.globs
3008    File "<doctest default[0]>", line 1, in <module>
3009      Chem.SanitizeMol(m)
3010  rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5
3011  <BLANKLINE>
3012
3013More complex transformations can be carried out using the
3014:py:class:`rdkit.Chem.rdchem.RWMol` class:
3015
3016.. doctest::
3017
3018  >>> m = Chem.MolFromSmiles('CC(=O)C=CC=C')
3019  >>> mw = Chem.RWMol(m)
3020  >>> mw.ReplaceAtom(4,Chem.Atom(7))
3021  >>> mw.AddAtom(Chem.Atom(6))
3022  7
3023  >>> mw.AddAtom(Chem.Atom(6))
3024  8
3025  >>> mw.AddBond(6,7,Chem.BondType.SINGLE)
3026  7
3027  >>> mw.AddBond(7,8,Chem.BondType.DOUBLE)
3028  8
3029  >>> mw.AddBond(8,3,Chem.BondType.SINGLE)
3030  9
3031  >>> mw.RemoveAtom(0)
3032  >>> mw.GetNumAtoms()
3033  8
3034
3035The RWMol can be used just like an ROMol:
3036
3037.. doctest::
3038
3039  >>> Chem.MolToSmiles(mw)
3040  'O=CC1=NC=CC=C1'
3041  >>> Chem.SanitizeMol(mw)
3042  rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
3043  >>> Chem.MolToSmiles(mw)
3044  'O=Cc1ccccn1'
3045
3046The RDKit also has functionality enabling batch edits of molecules which provides a
3047more efficient way to remove multiple atoms or bonds at once.
3048
3049.. doctest::
3050
3051  >>> m = Chem.MolFromSmiles('CC(=O)C=CC=C')
3052  >>> mw = Chem.RWMol(m)
3053  >>> mw.BeginBatchEdit()
3054  >>> mw.RemoveAtom(3)
3055  >>> mw.RemoveBond(1,2)  #<- these are the begin and end atoms of the bond
3056
3057None of the changes actually happen until we "commit" them:
3058.. doctest::
3059>>> Chem.MolToSmiles(mw)
3060'C=CC=CC(C)=O'
3061>>> mw.CommitBatchEdit()
3062>>> Chem.MolToSmiles(mw)
3063'C=CC.CC.O'
3064
3065You can make this more concise using a context manager, which takes care of the commit for you:
3066>>> with Chem.RWMol(m) as mw:
3067...     mw.RemoveAtom(3)
3068...     mw.RemoveBond(1,2)
3069...
3070>>> Chem.MolToSmiles(mw)
3071'C=CC.CC.O'
3072
3073It is even easier to generate nonsense using the RWMol than it
3074is with standard molecules.  If you need chemically reasonable
3075results, be certain to sanitize the results.
3076
3077Miscellaneous Tips and Hints
3078****************************
3079
3080
3081Chem vs AllChem
3082===============
3083
3084The majority of “basic” chemical functionality (e.g. reading/writing
3085molecules, substructure searching, molecular cleanup, etc.) is in the
3086:py:mod:`rdkit.Chem` module.  More advanced, or less frequently used,
3087functionality is in :py:mod:`rdkit.Chem.AllChem`.  The distinction has
3088been made to speed startup and lower import times; there's no sense in
3089loading the 2D->3D library and force field implementation if one is
3090only interested in reading and writing a couple of molecules.  If you
3091find the Chem/AllChem thing annoying or confusing, you can use
3092python's “import ... as ...” syntax to remove the irritation:
3093
3094.. doctest::
3095
3096  >>> from rdkit.Chem import AllChem as Chem
3097  >>> m = Chem.MolFromSmiles('CCC')
3098
3099
3100The SSSR Problem
3101================
3102
3103As others have ranted about with more energy and eloquence than I
3104intend to, the definition of a molecule's smallest set of smallest
3105rings is not unique.  In some high symmetry molecules, a “true” SSSR
3106will give results that are unappealing.  For example, the SSSR for
3107cubane only contains 5 rings, even though there are
3108“obviously” 6. This problem can be fixed by implementing a *small*
3109(instead of *smallest*) set of smallest rings algorithm that returns
3110symmetric results.  This is the approach that we took with the RDKit.
3111
3112Because it is sometimes useful to be able to count how many SSSR rings
3113are present in the molecule, there is a
3114:py:func:`rdkit.Chem.rdmolops.GetSSSR` function, but this only returns the
3115SSSR count, not the potentially non-unique set of rings.
3116
3117
3118List of Available Descriptors
3119*****************************
3120
3121
3122+-----------------------------------------------------+------------------------------------------------------------+----------+
3123|Descriptor/Descriptor                                |Notes                                                       | Language |
3124|Family                                               |                                                            |          |
3125+-----------------------------------------------------+------------------------------------------------------------+----------+
3126|Gasteiger/Marsili                                    |*Tetrahedron*                                               | C++      |
3127|Partial Charges                                      |**36**:3219\-28                                             |          |
3128|                                                     |(1980)                                                      |          |
3129+-----------------------------------------------------+------------------------------------------------------------+----------+
3130|BalabanJ                                             |*Chem. Phys. Lett.*                                         | Python   |
3131|                                                     |**89**:399\-404                                             |          |
3132|                                                     |(1982)                                                      |          |
3133+-----------------------------------------------------+------------------------------------------------------------+----------+
3134|BertzCT                                              |*J. Am. Chem. Soc.*                                         | Python   |
3135|                                                     |**103**:3599\-601                                           |          |
3136|                                                     |(1981)                                                      |          |
3137+-----------------------------------------------------+------------------------------------------------------------+----------+
3138|Ipc                                                  |*J. Chem. Phys.*                                            | Python   |
3139|                                                     |**67**:4517\-33                                             |          |
3140|                                                     |(1977)                                                      |          |
3141+-----------------------------------------------------+------------------------------------------------------------+----------+
3142|HallKierAlpha                                        |*Rev. Comput. Chem.*                                        | C++      |
3143|                                                     |**2**:367\-422                                              |          |
3144|                                                     |(1991)                                                      |          |
3145+-----------------------------------------------------+------------------------------------------------------------+----------+
3146|Kappa1 \- Kappa3                                     |*Rev. Comput. Chem.*                                        | C++      |
3147|                                                     |**2**:367\-422                                              |          |
3148|                                                     |(1991)                                                      |          |
3149+-----------------------------------------------------+------------------------------------------------------------+----------+
3150|Phi                                                  |New in 2021.03 release                                      | C++      |
3151|                                                     |*Quant. Struct.-Act. Rel.*                                  |          |
3152|                                                     |**8**:221\-224                                              |          |
3153|                                                     |(1989)                                                      |          |
3154+-----------------------------------------------------+------------------------------------------------------------+----------+
3155|Chi0, Chi1                                           |*Rev. Comput. Chem.*                                        | Python   |
3156|                                                     |**2**:367\-422                                              |          |
3157|                                                     |(1991)                                                      |          |
3158+-----------------------------------------------------+------------------------------------------------------------+----------+
3159|Chi0n \- Chi4n                                       |*Rev. Comput. Chem.*                                        | C++      |
3160|                                                     |**2**:367\-422                                              |          |
3161|                                                     |(1991)                                                      |          |
3162+-----------------------------------------------------+------------------------------------------------------------+----------+
3163|Chi0v \- Chi4v                                       |*Rev. Comput. Chem.*                                        | C++      |
3164|                                                     |**2**:367\-422                                              |          |
3165|                                                     |(1991)                                                      |          |
3166+-----------------------------------------------------+------------------------------------------------------------+----------+
3167|MolLogP                                              |Wildman and Crippen                                         | C++      |
3168|                                                     |*JCICS*                                                     |          |
3169|                                                     |**39**:868\-73                                              |          |
3170|                                                     |(1999)                                                      |          |
3171+-----------------------------------------------------+------------------------------------------------------------+----------+
3172|MolMR                                                |Wildman and Crippen                                         | C++      |
3173|                                                     |*JCICS*                                                     |          |
3174|                                                     |**39**:868\-73                                              |          |
3175|                                                     |(1999)                                                      |          |
3176+-----------------------------------------------------+------------------------------------------------------------+----------+
3177|MolWt                                                |                                                            | C++      |
3178+-----------------------------------------------------+------------------------------------------------------------+----------+
3179|ExactMolWt                                           |                                                            | C++      |
3180+-----------------------------------------------------+------------------------------------------------------------+----------+
3181|HeavyAtomCount                                       |                                                            | C++      |
3182+-----------------------------------------------------+------------------------------------------------------------+----------+
3183|HeavyAtomMolWt                                       |                                                            | C++      |
3184+-----------------------------------------------------+------------------------------------------------------------+----------+
3185|NHOHCount                                            |                                                            | C++      |
3186+-----------------------------------------------------+------------------------------------------------------------+----------+
3187|NOCount                                              |                                                            | C++      |
3188+-----------------------------------------------------+------------------------------------------------------------+----------+
3189|NumHAcceptors                                        |                                                            | C++      |
3190+-----------------------------------------------------+------------------------------------------------------------+----------+
3191|NumHDonors                                           |                                                            | C++      |
3192+-----------------------------------------------------+------------------------------------------------------------+----------+
3193|NumHeteroatoms                                       |                                                            | C++      |
3194+-----------------------------------------------------+------------------------------------------------------------+----------+
3195|NumRotatableBonds                                    |                                                            | C++      |
3196+-----------------------------------------------------+------------------------------------------------------------+----------+
3197|NumValenceElectrons                                  |                                                            | C++      |
3198+-----------------------------------------------------+------------------------------------------------------------+----------+
3199|NumAmideBonds                                        |                                                            | C++      |
3200+-----------------------------------------------------+------------------------------------------------------------+----------+
3201|Num{Aromatic,Saturated,Aliphatic}Rings               |                                                            | C++      |
3202+-----------------------------------------------------+------------------------------------------------------------+----------+
3203|Num{Aromatic,Saturated,Aliphatic}{Hetero,Carbo}cycles|                                                            | C++      |
3204+-----------------------------------------------------+------------------------------------------------------------+----------+
3205|RingCount                                            |                                                            | C++      |
3206+-----------------------------------------------------+------------------------------------------------------------+----------+
3207|FractionCSP3                                         |                                                            | C++      |
3208+-----------------------------------------------------+------------------------------------------------------------+----------+
3209|NumSpiroAtoms                                        |Number of spiro atoms                                       | C++      |
3210|                                                     |(atoms shared between rings that share                      |          |
3211|                                                     |exactly one atom)                                           |          |
3212+-----------------------------------------------------+------------------------------------------------------------+----------+
3213|NumBridgeheadAtoms                                   |Number of bridgehead atoms                                  | C++      |
3214|                                                     |(atoms shared between rings that share                      |          |
3215|                                                     |at least two bonds)                                         |          |
3216+-----------------------------------------------------+------------------------------------------------------------+----------+
3217|TPSA                                                 |*J. Med. Chem.*                                             | C++      |
3218|                                                     |**43**:3714\-7,                                             |          |
3219|                                                     |(2000)                                                      |          |
3220|                                                     |See the section in the RDKit book describing differences    |          |
3221|                                                     |to the original publication.                                |          |
3222+-----------------------------------------------------+------------------------------------------------------------+----------+
3223|LabuteASA                                            |*J. Mol. Graph. Mod.*                                       | C++      |
3224|                                                     |**18**:464\-77 (2000)                                       |          |
3225+-----------------------------------------------------+------------------------------------------------------------+----------+
3226|PEOE_VSA1 \- PEOE_VSA14                              |MOE\-type descriptors using partial charges                 | C++      |
3227|                                                     |and surface area contributions                              |          |
3228|                                                     |http://www.chemcomp.com/journal/vsadesc.htm                 |          |
3229+-----------------------------------------------------+------------------------------------------------------------+----------+
3230|SMR_VSA1 \- SMR_VSA10                                |MOE\-type descriptors using MR                              | C++      |
3231|                                                     |contributions and surface area                              |          |
3232|                                                     |contributions                                               |          |
3233|                                                     |http://www.chemcomp.com/journal/vsadesc.htm                 |          |
3234+-----------------------------------------------------+------------------------------------------------------------+----------+
3235|SlogP_VSA1 \- SlogP_VSA12                            |MOE\-type descriptors using LogP                            | C++      |
3236|                                                     |contributions and surface area                              |          |
3237|                                                     |contributions                                               |          |
3238|                                                     |http://www.chemcomp.com/journal/vsadesc.htm                 |          |
3239+-----------------------------------------------------+------------------------------------------------------------+----------+
3240|EState_VSA1 \- EState_VSA11                          |MOE\-type descriptors using EState indices                  | Python   |
3241|                                                     |and surface area contributions (developed                   |          |
3242|                                                     |at RD, not described in the CCG paper)                      |          |
3243+-----------------------------------------------------+------------------------------------------------------------+----------+
3244|VSA_EState1 \- VSA_EState10                          |MOE\-type descriptors using EState indices                  | Python   |
3245|                                                     |and surface area contributions (developed                   |          |
3246|                                                     |at RD, not described in the CCG paper)                      |          |
3247+-----------------------------------------------------+------------------------------------------------------------+----------+
3248|MQNs                                                 |Nguyen et al. *ChemMedChem* **4**:1803\-5                   | C++      |
3249|                                                     |(2009)                                                      |          |
3250+-----------------------------------------------------+------------------------------------------------------------+----------+
3251|Topliss fragments                                    |implemented using a set of SMARTS                           | Python   |
3252|                                                     |definitions in                                              |          |
3253|                                                     |$(RDBASE)/Data/FragmentDescriptors.csv                      |          |
3254+-----------------------------------------------------+------------------------------------------------------------+----------+
3255|Autocorr2D                                           |New in 2017.09 release. Todeschini and Consoni "Descriptors | C++      |
3256|                                                     |from Molecular Geometry" Handbook of Chemoinformatics       |          |
3257|                                                     |https://doi.org/10.1002/9783527618279.ch37                  |          |
3258+-----------------------------------------------------+------------------------------------------------------------+----------+
3259|BCUT2D                                               |New in 2020.09 release. Pearlman and Smith in "3D-QSAR and  | C++      |
3260|                                                     |Drug design: Recent Advances" (1997)                        |          |
3261+-----------------------------------------------------+------------------------------------------------------------+----------+
3262
3263
3264List of Available 3D Descriptors
3265********************************
3266
3267These all require the molecule to have a 3D conformer.
3268
3269+-----------------------------------------------------+-------------------------------------------------------------+----------+
3270|Descriptor/Descriptor                                |Notes                                                        | Language |
3271|Family                                               |                                                             |          |
3272+-----------------------------------------------------+-------------------------------------------------------------+----------+
3273|Plane of best fit (PBF)                              |Nicholas C. Firth, Nathan Brown, and Julian                  | C++      |
3274|                                                     |Blagg, *JCIM* **52**:2516\-25                                |          |
3275+-----------------------------------------------------+-------------------------------------------------------------+----------+
3276|PMI1, PMI2, PMI3                                     |Principal moments of inertia                                 | C++      |
3277+-----------------------------------------------------+-------------------------------------------------------------+----------+
3278|NPR1, NPR2                                           |Normalized principal moments ratios Sauer                    | C++      |
3279|                                                     |and Schwarz *JCIM* **43**:987\-1003 (2003)                   |          |
3280+-----------------------------------------------------+-------------------------------------------------------------+----------+
3281|Radius of gyration                                   |G. A. Arteca "Molecular Shape Descriptors"                   | C++      |
3282|                                                     |Reviews in Computational Chemistry vol 9                     |          |
3283|                                                     |https://doi.org/10.1002/9780470125861.ch5                    |          |
3284+-----------------------------------------------------+-------------------------------------------------------------+----------+
3285|Inertial shape factor                                |Todeschini and Consoni "Descriptors from Molecular Geometry" | C++      |
3286|                                                     |Handbook of Chemoinformatics                                 |          |
3287|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3288+-----------------------------------------------------+-------------------------------------------------------------+----------+
3289|Eccentricity                                         |G. A. Arteca "Molecular Shape Descriptors"                   | C++      |
3290|                                                     |Reviews in Computational Chemistry vol 9                     |          |
3291|                                                     |https://doi.org/10.1002/9780470125861.ch5                    |          |
3292+-----------------------------------------------------+-------------------------------------------------------------+----------+
3293|Asphericity                                          |A. Baumgaertner, "Shapes of flexible vesicles"               | C++      |
3294|                                                     |J. Chem. Phys. 98:7496                                       |          |
3295|                                                     |(1993)                                                       |          |
3296|                                                     |https://doi.org/10.1063/1.464689                             |          |
3297+-----------------------------------------------------+-------------------------------------------------------------+----------+
3298|Spherocity Index                                     |Todeschini and Consoni "Descriptors from Molecular Geometry" | C++      |
3299|                                                     |Handbook of Chemoinformatics                                 |          |
3300|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3301+-----------------------------------------------------+-------------------------------------------------------------+----------+
3302|Autocorr3D                                           |New in 2017.09 release. Todeschini and Consoni "Descriptors  | C++      |
3303|                                                     |from Molecular Geometry" Handbook of Chemoinformatics        |          |
3304|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3305+-----------------------------------------------------+-------------------------------------------------------------+----------+
3306|RDF                                                  |New in 2017.09 release. Todeschini and Consoni "Descriptors  | C++      |
3307|                                                     |from Molecular Geometry" Handbook of Chemoinformatics        |          |
3308|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3309+-----------------------------------------------------+-------------------------------------------------------------+----------+
3310|MORSE                                                |New in 2017.09 release. Todeschini and Consoni "Descriptors  | C++      |
3311|                                                     |from Molecular Geometry" Handbook of Chemoinformatics        |          |
3312|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3313+-----------------------------------------------------+-------------------------------------------------------------+----------+
3314|WHIM                                                 |New in 2017.09 release. Todeschini and Consoni "Descriptors  | C++      |
3315|                                                     |from Molecular Geometry" Handbook of Chemoinformatics        |          |
3316|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3317|                                                     |                                                             |          |
3318|                                                     |**Note** insufficient information is available to exactly    |          |
3319|                                                     |reproduce values from DRAGON for these descriptors. We       |          |
3320|                                                     |believe that this is close.                                  |          |
3321+-----------------------------------------------------+-------------------------------------------------------------+----------+
3322|GETAWAY                                              |New in 2017.09 release. Todeschini and Consoni "Descriptors  | C++      |
3323|                                                     |from Molecular Geometry" Handbook of Chemoinformatics        |          |
3324|                                                     |https://doi.org/10.1002/9783527618279.ch37                   |          |
3325|                                                     |                                                             |          |
3326|                                                     |**Note** insufficient information is available to exactly    |          |
3327|                                                     |reproduce values from DRAGON for these descriptors. We       |          |
3328|                                                     |believe that this is close.                                  |          |
3329+-----------------------------------------------------+-------------------------------------------------------------+----------+
3330
3331
3332
3333List of Available Fingerprints
3334******************************
3335
3336+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3337| Fingerprint Type     | Notes                                                                                                     | Language |
3338+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3339| RDKit                | a Daylight\-like fingerprint based on hashing molecular subgraphs                                         | C++      |
3340+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3341| Atom Pairs           | *JCICS* **25**:64\-73 (1985)                                                                              | C++      |
3342+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3343| Topological Torsions | *JCICS* **27**:82\-5 (1987)                                                                               | C++      |
3344+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3345| MACCS keys           | Using the 166 public keys implemented as SMARTS                                                           | C++      |
3346+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3347| Morgan/Circular      | Fingerprints based on the Morgan algorithm, similar to the ECFP/FCFP fingerprints                         | C++      |
3348|                      | *JCIM* **50**:742\-54 (2010).                                                                             |          |
3349+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3350| 2D Pharmacophore     | Uses topological distances between pharmacophoric points.                                                 | C++      |
3351+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3352| Pattern              | a topological fingerprint optimized for substructure screening                                            | C++      |
3353+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3354| Extended Reduced     | Derived from the ErG fingerprint published by Stiefl et al. in                                            | C++      |
3355| Graphs               | *JCIM* **46**:208\–20 (2006).                                                                             |          |
3356|                      | NOTE: these functions return an array of floats, not the usual fingerprint types                          |          |
3357+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3358| MHFP and SECFP       | Derived from the ErG fingerprint published by Probst et al. in                                            | C++      |
3359|                      | *J Cheminformatics* **10** (2018).                                                                        |          |
3360|                      | NOTE: these functions return different types of values                                                    |          |
3361+----------------------+-----------------------------------------------------------------------------------------------------------+----------+
3362
3363
3364Feature Definitions Used in the Morgan Fingerprints
3365***************************************************
3366
3367These are adapted from the definitions in Gobbi, A. & Poppinger, D. “Genetic optimization of combinatorial libraries.” *Biotechnology and Bioengineering* **61**, 47-54 (1998).
3368
3369+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3370| Feature  | SMARTS                                                                                                                                                                 |
3371+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3372| Donor    | ``[$([N;!H0;v3,v4&+1]),$([O,S;H1;+0]),n&H1&+0]``                                                                                                                       |
3373+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3374| Acceptor | ``[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]),$([N;v3;!$(N-*=[O,N,P,S])]),n&H0&+0,$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]``                             |
3375+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3376| Aromatic | ``[a]``                                                                                                                                                                |
3377+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3378| Halogen  | ``[F,Cl,Br,I]``                                                                                                                                                        |
3379+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3380| Basic    | ``[#7;+,$([N;H2&+0][$([C,a]);!$([C,a](=O))]),$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]`` |
3381+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3382| Acidic   | ``[$([C,S](=[O,S,P])-[O;H1,-1])]``                                                                                                                                     |
3383+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3384
3385.. rubric:: Footnotes
3386
3387.. [#blaney] Blaney, J. M.; Dixon, J. S. "Distance Geometry in Molecular Modeling".  *Reviews in Computational Chemistry*; VCH: New York, 1994.
3388.. [#rappe] Rappé, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard III, W. A.; Skiff, W. M. "UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations". *J. Am. Chem. Soc.* **114**:10024-35 (1992) .
3389.. [#carhart] Carhart, R.E.; Smith, D.H.; Venkataraghavan R. “Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications” *J. Chem. Inf. Comp. Sci.* **25**:64-73 (1985).
3390.. [#nilakantan] Nilakantan, R.; Bauman N.; Dixon J.S.; Venkataraghavan R. “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Desciptors.” *J. Chem.Inf. Comp. Sci.* **27**:82-5 (1987).
3391.. [#rogers] Rogers, D.; Hahn, M. “Extended-Connectivity Fingerprints.” *J. Chem. Inf. and Model.* **50**:742-54 (2010).
3392.. [#ashton] Ashton, M. et al. “Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions.” *Quantitative Structure-Activity Relationships* **21**:598-604 (2002).
3393.. [#bemis1] Bemis, G. W.; Murcko, M. A. "The Properties of Known Drugs. 1. Molecular Frameworks." *J. Med. Chem.*  **39**:2887-93 (1996).
3394.. [#lewell] Lewell, X.Q.; Judd, D.B.; Watson, S.P.; Hann, M.M. “RECAP-Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry” *J. Chem. Inf. Comp. Sci.* **38**:511-22 (1998).
3395.. [#degen] Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A; Rarey, M. "On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces." *ChemMedChem* **3**:1503–7 (2008).
3396.. [#gobbi] Gobbi, A. & Poppinger, D. "Genetic optimization of combinatorial libraries." *Biotechnology and Bioengineering* **61**:47-54 (1998).
3397.. [#rxnsmarts] A more detailed description of reaction smarts, as defined by the rdkit, is in the :doc:`RDKit_Book`.
3398.. [#mmff1] Halgren, T. A. "Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94." *J. Comp. Chem.* **17**:490–19 (1996).
3399.. [#mmff2] Halgren, T. A. "Merck molecular force field. II. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions." *J. Comp. Chem.* **17**:520–52 (1996).
3400.. [#mmff3] Halgren, T. A. "Merck molecular force field. III. Molecular geometries and vibrational frequencies for MMFF94." *J. Comp. Chem.* **17**:553–86 (1996).
3401.. [#mmff4] Halgren, T. A. & Nachbar, R. B. "Merck molecular force field. IV. conformational energies and geometries for MMFF94." *J. Comp. Chem.* **17**:587-615 (1996).
3402.. [#mmffs] Halgren, T. A. "MMFF VI. MMFF94s option for energy minimization studies." *J. Comp. Chem.* **20**:720–9 (1999).
3403.. [#riniker] Riniker, S.; Landrum, G. A. "Similarity Maps - A Visualization Strategy for Molecular Fingerprints and Machine-Learning Methods" *J. Cheminf.* **5**:43 (2013).
3404.. [#riniker2] Riniker, S.; Landrum, G. A. "Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation" *J. Chem. Inf. Comp. Sci.* **55**:2562-74 (2015)
3405
3406
3407
3408
3409License
3410*******
3411
3412.. image:: images/picture_5.png
3413
3414This document is copyright (C) 2007-2021 by Greg Landrum
3415
3416This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
3417To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
3418
3419
3420The intent of this license is similar to that of the RDKit itself.
3421In simple words: “Do whatever you want with it, but please give us some credit.”
3422
3423.. |picture_0| image:: images/picture_0.png
3424  :scale: 75 %
3425
3426.. |picture_1| image:: images/picture_1.png
3427  :scale: 75 %
3428
3429.. |picture_3| image:: images/picture_3.png
3430  :scale: 75 %
3431
3432.. |picture_2| image:: images/picture_2.png
3433  :scale: 50 %
3434
3435.. |picture_4| image:: images/picture_4.png
3436  :scale: 75 %
3437