Intermediate EA Example
Introduction
This tutorial builds on the previous, basic, EA example. Here we will introduce a couple more features which improve the effectiveness and customizability of our EA. Namely:
- Storing Fitness Values in a Database
In the basic example, we did not store the calculated fitness or property values in our database, which is a shame, because these values are often expensive to calculate and generally useful.
- Caching Fitness Values
In the basic example, the fitness function recalculates fitness values, even if it had already calculated one for the molecule in a previous generation. In the basic example, this is not a big deal, because the fitness function has a very low computational cost. However, in many applications, the fitness function will be expensive, likely including a structure optimization step. In cases like this, it is extremely important that the fitness value is returned from a database, instead of being recalculated, whenever possible. This will greatly reduce the computational cost of our EA.
- Normalizing Fitness Values
This tutorial will introduce you to normalizing fitness values. This is an optional step in the EA, which is carried out after the fitness values of molecules have been calculated. Like fitness calculation, fitness normalization assigns fitness values to molecules. However, unlike fitness calculation, the fitness value assigned to a molecule is determined not just by the molecule itself, but also by the other molecules in the population. For example, fitness normalization can divide the fitness value of a molecule by the average fitness value in the population, in order to get the new fitness value.
- Plotting Selection
This tutorial will introduce you to the
SelectionPlotter
, which plots which molecules were selected at each generation. This is often useful for validation, analysis and debugging.
You can get all the code associated with the tutorial by running:
$ git clone https://github.com/lukasturcani/intermediate_ea
Defining a New Fitness Calculator
In the basic example, we created a FitnessCalculator
by first
defining a Python function, which we then wrapped with
FitnessFunction
. The function we defined calculated some
property values of a molecule, namely complexity, the number of small
rings and the number of rotatable bonds, and then combined them into
a single fitness value. One problem with this approach is that
the values were combined in a very ad-hoc way, which makes it
difficult to reason about the contribution of each property to the
final fitness value.
As an alternative to this approach, we can define three Python functions, each of which returns a property of the molecule whose fitness value we want
import rdkit.Chem.AllChem as rdkit
def get_num_rotatable_bonds(record):
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
return rdkit.CalcNumRotatableBonds(rdkit_molecule)
def get_complexity(record):
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
return BertzCT(rdkit_molecule)
def get_num_bad_rings(record):
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
return sum(
1
for ring in rdkit.GetSymmSSSR(rdkit_molecule)
if len(ring) < 5
)
Now, instead of defining another Python function which combines
these property values into a single value, we can create a
FitnessCalculator
which returns a tuple
of
property values as our fitness value. This is done by using a
PropertyVector
instead of FitnessFunction
import stk
fitness_calculator = stk.PropertyVector(
property_functions=(
get_num_rotatable_bonds,
get_complexity,
get_num_bad_rings,
),
)
When we run
fitness_value = fitness_calculator.get_fitness_value(some_molecule)
Our fitness_value
is a tuple
of the form
(num_rotatable_bonds, complexity, num_bad_rings)
. This is a good
start, but our fitness value must be a single number. We can achieve
this by defining a FitnessNormalizer
.
Defining a Fitness Normalizer
Fitness normalization is a process that runs after fitness calculation.
The basic idea, is that a FitnessCalculator
takes as a
parameter a single molecule and returns its fitness value. This
fitness value will always be the same for the same molecule.
After this, we optionally perform fitness normalization with a
FitnessNormalizer
. The FitnessNormalizer
takes as
a parameter the entire population and yields a new population, holding
new fitness values. When a FitnessNormalizer
assigns a
new fitness value to a molecule, its fitness value depends both
on the molecule itself, and, if desired, the fitness values of all
other molecules in the population. For example, DivideByMean
is a FitnessNormalizer
, which assigns new fitness values
according to the formula
new_fitness_value = old_fitness_value / mean_fitness_value
In order to calculate mean_fitness_value
, we have to be able to
consider all the fitness values in the population.
It is quite common to want to do multiple fitness normalizations in
sequence, and for this there is NormalizerSequence
.
NormalizerSequence
is a compound FitnessNormalizer
.
This means it is initialized with other fitness normalizers, and
when its normalize()
method is called,
it delegates the normalization to them. For example, if we want
a FitnessNormalizer
that first divides fitness values by
the mean fitness value in the population and then takes the inverse
of each fitness value in the population we could define
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.DivideByMean(),
stk.Power(-1),
),
)
For the EA in this example, we want to perform a couple of
normalization steps, recall that the initial fitness values have the
form (num_rotatable_bonds, complexity, num_bad_rings)
.
First, we will use DivideByMean
, which in cases where
the fitness value is a tuple
, divides each member of the
tuple
by its own mean. This means the number of rotatable
bonds of each molecule is divided by the mean number of rotatable
bonds in the population, the complexity of each molecule is
divided by the mean complexity in the population and so on.
After using DivideByMean
, each fitness value
is still a tuple
, but the value for each component is scaled
by the population average. This scaling is important,
because normally the different properties of a molecule have
very different orders of magnitude, which makes them very hard to
combine reasonably into a single value. However, scaling by
the population average removes differences in orders of magnitude,
and also removes the units of each quantity. This means they can
be safely combined by something like a sum. Therefore, our
initial fitness normalizer looks like this
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.DivideByMean(),
),
)
However, you might notice an issue here. We are dividing by the
mean, but the property values we are using, such as
the number of bad rings or number of rotatable bonds have values
which are allowed to be zero. This means that it’s quite possible for
the population mean to be zero. If the population mean is zero
and we divide by zero - we will have a problem. We can prevent this
by adding 1
to every element of the tuple
before
using DivideByMean
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.Add((1, 1, 1)),
stk.DivideByMean(),
),
)
Add
is a fitness normalizer, which adds a number to
every fitness value. The number can be a tuple
of
numbers, if a our fitness value is also a tuple
.
Next, we can multiply each component of the tuple
by a
different coefficient. This will make each component contribute a
different amount to the final fitness value
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.Add((1, 1, 1)),
stk.DivideByMean(),
stk.Multiply((1, 1, 1)),
),
)
In our example, we do not actually have to use Multiply
,
because all the coefficients are set to 1
. However, if we wanted
the number of bad rings to contribute twice as much to the final
fitness value as the other properties, we would have used
stk.Multiply((1, 1, 2))
Now we want to combine the elements of tuple
into a single
fitness value. We can do this by taking a sum
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.Add((1, 1, 1)),
stk.DivideByMean(),
stk.Multiply((1, 1, 1)),
stk.Sum(),
),
)
Sum
is a FitnessNormalizer
, which assumes that
each fitness value in the population is a tuple
. It then
replaces the fitness value with the sum of all elements of the
tuple
, to get the new fitness value.
Finally, we recognize that all the elements of the tuple
,
the number of rotatable bonds, complexity and the number of bad rings,
indicate a low fitness value. This means our final fitness value
should be the inverse of the sum, because as these values grow bigger,
our fitness value should become smaller
fitness_normalizer = stk.NormalizerSequence(
fitness_normalizers=(
stk.Add((1, 1, 1)),
stk.DivideByMean(),
stk.Multiply((1, 1, 1)),
stk.Sum(),
stk.Power(-1),
),
)
That’s it, our fitness normalizer will perform these steps in
order to get the final fitness values at each generation. When a
new generation is started, the fitness values of all molecules in the
population are set to the values returned by the
FitnessCalculator
, and the fitness normalization is
started from scratch. This means that the final fitness value
can be different at each generation, even though the
FitnessCalculator
always returns the same value for a given
molecule. This can happen, for example, because the mean value of each
member of the fitness tuple
can change at each generation,
based on different molecules being present in different generations.
Caching Fitness Values
We now return to our FitnessCalculator
, recall
fitness_calculator = stk.PropertyVector(
property_functions=(
get_num_rotatable_bonds,
get_complexity,
get_num_bad_rings,
),
)
One of the improvements we want to make, is to prevent the recalculation of fitness values, for molecules where a fitness value was already found. We can do this by modifying our functions to retrieve values from our database
def get_num_rotatable_bonds(database, record):
key = stk.Inchi().get_key(record.get_molecule())
path = "$.ea.num_rotatable_bonds"
num_rotatable_bonds = database.get_property(key, path)
if num_rotatable_bonds is None:
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
num_rotatable_bonds = rdkit.CalcNumRotatableBonds(rdkit_molecule)
database.set_property(key, path, num_rotatable_bonds)
return num_rotatable_bonds
def get_complexity(database, record):
key = stk.Inchi().get_key(record.get_molecule())
path = "$.ea.complexity"
complexity = database.get_property(key, path)
if complexity is None:
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
complexity = BertzCT(rdkit_molecule)
database.set_property(key, path, complexity)
return complexity
def get_num_bad_rings(database, record):
key = stk.Inchi().get_key(record.get_molecule())
path = "$.ea.num_bad_rings"
num_bad_rings = database.get_property(key, path)
if num_bad_rings is None:
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
num_bad_rings = sum(
1
for ring in rdkit.GetSymmSSSR(rdkit_molecule)
if len(ring) < 5
)
database.set_property(key, path, num_bad_rings)
return num_bad_rings
We then bind the first argument to our previously created database object.
from functools import partial
fitness_calculator = stk.PropertyVector(
property_functions=(
partial(get_num_rotatable_bonds, db),
partial(get_complexity, db),
partial(get_num_bad_rings, db),
),
)
Plotting Selection
One final thing we would like to do, is check which molecules
were selected for mutation, crossover and the next generation by
our EA at each generation. We can do this by creating a
SelectionPlotter
for each Selector
whose
selections we want to plot.
generation_selector = stk.Best(
num_batches=25,
duplicate_molecules=False,
)
stk.SelectionPlotter('generation_selection', generation_selector)
mutation_selector = stk.Roulette(
num_batches=5,
random_seed=generator,
)
stk.SelectionPlotter('mutation_selection', mutation_selector)
crossover_selector = stk.Roulette(
num_batches=3,
batch_size=2,
random_seed=generator,
)
stk.SelectionPlotter('crossover_selection', crossover_selector)
We don’t have to assign a SelectionPlotter
to a variable,
we just have to create the instance, and it will make plots of which
molecules were selected in each select()
call.
This is an example graph, showing which molecules were
selected for mutation
You will get a new graph written for every select()
call, where the base name of the file is determined by the string
we provided to the SelectionPlotter
.
Final Version
When we combine the code in this tutorial with the code from the basic tutorial, we get our final version, which is best seen here.
Here is a plot of how the fitness value change across generations
and here is the change in the number of rotatable bonds
We can see that our EA was pretty good lowering the number the number of rotatable bonds, without us having to resort to any magic numbers in our fitness function.
Finally, we can compare the initial population
to the final population
Once again, we can see that the complexity of the molecules in the final generation is reduced, when compared to the initial population.