Basic EA Example
Introduction
This tutorial will introduce you to the basic components needed for
building an EA with stk
. You can see a basic outline of the
EA in EvolutionaryAlgorithm
.
You can get all the code associated with this tutorial by running:
$ git clone https://github.com/lukasturcani/basic_ea
EA Components
There are a couple of components of the EA, which we need to define, so that it can run. All of these provide an avenue for you to customize the algorithm. We will need to use a
Selector
- There are three different selectors we will need to define. One for selecting which molecules get mutated, one for selecting which molecules undergo crossover and one for selecting which molecules are part of the next generation of the EA.Mutator
- Used to carry out mutation operations on molecules.Crosser
- Used to carry out crossover operations on molecules.FitnessCalculator
- Used to calculate the fitness value of molecules.
An important class in our EA is also the MoleculeRecord
.
The MoleculeRecord
holds a molecule being considered by the
EA, as well as additional relevant information, such as its fitness
value.
Basic EA Loop
To build a really simple EA, we would begin by writing a really simple Python script, such as
import stk
ea = stk.EvolutionaryAlgorithm(
# We will define these components later.
initial_population=...,
fitness_calculator=...,
mutator=...,
crosser=...,
generation_selector=...,
mutation_selector=...,
crossover_selector=...,
)
# Go through 50 generations of the EA.
for i, generation in enumerate(ea.get_generations(50)):
# The generation object gives you access to the molecules
# found in the generation.
You can place whatever code you like into the loop, for example, you can write each molecule in each generation to a file
# Go through 50 generations of the EA.
writer = stk.MolWriter()
for i, generation in enumerate(ea.get_generations(50)):
# Go through the molecules in the generation, and write them
# to a file.
for molecule_id, molecule_record in enumerate(
generation.get_molecule_records()
):
writer.write(
molecule=molecule_record.get_molecule(),
path=f'generation_{i}_molecule_{molecule_id}.mol',
)
While this is a perfectly valid EA loop, we can make it a lot better.
Adding a Database
One of the main things that will significantly improve our quality of life, is replacing our file writing, with a molecular database. We will use AtomLite for this. AtomLite is a molecular database which is written to a file, making it really easy to get started.
So let’s create our database in the basic_ea.db
file:
import atomlite
db = atomlite.Database("basic_ea.db")
Now we can modify the EA loop to use the database instead of writing a bunch of files
# Go through 50 generations of the EA.
for generation in ea.get_generations(50):
db.update_entries(
(
atomlite.Entry.from_rdkit(
key=stk.Smiles().get_key(record.get_molecule()),
molecule=record.get_molecule().to_rdkit_mol(),
)
for record in generation.get_molecule_records()
)
)
Plotting the EA Progress
Usually, when we run an EA, we want to be able evaluate its
performance somehow. A very simple way to do this, is to plot how
the fitness of the population changes with generations. You
can use a ProgressPlotter
to do this.
The ProgressPlotter
needs to know what generations it
should plot, so we have to modify our loop so that it stores the
fitness values in each generation
fitness_values = []
for generation in ea.get_generations(50):
db.update_entries(
(
atomlite.Entry.from_rdkit(
key=stk.Smiles().get_key(record.get_molecule()),
molecule=record.get_molecule().to_rdkit_mol(),
)
for record in generation.get_molecule_records()
)
)
fitness_values.append(
[
fitness_value.normalized
for fitness_value in generation.get_fitness_values().values()
]
)
Now that we have the generations, we can use a
ProgressPlotter
to plot them
fitness_progress = stk.ProgressPlotter(
property=fitness_values,
y_label='Fitness Value',
)
fitness_progress.write('fitness_progress.png')
Review
Ok, we now have a half-decent EA loop, so let’s review it.
import stk
import atomlite
db = atomlite.Database("basic_ea")
ea = stk.EvolutionaryAlgorithm(
initial_population=...,
fitness_calculator=...,
mutator=...,
crosser=...,
generation_selector=...,
mutation_selector=...,
crossover_selector=...,
)
# Go through 50 generations of the EA.
fitness_values = []
for generation in ea.get_generations(50):
db.update_entries(
(
atomlite.Entry.from_rdkit(
key=stk.Smiles().get_key(record.get_molecule()),
molecule=record.get_molecule().to_rdkit_mol(),
)
for record in generation.get_molecule_records()
)
)
fitness_values.append(
[
fitness_value.normalized
for fitness_value in generation.get_fitness_values().values()
]
)
fitness_progress = stk.ProgressPlotter(
property=fitness_values,
y_label='Fitness Value',
)
fitness_progress.write('fitness_progress.png')
The only thing thats left to do, is define the components of the EA that we want to use. There are a lot of options, so for the sake of example, I will just use a couple of straight-forward ones.
Defining EA Components
When defining EA components, there are two major questions that the user must answer:
What molecular properties do I want to optimize?
What kinds of molecular structures do I want to consider?
The user answers the first question by defining a
FitnessCalculator
. The FitnessCalculator
returns
a fitness value, and this is the value that the EA will optimize.
The simplest way to define a FitnessCalculator
is to
first define a simple Python function, which takes a
MoleculeRecord
instance, and returns the fitness
value of that molecule.
For example, in many applications it is desirable to have rigid molecules. One way to measure how rigid a molecule is, is to calculate the number of rotatable bonds it has. The more rotatable bonds, the less rigid the molecule. Therefore, if we want the EA to produce rigid molecules, our fitness function should give a high fitness to molecules with few rotatable bonds. We can therefore define a function which returns the inverse of the number of rotatable bonds in a molecule
import rdkit.Chem.AllChem as rdkit
def get_rigidity(molecule):
num_rotatable_bonds = rdkit.CalcNumRotatableBonds(molecule)
# Add 1 to the denominator to prevent division by 0.
return 1 / (num_rotatable_bonds + 1)
In addition to minimizing the number of rotatable bonds, we also
want to minimize the molecular complexity, so that molecules made
by the EA look at least somewhat reasonable. rdkit
provides a
function called BertzCT()
, which returns a measure of
molecular complexity. In addition to this, we will also count the
number of rings of size less than 5, as an additional measure of
complexity
from rdkit.Chem.GraphDescriptors import BertzCT
def get_complexity(molecule):
num_bad_rings = sum(
1 for ring in rdkit.GetSymmSSSR(molecule) if len(ring) < 5
)
# Multiply by 10 and raise to the power of 2 to increase the
# penalty for having many small rings. These numbers were
# chosen by trial and error, so do don't worry about them
# too much.
return BertzCT(molecule) + 10*num_bad_rings**2
Now we can combine the rigidity and complexity into a single fitness value, that the EA can optimize. There are multiple way to do this, but an easy thing to do is just divide the rigidity by the complexity
def get_fitness_value(record):
rdkit_molecule = record.get_molecule().to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
# Multiply by 100 just to scale the values up a bit, which
# makes for nicer plots later.
return 100*(
get_rigidity(rdkit_molecule)
/ get_complexity(rdkit_molecule)
)
Now that we have our function, we can turn it into a
FitnessCalculator
by using FitnessFunction
fitness_calculator = stk.FitnessFunction(get_fitness_value)
Now we only have to answer the second question, What kinds of molecular structures do I want to consider?
The user answers this question by defining an initial population of molecules the EA should use, as well as the mutation and crossover operations. These operations will determine which molecules the EA can construct.
Lets begin by defining an initial population. The first thing we will
need is a set of building blocks, with which we can build our
molecules. In this, example we will use two files from
https://github.com/lukasturcani/basic_ea, bromos.txt
and
fluoros.txt
. Each file contains the SMILES strings of buildings
blocks, holding the respective functional groups. The building
blocks in these files are randomly generated molecular graphs.
We can define a function which will load the building blocks from
these files
def get_building_blocks(path, functional_group_factory, generator):
with open(path, 'r') as f:
content = f.readlines()
for smiles in content:
molecule = rdkit.AddHs(rdkit.MolFromSmiles(smiles))
molecule.AddConformer(
conf=rdkit.Conformer(molecule.GetNumAtoms()),
)
rdkit.Kekulize(molecule)
building_block = stk.BuildingBlock.init_from_rdkit_mol(
molecule=molecule,
functional_groups=functional_group_factory,
)
yield building_block.with_position_matrix(
position_matrix=get_position_matrix(building_block, generator),
)
def get_position_matrix(molecule, generator):
position_matrix = generator.uniform(
low=-500,
high=500,
size=(molecule.get_num_atoms(), 3),
)
molecule = molecule.with_position_matrix(position_matrix)
rdkit_molecule = molecule.to_rdkit_mol()
rdkit.SanitizeMol(rdkit_molecule)
rdkit.Compute2DCoords(rdkit_molecule)
try:
rdkit.MMFFOptimizeMolecule(rdkit_molecule)
except Exception:
pass
return rdkit_molecule.GetConformer().GetPositions()
Once these functions are defined, we can use get_building_block()
to generate our building blocks
import pathlib
import numpy as np
generator = np.random.default_rng(4)
fluoros = tuple(get_building_blocks(
# Assume that fluoros.txt is in the same folder as this
# code.
path=pathlib.Path(__file__).parent / 'fluoros.txt',
functional_group_factory=stk.FluoroFactory(),
generator=generator,
))
bromos = tuple(get_building_blocks(
# Assume that bromos.txt is in the same folder as this
# code.
path=pathlib.Path(__file__).parent / 'bromos.txt',
functional_group_factory=stk.BromoFactory(),
generator=generator,
))
In this example, the EA will create AB
dimers, using the
Linear
topology graph. The initial population of 25 such
dimers can be made by taking the first 5 bromo
and fluoro
building blocks
def get_initial_population(fluoros, bromos):
for fluoro, bromo in itertools.product(fluoros, bromos):
yield stk.MoleculeRecord(
topology_graph=stk.polymer.Linear(
building_blocks=(fluoro, bromo),
repeating_unit='AB',
num_repeating_units=1,
),
)
initial_population = tuple(get_initial_population(fluoros[:5], bromos[:5]))
Next, we can define our mutation operations. There are a multiple
options, as you can see in the sidebar. One thing that you might
notice immediately, is that there are multiple Mutator
types you would like to use during the EA, but the
EvolutionaryAlgorithm
only takes a single
Mutator
. To get around this, we can use a compound
Mutator
, such as the RandomMutator
. When you create
a RandomMutator
, you define it in terms of other mutators
you want to use, for example
def get_functional_group_type(building_block):
functional_group, = building_block.get_functional_groups(0)
return functional_group.__class__
def is_fluoro(building_block):
functional_group, = building_block.get_functional_groups(0)
return functional_group.__class__ is stk.Fluoro
def is_bromo(building_block):
functional_group, = building_block.get_functional_groups(0)
return functional_group.__class__ is stk.Bromo
mutator = stk.RandomMutator(
mutators=(
# Substitutes a building block with a fluoro group with
# a random building block in fluoros.
stk.RandomBuildingBlock(
building_blocks=fluoros,
is_replaceable=is_fluoro,
random_seed=generator,
),
# Substitutes a building block with a fluoro group with
# a similar building block in fluoros.
stk.SimilarBuildingBlock(
building_blocks=fluoros,
is_replaceable=is_fluoro,
random_seed=generator,
),
# Substitutes a building block with a bromo group with
# a random building block in bromos.
stk.RandomBuildingBlock(
building_blocks=bromos,
is_replaceable=is_bromo,
random_seed=generator,
),
# Substitutes a building block with a bromo group with
# a similar building block in bromos.
stk.SimilarBuildingBlock(
building_blocks=bromos,
is_replaceable=is_bromo,
random_seed=generator,
),
),
random_seed=generator,
)
When mutate()
is called on a RandomMutator
,
it randomly selects one of the mutators you gave it during
initialization, and asks it to perform the mutation operation on its
behalf. In this way, all of the mutators you provided it will get used
during the EA.
Now we can put all of these components together, and fill in the remaining ones too
ea = stk.EvolutionaryAlgorithm(
initial_population=tuple(
get_initial_population(fluoros[:5], bromos[:5])
),
fitness_calculator=stk.FitnessFunction(get_fitness_value),
mutator=stk.RandomMutator(
mutators=(
stk.RandomBuildingBlock(
building_blocks=fluoros,
is_replaceable=is_fluoro,
random_seed=generator,
),
stk.SimilarBuildingBlock(
building_blocks=fluoros,
is_replaceable=is_fluoro,
random_seed=generator,
),
stk.RandomBuildingBlock(
building_blocks=bromos,
is_replaceable=is_bromo,
random_seed=generator,
),
stk.SimilarBuildingBlock(
building_blocks=bromos,
is_replaceable=is_bromo,
random_seed=generator,
),
),
random_seed=generator,
),
crosser=stk.GeneticRecombination(
get_gene=get_functional_group_type,
),
generation_selector=stk.Best(
num_batches=25,
duplicate_molecules=False,
),
mutation_selector=stk.Roulette(
num_batches=5,
random_seed=generator,
),
crossover_selector=stk.Roulette(
num_batches=3,
batch_size=2,
random_seed=generator,
),
)
Final Version
The final version of our code is best seen here.
The plot of fitness we produced looks like this:
which shows us that the EA was pretty good at improving the fitness value. Another thing to look at is the plot for the number of rotatable bonds
Clearly, our EA was able to minimize the number of rotatable bonds to a low value across all members of the population.
We can also compare the molecules in the initial population
to those in the final population
where the hydrogen atoms have been left out for clarity. When considering that these were chosen out of a search space of 1,000,000 randomly constructed molecular graphs, they don’t look that bad, though you will probably want to a better measure of synthetic accessibility in your own EAs.
Next, you can read the intermediate tutorial, which will show you some additional customization options for the EA.