Basic EA Example

Introduction

This tutorial will introduce you to the basic components needed for building an EA with stk. You can see a basic outline of the EA in EvolutionaryAlgorithm.

You can get all the code associated with this tutorial by running:

$ git clone https://github.com/lukasturcani/basic_ea

EA Components

There are a couple of components of the EA, which we need to define, so that it can run. All of these provide an avenue for you to customize the algorithm. We will need to use a

Selector - There are three different selectors we will need to define. One for selecting which molecules get mutated, one for selecting which molecules undergo crossover and one for selecting which molecules are part of the next generation of the EA.
Mutator - Used to carry out mutation operations on molecules.
Crosser - Used to carry out crossover operations on molecules.
FitnessCalculator - Used to calculate the fitness value of molecules.

An important class in our EA is also the MoleculeRecord. The MoleculeRecord holds a molecule being considered by the EA, as well as additional relevant information, such as its fitness value.

Basic EA Loop

To build a really simple EA, we would begin by writing a really simple Python script, such as

import stk


ea = stk.EvolutionaryAlgorithm(
    # We will define these components later.
    initial_population=...,
    fitness_calculator=...,
    mutator=...,
    crosser=...,
    generation_selector=...,
    mutation_selector=...,
    crossover_selector=...,
)

# Go through 50 generations of the EA.
for i, generation in enumerate(ea.get_generations(50)):

    # The generation object gives you access to the molecules
    # found in the generation.

You can place whatever code you like into the loop, for example, you can write each molecule in each generation to a file

# Go through 50 generations of the EA.
writer = stk.MolWriter()
for i, generation in enumerate(ea.get_generations(50)):
    # Go through the molecules in the generation, and write them
    # to a file.
    for molecule_id, molecule_record in enumerate(
        generation.get_molecule_records()
    ):
        writer.write(
            molecule=molecule_record.get_molecule(),
            path=f'generation_{i}_molecule_{molecule_id}.mol',
        )

While this is a perfectly valid EA loop, we can make it a lot better.

Adding a Database

One of the main things that will significantly improve our quality of life, is replacing our file writing, with a molecular database. We will use AtomLite for this. AtomLite is a molecular database which is written to a file, making it really easy to get started.

So let’s create our database in the basic_ea.db file:

import atomlite
db = atomlite.Database("basic_ea.db")

Now we can modify the EA loop to use the database instead of writing a bunch of files

# Go through 50 generations of the EA.
for generation in ea.get_generations(50):
    db.update_entries(
        (
            atomlite.Entry.from_rdkit(
                key=stk.Smiles().get_key(record.get_molecule()),
                molecule=record.get_molecule().to_rdkit_mol(),
            )
            for record in generation.get_molecule_records()
        )
    )

Plotting the EA Progress

Usually, when we run an EA, we want to be able evaluate its performance somehow. A very simple way to do this, is to plot how the fitness of the population changes with generations. You can use a ProgressPlotter to do this.

The ProgressPlotter needs to know what generations it should plot, so we have to modify our loop so that it stores the fitness values in each generation

fitness_values = []
for generation in ea.get_generations(50):
    db.update_entries(
        (
            atomlite.Entry.from_rdkit(
                key=stk.Smiles().get_key(record.get_molecule()),
                molecule=record.get_molecule().to_rdkit_mol(),
            )
            for record in generation.get_molecule_records()
        )
    )
    fitness_values.append(
        [
            fitness_value.normalized
            for fitness_value in generation.get_fitness_values().values()
        ]
    )

Now that we have the generations, we can use a ProgressPlotter to plot them

fitness_progress = stk.ProgressPlotter(
    property=fitness_values,
    y_label='Fitness Value',
)
fitness_progress.write('fitness_progress.png')

Review

Ok, we now have a half-decent EA loop, so let’s review it.

import stk
import atomlite

db = atomlite.Database("basic_ea")
ea = stk.EvolutionaryAlgorithm(
    initial_population=...,
    fitness_calculator=...,
    mutator=...,
    crosser=...,
    generation_selector=...,
    mutation_selector=...,
    crossover_selector=...,
)

# Go through 50 generations of the EA.
fitness_values = []
for generation in ea.get_generations(50):
    db.update_entries(
        (
            atomlite.Entry.from_rdkit(
                key=stk.Smiles().get_key(record.get_molecule()),
                molecule=record.get_molecule().to_rdkit_mol(),
            )
            for record in generation.get_molecule_records()
        )
    )
    fitness_values.append(
        [
            fitness_value.normalized
            for fitness_value in generation.get_fitness_values().values()
        ]
    )

fitness_progress = stk.ProgressPlotter(
    property=fitness_values,
    y_label='Fitness Value',
)
fitness_progress.write('fitness_progress.png')

The only thing thats left to do, is define the components of the EA that we want to use. There are a lot of options, so for the sake of example, I will just use a couple of straight-forward ones.

Defining EA Components

When defining EA components, there are two major questions that the user must answer:

What molecular properties do I want to optimize?
What kinds of molecular structures do I want to consider?

The user answers the first question by defining a FitnessCalculator. The FitnessCalculator returns a fitness value, and this is the value that the EA will optimize. The simplest way to define a FitnessCalculator is to first define a simple Python function, which takes a MoleculeRecord instance, and returns the fitness value of that molecule.

For example, in many applications it is desirable to have rigid molecules. One way to measure how rigid a molecule is, is to calculate the number of rotatable bonds it has. The more rotatable bonds, the less rigid the molecule. Therefore, if we want the EA to produce rigid molecules, our fitness function should give a high fitness to molecules with few rotatable bonds. We can therefore define a function which returns the inverse of the number of rotatable bonds in a molecule

import rdkit.Chem.AllChem as rdkit

def get_rigidity(molecule):
    num_rotatable_bonds = rdkit.CalcNumRotatableBonds(molecule)
    # Add 1 to the denominator to prevent division by 0.
    return 1 / (num_rotatable_bonds + 1)

In addition to minimizing the number of rotatable bonds, we also want to minimize the molecular complexity, so that molecules made by the EA look at least somewhat reasonable. rdkit provides a function called BertzCT(), which returns a measure of molecular complexity. In addition to this, we will also count the number of rings of size less than 5, as an additional measure of complexity

from rdkit.Chem.GraphDescriptors import BertzCT

def get_complexity(molecule):
    num_bad_rings = sum(
        1 for ring in rdkit.GetSymmSSSR(molecule) if len(ring) < 5
    )
    # Multiply by 10 and raise to the power of 2 to increase the
    # penalty for having many small rings. These numbers were
    # chosen by trial and error, so do don't worry about them
    # too much.
    return BertzCT(molecule) + 10*num_bad_rings**2

Now we can combine the rigidity and complexity into a single fitness value, that the EA can optimize. There are multiple way to do this, but an easy thing to do is just divide the rigidity by the complexity

def get_fitness_value(record):
    rdkit_molecule = record.get_molecule().to_rdkit_mol()
    rdkit.SanitizeMol(rdkit_molecule)
    # Multiply by 100 just to scale the values up a bit, which
    # makes for nicer plots later.
    return 100*(
        get_rigidity(rdkit_molecule)
        / get_complexity(rdkit_molecule)
    )

Now that we have our function, we can turn it into a FitnessCalculator by using FitnessFunction

fitness_calculator = stk.FitnessFunction(get_fitness_value)

Now we only have to answer the second question, What kinds of molecular structures do I want to consider?

The user answers this question by defining an initial population of molecules the EA should use, as well as the mutation and crossover operations. These operations will determine which molecules the EA can construct.

Lets begin by defining an initial population. The first thing we will need is a set of building blocks, with which we can build our molecules. In this, example we will use two files from https://github.com/lukasturcani/basic_ea, bromos.txt and fluoros.txt. Each file contains the SMILES strings of buildings blocks, holding the respective functional groups. The building blocks in these files are randomly generated molecular graphs. We can define a function which will load the building blocks from these files

def get_building_blocks(path, functional_group_factory, generator):
    with open(path, 'r') as f:
        content = f.readlines()

    for smiles in content:
        molecule = rdkit.AddHs(rdkit.MolFromSmiles(smiles))
        molecule.AddConformer(
            conf=rdkit.Conformer(molecule.GetNumAtoms()),
        )
        rdkit.Kekulize(molecule)
        building_block = stk.BuildingBlock.init_from_rdkit_mol(
            molecule=molecule,
            functional_groups=functional_group_factory,
        )
        yield building_block.with_position_matrix(
            position_matrix=get_position_matrix(building_block, generator),
        )


def get_position_matrix(molecule, generator):
    position_matrix = generator.uniform(
        low=-500,
        high=500,
        size=(molecule.get_num_atoms(), 3),
    )
    molecule = molecule.with_position_matrix(position_matrix)
    rdkit_molecule = molecule.to_rdkit_mol()
    rdkit.SanitizeMol(rdkit_molecule)
    rdkit.Compute2DCoords(rdkit_molecule)
    try:
        rdkit.MMFFOptimizeMolecule(rdkit_molecule)
    except Exception:
        pass
    return rdkit_molecule.GetConformer().GetPositions()

Once these functions are defined, we can use get_building_block() to generate our building blocks

import pathlib
import numpy as np

generator = np.random.default_rng(4)
fluoros = tuple(get_building_blocks(
    # Assume that fluoros.txt is in the same folder as this
    # code.
    path=pathlib.Path(__file__).parent / 'fluoros.txt',
    functional_group_factory=stk.FluoroFactory(),
    generator=generator,
))
bromos = tuple(get_building_blocks(
    # Assume that bromos.txt is in the same folder as this
    # code.
    path=pathlib.Path(__file__).parent / 'bromos.txt',
    functional_group_factory=stk.BromoFactory(),
    generator=generator,
))

In this example, the EA will create AB dimers, using the Linear topology graph. The initial population of 25 such dimers can be made by taking the first 5 bromo and fluoro building blocks

def get_initial_population(fluoros, bromos):
    for fluoro, bromo in itertools.product(fluoros, bromos):
        yield stk.MoleculeRecord(
            topology_graph=stk.polymer.Linear(
                building_blocks=(fluoro, bromo),
                repeating_unit='AB',
                num_repeating_units=1,
            ),
        )


initial_population = tuple(get_initial_population(fluoros[:5], bromos[:5]))

Next, we can define our mutation operations. There are a multiple options, as you can see in the sidebar. One thing that you might notice immediately, is that there are multiple Mutator types you would like to use during the EA, but the EvolutionaryAlgorithm only takes a single Mutator. To get around this, we can use a compound Mutator, such as the RandomMutator. When you create a RandomMutator, you define it in terms of other mutators you want to use, for example

def get_functional_group_type(building_block):
    functional_group, = building_block.get_functional_groups(0)
    return functional_group.__class__

def is_fluoro(building_block):
    functional_group, = building_block.get_functional_groups(0)
    return functional_group.__class__ is stk.Fluoro

def is_bromo(building_block):
    functional_group, = building_block.get_functional_groups(0)
    return functional_group.__class__ is stk.Bromo

mutator = stk.RandomMutator(
    mutators=(
        # Substitutes a building block with a fluoro group with
        # a random building block in fluoros.
        stk.RandomBuildingBlock(
            building_blocks=fluoros,
            is_replaceable=is_fluoro,
            random_seed=generator,
        ),
        # Substitutes a building block with a fluoro group with
        # a similar building block in fluoros.
        stk.SimilarBuildingBlock(
            building_blocks=fluoros,
            is_replaceable=is_fluoro,
            random_seed=generator,
        ),
        # Substitutes a building block with a bromo group with
        # a random building block in bromos.
        stk.RandomBuildingBlock(
            building_blocks=bromos,
            is_replaceable=is_bromo,
            random_seed=generator,
        ),
        # Substitutes a building block with a bromo group with
        # a similar building block in bromos.
        stk.SimilarBuildingBlock(
            building_blocks=bromos,
            is_replaceable=is_bromo,
            random_seed=generator,
        ),
    ),
    random_seed=generator,
)

When mutate() is called on a RandomMutator, it randomly selects one of the mutators you gave it during initialization, and asks it to perform the mutation operation on its behalf. In this way, all of the mutators you provided it will get used during the EA.

Now we can put all of these components together, and fill in the remaining ones too

ea = stk.EvolutionaryAlgorithm(
    initial_population=tuple(
        get_initial_population(fluoros[:5], bromos[:5])
    ),
    fitness_calculator=stk.FitnessFunction(get_fitness_value),
    mutator=stk.RandomMutator(
        mutators=(
            stk.RandomBuildingBlock(
                building_blocks=fluoros,
                is_replaceable=is_fluoro,
                random_seed=generator,
            ),
            stk.SimilarBuildingBlock(
                building_blocks=fluoros,
                is_replaceable=is_fluoro,
                random_seed=generator,
            ),
            stk.RandomBuildingBlock(
                building_blocks=bromos,
                is_replaceable=is_bromo,
                random_seed=generator,
            ),
            stk.SimilarBuildingBlock(
                building_blocks=bromos,
                is_replaceable=is_bromo,
                random_seed=generator,
            ),
        ),
        random_seed=generator,
    ),
    crosser=stk.GeneticRecombination(
        get_gene=get_functional_group_type,
    ),
    generation_selector=stk.Best(
        num_batches=25,
        duplicate_molecules=False,
    ),
    mutation_selector=stk.Roulette(
        num_batches=5,
        random_seed=generator,
    ),
    crossover_selector=stk.Roulette(
        num_batches=3,
        batch_size=2,
        random_seed=generator,
    ),
)

Final Version

The final version of our code is best seen here.

The plot of fitness we produced looks like this:

which shows us that the EA was pretty good at improving the fitness value. Another thing to look at is the plot for the number of rotatable bonds

Clearly, our EA was able to minimize the number of rotatable bonds to a low value across all members of the population.

We can also compare the molecules in the initial population

to those in the final population

where the hydrogen atoms have been left out for clarity. When considering that these were chosen out of a search space of 1,000,000 randomly constructed molecular graphs, they don’t look that bad, though you will probably want to a better measure of synthetic accessibility in your own EAs.

Next, you can read the intermediate tutorial, which will show you some additional customization options for the EA.