======================= Intermediate EA Example ======================= Introduction ============ This tutorial builds on the previous, basic, EA example. Here we will introduce a couple more features which improve the effectiveness and customizability of our EA. Namely: Storing Fitness Values in a Database In the basic example, we did not store the calculated fitness or property values in our database, which is a shame, because these values are often expensive to calculate and generally useful. Caching Fitness Values In the basic example, the fitness function recalculates fitness values, even if it had already calculated one for the molecule in a previous generation. In the basic example, this is not a big deal, because the fitness function has a very low computational cost. However, in many applications, the fitness function will be expensive, likely including a structure optimization step. In cases like this, it is extremely important that the fitness value is returned from a database, instead of being recalculated, whenever possible. This will greatly reduce the computational cost of our EA. Normalizing Fitness Values This tutorial will introduce you to normalizing fitness values. This is an optional step in the EA, which is carried out after the fitness values of molecules have been calculated. Like fitness calculation, fitness normalization assigns fitness values to molecules. However, unlike fitness calculation, the fitness value assigned to a molecule is determined not just by the molecule itself, but also by the other molecules in the population. For example, fitness normalization can divide the fitness value of a molecule by the average fitness value in the population, in order to get the new fitness value. Plotting Selection This tutorial will introduce you to the :class:`.SelectionPlotter`, which plots which molecules were selected at each generation. This is often useful for validation, analysis and debugging. You can get all the code associated with the tutorial by running:: $ git clone https://github.com/lukasturcani/intermediate_ea Defining a New Fitness Calculator ================================= In the basic example, we created a :class:`.FitnessCalculator` by first defining a Python function, which we then wrapped with :class:`.FitnessFunction`. The function we defined calculated some property values of a molecule, namely complexity, the number of small rings and the number of rotatable bonds, and then combined them into a single fitness value. One problem with this approach is that the values were combined in a very ad-hoc way, which makes it difficult to reason about the contribution of each property to the final fitness value. As an alternative to this approach, we can define three Python functions, each of which returns a property of the molecule whose fitness value we want .. code-block:: python import rdkit.Chem.AllChem as rdkit def get_num_rotatable_bonds(record): rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) return rdkit.CalcNumRotatableBonds(rdkit_molecule) def get_complexity(record): rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) return BertzCT(rdkit_molecule) def get_num_bad_rings(record): rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) return sum( 1 for ring in rdkit.GetSymmSSSR(rdkit_molecule) if len(ring) < 5 ) Now, instead of defining another Python function which combines these property values into a single value, we can create a :class:`.FitnessCalculator` which returns a :class:`tuple` of property values as our fitness value. This is done by using a :class:`.PropertyVector` instead of :class:`.FitnessFunction` .. code-block:: python import stk fitness_calculator = stk.PropertyVector( property_functions=( get_num_rotatable_bonds, get_complexity, get_num_bad_rings, ), ) When we run .. code-block:: python fitness_value = fitness_calculator.get_fitness_value(some_molecule) Our ``fitness_value`` is a :class:`tuple` of the form ``(num_rotatable_bonds, complexity, num_bad_rings)``. This is a good start, but our fitness value must be a single number. We can achieve this by defining a :class:`.FitnessNormalizer`. Defining a Fitness Normalizer ============================= Fitness normalization is a process that runs after fitness calculation. The basic idea, is that a :class:`.FitnessCalculator` takes as a parameter a single molecule and returns its fitness value. This fitness value will always be the same for the same molecule. After this, we optionally perform fitness normalization with a :class:`.FitnessNormalizer`. The :class:`.FitnessNormalizer` takes as a parameter the entire population and yields a new population, holding new fitness values. When a :class:`.FitnessNormalizer` assigns a new fitness value to a molecule, its fitness value depends both on the molecule itself, and, if desired, the fitness values of all other molecules in the population. For example, :class:`.DivideByMean` is a :class:`.FitnessNormalizer`, which assigns new fitness values according to the formula .. code-block:: new_fitness_value = old_fitness_value / mean_fitness_value In order to calculate ``mean_fitness_value``, we have to be able to consider all the fitness values in the population. It is quite common to want to do multiple fitness normalizations in sequence, and for this there is :class:`.NormalizerSequence`. :class:`.NormalizerSequence` is a compound :class:`.FitnessNormalizer`. This means it is initialized with other fitness normalizers, and when its :meth:`~.FitnessNormalizer.normalize` method is called, it delegates the normalization to them. For example, if we want a :class:`.FitnessNormalizer` that first divides fitness values by the mean fitness value in the population and then takes the inverse of each fitness value in the population we could define .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.DivideByMean(), stk.Power(-1), ), ) For the EA in this example, we want to perform a couple of normalization steps, recall that the initial fitness values have the form ``(num_rotatable_bonds, complexity, num_bad_rings)``. First, we will use :class:`.DivideByMean`, which in cases where the fitness value is a :class:`tuple`, divides each member of the :class:`tuple` by its own mean. This means the number of rotatable bonds of each molecule is divided by the mean number of rotatable bonds in the population, the complexity of each molecule is divided by the mean complexity in the population and so on. After using :class:`.DivideByMean`, each fitness value is still a :class:`tuple`, but the value for each component is scaled by the population average. This scaling is important, because normally the different properties of a molecule have very different orders of magnitude, which makes them very hard to combine reasonably into a single value. However, scaling by the population average removes differences in orders of magnitude, and also removes the units of each quantity. This means they can be safely combined by something like a sum. Therefore, our initial fitness normalizer looks like this .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.DivideByMean(), ), ) However, you might notice an issue here. We are dividing by the mean, but the property values we are using, such as the number of bad rings or number of rotatable bonds have values which are allowed to be zero. This means that it's quite possible for the population mean to be zero. If the population mean is zero and we divide by zero - we will have a problem. We can prevent this by adding ``1`` to every element of the :class:`tuple` before using :class:`.DivideByMean` .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.Add((1, 1, 1)), stk.DivideByMean(), ), ) :class:`.Add` is a fitness normalizer, which adds a number to every fitness value. The number can be a :class:`tuple` of numbers, if a our fitness value is also a :class:`tuple`. Next, we can multiply each component of the :class:`tuple` by a different coefficient. This will make each component contribute a different amount to the final fitness value .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.Add((1, 1, 1)), stk.DivideByMean(), stk.Multiply((1, 1, 1)), ), ) In our example, we do not actually have to use :class:`.Multiply`, because all the coefficients are set to ``1``. However, if we wanted the number of bad rings to contribute twice as much to the final fitness value as the other properties, we would have used .. code-block:: python stk.Multiply((1, 1, 2)) Now we want to combine the elements of :class:`tuple` into a single fitness value. We can do this by taking a sum .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.Add((1, 1, 1)), stk.DivideByMean(), stk.Multiply((1, 1, 1)), stk.Sum(), ), ) :class:`.Sum` is a :class:`.FitnessNormalizer`, which assumes that each fitness value in the population is a :class:`tuple`. It then replaces the fitness value with the sum of all elements of the :class:`tuple`, to get the new fitness value. Finally, we recognize that all the elements of the :class:`tuple`, the number of rotatable bonds, complexity and the number of bad rings, indicate a *low* fitness value. This means our final fitness value should be the inverse of the sum, because as these values grow bigger, our fitness value should become smaller .. code-block:: python fitness_normalizer = stk.NormalizerSequence( fitness_normalizers=( stk.Add((1, 1, 1)), stk.DivideByMean(), stk.Multiply((1, 1, 1)), stk.Sum(), stk.Power(-1), ), ) That's it, our fitness normalizer will perform these steps in order to get the final fitness values at each generation. When a new generation is started, the fitness values of all molecules in the population are set to the values returned by the :class:`.FitnessCalculator`, and the fitness normalization is started from scratch. This means that the final fitness value can be different at each generation, even though the :class:`.FitnessCalculator` always returns the same value for a given molecule. This can happen, for example, because the mean value of each member of the fitness :class:`tuple` can change at each generation, based on different molecules being present in different generations. Caching Fitness Values ====================== We now return to our :class:`.FitnessCalculator`, recall .. code-block:: python fitness_calculator = stk.PropertyVector( property_functions=( get_num_rotatable_bonds, get_complexity, get_num_bad_rings, ), ) One of the improvements we want to make, is to prevent the recalculation of fitness values, for molecules where a fitness value was already found. We can do this by modifying our functions to retrieve values from our database .. code-block:: python def get_num_rotatable_bonds(database, record): key = stk.Inchi().get_key(record.get_molecule()) path = "$.ea.num_rotatable_bonds" num_rotatable_bonds = database.get_property(key, path) if num_rotatable_bonds is None: rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) num_rotatable_bonds = rdkit.CalcNumRotatableBonds(rdkit_molecule) database.set_property(key, path, num_rotatable_bonds) return num_rotatable_bonds def get_complexity(database, record): key = stk.Inchi().get_key(record.get_molecule()) path = "$.ea.complexity" complexity = database.get_property(key, path) if complexity is None: rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) complexity = BertzCT(rdkit_molecule) database.set_property(key, path, complexity) return complexity def get_num_bad_rings(database, record): key = stk.Inchi().get_key(record.get_molecule()) path = "$.ea.num_bad_rings" num_bad_rings = database.get_property(key, path) if num_bad_rings is None: rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) num_bad_rings = sum( 1 for ring in rdkit.GetSymmSSSR(rdkit_molecule) if len(ring) < 5 ) database.set_property(key, path, num_bad_rings) return num_bad_rings We then bind the first argument to our previously created database object. .. code-block:: python from functools import partial fitness_calculator = stk.PropertyVector( property_functions=( partial(get_num_rotatable_bonds, db), partial(get_complexity, db), partial(get_num_bad_rings, db), ), ) Plotting Selection ================== One final thing we would like to do, is check which molecules were selected for mutation, crossover and the next generation by our EA at each generation. We can do this by creating a :class:`.SelectionPlotter` for each :class:`.Selector` whose selections we want to plot. .. code-block:: python generation_selector = stk.Best( num_batches=25, duplicate_molecules=False, ) stk.SelectionPlotter('generation_selection', generation_selector) mutation_selector = stk.Roulette( num_batches=5, random_seed=generator, ) stk.SelectionPlotter('mutation_selection', mutation_selector) crossover_selector = stk.Roulette( num_batches=3, batch_size=2, random_seed=generator, ) stk.SelectionPlotter('crossover_selection', crossover_selector) We don't have to assign a :class:`.SelectionPlotter` to a variable, we just have to create the instance, and it will make plots of which molecules were selected in each :meth:`~.Selector.select` call. This is an example graph, showing which molecules were selected for mutation .. image:: https://i.imgur.com/rx8qayL.png You will get a new graph written for every :meth:`~.Selector.select` call, where the base name of the file is determined by the string we provided to the :class:`.SelectionPlotter`. Final Version ============= When we combine the code in this tutorial with the code from the basic tutorial, we get our final version, which is best seen here_. .. _here: https://github.com/lukasturcani/intermediate_ea/blob/master/intermediate_ea.py Here is a plot of how the fitness value change across generations .. image:: https://i.imgur.com/IMoMF4n.png and here is the change in the number of rotatable bonds .. image:: https://i.imgur.com/pWVMB14.png We can see that our EA was pretty good lowering the number the number of rotatable bonds, without us having to resort to any magic numbers in our fitness function. Finally, we can compare the initial population .. image:: https://i.imgur.com/0px3bL0.png to the final population .. image:: https://i.imgur.com/wYHHUBP.png Once again, we can see that the complexity of the molecules in the final generation is reduced, when compared to the initial population.