================ Basic EA Example ================ Introduction ============ This tutorial will introduce you to the basic components needed for building an EA with :mod:`stk`. You can see a basic outline of the EA in :class:`.EvolutionaryAlgorithm`. You can get all the code associated with this tutorial by running:: $ git clone https://github.com/lukasturcani/basic_ea EA Components ============= There are a couple of components of the EA, which we need to define, so that it can run. All of these provide an avenue for you to customize the algorithm. We will need to use a #. :class:`.Selector` - There are three different selectors we will need to define. One for selecting which molecules get mutated, one for selecting which molecules undergo crossover and one for selecting which molecules are part of the next generation of the EA. #. :class:`.Mutator` - Used to carry out mutation operations on molecules. #. :class:`.Crosser` - Used to carry out crossover operations on molecules. #. :class:`.FitnessCalculator` - Used to calculate the fitness value of molecules. An important class in our EA is also the :class:`.MoleculeRecord`. The :class:`.MoleculeRecord` holds a molecule being considered by the EA, as well as additional relevant information, such as its fitness value. Basic EA Loop ============= To build a really simple EA, we would begin by writing a really simple Python script, such as .. code-block:: python import stk ea = stk.EvolutionaryAlgorithm( # We will define these components later. initial_population=..., fitness_calculator=..., mutator=..., crosser=..., generation_selector=..., mutation_selector=..., crossover_selector=..., ) # Go through 50 generations of the EA. for i, generation in enumerate(ea.get_generations(50)): # The generation object gives you access to the molecules # found in the generation. You can place whatever code you like into the loop, for example, you can write each molecule in each generation to a file .. code-block:: python # Go through 50 generations of the EA. writer = stk.MolWriter() for i, generation in enumerate(ea.get_generations(50)): # Go through the molecules in the generation, and write them # to a file. for molecule_id, molecule_record in enumerate( generation.get_molecule_records() ): writer.write( molecule=molecule_record.get_molecule(), path=f'generation_{i}_molecule_{molecule_id}.mol', ) While this is a perfectly valid EA loop, we can make it a lot better. Adding a Database ----------------- One of the main things that will significantly improve our quality of life, is replacing our file writing, with a molecular database. We will use AtomLite_ for this. AtomLite_ is a molecular database which is written to a file, making it really easy to get started. .. _AtomLite: https://atomlite.readthedocs.io So let's create our database in the ``basic_ea.db`` file: .. code-block:: python import atomlite db = atomlite.Database("basic_ea.db") Now we can modify the EA loop to use the database instead of writing a bunch of files .. code-block:: python # Go through 50 generations of the EA. for generation in ea.get_generations(50): db.update_entries( ( atomlite.Entry.from_rdkit( key=stk.Smiles().get_key(record.get_molecule()), molecule=record.get_molecule().to_rdkit_mol(), ) for record in generation.get_molecule_records() ) ) Plotting the EA Progress ======================== Usually, when we run an EA, we want to be able evaluate its performance somehow. A very simple way to do this, is to plot how the fitness of the population changes with generations. You can use a :class:`.ProgressPlotter` to do this. The :class:`.ProgressPlotter` needs to know what generations it should plot, so we have to modify our loop so that it stores the fitness values in each generation .. code-block:: python fitness_values = [] for generation in ea.get_generations(50): db.update_entries( ( atomlite.Entry.from_rdkit( key=stk.Smiles().get_key(record.get_molecule()), molecule=record.get_molecule().to_rdkit_mol(), ) for record in generation.get_molecule_records() ) ) fitness_values.append( [ fitness_value.normalized for fitness_value in generation.get_fitness_values().values() ] ) Now that we have the generations, we can use a :class:`.ProgressPlotter` to plot them .. code-block:: python fitness_progress = stk.ProgressPlotter( property=fitness_values, y_label='Fitness Value', ) fitness_progress.write('fitness_progress.png') Review ====== Ok, we now have a half-decent EA loop, so let's review it. .. code-block:: python import stk import atomlite db = atomlite.Database("basic_ea") ea = stk.EvolutionaryAlgorithm( initial_population=..., fitness_calculator=..., mutator=..., crosser=..., generation_selector=..., mutation_selector=..., crossover_selector=..., ) # Go through 50 generations of the EA. fitness_values = [] for generation in ea.get_generations(50): db.update_entries( ( atomlite.Entry.from_rdkit( key=stk.Smiles().get_key(record.get_molecule()), molecule=record.get_molecule().to_rdkit_mol(), ) for record in generation.get_molecule_records() ) ) fitness_values.append( [ fitness_value.normalized for fitness_value in generation.get_fitness_values().values() ] ) fitness_progress = stk.ProgressPlotter( property=fitness_values, y_label='Fitness Value', ) fitness_progress.write('fitness_progress.png') The only thing thats left to do, is define the components of the EA that we want to use. There are a lot of options, so for the sake of example, I will just use a couple of straight-forward ones. Defining EA Components ====================== When defining EA components, there are two major questions that the user must answer: * What molecular properties do I want to optimize? * What kinds of molecular structures do I want to consider? The user answers the first question by defining a :class:`.FitnessCalculator`. The :class:`.FitnessCalculator` returns a fitness value, and this is the value that the EA will optimize. The simplest way to define a :class:`.FitnessCalculator` is to first define a simple Python function, which takes a :class:`.MoleculeRecord` instance, and returns the fitness value of that molecule. For example, in many applications it is desirable to have rigid molecules. One way to measure how rigid a molecule is, is to calculate the number of rotatable bonds it has. The more rotatable bonds, the less rigid the molecule. Therefore, if we want the EA to produce rigid molecules, our fitness function should give a high fitness to molecules with *few* rotatable bonds. We can therefore define a function which returns the inverse of the number of rotatable bonds in a molecule .. code-block:: python import rdkit.Chem.AllChem as rdkit def get_rigidity(molecule): num_rotatable_bonds = rdkit.CalcNumRotatableBonds(molecule) # Add 1 to the denominator to prevent division by 0. return 1 / (num_rotatable_bonds + 1) In addition to minimizing the number of rotatable bonds, we also want to minimize the molecular complexity, so that molecules made by the EA look at least somewhat reasonable. :mod:`rdkit` provides a function called :func:`BertzCT`, which returns a measure of molecular complexity. In addition to this, we will also count the number of rings of size less than 5, as an additional measure of complexity .. code-block:: python from rdkit.Chem.GraphDescriptors import BertzCT def get_complexity(molecule): num_bad_rings = sum( 1 for ring in rdkit.GetSymmSSSR(molecule) if len(ring) < 5 ) # Multiply by 10 and raise to the power of 2 to increase the # penalty for having many small rings. These numbers were # chosen by trial and error, so do don't worry about them # too much. return BertzCT(molecule) + 10*num_bad_rings**2 Now we can combine the rigidity and complexity into a single fitness value, that the EA can optimize. There are multiple way to do this, but an easy thing to do is just divide the rigidity by the complexity .. code-block:: python def get_fitness_value(record): rdkit_molecule = record.get_molecule().to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) # Multiply by 100 just to scale the values up a bit, which # makes for nicer plots later. return 100*( get_rigidity(rdkit_molecule) / get_complexity(rdkit_molecule) ) Now that we have our function, we can turn it into a :class:`.FitnessCalculator` by using :class:`.FitnessFunction` .. code-block:: python fitness_calculator = stk.FitnessFunction(get_fitness_value) Now we only have to answer the second question, *What kinds of molecular structures do I want to consider?* The user answers this question by defining an initial population of molecules the EA should use, as well as the mutation and crossover operations. These operations will determine which molecules the EA can construct. Lets begin by defining an initial population. The first thing we will need is a set of building blocks, with which we can build our molecules. In this, example we will use two files from https://github.com/lukasturcani/basic_ea, ``bromos.txt`` and ``fluoros.txt``. Each file contains the SMILES strings of buildings blocks, holding the respective functional groups. The building blocks in these files are randomly generated molecular graphs. We can define a function which will load the building blocks from these files .. code-block:: python def get_building_blocks(path, functional_group_factory, generator): with open(path, 'r') as f: content = f.readlines() for smiles in content: molecule = rdkit.AddHs(rdkit.MolFromSmiles(smiles)) molecule.AddConformer( conf=rdkit.Conformer(molecule.GetNumAtoms()), ) rdkit.Kekulize(molecule) building_block = stk.BuildingBlock.init_from_rdkit_mol( molecule=molecule, functional_groups=functional_group_factory, ) yield building_block.with_position_matrix( position_matrix=get_position_matrix(building_block, generator), ) def get_position_matrix(molecule, generator): position_matrix = generator.uniform( low=-500, high=500, size=(molecule.get_num_atoms(), 3), ) molecule = molecule.with_position_matrix(position_matrix) rdkit_molecule = molecule.to_rdkit_mol() rdkit.SanitizeMol(rdkit_molecule) rdkit.Compute2DCoords(rdkit_molecule) try: rdkit.MMFFOptimizeMolecule(rdkit_molecule) except Exception: pass return rdkit_molecule.GetConformer().GetPositions() Once these functions are defined, we can use :func:`get_building_block` to generate our building blocks .. code-block:: python import pathlib import numpy as np generator = np.random.default_rng(4) fluoros = tuple(get_building_blocks( # Assume that fluoros.txt is in the same folder as this # code. path=pathlib.Path(__file__).parent / 'fluoros.txt', functional_group_factory=stk.FluoroFactory(), generator=generator, )) bromos = tuple(get_building_blocks( # Assume that bromos.txt is in the same folder as this # code. path=pathlib.Path(__file__).parent / 'bromos.txt', functional_group_factory=stk.BromoFactory(), generator=generator, )) In this example, the EA will create ``AB`` dimers, using the :class:`.Linear` topology graph. The initial population of 25 such dimers can be made by taking the first 5 ``bromo`` and ``fluoro`` building blocks .. code-block:: python def get_initial_population(fluoros, bromos): for fluoro, bromo in itertools.product(fluoros, bromos): yield stk.MoleculeRecord( topology_graph=stk.polymer.Linear( building_blocks=(fluoro, bromo), repeating_unit='AB', num_repeating_units=1, ), ) initial_population = tuple(get_initial_population(fluoros[:5], bromos[:5])) Next, we can define our mutation operations. There are a multiple options, as you can see in the sidebar. One thing that you might notice immediately, is that there are multiple :class:`.Mutator` types you would like to use during the EA, but the :class:`.EvolutionaryAlgorithm` only takes a single :class:`.Mutator`. To get around this, we can use a compound :class:`.Mutator`, such as the :class:`.RandomMutator`. When you create a :class:`.RandomMutator`, you define it in terms of other mutators you want to use, for example .. code-block:: python def get_functional_group_type(building_block): functional_group, = building_block.get_functional_groups(0) return functional_group.__class__ def is_fluoro(building_block): functional_group, = building_block.get_functional_groups(0) return functional_group.__class__ is stk.Fluoro def is_bromo(building_block): functional_group, = building_block.get_functional_groups(0) return functional_group.__class__ is stk.Bromo mutator = stk.RandomMutator( mutators=( # Substitutes a building block with a fluoro group with # a random building block in fluoros. stk.RandomBuildingBlock( building_blocks=fluoros, is_replaceable=is_fluoro, random_seed=generator, ), # Substitutes a building block with a fluoro group with # a similar building block in fluoros. stk.SimilarBuildingBlock( building_blocks=fluoros, is_replaceable=is_fluoro, random_seed=generator, ), # Substitutes a building block with a bromo group with # a random building block in bromos. stk.RandomBuildingBlock( building_blocks=bromos, is_replaceable=is_bromo, random_seed=generator, ), # Substitutes a building block with a bromo group with # a similar building block in bromos. stk.SimilarBuildingBlock( building_blocks=bromos, is_replaceable=is_bromo, random_seed=generator, ), ), random_seed=generator, ) When :meth:`~.Mutator.mutate` is called on a :class:`.RandomMutator`, it randomly selects one of the mutators you gave it during initialization, and asks it to perform the mutation operation on its behalf. In this way, all of the mutators you provided it will get used during the EA. Now we can put all of these components together, and fill in the remaining ones too .. code-block:: python ea = stk.EvolutionaryAlgorithm( initial_population=tuple( get_initial_population(fluoros[:5], bromos[:5]) ), fitness_calculator=stk.FitnessFunction(get_fitness_value), mutator=stk.RandomMutator( mutators=( stk.RandomBuildingBlock( building_blocks=fluoros, is_replaceable=is_fluoro, random_seed=generator, ), stk.SimilarBuildingBlock( building_blocks=fluoros, is_replaceable=is_fluoro, random_seed=generator, ), stk.RandomBuildingBlock( building_blocks=bromos, is_replaceable=is_bromo, random_seed=generator, ), stk.SimilarBuildingBlock( building_blocks=bromos, is_replaceable=is_bromo, random_seed=generator, ), ), random_seed=generator, ), crosser=stk.GeneticRecombination( get_gene=get_functional_group_type, ), generation_selector=stk.Best( num_batches=25, duplicate_molecules=False, ), mutation_selector=stk.Roulette( num_batches=5, random_seed=generator, ), crossover_selector=stk.Roulette( num_batches=3, batch_size=2, random_seed=generator, ), ) Final Version ============= The final version of our code is best seen here_. .. _here: https://github.com/lukasturcani/basic_ea/blob/master/basic_ea.py The plot of fitness we produced looks like this: .. image:: https://i.imgur.com/9Difk6R.png which shows us that the EA was pretty good at improving the fitness value. Another thing to look at is the plot for the number of rotatable bonds .. image:: https://i.imgur.com/QJKTTEx.png Clearly, our EA was able to minimize the number of rotatable bonds to a low value across all members of the population. We can also compare the molecules in the initial population .. image:: https://i.imgur.com/C9Gisxf.png to those in the final population .. image:: https://i.imgur.com/5lq42FZ.png where the hydrogen atoms have been left out for clarity. When considering that these were chosen out of a search space of 1,000,000 randomly constructed molecular graphs, they don't look that bad, though you will probably want to a better measure of synthetic accessibility in your own EAs. Next, you can read the intermediate tutorial, which will show you some additional customization options for the EA.