By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 1646 |
Pages: 4|
9 min read
Published: Apr 2, 2020
Words: 1646|Pages: 4|9 min read
Published: Apr 2, 2020
Multi-objective evolutionary algorithm is a popular approach that has been widely used in optimization problems. This research on using Multi-objective genetic algorithm for motif discovery (MOGAMOD) was the first study to apply multi-objective genetic algorithm in the motif finding problem. By maximizing three conflicting objectives: motif length, similarity, and support, the motif pattern can be achieved with high accuracy and low running time. The MOGAMOD algorithm used a popular multi-objective genetic algorithm with high performance called Non-dominated Sorting Algorithm (NSGA-II) with an adaptation to the motif finding problem to find the optimal motif. What makes NSGA-II more efficient than other algorithms is that it has two unique operations, mutation and crossover, that constantly produce different set of solutions and compare them to come up with a final optimal result. The algorithm was tested and analyzed for several samples with different properties: simple sample, corrupted sample, invaded sample, multiple pattern. The results were compared with three conventional methods, which used the statistical approaches, to show its efficiency and superiority.
Sequence motifs are defined as repeated patterns in DNA that can be found in regulatory sites on DNA. These regulatory sites and motif instances are found to be responsible for being the protein binding sties of the gene sequence in order to start the transcription process. The motif instances found in DNA sequences usually have some slight variations in their components. Finding motif instances on DNA and their regulatory regions is crucial for understanding the relationship between DNA and proteins such as nucleases and transcription factors; it is also the key factor to control gene expression and identify drug target for personalized medicine. In real world problems, DNA can contain up to 220 million base pairs of nucleotides and motif instances are usually short (30 nucleotide pairs). As a result, biological experimental approaches have been developed to extract motif instances from given DNA samples, the most popular methods are DNase foot printing, gel-shift analysis and linker-scanning. These biological approaches require a huge amount of lab work and time when the length of the sequences or the number of sequences increases.
Therefore, computational methods with statistical approaches have been developed to find motifs in given DNA samples such as Gibbs Sampler and Consensus. However, these algorithms also have high time complexities when the dimensions of the DNA array increase. They also do not consider other cases when the sample does not contain motif instances in some sequences or multiple instances exist in one sequence. In this report, a new approach using multi-objective genetic algorithm is introduced as an alternative to the typical statistical approaches. Instead of optimizing only one objective and having extremely low performance of other objectives such as similarity or length of the final motif, this new approach produces results that compromise between the objectives to solve the problems found in other methods. The multi-objective genetic algorithm is designed to maximize three properties of the final motif: similarity, length, and support.
The proposed algorithm in this paper is tested with three data sets and compared with other well-known biological methods to demonstrate its effectiveness and superiority in terms of accuracy and time complexity. It is also compared with single-objective Genetic Algorithm to provide a better understanding of the tradeoff between the problem objectives.
The Multi-objective genetic algorithm for motif discovery (MOGAMOD) was constructed based on a popular multi-objective genetic algorithm with high performance called Non-dominated Sorting Algorithm (NSGA-II). NSGA-II is a population-based often used in optimization problems to find global optima fast and effective. It is established with the foundation of Darwin’s principle of natural selection to achieve the best solution set for the given problems. The first step of a genetic algorithm is to establish a randomly-generated initial population that contains individuals representing possible solutions for the problem. In this case, an individual was created as an array that contained n genes that corresponded to n number of DNA sequences in the problem. Each gene was then split into two parts: weight (wi) and possible starting location of motif instance (si). The weight values in the array indicated the probability of the potential motif existing in the corresponding sequence, these values ranged from 0 to 1. MOGAMOD was designed to allow users to set a threshold limit of wi so that the corresponding sequence with low wi could be excluded from the motif discovery process. The starting location variables (si) indicated the potential starting position of the motif instance in that corresponding sequence, in this research, si was restricted between 7 and 64. Every individual of the population was then evaluated by a fitness function which was constructed based on three objectives: Similarity, Length of motif and Support.
In the motif discovery problem, Similarity is defined as a measure of resemblance in all motif instances of an individual. The similarity value of an individual was calculated from the position weight matrix in every sequence by taking the mean of the probability from the most popular nucleotide. This value also ranged from 0 to 1 and indicated how likely the current motif would be chosen as a motif. In the motif discovery problem, motif length is always an objective that every algorithm tries to maximize in order to reduce the probability of having false motif instances and increase the chance of getting a strong motif as a result.
The support value of an individual was determined by the number of sequences that were used to compose the candidate motif. This value was created to exclude the “corrupted” sequences that did not have any motif instances in order to achieve a strong final motif without taking into account those sequences. In conclusion, to solve the motif discovery problem, MOGAMOD was created to optimize three objectives of a final motif: Similarity – Motif length – Support. From the initial population, the stronger individuals were selected to move on to the next generation. A fitness function was created to determine whether if an objective of an individual was strong enough compared to other individuals of the current population. The individuals were first put into ranks based on their fitness by using non-dominated sort algorithm. This algorithm has a time complexity of polynomial to the second degree as described as O(M. N2), where M is the amount of objectives and N is the amount of individuals in the population. According to Deb, a solution A dominates another solution B if and only if:
Based on this rule, the individuals in the population were put into ranks where the individuals that were not dominated by any other individuals were ranked 1. The individuals in rank 1 were then removed from the population to repeat the process and determine the individuals in rank 2. The process was repeated until all the individuals of the population were ranked and sorted. After that, pairs of individuals were chosen randomly to be evaluated and moved to the next generation. The selected individual with a lower rank compared to the other individual in the pair was moved to the next generation. In case both of the individual had the same rank, their crowding distance values were calculated and the one with a larger crowding distance was chosen for next generation.
In evolutionary algorithms, genetic operators are often used to explore new solutions or cover all the possible outcomes of the problem.
Crossover: In this research, after the two individuals were randomly selected from the population, arithmetic one-point crossover method was applied with a user-defined probability to produce “offspring” that were different from their parents.
Mutation: The mutation operator in this algorithm was applied to the “offspring” by changing the value of a position in an individual with a user-defined probability. There were three types of mutation used in MOGAMOD:+ Shifting to the right+Shifting to the left+Random changing. After everything was established, the algorithm was implemented to find the most optimal set of solutions by using elitism to create new generations and it only stopped when the algorithm converged to the most optimal individual or the maximum number of generations was met.
For the first dataset, yst04r, containing 7 sequences with 1000 based pairs of nucleotides in each sequence, the algorithm was able to produce multiple solutions from one given set of inputs. A conclusion could be drawn from this first experiment that the motif length and the similarity value had an inverse relationship. The same conclusion about the relationship of the support value and the motif length. When compared to other biological methods, the MOGAMOD algorithm also demonstrated its superiority in terms of running time. For the second experiment with the dataset yst08r when the number of sequences was increased, the MOGAMOD algorithm continued to out-perform the single-objective GA when it produced better results in terms of objectives. In the last set of experiment, where the dataset array had larger dimensions in terms of number of sequences and sequence length compared to the first set of data, hmr03r, the result table indicated that when the length of the sequence was increased, the MOGAMOD algorithm was able to produce more results with similar values of objectives. It was also noted that the runtime of the algorithm did not increased even though the sequence length was increased.
MOGAMOD, a multi-objective evolutionary algorithm based on NSGA-II, was constructed to solve the motif discovery problem from a multi-objective approach. With this new algorithm, different problems with other methods such as finding motif instances in “corrupted” sequences and multiple motif instances exist in one sequence were resolved. The algorithm also demonstrated its superiority in terms of runtime compared to other methods. Finally, it provided a set of nondominated solutions for the person who is making the decision to understand the tradeoff between the three objectives: motif length, support, and similarity.
Browse our vast selection of original essay samples, each expertly formatted and styled