Biotechnology and Machine Learning with SVM and LSS

News, Comments and Updates

Blog posts will start with links including ones to github with original report. The information will further flow as series of diagrams as visual aid followed by chunks of research from original paper. The rest guides or papers to the main topic could be found in last post called "References" together with actual references.

Project Blog Post Links

SVM guides

SVM diagrams (SVM 2D)

2D SVM interpretation

A lot of diagrams show hyperplane with parallel lines, but that could be confusing, because you are searching for widest gap between two classes with support vectors assisting in finding it. Usually data is not 2D (higher dimensional data could be reduced to 2D), so the separating plane in higher dimensions could be even more confusing to imagine if you think of parallel lines and that you may need to only use 3-4 support vectors, in reality you can use many and come up with hyperplane which is hypercurved surface, which if you transfer in to different dimension can be flat. In the mathematical version of SVM (non visual) hyperplane starts as line equation, then is gradually transformed to quadratic optimization formula, which needs variables called alphas to be found, then it can be used to enter new vector, run through new formula and tell if the result is -1, or 1 indicating the class label. More text based explanations will be in further posts.

Figure 1. SVM and hyperplane separation (Source: Ubaby, 2016)

Multi Class Solution

One Vs All

Multi class in SVM is solved with One To All/ One Vs Others method. For each class pick current class and mark it as 1, the rest classes are -1, repeat. This idea could also be applied to other machine learning argorithms.

Figure 4. One vs. others/rest/all (Source: Ubaby, 2016)

Linearly Inseparable and Slack Variable

One of ways to solve linearly inseparable problems is to allow some wrong classification, that way reducing precision, but at the same time allowing to separate most values. Slack variable idea can also be applied to other machine learning algorithms. Neural network could be called one huge slack variable, it just separates classes any way it likes and that is why it is so good with noisy data. In the diagram below the depicted dataset is similar to Iris, it has linearly separable groups as well as inseparable.

Figure 3. Slack variable (Source: Ubaby, 2016)

Linearly Inseparable and Higher Dimensions

Interpretation of the Idea

A lot of diagrams show higher dimensions starting from 3D and picturing that lower space could be transferred to higher space in such way that it could be sliced with a flat plane. Beyond 3D it is difficult to imagine planes and hyperplanes as flat surfaces which can divide the space they are in in to two. In the diagram below you can imagine the space just as collection of different directions. In 2D there is x and y, up down, left right, in 3D there is one more direction, but lets not focus on the directions we imagine, but even then, we live in 4D where fourth dimension is time. Just another direction/ dimension to describe space. So to transfer variables to higher dimension could simply mean introducing some new direction, new formula which makes it easier to manage firstly seemingly inseparable classes. In higher/ different dimensions the distances could be different and therefore allow easier separation.

Figure 2. Higher dimension separation (Source: Ubaby, 2016)

Math of SVM from 2D to (n)D

2D SVM and line equations

SVM starts as 2D idea with math behind it looking like this.

(n)D SVM and Lagrange multiplier

Then higher dimensional data and linearly inseparable problems are taken in to consideration, which starts using multidimensional variables with several constraints and leads to maximization ideas from Lagrange and there the line equation is fit in to Lagrange formula.

Further this formula is gradually simplified to. Where K is a kernel or a function to which you pass vectors and measure their distances in higher/different dimensions.

SVM formula with constraints

SVM Kernels

Some of the Kernels (Author may update this later with actual formulas). You could look at these and see that they are just a way to differently measure the distances between vectors, instances or rows. More about it in SVM interpretation section.

Linear or dot product.
Gaussian, which uses euclidean distance.
RBF, which uses Gaussian.
Exponential, which looks like simpler version of Gaussian.
Polynomial, which uses probabilities and dot product.
Hybrid, which is mix of polynomial and gaussian.
Sigmoidal, step function used in neural networks.

Possible transformations between Lagrange SVM and final formula

Some possible transformations in between Lagrange and final SVM formula may look like this.

Kernels in Java

Java code for the Kernel functions. (More of it in further posts)

SVM interpretation in Java and start of Least Similar Spheres (LSS)

Interpretation of SVM

Given that SVM math is difficult to fully implement from scratch, because it uses quadratic optimizers, it was assumed that to create its interpreted version of code, idea, math and logic would be easier to understand. The solution to this difficulty was called plan B and supposed to be used in case author does not manage to implement SVM in Java. It lead to discovery of different interpretation of SVM, which looks easier to understand even at programming the math level.

If we look at SVM hyperplane in 2D, it is a line, which separates two sides. We can imagine that a line can be two parallel lines very close to each other. These lines could be just edges of very large circles, which means that opposite centers have to be very far away from each other. In SVM we use support vectors or closest opposite vectors to support the finding of separating hyperplane, but in interpretation these vectors are used to find anti-support or furthest vectors. In geometry we can measure distances with various formulas, which in SVM are called kernels. The distance between two different vectors is a value. That value can have slightly different name than a measurement of distance, it could be called measurement of how similar or different are the vectors. Author uses these names to further simplify mathematical problem in to more intuitive logical one. That way multidimensional vector space, distances, hyperplanes, hyperspheres can become easier to imagine. Another difficult problem might be to try to imagine hyperplane separating high dimensional space. In previous paragraphs it was mentioned that this high dimensional hyperplane becomes a complex formula with no easy way to imagine its shape. That is where another simplification might help- spheres and hyperspheres. If a circle is simply a set of points which surround a center point always in the same distance, just different direction, so is the sphere. The sphere just has one more dimension, but the idea of the same distance to all directions from center is the same.

Figure 5. Least Similar Sphere LSS or SVM interpretation (Source: Ubaby, 2016)

At first it might seem difficult to imagine hypersphere, but if you look at it just as picking the point in high dimension space and spray the paint same distance in all directions of that space, then that would seem like a sphere. Now if we join all ideas it starts to sound like finding least similar vectors of opposite classes and drawing spheres up to the most similar opposite vectors and then using these spheres to determine how different or how close the new unknown vectors are from these spheres. That is how Least Similar Sphere idea arisen. Author hopes that someone else, somewhere else, some time ago did not already invent this idea and claims that given limited amount of knowledge about the topic and desire to invent, tried to independently and uniquely create something new and unify existing ideas learned from SVM, ANN and clustering.

Least Similar Spheres (LSS)

How LSS interprets SVM

LSS machine design requires data, which should be loaded in to memory as lists of feature vectors with their class labels. It needs two types of data, one with assigned class labels for learning phase, another with unknown class labels to predict them. Then for each class label there should be a pair of lists representing each “one to all”. To evaluate prediction precision it should use same data for learning as for predicting, but reset class labels to unknowns and then predict them. Once it is done predicting it can then compare it with learning data. The learning phase needs to find closest vectors to opposite class and then use these to find the furthest vectors. Predicting phase has to use these vectors and calculate distances which represent each class multiple radius spheres of each class and then check if new unknown value is within the radius. The inseparable problem is then attempted to be solved by using different formulas for measuring the distances and using a slack value to reduce or increase each radius size. LSS could also try to learn and predict multiple times to find best prediction, by changing the settings. Additional ideas like creating the averaged vector clusters, finding and removing outliers could be used to try to improve prediction precision. The outlier could be furthest vector which is far from its closest vectors but close to opposite class closest vectors. The averaging could allow finding smaller averaged cluster which has higher abstraction than original one, but still has information describing it. The formula for finding closest vectors could be different from finding furthest and different from the prediction formula. Its success is mentioned further. The design of LSS idea will be shown in following pictures and comments.

Learning phase, finding furthest and closest vectors.

Once closest/support points/vectors are found, by measuring each vector of one class and checking how far away from other class it is, then choosing closest one, these points are then used to find the furthest ones. Furthest points could help finding even further ones, but author started from this point and maybe will come up with idea of how to do it later.

Figure 6. Finding closest and furthest vectors. (Source: Ubaby, 2016)

Predict phase

For each “one to all” class label feature vector pairs iterate through furthest vectors. For each furthest vector find closest opposite class vector from already found closest ones. Find closest opposite class vector, which this time is its own closest vector. Find mid-point and then calculate the distance from furthest vector to mid-point, then use this distance as sphere radius which represents this class and check if unknown vector distance from this furthest vector is less or equal to that radius, indicating that it is inside or outside this class. In case closest vectors themselves are out of range of the radius, they also have their own radius. It is calculated by finding closest opposite class vector and then using radius to midpoint between them.

Figure 7. How spheres and their radiuses are built. (Source: Ubaby, 2016)

The Big Picture of Least Similar Spheres

It can be seen that some circles overlap, that is where slack variable helps to reduce the original size of spheres.

Figure 8. How LSS predicts. (Source: Ubaby, 2016)

Abstract of the Research about Biotechnology and Machine Learning with SVM

Advancement of science caused more cross-disciplinary and narrow fields to arise, so that more new topics could be covered and generate experts in various areas. One of these fields started from biology and with the rise of computers became bioinformatics. With the need to know about biotechnology and computers the knowledge gap between experts and beginners kept increasing. This project report offers a partial solution to the problem by educating about biotechnology and machine learning with implementation of SVM interpretation in Java.

Introduction describes bioinformatics topics which cover the problems like the complexity of finding the problem of disease because protein structure, biochemical pathways and even more has to be taken in to consideration and about how the big data gave rise to new techniques to manage it, because mathematical statistics started to become more and more complicated.

Literature review mentions several bioinformatics papers about biological data analysis with various machine learning techniques, while starting to explain older techniques, then ending with new ones like Bayesian networks and Support Vector Machines. This chapter touches on how the big data needs to be approximated, reduced, modelled, discretized, normalized or abstracted to some level and gives the examples on what has been taken from nature to create evolutionary algorithms and artificial neural networks. From the literature review it becomes clearer that SVM has been shown to be intuitive to use high precision algorithm. SVM is further explained with math and simple diagrams.

In first posts author explained some of SVM as its distinct interpretation in Java with hopes that this way it would be easier to understand. The interpretation idea asks to look at SVM hyperplane in 2D and see that it is a line, which separates two sides, then imagine that a line can be two parallel lines very close to each other and that these lines could be just edges of very large circles with opposite centers very far away from each other. That is how Least Similar Spheres (LSS) idea was discovered. Further paragraphs explain how this machine interpretation was designed, that it was used on flower Iris dataset by Marshall M. (2016) and large bioassay dataset by A. Schierz (2009) and its results were compared with Weka SVM and R SVM.

In conclusions it is mentioned that although LSS had unexpectedly good results, the main point of it was to show that mathematical ideas could be interpreted differently, while still being used as they are and that machine learning might not be as difficult as it looks, because you can have your own interpretations. Some of the examples in LSS were the distance measure viewed as similarity, support vectors as the most similar and separated spaces with hyperplanes as spheres with radiuses. Further mentioned that literature review shown SVM is one of the best techniques. From implementation and explanation of SVM interpretation it could be concluded that it does not really matter whether the data is biological or not and that it can be useful even if precision is not high, because it can reduce the search space for further confirmation.

The Introduction to Biotechnology and Machine Learning

Introduction

Curiosity and desire for discovering new things is one of human traits. These things could be called problems, which need solutions. The more problems are solved, the more can be achieved and technology can help saving time and effort. Preventing sickness and diseases would allow more minds to join the progress of discovery. “Drugs are essential for the prevention and treatment of disease” (S. Mandal et al., 2009, p90) and their development is one of the biggest time and resource consuming problems. More sciences involved in to this problem led to new mixed fields like biophysics, biochemistry, biotechnology and recently, bioinformatics, which emerged because “the enormous amount of data gathered by biologists—and the need to interpret it— requires tools that are in the realm of computer science” (J. Cohen, 2004, p123). Bioinformatics is one of the fields where different sciences collaborate to solve its problems and increasing amount of data about it. “This area has arisen from the needs of biologists to utilize and help interpret the vast amounts of data that are constantly being gathered in genomic research—and its more recent counterparts, proteomics and functional genomics” (J. Cohen, 2004, p122). This field uses computer science and technology to automate parts of it and improves the process of discovery by trying to reduce required time and resources needed to do it. The rise of this science and big data lead to more and more sophisticated statistics, which got more and more complicated. Data mining and machine learning emerged as solution to the difficulties in statistics with multi-dimensional spaces caused by large amount of attributes. Machine learning started from combining programming and different mathematical ideas ranging from regression (drawing a function, which can describe points on graph), clustering (finding distinct groups), to unique ideas taken from examples of the way humans think (decision trees, Apriori Algorithm) and how process of thinking works in biology (neural networks). Further improvements and their combinations lead to Support Vector Machines and Bayesian networks. The most recent ones use different combinations of multiple techniques. In machine learning “each area involves one or more reasoning problems for which significant expertise exists in the AI community, such as simulation, planning, redesign, diagnosis, and learning” (D. Karp et al., 1994, p8). Machine learning tools are used in bioinformatics, their use and improvement “will have numerous benefits such as efficiency, cost effectiveness, time saving” (S. Mandal et al., 2009, p90).

Drug Development

Drug engineering just like software engineering has its methodologies. The top parts are called “Discovery’, ‘Development’ and ‘Registration’ phases. The ‘Discovery’ phase, routinely three to four years, involves identification of new therapeutic targets, lead finding and prioritisation, lead optimisation and nomination of new chemical entities (NCEs)” (J. Wang et al., 2004, p73). Discovering new drugs is very large and costly process and “to address this issue, several multidisciplinary approaches are required for the process of drug development, including structural biology, computational chemistry, and information technology, which collectively form the basis of rational drug design” (Y. Wang et al., 2015, p489). The discovery phase is where biotechnology and bioinformatics are mostly used. This phase has its own main parts called ADME, which is "absorption, distribution, metabolism, and excretion" (S. K. Balani et al., 2005, p1). Another technique in drug discovery focuses on 3D structures of biomolecules and is called structure based drug design or SBDD. “SBDD provides insight in the interaction of a specific protein-ligand pair, allowing medicinal chemists to devise highly accurate chemical modifications around the ligand scaffold” (V. Lounnas et al., 2013, p1). Together these rules allow better drug discovery, because it is not enough for a drug to be effective, it should also be less toxic (T in ADME/T stands for toxicity). In reality, a drug can be effective but not satisfy all the requirements of ADME/T and therefore not released for medical use. “Investigation of terminated projects revealed that the primary cause for drug failure in the development phase was the poor pharmacokinetic and ADMET (ADME+Toxicity) properties rather than unsatisfactory efficacy” (J. Wang., 2004, p73). There are various drug design techniques, some of them include structure based drug design, target based drug design and more recent rational drug design which involves multiple disciplines and techniques, it “can be applied to develop drugs to treat a wide variety of diseases and can also be used for designing drugs for disease prevention” (S. Mandal et al., 2009, p90). Drug discovery is the beginning of drug development and involves identifying problem or drug target, which “is a biomolecule which is involved in signaling or metabolic pathways that are specific to a disease process” (S. Mandal et al., 2009, p90). Identifying which molecules or ligands can bind to the receptors of the target is one of the most complicated parts of drug discovery.

Cell Biology

Life forms are made of cells and “cells are complex molecular machines contained within phospholipid membranes that isolate a unique chemical environment” (E. Yoruk et al., 2011, p1). These machines are made from big protein, average peptide, smaller amino acid and very small molecules like H2O. Each big molecule is made from the atoms held by the atomic force. ”The atomic force field model describes physical systems as collections of atoms kept together by interatomic forces” (J. Meller, 2001, p2). The proteins are also machines and ”are made of long chains of amino acids which in their natural environment (in solution) fold up into simple "secondary" structures, like helices, and then by further folding into higher-order structures” (J. Fox et al., 1994, p290). While big molecules with more than 50 amino acids are called proteins,” short strings of amino acids, called peptides” (W. Noble, 2003, p23). ”Peptides can be considered to be up to 50 amino acids in length, with proteins being larger than this” (C. Walle, 2011, p4). And “there are twenty common amino acids” (F. Altschul, 2011, p8). Cells can be divided in two communication parts, the inside and the outside. Both sides involve sequences of bio-reactions. “The metabolism of a cell is the set of bioreactions that its enzymes can catalyse and such a sequence of reactions is called a path way” (D. Karp et al., 1994, p7). These pathways are like a constantly moving and floating queues of large and small molecules interacting with each other, they are called biochemical pathways “and, in reality, metabolic, signaling, and regulatory pathways interact and intersect in the course of cellular growth and activity” (R. Gostner et al., 2014, p16:2). The inside is made of biochemical pathways, which allow communication between smaller inner organelles. ”Living cells are complex systems whose growth and existence depends on thousands of biochemical reactions” (D. Karp et al., 1994, p2). A lot of biochemical reactions need help or catalysts for them to happen, these are called enzymes. Enzymes “facilitate the association of several molecules to form a complex, and it lowers the energy barrier required for the bond rearrangements that constitute a reaction” (D. Karp et al., 1994, p5). The outside on membrane’s surface receptors and ligands are used to communicate with other cells. How strongly ligand molecule reacts with receptor is called binding affinity. ”Binding affinity represents the strength of association between the ligand and its receptor protein” (M. Ashtawy et al., 2012, p1301) that’s why” the recognition of signal peptides is important for the development of new drugs” (W. Noble, 2003, p12) and” finding the structure and the fold of a protein is very important since it helps to understand the functions (A. Chinnasamy et al., 2003, p1).

All these and even more has to be taken in to consideration when trying to find the disease or a problem with cells. Because of that much of complexity even todays computing power is not enough. Finding similar protein structures can also reduce the problem and machine learning can help recognize them.

Machine Learning

Multi-disciplinary approach may increase reliability and effectiveness of problem solving, but it becomes more and more difficult to grasp it all as a whole. Increasing amount of data is pushing the limits of how much more should be done in the same amount of time to be able to expand the knowledge, which could possibly lead to the problem solutions. “In the era of big data, a necessary goal is the ability to use rapidly accumulating data to pinpoint potential ADME/T issues before entering late-stage development.” (Y. Wang et al., 2015, p508). The big data also gave rise to new techniques for managing it, because mathematical statistics started to become too complicated. That led to various machine learning approaches, which can use large amounts of data to learn and then predict the outcome of unknown new data. ”Computational analysis of biological data obtained in genome sequencing and other projects is essential for understanding cellular function and the discovery of new drugs and therapies” (C. Ding et al., 2000, p349). The hopes are not lost because of how successful machine learning was and still is. As in the early years scientists trusted computers will help understand big problems saying that” it is now generally accepted that modern molecular biology research needs many different types of software to support the management, analysis and interpretation of data (J. Fox et al., 1994, p287) and in later years hoping that “progress in reliable computational methods is greatly anticipated” (M. Blaszczyk et al., 2015).

Literature Review (Machine Learning in early and late Biotechnology)

Introduction

Automating or simulating parts of drug discovery with machine learning involve its various branches and different techniques used for each one and it is because the problem is too big for one field to solve it. “The interaction between drugs and the human body is a bidirectional process: drugs affect the human body, resulting in receptor inhibition, activation, and signal pathway blocking, and the human body disposes of drug by absorption, distribution, metabolism, and excretion” (Y. Wang et al., 2015, p495). This chapter will review how different drug discovery problems like protein folding and genome sequencing are tackled with machine learning techniques from several papers by various authors.

Ethical, legal and social issues

Genome sequencing has always been one of the biggest issues in biotechnology, because it is on the edge of how much it costs for private companies to develop it and that biological data has to be property of everyone in the world and therefore should be public. Public data could be stolen if not protected and then used as private product again. To prevent that “in Europe legal protection is given to databases by the European Directive on The Legal Protection of Databases” (C. McCubbin, 2003, p250). More research in to legal issues might be needed, but this project focus is on education.

Large private projects on sequencing DNA and creating new molecules aim to patent it, “many patent applications have been filed in the field of genomics/functional genomics claiming the output from largescale DNA sequencing projects“(C. McCubbin, 2003, p252). Software which is created for biotechnology reasons can also be patentable, because it is usually unique and not easily discoverable in the same way without actually copying or knowing something about it. From private company owners point of view it seems unfair that drug they developed and sequences they discovered through many costly years are not patentable and therefore it causes difficulties to have a profit. But from another point of view it would be unethical if one company discovers cure to cancer, but government discovers it later, people could not use the free cure, because they would have to buy previously patented one. That also leads to questions about if there was most efficient molecule found, then others are not allowed to use it or reproduce it even in their own way without paying to first inventor? Unique software, patented without problems, could possibly produce the results which are not supposed to be private. What if that software creates the cure, while being patented itself, would its results also be patented? There are also socially not acceptable issue examples like genetically modified foods and experiments with animals. Other problems include “social views on such issues as whether knowledge is being commoditised, whether it is acceptable to patent living organisms, innovations derived from traditional local knowledge and active ingredients from plants considered sacred, and whether the industry should share benefits with local communities” (N. Rigand, 2008, p20).

Earlier and not machine learning techniques

Simplest and earliest form use of computing in biotechnology was knowledge based systems, which “are being used to represent, manage and maintain, over time, knowledge obtained by interpreting the raw data stored in databases” (J. Fox et al., 1994, p288), earlier interpreting was done by experts using statistics and rule based systems. Fox et al. (1994) mentions the use of regular expressions which extract sentences from long genetic sequences and can help identify the known parts. It is quite old technique widely used in scripting languages from Bash to JavaScript. From math and statistics arise regression, clustering and decision tree algorithms. Ashtawy et al. (2012) compared different machine learning techniques ranging from clustering K-Nearest Neighbour to regression, decision trees and SVM, which had best results. J. Cohen (2004) mentions the use of microarrays and says that “a powerful new tool available in biology is microarrays. They allow determining simultaneously the amount of mRNA production of thousands of genes “(J. Cohen, 2004, p129). In other words, microarrays are RNA slices which have known properties, then they are mixed with test substances and loaded in to computer using laser scanners to analyse the results. There are various ways to use micro arrays, one of them is when “data from many separate microarray experiments are collected into a single matrix, indexed by gene (row) and experiment (column)” (W. Noble, 2003, p15). Binding affinity can be used in “scoring function SF, a mathematical or predictive model that produces a score representing the binding free energy of a binding pose” (M. Ashtawy et al., 2012, p1302). Because biotechnologists need a way to evaluate or compare potential drug interactions with targets, there are various scoring functions and “most SFs in use today can be categorized as either force-fieldbased, empirical, or knowledge-based SFs, but none were based on sophisticated machine-learning (ML) algorithms” (M. Ashtawy et al., 2012, p1302). Nevertheless scoring functions are used in the latter algorithms, there are ”steady gains in performance of ML based SFs, in particular for those based on RF and BRT, as the training set size and type and number of features were increased” (M. Ashtawy et al., 2012, p1311). Scoring functions are pretty close to the atomic level physics and chemistry, sometimes even quantum mechanics. Atomic level simulations are called molecular dynamics, “a technique for computer simulation of complex systems, modelled at the atomic level” (J. Meller, 2001, p1). Molecular dynamics usually used on macro molecules, because of very high complexity and need to understand how they work. “Biologically important macromolecules and their environments are routinely studied using molecular dynamics simulations” (J. Meller, 2001, p1). Ligand and receptor interaction can be measured by SFs, but the whole thing could also be simulated as molecular docking, which” simulates the binding of drug molecules to their target protein molecules (S. Smith et al., 2014, p119). Although there are non-computational ”experimental techniques, such as X-ray diffraction or nuclear magnetic resonance (NMR), allow determination of the structure and elucidation of the function of large molecules of biological interest” (J. Meller, 2001, p1). They are not as popular as computational, because these allow some level of protein structure prediction without actually knowing it or costing as much resources. ”X-ray crystallography and Nuclear Magnetic Resonance (NMR) being expensive and not likely to cope with the growing amount of data and necessitates the development of novel computational techniques” (M. Smitha et al., 2008, p2). Computational simulations or not, they are still very complex problems, because of how many laws and particles are involved in the process. “Many problems of interpreting data from molecular biology experiments have combinatorial complexity” (J. Fox et al., 1994, p294). Meaning that there are multiple related variables with each having their own different set of values. These could be solved by constraint-based system, which “represents the dependencies among all the objects in a problem as constraints, and uses a problem solver that will prune illegal solutions and their consequents (constraint propagation) when a constraint is violated” (J. Fox et al., 1994, p294). J. Fox et al. (1994) points the weaknesses of machine learning, but does not lose hope and says that “AI has the potential to make a significant contribution to computational molecular biology” (J. Fox et al., 1994, p298). Older techniques are still used but in combination with more advanced ones, which will be mentioned in further paragraphs.

Abstracting the data to reduce the problem

The rise of big data gave rise to better results from machine learning, it allows “to make fewer initial assumptions about the underlying model and let the data determine which model is the most appropriate” (Y. Wang et al., 2015, p508). Because of how big the Big Data is, even the most efficient algorithms need it approximated, reduced, modelled, discretized, normalized or abstracted to some level, and that way allow taking limited amount of resources to produce the results. “It is intuitively clear that less accurate approximations become inevitable with growing complexity” (J. Meller, 2001, p2) and” in order to develop efficient protein structure prediction algorithm, there is a need for restricting the conformational search space” (M. Smitha et al., 2008, p3). Such simplification could be called simulation or “reduction of complexities in living things for achieving much smaller systems that can express the whole or a part of living cellular properties” (K. Tsumoto et al., 2007, p102). Mathematical abstractions arise, some of them rely on vector ideas, where each attribute of a protein is converted in to numeric value as a separate and unique vector dimension. These attributes can be called features and together represented as feature vectors. “Much attention paid on statistical or machine learning techniques to classify the proteins using feature vector representations of available knowledge” (A. Chinnasamy et al., 2003, p2). Noble (2003) mentions that you can make” each protein characterized by a simple vector of letter frequencies” (W. Noble, 2003, p5). Amino acids can be represented by letters. Proteins can also be represented as sequences of SF’s results, then “resulting scores are used to map each protein into a 10,000-dimensional space” (W. Noble, 2003, p6). Blaszczyk et al. (2015) described algorithm called CABS-dock which uses 3D structure of the receptor and peptide sequence to predict best fitting peptide for that receptor. 2D representations of biochemical pathways and their simulations” can be interpreted as graphical programming languages, and compiler theory can be applied to compile/interpret the graphical language into another one that is supported by runtime simulators” (R. Gostner et al., 2014, p16:2). Data can also be simplified by introducing limits such as “neither the residues in the protein binding pocket nor the ligand are allowed to have organic elements other than C, N, O, P, S, F, Cl, Br, I, and H” (M. Ashtawy et al., 2012, p1303). Even averaging can be done in some cases, because” according to statistical mechanics, physical quantities are represented by averages over microscopic states” (J. Meller, 2001, p4). Another way to identify protein is to compare its similarity to others, but” it is important to keep in mind that sequence similarity does not always imply similarity in structure and vice-versa” (J. Cohen, 2004, p131). With abstractions making the job easier for computers comes the side effect of reduced precision and” in practice the quality of sampling and the accuracy of the interatomic potentials used in simulations are always limited” (J. Meller, 2001, p4). Even though you have to lose precision by reducing the data for effectiveness, more powerful computers arise and less abstraction might be needed. Nevertheless with the rise of big data, precision becomes not such a big issue. “With synchrotrons and fast computers, drug designers can visualize ligands bound to their target providing a wealth of details concerning the non-bonded interactions that control the binding process” (V. Lounnas et al., 2013, p1).

Literature Review (Machine Learning examples taken from biology)

Evolutionary algorithms

Even if evolution took so long it has been successful, that is why taking examples from nature has also been done in computer science. Some of these include evolutionary algorithms and artificial neural networks. Evolutionary algorithms use the ideas like natural selection and random mutation. “The evolutionary algorithm is one of the major methods used to investigate protein folding” (J. Tsay et al., 2012, p2). It is done by taking the parameters of molecule, its atoms and surrounding physics as mathematical representations, although these parameters or attributes might not necessarily be mathematical representations, because “for every application of a genetic algorithm one has to decide the representation formalism for the genes” (M. Smitha et al., 2008, p5). Each attribute randomly changed, then checked if new combination or fold is more successful than previous one. The process repeats for most successful candidate folds and it often does not lead to perfect solutions or might never finish searching, so it is stopped at some point and then tested. Protein folding is used to determine its functions, but there are different ways to do that. It can be compared to other known proteins, in other words, “protein function can also be determined via sequence comparison with other species” (W. Noble, 2003, p10), and it can also be determined by finding its 3D structure from a chain of amino acids, because “it is the tertiary structure of a protein that determines its function” (J. Tsay et al., 2012, p156).

Representing proteins for evolutionary algorithm folding

Protein folding techniques may use laws of physics, chemistry or quantum mechanics. “Physics-based folding is far from routine for general protein structure prediction of normal size proteins, mainly because of the prohibitive computing demand and the best current free-modeling results come from those which combine both knowledge-based and physics-based approaches” (S. Wu et al., 2009, p234). Because of that, data has to be represented in abstracted and simplified ways. There has been various publications from different authors and their interpretations of how data could be abstracted.

Tsay et al. (2012) used evolutionary algorithms to predict protein structure and mentioned the use of 2D or 3D transformations as tuneable parameters for evolutionary algorithms to be randomly tweaked until best structure is found. Smith et al. (2014) used computation of each atom pair interaction energies to evaluate how well protein and ligand binds given certain pose or alignment. Representation can also be done by using “fitness function based on a simple force field” (M. Smitha et al., 2008, p2). Because of high computation needed, advanced optimization or minimal use of computer resources has been also considered, Smith et al. (2014) used genetic algorithms to determine best alignment of protein and ligand and tried to optimize it to the point where each atom description fits just in to 16 bytes, then compared different hardware computational power. If we consider ASCII character representation format, it would mean that each atom was represented by two short words, which is impressive, given that each atom has multiple properties like its mass, proton and electron count. 16 bytes is quite common size of memory addresses, it aligns “much more efficiently with most hardware’s memory interfaces” (S. Smith et al., 2014, p124).

Artificial neural networks

Another example taken from nature is the brain and its ability to recognize patterns, which for long time seemed like very complicated process until someone thought of our sensory data as inputs, brain as machine made from many repeated bits called neurons and the result of recognition as output. That machine takes inputs and is taught that these inputs represent certain value or meaning and when new similar situation happens, it can be recognized. Each repeated bit would be neuron and each neuron also has minimized version of input, machine and output, which are interconnected with others. How neuron learns relies on remembering previous input and comparing how different it is, then strengthening or weakening connections which interact with most informative data This new neural net then is used to evaluate how new data values are similar to strongest connections. Interconnections are mixed and crossed, inputs are summed and then ran through functions, which lead to some problems, because” what they learn is difficult to understand” (J. Fox et al., 1994, p292) and “make many non-computational experts very wary and distrustful of the results” (J. Fox et al., 1994, p292). But” neural networks can learn very complicated relationships between their inputs and outputs”. (N. Srivastava et al., 2014, p1929) and” deep neural nets with a large number of parameters are very powerful machine learning system”. (N. Srivastava et al., 2014, p1929). Ding et al. (2000) mentions that neural network accuracy drops significantly because of higher noise levels. Noise is a big problem to most machine learning algorithms, because it can skew the results towards some bad value. That is why noise reduction methods are usually added to the main machine learning algorithms.

Literature Review (Machine Learning later techniques)

Bayes Theory Probability Based Networks

From statistics we know that probability is a useful tool used for prediction, but it may become difficult when dealing with multiple related probabilities. Bayes theorem helped to solve this problem with famous equation:

. It takes in to account the probability of something that already happened and is related to something that might happen. This can be used as cascading tree of probabilities given conditions and treating each group of probabilities in a context of probability of A given the probability of the rest, until there is only one left. That way you can use it in machine learning on multidimensional data. Yoruk et al. (2011) used Bayesian networks to model cell protein signalling and said that” biological data are inherently probabilistic and generally display hierarchical relationships” (E. Yoruk et al., 2011, p592). You don’t usually have hundreds of thousands experiments on some biological target, which means that the amount of data for each unique event is not very large. Many machine learning algorithms need a lot of data for the accuracy to be higher. ”Bayesian neural nets are extremely useful at drug discovery, but are slow to train and difficult to scale to very large network sizes” (N. Srivastava et al., 2014, p1941). But even if they are slow, they can be more precise because of the use of inter-related probabilities and because it is not very difficult to pre-calculate probabilities for smaller data sets. ”Bayesian classifier requires a prior knowledge of many probabilities” (A. Chinnasamy et al., 2003, p2). This classifier is also more popular among biologists because the results are clear and easy to understand, slow training is not a problem when data sets are small and” the biologists need to know the confident level of the resultant classes outputted by the classifiers for further analysis” (A. Chinnasamy et al., 2003, p11). Chinnasamy et al. (2003) managed to produce better results with Bayesian networks than SVM. Even though the Bayes theorem seem simple, when used with multivariate attributes and many related probabilities, it may start to look like in the pictures below, making it not intuitive to understand, but to make it easier you could think of them simply as nested “P(A\The rest)”.

Screenshots 1, 2, 3: Possible Bayesian Network Formulas: (Source: E. Yoruk et al., 2011, p604)

Geometry based Support Vector Machine

The way machine learning is used could be explained by saying that algorithms use data identified or labelled by experts to learn or find best separation point between different identities or class labels, so that when new unidentified or unlabelled data comes in, machine could tell on which side of separated space it is and therefore which label or identity that data has. There are many different algorithms which separate data in to various ways, but unlike SVM, these do not try to find the best way. The best separation in SVM is called widest margin. It leads to higher precision and that’s why “support vector machine learning algorithm has been extensively applied within the field of computational biology” (W. Noble, 2003, p1). SVM starts with very simple ideas from 2D/ 3D vector geometry. If we imagine data rows with 4 variables as points or vectors in 3D space and fourth value being the class label with one of two possible labels or colors, then we could draw a plane in between differently colored points, finally, try to modify the plane’s location and orientation so that it is as far away from each color as possible. When new uncoloured points come in, we can determine what color they should be, by measuring on which side of the plane they are, and because we modified the plane to be as far away as possible from each color, even the points on the very edge, difficult to identify, could simply be measured by how far and in which direction of the plane they are. “There are three key ideas needed to understand SVM: maximizing margins, the dual formulation, and kernels” (K. Bennett et al., 2000, p1). One of SVM problems is overfitting, but you can improve it by reducing “the input dimensionality” (N. Srivastava et al., 2014, p1942).

Figure 1. SVM and hyperplane separation (Source: Ubaby, 2016)

Data is rarely 2D or 3D, it is usually multi-dimensional, multi-class, noisy and non-linearly separable, which means that the different class label or color points are not separable by simply drawing a line or a plane. Instead it could be separated by some curve or curved hyper-surface in some dimension. Our data rows can be called as vectors or, as in some formulas, transposed vectors which are one row matrices. In higher dimensions the hyperplane is not anymore a nice hyper-flat geometrical thing, it is converted in to a formula or a function, which is created by measuring distances between vectors, taking in to account each vector’s color or class label. In SVM the geometric functions are transformed in to Lagrange multipliers, which needs two parts, main function determining the rule or the hyperplane and constraints relying on support vectors, which help determine how hyperplane is oriented. Lagrange multipliers allow finding the minimum or maximum values, and in this case, the function which represents furthest hyperplane from each class. In non-linearly separable case, function to measure the distance between points might not be the same as in linearly separable. In SVM distance formulas are called kernels. Kernels allow linear separation by plotting points in to higher dimensions. “SVM kernel framework accommodates in a straightforward fashion many different types of data—vectors, strings, trees, graphs” (W. Noble, 2003, p24). Higher dimension kernels could simply be called measurement s of how similar are the two vectors.

SVM results by various authors

Studies have been done comparing multiple machine learning techniques with biological data and SVM has been shown to be intuitive to use high precision algorithm. Chinnasamy et al. (2003) talked about other studies where SVM performs better, but has a high number of false positives, he said that” from all these studies it is evident that among all the prediction methods, SVM performs better (A. Chinnasamy et al., 2003, p3) and explained that “in SVM, as the number of classifiers is high, reading the distances between hyperplane and the classes are very difficult” (A. Chinnasamy et al., 2003, p11). Noble (2003) said that SVM can be used” to learn to differentiate true from false positives” (W. Noble, 2003, p23). Furlanello et al. (2005) used SVM for molecular profiling, outlier detection and classification of samples to reduce computation required. Ding et al. (2000) compared SVM and neural networks when predicting protein fold and the results indicate much higher accuracy and faster processing with SVM than NN. SVM is one of the most precise techniques and it is” very robust methodology for inference with minimal parameter choice" (K. Bennett et al., 2000, p1) and that is because you just have to choose the Kernel and slack variable value. Choosing correct kernel might not be easy, because it is not always clear what dimension or measurement of similarity is the most appropriate for the data.

Implementation (The Biotechnology and Machine learning Problem and Proposed Solution)

The Problem

Discovering drugs still involves a lot of real experiments on live cells. Automated and simulated models of cells could allow speeding up this process. Simulations require designing models, which represent real structures to predict their behaviour without doing real experiments and requiring more resources.

If we consider a cell as an object which we are trying to simulate as precisely as possible, we might need to include physics and chemistry and even quantum mechanics of each atom. Full simulation even of just one cell is still beyond what computers can do effectively today and that’s why biology, physics, chemistry and computing experts try to find the best balance of how far the problem can be abstracted, but still keep predictions in satisfactory precision and still return results within resource constraints. Simulating just certain function would reduce the problem, but could lose precision, because of undiscovered relations to other functions. There are some examples of quite abstract solutions in machine learning, like neural networks, which simulate brain function of recognizing patterns, without the need to simulate brains to the atomic level and even taking more space than just couple paragraphs of code. But even if this example lets us know that it is possible to make quite big abstractions to work, in drug development it might not be the case. Drugs are also made of molecules, which have to be exactly defined to be able to produce them. That causes some difficulties when trying to abstract molecular processes to get answers about precise molecules. It is difficult to abstract something and get out precise answers, but it is possible to increase precision by gradually abstracting less and less.

“Computational modeling of the structure of protein-peptide interactions is usually divided into two stages: prediction of the binding site at a protein receptor surface, and then docking (and modeling) the peptide structure into the known binding site” (M. Blaszczyk et al., 2015, p1). The “gap between identified hits and the many criteria that must be fulfilled, can be bridged by investigating the interactions between the ligands and their receptors“(V. Lounnas et al., 2013, p1). Finding the point of abstraction at which simulated cell or cell’s function behaves acceptably close to reality is another interesting and complicated problem. Experimenting at different levels of abstraction could allow finding the optimal solution for further development of these simulations. “With faster and more powerful computers larger and more complex systems may be explored using computer modelling or computer simulations” (J. Meller, 2001, p1). Work with cell receptor and ligand interactions have examples both in real biochemistry experiments and machine learning. The most unambiguous and precise language for modelling is mathematics and “mathematical modelling typically requires deep mathematical or computing knowledge, and this limits the spread of modelling tools among biologists” (R. Gostner et al., 2014, p16).

The knowledge gap between experts and beginners keeps increasing, because their time on actual research is more valuable than time spent explaining their work. Advancing science caused more narrow fields to arise and as a result it keeps increasing the knowledge gap even between experts. The need for experts and guides on their created technologies arise and it could be solved by gradually educating about the general topic or parts of it and its problems. There are experts who are doing actual experiments and search for real solutions, but there are not that many intermediate tools or solutions which introduce to the bioinformatics by showing how it works and what are its tools.

Proposed solution

Partial solution could be to advance research about these kind of problems and educate more people about bioinformatics. Proposed solution offers to educate about the main topics of bioinformatics and one of the best machine learning techniques- SVM. The implementation expands this guide in to Java implementation of SVM interpretation called Least Similar Spheres (LSS).

Implementation (LSS Java Snippets)

Learning

It finds closest and furthest vectors for each label pair.

public void findAllClosestAndFurthest(List<OneVsOtherBox> labelPairs, SimilaritySettingsBox closest, SimilaritySettingsBox furthest) {

for (int i = 0; i < labelPairs.size(); i++) {

OneVsOtherBox labelPair = labelPairs.get(i);

findClosestAndFurthest(labelPair, closest, furthest);

}

Find closest and furthest uses find with Boolean which sets what to find.

for (int i = 0; i < targets.size(); i++) {

VectorBox target = targets.get(i);

VectorBox closestVector = closestOrFurthest(findClosestOrFurthestThis, target, settings, closest);

if (closestVector != null) {

if (closest) {

// if it is not already used

if (closestVector.getClosest() == 0) {

closestOrFurthestVectors.add(closestVector);

It uses settings which contain the pointers to formulas which measure the distances to try and solve inseparable problem.

public double euclidean(VectorBox a, VectorBox b) {

return v.euclideanDistance(a, b);

}

public double linear(VectorBox a, VectorBox b) {

return v.dotProduct(a, b);

}

public double RBF(VectorBox a, VectorBox b, double ro) {

double gamma = 1 / (2 * v.square(ro));

return gaussian(a, b, gamma);

}

public double gaussian(VectorBox a, VectorBox b, double gamma) {

Predicting

Various modifications have been done to prevent high precision because of how data is sorted and how class labels are counted. Shuffling and hash maps gave the most reliable results.

For each label pair, check which class the unknown value might be.

public List<VectorBox> predict(List<VectorBox> predictData, SettingsBox settings) {

currentSettings = settings;

Collections.shuffle(predictData);

return predict(predictData);

}

public List<VectorBox> predict(List<VectorBox> data) {

resetDataLabels(data);

for (int i = 0; i < data.size(); i++) {

assignLabels(data.get(i));

}

return data;

}

The “assignLabels” method iterates through furthest vectors and measures radiuses in the way it is described in the design section.

public void assignLabels(VectorBox unknownLabel) {

for (int i = 0; i < labelPairs.size(); i++) {

OneVsOtherBox l = labelPairs.get(i);

assignLabel(unknownLabel, l, currentSettings.getPredictSimilaritySettings(), l.getOneLabel());

}

The “inSphere” method checks if new unknown value is in the class representing sphere. It iterates through each furthest vector, then finds closest opposite class vector. Using this vector it finds closest opposite class vector again, but this time it is its own closest vector. These are used to find midpoint and then radius which represents class points inside the radius. The next iteration checks each closest vector and their own circle radius in case they are not in range of furthest vector radius.

public boolean inSphere(VectorBox unknownLabel, List<VectorBox> closestOnes, List<VectorBox> furthestOnes, List<VectorBox> closestOthers, SimilaritySettingsBox similaritySettings) {

for (int i = 0; i < furthestOnes.size(); i++) {

// furthestOne--->closestOne---><---closestOther<---furthestOther

VectorBox furthestOne = furthestOnes.get(i);

// furthestOne-----------------><---closestOtherToFurthestOne------------

VectorBox closestOtherToFurthest = findClosest(furthestOne, closestOthers, similaritySettings);

// --------->closestOneToClosestOther---><---closestOtherToFurthestOne---

VectorBox closestOneToClosestOther = findClosest(closestOtherToFurthest, closestOnes, similaritySettings);

VectorBox midPoint = extraMath.middle(closestOtherToFurthest, closestOneToClosestOther);

double midDist = similarity(midPoint, closestOtherToFurthest, similaritySettings);

double safeDist = similarity(furthestOne, closestOtherToFurthest, similaritySettings) - midDist;

safeDist *= currentSettings.getPredictionSearchSensitivity();

double distance = similarity(unknownLabel, furthestOne, similaritySettings);

if (distance <= safeDist) {

return true;

}

for (int j = 0; j < closestOnes.size(); j++) {

VectorBox closestOne = closestOnes.get(j);

VectorBox closestOther = findClosest(closestOne, closestOthers, similaritySettings);

double dist1 = similarity(unknownLabel, closestOne, similaritySettings);

VectorBox midP = extraMath.middle(closestOne, closestOther);

double midDist = similarity(midP, closestOne, similaritySettings);

if (dist1 <= midDist) {

return true;

}

return false;

}

Inseparable problem solution by using slack variable can be seen inside the above code snippet as “safeDist*=currentSettings.getPredictionSearchSensitivity()”. It modifies the sphere radius by multiplying it with given value.

Slow Machine

To improve learning the option to search for best learning subset was also implemented. It randomly creates subsets of learn data, learns it, predicts, reset original data to unknowns, then compares the prediction precision and tracks which was the best subset. Due to difficulties and complexity for calculating precision when using subsets, these were not used for results even though it had better results. One of those difficulties arise when subset becomes vectors of one class label, which results to 100% precision because there is no other class.

public List<VectorBox> tryFindBestTrainSet(List<VectorBox> learn, List<VectorBox> predict, SettingsBox s, int times, int subsetSize) {

List<List<VectorBox>> randomSubsets = getRandomSubsets(learn, times, subsetSize);

double[] results = new double[times];

double high = 0;

for (int i = 0; i < times; i++) {

List<VectorBox> randomSubset = randomSubsets.get(i);

FastMachine lss = new FastMachine(s);

lss.learn(randomSubset);

resetDataLabels(predict);

List<VectorBox> predicted = lss.predict(predict);

results[i] = predictonPrecision(predicted, learn);

if (results[i] > high) {

high = results[i];

addDisplayInfo("Trying to find best learning subset: " + "Try: " + (i + 1) + " Prediction: " + high + "%");

}

double highest = results[0];

List<VectorBox> bestTrainingSubset = randomSubsets.get(0);

for (int i = 0; i < results.length; i++) {

if (highest < results[i]) {

highest = results[i];

bestTrainingSubset = randomSubsets.get(i);

}

return bestTrainingSubset;

}

Find best settings and find best closest similarity settings use randomly generated settings and “FastMachine” multiple times by comparing the prediction precision.