Literature Review (Machine Learning in early and late Biotechnology)

Introduction

Automating or simulating parts of drug discovery with machine learning involve its various branches and different techniques used for each one and it is because the problem is too big for one field to solve it. “The interaction between drugs and the human body is a bidirectional process: drugs affect the human body, resulting in receptor inhibition, activation, and signal pathway blocking, and the human body disposes of drug by absorption, distribution, metabolism, and excretion” (Y. Wang et al., 2015, p495). This chapter will review how different drug discovery problems like protein folding and genome sequencing are tackled with machine learning techniques from several papers by various authors.

Ethical, legal and social issues

Genome sequencing has always been one of the biggest issues in biotechnology, because it is on the edge of how much it costs for private companies to develop it and that biological data has to be property of everyone in the world and therefore should be public. Public data could be stolen if not protected and then used as private product again. To prevent that “in Europe legal protection is given to databases by the European Directive on The Legal Protection of Databases” (C. McCubbin, 2003, p250).  More research in to legal issues might be needed, but this project focus is on education.

Large private projects on sequencing DNA and creating new molecules aim to patent it, “many patent applications have been filed in the field of genomics/functional genomics claiming the output from largescale DNA sequencing projects“(C. McCubbin, 2003, p252). Software which is created for biotechnology reasons can also be patentable, because it is usually unique and not easily discoverable in the same way without actually copying or knowing something about it. From private company owners point of view it seems unfair that drug they developed and sequences they discovered through many costly years are not patentable and therefore it causes difficulties to have a profit. But from another point of view it would be unethical if one company discovers cure to cancer, but government discovers it later, people could not use the free cure, because they would have to buy previously patented one. That also leads to questions about if there was most efficient molecule found, then others are not allowed to use it or reproduce it even in their own way without paying to first inventor? Unique software, patented without problems, could possibly produce the results which are not supposed to be private. What if that software creates the cure, while being patented itself, would its results also be patented? There are also socially not acceptable issue examples like genetically modified foods and experiments with animals. Other problems include “social views on such issues as whether knowledge is being commoditised, whether it is acceptable to patent living organisms, innovations derived from traditional local knowledge and active ingredients from plants considered sacred, and whether the industry should share benefits with local communities” (N. Rigand, 2008, p20).

Earlier and not machine learning techniques

Simplest and earliest form use of computing in biotechnology was knowledge based systems, which “are being used to represent, manage and maintain, over time, knowledge obtained by interpreting the raw data stored in databases” (J. Fox et al., 1994, p288), earlier interpreting was done by experts using statistics and rule based systems. Fox et al. (1994) mentions the use of regular expressions which extract sentences from long genetic sequences and can help identify the known parts. It is quite old technique widely used in scripting languages from Bash to JavaScript. From math and statistics arise regression, clustering and decision tree algorithms. Ashtawy et al. (2012) compared different machine learning techniques ranging from clustering K-Nearest Neighbour to regression, decision trees and SVM, which had best results. J. Cohen (2004) mentions the use of microarrays and says that “a powerful new tool available in biology is microarrays. They allow determining simultaneously the amount of mRNA production of thousands of genes “(J. Cohen, 2004, p129). In other words, microarrays are RNA slices which have known properties, then they are mixed with test substances and loaded in to computer using laser scanners to analyse the results. There are various ways to use micro arrays, one of them is when “data from many separate microarray experiments are collected into a single matrix, indexed by gene (row) and experiment (column)” (W. Noble, 2003, p15). Binding affinity can be used in “scoring function SF, a mathematical or predictive model that produces a score representing the binding free energy of a binding pose” (M. Ashtawy et al., 2012, p1302). Because biotechnologists need a way to evaluate or compare potential drug interactions with targets, there are various scoring functions and “most SFs in use today can be categorized as either force-fieldbased, empirical, or knowledge-based SFs, but none were based on sophisticated machine-learning (ML) algorithms” (M. Ashtawy et al., 2012, p1302). Nevertheless scoring functions are used in the latter algorithms, there are ”steady gains in performance of ML based SFs, in particular for those based on RF and BRT, as the training set size and type and number of features were increased” (M. Ashtawy et al., 2012, p1311). Scoring functions are pretty close to the atomic level physics and chemistry, sometimes even quantum mechanics. Atomic level simulations are called molecular dynamics, “a technique for computer simulation of complex systems, modelled at the atomic level” (J. Meller, 2001, p1). Molecular dynamics usually used on macro molecules, because of very high complexity and need to understand how they work. “Biologically important macromolecules and their environments are routinely studied using molecular dynamics simulations” (J. Meller, 2001, p1). Ligand and receptor interaction can be measured by SFs, but the whole thing could also be simulated as molecular docking, which” simulates the binding of drug molecules to their target protein molecules (S. Smith et al., 2014, p119). Although there are non-computational ”experimental techniques, such as X-ray diffraction or nuclear magnetic resonance (NMR), allow determination of the structure and elucidation of the function of large molecules of biological interest” (J. Meller, 2001, p1). They are not as popular as computational, because these allow some level of protein structure prediction without actually knowing it or costing as much resources. ”X-ray crystallography and Nuclear Magnetic Resonance (NMR) being expensive and not likely to cope with the growing amount of data and necessitates the development of novel computational techniques” (M. Smitha et al., 2008, p2). Computational simulations or not, they are still very complex problems, because of how many laws and particles are involved in the process. “Many problems of interpreting data from molecular biology experiments have combinatorial complexity” (J. Fox et al., 1994, p294). Meaning that there are multiple related variables with each having their own different set of values. These could be solved by constraint-based system, which “represents the dependencies among all the objects in a problem as constraints, and uses a problem solver that will prune illegal solutions and their consequents (constraint propagation) when a constraint is violated” (J. Fox et al., 1994, p294). J. Fox et al. (1994) points the weaknesses of machine learning, but does not lose hope and says that “AI has the potential to make a significant contribution to computational molecular biology” (J. Fox et al., 1994, p298). Older techniques are still used but in combination with more advanced ones, which will be mentioned in further paragraphs.

Abstracting the data to reduce the problem

The rise of big data gave rise to better results from machine learning, it allows “to make fewer initial assumptions about the underlying model and let the data determine which model is the most appropriate” (Y. Wang et al., 2015, p508). Because of how big the Big Data is, even the most efficient algorithms need it approximated, reduced, modelled, discretized, normalized or abstracted to some level, and that way allow taking limited amount of resources to produce the results. “It is intuitively clear that less accurate approximations become inevitable with growing complexity” (J. Meller, 2001, p2) and” in order to develop efficient protein structure prediction algorithm, there is a need for restricting the conformational search space” (M. Smitha et al., 2008, p3). Such simplification could be called simulation or “reduction of complexities in living things for achieving much smaller systems that can express the whole or a part of living cellular properties” (K. Tsumoto et al., 2007, p102). Mathematical abstractions arise, some of them rely on vector ideas, where each attribute of a protein is converted in to numeric value as a separate and unique vector dimension. These attributes can be called features and together represented as feature vectors. “Much attention paid on statistical or machine learning techniques to classify the proteins using feature vector representations of available knowledge” (A. Chinnasamy et al., 2003, p2). Noble (2003) mentions that you can make” each protein characterized by a simple vector of letter frequencies” (W. Noble, 2003, p5). Amino acids can be represented by letters. Proteins can also be represented as sequences of SF’s results, then “resulting scores are used to map each protein into a 10,000-dimensional space” (W. Noble, 2003, p6). Blaszczyk et al. (2015) described algorithm called CABS-dock which uses 3D structure of the receptor and peptide sequence to predict best fitting peptide for that receptor. 2D representations of biochemical pathways and their simulations” can be interpreted as graphical programming languages, and compiler theory can be applied to compile/interpret the graphical language into another one that is supported by runtime simulators” (R. Gostner et al., 2014, p16:2). Data can also be simplified by introducing limits such as “neither the residues in the protein binding pocket nor the ligand are allowed to have organic elements other than C, N, O, P, S, F, Cl, Br, I, and H” (M. Ashtawy et al., 2012, p1303). Even averaging can be done in some cases, because” according to statistical mechanics, physical quantities are represented by averages over microscopic states” (J. Meller, 2001, p4). Another way to identify protein is to compare its similarity to others, but” it is important to keep in mind that sequence similarity does not always imply similarity in structure and vice-versa” (J. Cohen, 2004, p131). With abstractions making the job easier for computers comes the side effect of reduced precision and” in practice the quality of sampling and the accuracy of the interatomic potentials used in simulations are always limited” (J. Meller, 2001, p4). Even though you have to lose precision by reducing the data for effectiveness, more powerful computers arise and less abstraction might be needed. Nevertheless with the rise of big data, precision becomes not such a big issue. “With synchrotrons and fast computers, drug designers can visualize ligands bound to their target providing a wealth of details concerning the non-bonded interactions that control the binding process” (V. Lounnas et al., 2013, p1).

No comments:

Post a Comment