Introduction
Automating or simulating parts of
drug discovery with machine learning involve its various branches and different
techniques used for each one and it is because the problem is too big for one
field to solve it. “The interaction
between drugs and the human body is a bidirectional process: drugs affect the
human body, resulting in receptor inhibition, activation, and signal pathway
blocking, and the human body disposes of drug by absorption, distribution,
metabolism, and excretion” (Y. Wang et al., 2015, p495). This chapter will
review how different drug discovery problems like protein folding and genome
sequencing are tackled with machine learning techniques from several papers by
various authors.
Ethical, legal and social issues
Genome
sequencing has always been one of the biggest issues in biotechnology, because
it is on the edge of how much it costs for private companies to develop it and that
biological data has to be property of everyone in the world and therefore
should be public. Public data could be stolen if not protected and then used as
private product again. To prevent that “in
Europe legal protection is given to databases by the European Directive on The
Legal Protection of Databases” (C. McCubbin, 2003, p250). More research in to legal issues might be
needed, but this project focus is on education.
Large private projects on sequencing DNA and creating new molecules aim to patent it, “many patent applications have been filed in the field of genomics/functional genomics claiming the output from largescale DNA sequencing projects“(C. McCubbin, 2003, p252). Software which is created for biotechnology reasons can also be patentable, because it is usually unique and not easily discoverable in the same way without actually copying or knowing something about it. From private company owners point of view it seems unfair that drug they developed and sequences they discovered through many costly years are not patentable and therefore it causes difficulties to have a profit. But from another point of view it would be unethical if one company discovers cure to cancer, but government discovers it later, people could not use the free cure, because they would have to buy previously patented one. That also leads to questions about if there was most efficient molecule found, then others are not allowed to use it or reproduce it even in their own way without paying to first inventor? Unique software, patented without problems, could possibly produce the results which are not supposed to be private. What if that software creates the cure, while being patented itself, would its results also be patented? There are also socially not acceptable issue examples like genetically modified foods and experiments with animals. Other problems include “social views on such issues as whether knowledge is being commoditised, whether it is acceptable to patent living organisms, innovations derived from traditional local knowledge and active ingredients from plants considered sacred, and whether the industry should share benefits with local communities” (N. Rigand, 2008, p20).
Large private projects on sequencing DNA and creating new molecules aim to patent it, “many patent applications have been filed in the field of genomics/functional genomics claiming the output from largescale DNA sequencing projects“(C. McCubbin, 2003, p252). Software which is created for biotechnology reasons can also be patentable, because it is usually unique and not easily discoverable in the same way without actually copying or knowing something about it. From private company owners point of view it seems unfair that drug they developed and sequences they discovered through many costly years are not patentable and therefore it causes difficulties to have a profit. But from another point of view it would be unethical if one company discovers cure to cancer, but government discovers it later, people could not use the free cure, because they would have to buy previously patented one. That also leads to questions about if there was most efficient molecule found, then others are not allowed to use it or reproduce it even in their own way without paying to first inventor? Unique software, patented without problems, could possibly produce the results which are not supposed to be private. What if that software creates the cure, while being patented itself, would its results also be patented? There are also socially not acceptable issue examples like genetically modified foods and experiments with animals. Other problems include “social views on such issues as whether knowledge is being commoditised, whether it is acceptable to patent living organisms, innovations derived from traditional local knowledge and active ingredients from plants considered sacred, and whether the industry should share benefits with local communities” (N. Rigand, 2008, p20).
Earlier and not machine learning techniques
Simplest and earliest form use of
computing in biotechnology was knowledge based systems, which “are being used to represent, manage and
maintain, over time, knowledge obtained by interpreting the raw data stored in
databases” (J. Fox et al., 1994, p288), earlier interpreting was done by
experts using statistics and rule based systems. Fox et al. (1994) mentions the
use of regular expressions which extract sentences from long genetic sequences
and can help identify the known parts. It is quite old technique widely used in
scripting languages from Bash to JavaScript. From math and statistics arise
regression, clustering and decision tree algorithms. Ashtawy et al. (2012)
compared different machine learning techniques ranging from clustering K-Nearest
Neighbour to regression, decision trees and SVM, which had best results. J.
Cohen (2004) mentions the use of microarrays and says that “a powerful new tool available in biology is microarrays. They allow
determining simultaneously the amount of mRNA production of thousands of genes
“(J. Cohen, 2004, p129). In other words, microarrays are RNA slices which
have known properties, then they are mixed with test substances and loaded in
to computer using laser scanners to analyse the results. There are various ways
to use micro arrays, one of them is when
“data from many separate microarray experiments are collected into a single
matrix, indexed by gene (row) and experiment (column)” (W. Noble, 2003,
p15). Binding affinity can be used in “scoring
function SF, a mathematical or predictive model that produces a score
representing the binding free energy of a binding pose” (M. Ashtawy et al.,
2012, p1302). Because biotechnologists need a way to evaluate or compare
potential drug interactions with targets, there are various scoring functions
and “most SFs in use today can be
categorized as either force-fieldbased, empirical, or knowledge-based SFs, but
none were based on sophisticated machine-learning (ML) algorithms” (M.
Ashtawy et al., 2012, p1302). Nevertheless scoring functions are used in the
latter algorithms, there are ”steady
gains in performance of ML based SFs, in particular for those based on RF and
BRT, as the training set size and type and number of features were increased” (M.
Ashtawy et al., 2012, p1311). Scoring functions are pretty close to the atomic
level physics and chemistry, sometimes even quantum mechanics. Atomic level
simulations are called molecular dynamics, “a
technique for computer simulation of complex systems, modelled at the atomic
level” (J. Meller, 2001, p1). Molecular dynamics usually used on macro
molecules, because of very high complexity and need to understand how they
work. “Biologically important macromolecules
and their environments are routinely studied using molecular dynamics
simulations” (J. Meller, 2001, p1). Ligand and receptor interaction can be
measured by SFs, but the whole thing could also be simulated as molecular
docking, which” simulates the binding of
drug molecules to their target protein molecules (S. Smith et al., 2014,
p119). Although there are non-computational ”experimental
techniques, such as X-ray diffraction or nuclear magnetic resonance (NMR),
allow determination of the structure and elucidation of the function of large molecules
of biological interest” (J. Meller, 2001, p1). They are not as popular as
computational, because these allow some level of protein structure prediction
without actually knowing it or costing as much resources. ”X-ray crystallography and Nuclear Magnetic Resonance (NMR) being
expensive and not likely to cope with the growing amount of data and
necessitates the development of novel computational techniques” (M. Smitha
et al., 2008, p2). Computational simulations or not, they are still very
complex problems, because of how many laws and particles are involved in the process. “Many problems of interpreting data from
molecular biology experiments have combinatorial complexity” (J. Fox et al.,
1994, p294). Meaning that there are multiple related variables with each having
their own different set of values. These could be solved by constraint-based
system, which “represents the
dependencies among all the objects in a problem as constraints, and uses a
problem solver that will prune illegal solutions and their consequents
(constraint propagation) when a constraint is violated” (J. Fox et al.,
1994, p294). J. Fox et al. (1994) points the weaknesses of machine learning,
but does not lose hope and says that “AI
has the potential to make a significant contribution to computational molecular
biology” (J. Fox et al., 1994, p298). Older techniques are still used but
in combination with more advanced ones, which will be mentioned in further
paragraphs.
Abstracting the data to reduce the
problem
The
rise of big data gave rise to better results from machine learning, it allows “to make fewer initial assumptions about
the underlying model and let the data determine which model is the most
appropriate” (Y. Wang et al., 2015, p508). Because of how big the Big Data
is, even the most efficient algorithms need it approximated, reduced, modelled,
discretized, normalized or abstracted to some level, and that way allow taking
limited amount of resources to produce the results. “It is intuitively clear that less accurate approximations become
inevitable with growing complexity” (J. Meller, 2001, p2) and” in order to develop efficient protein
structure prediction algorithm, there is a need for restricting the
conformational search space” (M. Smitha et al., 2008, p3). Such
simplification could be called simulation or “reduction of complexities in living things for achieving much smaller
systems that can express the whole or a part of living cellular properties”
(K. Tsumoto et al., 2007, p102). Mathematical abstractions arise, some of them
rely on vector ideas, where each attribute of a protein is converted in to
numeric value as a separate and unique vector dimension. These attributes can
be called features and together represented as feature vectors. “Much attention paid on statistical or
machine learning techniques to classify the proteins using feature vector
representations of available knowledge” (A. Chinnasamy et al., 2003, p2). Noble
(2003) mentions that you can make” each
protein characterized by a simple vector of letter frequencies” (W. Noble,
2003, p5). Amino acids can be represented by letters. Proteins can also be
represented as sequences of SF’s results, then “resulting scores are used to map each protein into a 10,000-dimensional
space” (W. Noble, 2003, p6). Blaszczyk et al. (2015) described algorithm
called CABS-dock which uses 3D structure of the receptor and peptide sequence
to predict best fitting peptide for that receptor. 2D representations of
biochemical pathways and their simulations”
can be interpreted as graphical programming languages, and compiler theory can
be applied to compile/interpret the graphical language into another one that is
supported by runtime simulators” (R. Gostner et al., 2014, p16:2). Data can
also be simplified by introducing limits such as “neither the residues in the protein binding pocket nor the ligand are
allowed to have organic elements other than C, N, O, P, S, F, Cl, Br, I, and H”
(M. Ashtawy et al., 2012, p1303). Even averaging can be done in some cases,
because” according to statistical
mechanics, physical quantities are represented by averages over microscopic
states” (J. Meller, 2001, p4). Another way to identify protein is to
compare its similarity to others, but” it
is important to keep in mind that sequence similarity does not always imply
similarity in structure and vice-versa” (J. Cohen, 2004, p131). With
abstractions making the job easier for computers comes the side effect of
reduced precision and” in practice the
quality of sampling and the accuracy of the interatomic potentials used in
simulations are always limited” (J. Meller, 2001, p4). Even though you have
to lose precision by reducing the data for effectiveness, more powerful
computers arise and less abstraction might be needed. Nevertheless with the
rise of big data, precision becomes not such a big issue. “With synchrotrons and fast computers, drug designers can visualize
ligands bound to their target providing a wealth of details concerning the
non-bonded interactions that control the binding process” (V. Lounnas et
al., 2013, p1).
No comments:
Post a Comment