Bayes Theory Probability Based Networks
From
statistics we know that probability is a useful tool used for prediction, but
it may become difficult when dealing with multiple related probabilities. Bayes
theorem helped to solve this problem with famous equation:
. It takes in to account the probability of something that
already happened and is related to something that might happen. This can be
used as cascading tree of probabilities given conditions and treating each
group of probabilities in a context of probability of A given the probability
of the rest, until there is only one left. That way you can use it in machine
learning on multidimensional data. Yoruk
et al. (2011) used Bayesian networks to model cell protein signalling and said
that” biological data are inherently
probabilistic and generally display hierarchical relationships” (E. Yoruk
et al., 2011, p592). You don’t usually have hundreds of thousands experiments
on some biological target, which means that the amount of data for each unique
event is not very large. Many machine learning algorithms need a lot of data
for the accuracy to be higher. ”Bayesian
neural nets are extremely useful at drug discovery, but are slow to train and
difficult to scale to very large network sizes” (N. Srivastava et al.,
2014, p1941). But even if they are slow, they can be more precise because of
the use of inter-related probabilities and because it is not very difficult to
pre-calculate probabilities for smaller data sets. ”Bayesian classifier requires a prior knowledge of many probabilities” (A.
Chinnasamy et al., 2003, p2). This classifier is also more popular among
biologists because the results are clear and easy to understand, slow training
is not a problem when data sets are small and” the biologists need to know the confident level of the resultant
classes outputted by the classifiers for further analysis” (A. Chinnasamy
et al., 2003, p11). Chinnasamy et al. (2003) managed to produce better results
with Bayesian networks than SVM. Even though the Bayes theorem seem simple,
when used with multivariate attributes and many related probabilities, it may
start to look like in the pictures below, making it not intuitive to
understand, but to make it easier you could think of them simply as nested
“P(A\The rest)”.
Screenshots
1, 2, 3: Possible Bayesian Network Formulas: (Source: E. Yoruk et al., 2011,
p604)
|
Geometry based Support Vector Machine
The
way machine learning is used could be explained by saying that algorithms use
data identified or labelled by experts to learn or find best separation point
between different identities or class labels, so that when new unidentified or
unlabelled data comes in, machine could tell on which side of separated space
it is and therefore which label or identity that data has. There are many
different algorithms which separate data in to various ways, but unlike SVM,
these do not try to find the best way. The best separation in SVM is called
widest margin. It leads to higher precision and that’s why “support vector machine learning algorithm has been extensively
applied within the field of computational biology” (W. Noble, 2003, p1).
SVM starts with very simple ideas from 2D/ 3D vector geometry. If we imagine
data rows with 4 variables as points or vectors in 3D space and fourth value
being the class label with one of two possible labels or colors, then we could
draw a plane in between differently colored points, finally, try to modify the
plane’s location and orientation so that it is as far away from each color as
possible. When new uncoloured points come in, we can determine what color they
should be, by measuring on which side of the plane they are, and because we
modified the plane to be as far away as possible from each color, even the
points on the very edge, difficult to identify, could simply be measured by how
far and in which direction of the plane they are. “There are three key ideas needed to understand SVM: maximizing margins,
the dual formulation, and kernels” (K. Bennett et al., 2000, p1). One of
SVM problems is overfitting, but you can improve it by reducing “the input dimensionality” (N.
Srivastava et al., 2014, p1942).
Figure 1. SVM and hyperplane separation (Source: Ubaby, 2016)
|
Data
is rarely 2D or 3D, it is usually multi-dimensional, multi-class, noisy and
non-linearly separable, which means that the different class label or color
points are not separable by simply drawing a line or a plane. Instead it could
be separated by some curve or curved hyper-surface in some dimension. Our data
rows can be called as vectors or, as in some formulas, transposed vectors which
are one row matrices. In higher dimensions the hyperplane is not anymore a nice
hyper-flat geometrical thing, it is converted in to a formula or a function, which
is created by measuring distances between vectors, taking in to account each
vector’s color or class label. In SVM the geometric functions are transformed
in to Lagrange multipliers, which needs two parts, main function determining
the rule or the hyperplane and constraints relying on support vectors, which
help determine how hyperplane is oriented. Lagrange multipliers allow finding
the minimum or maximum values, and in this case, the function which represents
furthest hyperplane from each class. In non-linearly separable case, function
to measure the distance between points might not be the same as in linearly
separable. In SVM distance formulas are called kernels. Kernels allow linear
separation by plotting points in to higher dimensions. “SVM kernel framework accommodates in a straightforward fashion many
different types of data—vectors, strings, trees, graphs” (W. Noble, 2003,
p24). Higher dimension kernels could simply be called measurement s of how
similar are the two vectors.
SVM results by various authors
Studies
have been done comparing multiple machine learning techniques with biological
data and SVM has been shown to be intuitive to use high precision algorithm. Chinnasamy
et al. (2003) talked about other studies where SVM performs better, but has a
high number of false positives, he said that”
from all these studies it is evident that among all the prediction methods, SVM
performs better (A. Chinnasamy et al., 2003, p3) and explained that “in SVM, as the number of classifiers is
high, reading the distances between hyperplane and the classes are very
difficult” (A. Chinnasamy et al., 2003, p11). Noble (2003) said that SVM
can be used” to learn to differentiate
true from false positives” (W. Noble, 2003, p23). Furlanello et al. (2005)
used SVM for molecular profiling, outlier detection and classification of
samples to reduce computation required. Ding et al. (2000) compared SVM and neural
networks when predicting protein fold and the results indicate much higher
accuracy and faster processing with SVM than NN. SVM is one of the most precise
techniques and it is” very robust
methodology for inference with minimal parameter choice" (K. Bennett
et al., 2000, p1) and that is because you just have to choose the Kernel and
slack variable value. Choosing correct kernel might not be easy, because it is
not always clear what dimension or measurement of similarity is the most
appropriate for the data.
No comments:
Post a Comment