Literature Review (Machine Learning later techniques)

Bayes Theory Probability Based Networks

From statistics we know that probability is a useful tool used for prediction, but it may become difficult when dealing with multiple related probabilities. Bayes theorem helped to solve this problem with famous equation:
. It takes in to account the probability of something that already happened and is related to something that might happen. This can be used as cascading tree of probabilities given conditions and treating each group of probabilities in a context of probability of A given the probability of the rest, until there is only one left. That way you can use it in machine learning on multidimensional data. Yoruk et al. (2011) used Bayesian networks to model cell protein signalling and said that” biological data are inherently probabilistic and generally display hierarchical relationships” (E. Yoruk et al., 2011, p592). You don’t usually have hundreds of thousands experiments on some biological target, which means that the amount of data for each unique event is not very large. Many machine learning algorithms need a lot of data for the accuracy to be higher. ”Bayesian neural nets are extremely useful at drug discovery, but are slow to train and difficult to scale to very large network sizes” (N. Srivastava et al., 2014, p1941). But even if they are slow, they can be more precise because of the use of inter-related probabilities and because it is not very difficult to pre-calculate probabilities for smaller data sets. ”Bayesian classifier requires a prior knowledge of many probabilities” (A. Chinnasamy et al., 2003, p2). This classifier is also more popular among biologists because the results are clear and easy to understand, slow training is not a problem when data sets are small and” the biologists need to know the confident level of the resultant classes outputted by the classifiers for further analysis” (A. Chinnasamy et al., 2003, p11). Chinnasamy et al. (2003) managed to produce better results with Bayesian networks than SVM. Even though the Bayes theorem seem simple, when used with multivariate attributes and many related probabilities, it may start to look like in the pictures below, making it not intuitive to understand, but to make it easier you could think of them simply as nested “P(A\The rest)”.

Screenshots 1, 2, 3: Possible Bayesian Network Formulas: (Source: E. Yoruk et al., 2011, p604)


Geometry based Support Vector Machine

The way machine learning is used could be explained by saying that algorithms use data identified or labelled by experts to learn or find best separation point between different identities or class labels, so that when new unidentified or unlabelled data comes in, machine could tell on which side of separated space it is and therefore which label or identity that data has. There are many different algorithms which separate data in to various ways, but unlike SVM, these do not try to find the best way. The best separation in SVM is called widest margin. It leads to higher precision and that’s why “support vector machine learning algorithm has been extensively applied within the field of computational biology” (W. Noble, 2003, p1). SVM starts with very simple ideas from 2D/ 3D vector geometry. If we imagine data rows with 4 variables as points or vectors in 3D space and fourth value being the class label with one of two possible labels or colors, then we could draw a plane in between differently colored points, finally, try to modify the plane’s location and orientation so that it is as far away from each color as possible. When new uncoloured points come in, we can determine what color they should be, by measuring on which side of the plane they are, and because we modified the plane to be as far away as possible from each color, even the points on the very edge, difficult to identify, could simply be measured by how far and in which direction of the plane they are. “There are three key ideas needed to understand SVM: maximizing margins, the dual formulation, and kernels” (K. Bennett et al., 2000, p1). One of SVM problems is overfitting, but you can improve it by reducing “the input dimensionality” (N. Srivastava et al., 2014, p1942).

Figure 1. SVM and hyperplane separation (Source: Ubaby, 2016)

Data is rarely 2D or 3D, it is usually multi-dimensional, multi-class, noisy and non-linearly separable, which means that the different class label or color points are not separable by simply drawing a line or a plane. Instead it could be separated by some curve or curved hyper-surface in some dimension. Our data rows can be called as vectors or, as in some formulas, transposed vectors which are one row matrices. In higher dimensions the hyperplane is not anymore a nice hyper-flat geometrical thing, it is converted in to a formula or a function, which is created by measuring distances between vectors, taking in to account each vector’s color or class label. In SVM the geometric functions are transformed in to Lagrange multipliers, which needs two parts, main function determining the rule or the hyperplane and constraints relying on support vectors, which help determine how hyperplane is oriented. Lagrange multipliers allow finding the minimum or maximum values, and in this case, the function which represents furthest hyperplane from each class. In non-linearly separable case, function to measure the distance between points might not be the same as in linearly separable. In SVM distance formulas are called kernels. Kernels allow linear separation by plotting points in to higher dimensions. “SVM kernel framework accommodates in a straightforward fashion many different types of data—vectors, strings, trees, graphs” (W. Noble, 2003, p24). Higher dimension kernels could simply be called measurement s of how similar are the two vectors. 

 SVM results by various authors

Studies have been done comparing multiple machine learning techniques with biological data and SVM has been shown to be intuitive to use high precision algorithm. Chinnasamy et al. (2003) talked about other studies where SVM performs better, but has a high number of false positives, he said that” from all these studies it is evident that among all the prediction methods, SVM performs better (A. Chinnasamy et al., 2003, p3) and explained that “in SVM, as the number of classifiers is high, reading the distances between hyperplane and the classes are very difficult” (A. Chinnasamy et al., 2003, p11). Noble (2003) said that SVM can be used” to learn to differentiate true from false positives” (W. Noble, 2003, p23). Furlanello et al. (2005) used SVM for molecular profiling, outlier detection and classification of samples to reduce computation required. Ding et al. (2000) compared SVM and neural networks when predicting protein fold and the results indicate much higher accuracy and faster processing with SVM than NN. SVM is one of the most precise techniques and it is” very robust methodology for inference with minimal parameter choice" (K. Bennett et al., 2000, p1) and that is because you just have to choose the Kernel and slack variable value. Choosing correct kernel might not be easy, because it is not always clear what dimension or measurement of similarity is the most appropriate for the data.

No comments:

Post a Comment