Biotechnology and Machine Learning with SVM and LSS: Evaluation and Conclusion

Evaluation

Given that this project background research scope is not very large, the assumption that there is knowledge gap between experts and beginners in machine learning and biotechnology might not be as true as proposed, because there are already plenty of books and journals to read about the topic and if desire to learn is high, then lack of knowledge is not a big problem, because the gaps can be filled by carefully reading these books. Although more intermediate guides to general topic allow shorter introductions and might bring more interest in the topic. While original idea was to implement existing machine learning technique- SVM in Java, so that it could be easier understood at programming level, the discovery of distinct interpretation arisen together with unexpectedly good results. The failure to implement SVM in Java might have led to successful and new way of explaining it. From another point of view, implementing something what already exists is a waste of time, but this project focus is to educate about the problem and explain it with simple examples. From author’s perspective, different views and interpretations about problems can at least be useful by expanding on the ways we look at them and sometimes even help discover the solutions.

As seen in results section, Weka-SVM had best results on all datasets. Surprisingly LSS had better overall results than R-SVM, but that might not be because it works. On the other hand, LSS had 2 exactly the same predictions on same data as R-SVM which hopefully indicates that it actually works. Looking back at the design of diagrams and ideas proposed for LSS, these seem to show they should work, but because of how each time data is loaded and differently predicted it might indicate that LSS implementation in Java is probably not learning independently from how data is sorted.

Conclusion

The research started with basic biology and machine learning introduction, then further expanded on machine learning techniques and tools used for them in biotechnology. Various papers and their results with different algorithms have been mentioned. In most of these papers SVM results were the best. Further explanation of SVM gradually reduced to author’s interpretation, then tested with bio data and compared with already existing SVM implementations in Weka and R. The results were unexpectedly good, but unclear exactly why. Either idea is good or implementation is buggy and diagrams would not work as expected in higher dimensions.

One of the biggest problems when discovering drugs is the lack of time. Some of it could be compensated if more people could work on it. More people could work on it if they understood it. Proposed solution was to educate about biotechnology and machine learning. One of the ways to do it is by introducing to the topic, guiding through main ideas, showing examples and possible interpretations of already existing solutions. Maybe in future interpretations of previous work could become full solutions for unsolved problems.

Because LSS results were unexpectedly good, possibly because of how data is loaded and how learning depends on the way it is sorted, different implementation of LSS idea could produce different results.

From the effectiveness point of view it could be improved by being re-written in more suitable machine learning language, use hash maps of hash codes instead of full copies of array lists and some for loops. LSS could be improved if it could run in command line interface. From the easier understanding, the code could be further simplified by making methods more descriptive and have more comments, it could also be written in JavaScript for it to be more web based and simpler to learn, because it is the language which is known even by web developers who don’t know a lot about programming.

Another point or discovery of LSS was to show that mathematical ideas could be interpreted differently, while still being used as they are. Some of the examples in LSS could be distance measure renamed to similarity, support vectors renamed to closest and spheres with radiuses instead of separated spaces with hyperplanes.

Literature review shown that SVM is one of the best techniques, especially when datasets are not very large, but it also shown that biologists still want higher precision and therefore even the best algorithms seem to be not as useful at first sight. Even though the precision might be as low as 80-90%, it can still help reduce the research space of, worth doing experiment for, candidates. The reduction of search space could further be improved by using it as data and some of its attributes could become how precise the machine was compared to real experiment, how useful was the reduced search space. Reduced search space would mean that at the beginning scientist had thousands of experiments to do, but he used previously done experiments with known results to train the machine and then machine can tell the scientist which of these thousand experiments are worth investigating, later he can pick random samples and test how useful the machine suggestion was.

From implementation and explanation of SVM interpretation it could be concluded that it does not really matter whether the data is biological or not. All the algorithm might care about is just how big, how noisy, how complex or simple is the data, but not where it came from or what its meaning is for those who use it.

The purpose of this project is to educate about how machine learning works and how it could help in biotechnology. It could be further applied even as introductory online guide. The purpose of the LSS could become more than educational tool about how machine learning techniques could be interpreted and used, because it maybe has the potential to become useful machine learning technique.

Evaluation and Conclusion

Evaluation

Conclusion

No comments:

Post a Comment