Biotechnology and Machine Learning with SVM and LSS: Methodology (Programming Language, Data, Comparison)

Programming Language

Questions arise about what type of language to use for implementation of Least Similar Sphere or LSS when knowing about ”AI programming languages such as Lisp, Prolog” (J. Fox et al., 1994, p288) and more mathematical MatLab. Because these languages focus on effectiveness and optimal solutions, these usually result in very difficult to read functional or procedural programming code. Author focuses on educating about machine learning and therefore needs the code to be easy to read and understand. That leads to the choice between object oriented languages like C++ and Java. Java was chosen because of previous author’s experience with it and lack of time for learning C++.

Data for machine

Since the main topic is about ligand and receptor interaction or bioassay experiments, the data chosen had to be as relevant as possible. Searching online lead to finding one large bioassay dataset by A. Schierz (2009). Could not find more possibly because most of the data is still used in research and is not easily available as a bunch. Further attempts to gain more biological data led to discussions with specialists in this area, but led to conclusion that author has to collect his own data manually from databases like PubMed using protein docking software like Pyrx. Author attempted to use Pyrx, but had bad results and decided to go with plan B for dataset. That meant searching for few proteins or receptors on PubMed, which have a lot known ligands that bind to them. After few hundred compounds reviewed, author found proteins with up to 10 known ligands, but did not use all of the ligands as data, because some of them were just common ions like magnesium. The protein id was then chosen to be a class label, while different unique ligands were represented as vectors. Author focus on idea of information theory and interprets ligand molecular formula as information which already contains its 3D structure without being folded. From author’s point of view, if protein can fold from a sequence, then that sequence contains information how it should fold and therefore indirectly describes its 3D structure. That led author to naïve way of representing molecules using their mass in g/mol, hydrogen count, carbon count, carbon count upwards, carbon count downwards, oxygen count, oxygen bond counts, oxygen count upwards, oxygen count downwards, nitrogen counts, nitrogen bond counts, nitrogen count upwards, nitrogen count downwards, sulphur count, sulphur bonds, phosphorus count, phosphorus bond count, which in the dataset are named as “MassGmol, HCount, CCount, CUp, CDown, OCount, OBonds, OUp, ODown, NCount, NBonds, NUp, NDown, SCount, SBonds, PCount, PBonds, BondsToProtein”. Given that these biological datasets might not necessarily produce results with good description of how well the machine works, another dataset of flower Iris by Marshall M. (2016) was used to further test for precision of the machine.

Prediction, results and comparison

While the machine could produce some results, they might be difficult to evaluate without comparing it to something. That is why author chose popular open source machine learning software like Weka and R to compare how real SVM works with collected data and how it compares to interpreted implementation.

Methodology (Programming Language, Data, Comparison)

Programming Language

Data for machine

Prediction, results and comparison

No comments:

Post a Comment