Programming Language
Questions
arise about what type of language to use for implementation of Least Similar
Sphere or LSS when knowing about ”AI
programming languages such as Lisp, Prolog” (J. Fox et al., 1994, p288) and
more mathematical MatLab. Because these languages focus on effectiveness and
optimal solutions, these usually result in very difficult to read functional or
procedural programming code. Author focuses on educating about machine learning
and therefore needs the code to be easy to read and understand. That leads to
the choice between object oriented languages like C++ and Java. Java was chosen
because of previous author’s experience with it and lack of time for learning
C++.
Data for machine
Since
the main topic is about ligand and receptor interaction or bioassay experiments,
the data chosen had to be as relevant as possible. Searching online lead to
finding one large bioassay dataset by A. Schierz (2009). Could not find more possibly
because most of the data is still used in research and is not easily available
as a bunch. Further attempts to gain more biological data led to discussions
with specialists in this area, but led to conclusion that author has to collect
his own data manually from databases like PubMed using protein docking software
like Pyrx. Author attempted to use Pyrx, but had bad results and decided to go
with plan B for dataset. That meant searching for few proteins or receptors on
PubMed, which have a lot known ligands that bind to them. After few hundred compounds
reviewed, author found proteins with up to 10 known ligands, but did not use
all of the ligands as data, because some of them were just common ions like
magnesium. The protein id was then chosen to be a class label, while different
unique ligands were represented as vectors. Author focus on idea of information
theory and interprets ligand molecular formula as information which already
contains its 3D structure without being folded. From author’s point of view, if
protein can fold from a sequence, then that sequence contains information how
it should fold and therefore indirectly describes its 3D structure. That led
author to naïve way of representing molecules using their mass in g/mol,
hydrogen count, carbon count, carbon count upwards, carbon count downwards,
oxygen count, oxygen bond counts, oxygen count upwards, oxygen count downwards,
nitrogen counts, nitrogen bond counts, nitrogen count upwards, nitrogen count
downwards, sulphur count, sulphur bonds, phosphorus count, phosphorus bond
count, which in the dataset are named as “MassGmol,
HCount, CCount, CUp, CDown, OCount, OBonds, OUp, ODown, NCount, NBonds, NUp, NDown,
SCount, SBonds, PCount, PBonds, BondsToProtein”. Given that these
biological datasets might not necessarily produce results with good description
of how well the machine works, another dataset of flower Iris by Marshall M. (2016) was used to further test for precision of
the machine.
Prediction, results and comparison
While
the machine could produce some results, they might be difficult to evaluate
without comparing it to something. That is why author chose popular open source
machine learning software like Weka and R to compare how real SVM works with
collected data and how it compares to interpreted implementation.
No comments:
Post a Comment