Biotechnology and Machine Learning with SVM and LSS: The Introduction to Biotechnology and Machine Learning

Introduction

Curiosity and desire for discovering new things is one of human traits. These things could be called problems, which need solutions. The more problems are solved, the more can be achieved and technology can help saving time and effort. Preventing sickness and diseases would allow more minds to join the progress of discovery. “Drugs are essential for the prevention and treatment of disease” (S. Mandal et al., 2009, p90) and their development is one of the biggest time and resource consuming problems. More sciences involved in to this problem led to new mixed fields like biophysics, biochemistry, biotechnology and recently, bioinformatics, which emerged because “the enormous amount of data gathered by biologists—and the need to interpret it— requires tools that are in the realm of computer science” (J. Cohen, 2004, p123). Bioinformatics is one of the fields where different sciences collaborate to solve its problems and increasing amount of data about it. “This area has arisen from the needs of biologists to utilize and help interpret the vast amounts of data that are constantly being gathered in genomic research—and its more recent counterparts, proteomics and functional genomics” (J. Cohen, 2004, p122). This field uses computer science and technology to automate parts of it and improves the process of discovery by trying to reduce required time and resources needed to do it. The rise of this science and big data lead to more and more sophisticated statistics, which got more and more complicated. Data mining and machine learning emerged as solution to the difficulties in statistics with multi-dimensional spaces caused by large amount of attributes. Machine learning started from combining programming and different mathematical ideas ranging from regression (drawing a function, which can describe points on graph), clustering (finding distinct groups), to unique ideas taken from examples of the way humans think (decision trees, Apriori Algorithm) and how process of thinking works in biology (neural networks). Further improvements and their combinations lead to Support Vector Machines and Bayesian networks. The most recent ones use different combinations of multiple techniques. In machine learning “each area involves one or more reasoning problems for which significant expertise exists in the AI community, such as simulation, planning, redesign, diagnosis, and learning” (D. Karp et al., 1994, p8). Machine learning tools are used in bioinformatics, their use and improvement “will have numerous benefits such as efficiency, cost effectiveness, time saving” (S. Mandal et al., 2009, p90).

Drug Development

Drug engineering just like software engineering has its methodologies. The top parts are called “Discovery’, ‘Development’ and ‘Registration’ phases. The ‘Discovery’ phase, routinely three to four years, involves identification of new therapeutic targets, lead finding and prioritisation, lead optimisation and nomination of new chemical entities (NCEs)” (J. Wang et al., 2004, p73). Discovering new drugs is very large and costly process and “to address this issue, several multidisciplinary approaches are required for the process of drug development, including structural biology, computational chemistry, and information technology, which collectively form the basis of rational drug design” (Y. Wang et al., 2015, p489). The discovery phase is where biotechnology and bioinformatics are mostly used. This phase has its own main parts called ADME, which is "absorption, distribution, metabolism, and excretion" (S. K. Balani et al., 2005, p1). Another technique in drug discovery focuses on 3D structures of biomolecules and is called structure based drug design or SBDD. “SBDD provides insight in the interaction of a specific protein-ligand pair, allowing medicinal chemists to devise highly accurate chemical modifications around the ligand scaffold” (V. Lounnas et al., 2013, p1). Together these rules allow better drug discovery, because it is not enough for a drug to be effective, it should also be less toxic (T in ADME/T stands for toxicity). In reality, a drug can be effective but not satisfy all the requirements of ADME/T and therefore not released for medical use. “Investigation of terminated projects revealed that the primary cause for drug failure in the development phase was the poor pharmacokinetic and ADMET (ADME+Toxicity) properties rather than unsatisfactory efficacy” (J. Wang., 2004, p73). There are various drug design techniques, some of them include structure based drug design, target based drug design and more recent rational drug design which involves multiple disciplines and techniques, it “can be applied to develop drugs to treat a wide variety of diseases and can also be used for designing drugs for disease prevention” (S. Mandal et al., 2009, p90). Drug discovery is the beginning of drug development and involves identifying problem or drug target, which “is a biomolecule which is involved in signaling or metabolic pathways that are specific to a disease process” (S. Mandal et al., 2009, p90). Identifying which molecules or ligands can bind to the receptors of the target is one of the most complicated parts of drug discovery.

Cell Biology

Life forms are made of cells and “cells are complex molecular machines contained within phospholipid membranes that isolate a unique chemical environment” (E. Yoruk et al., 2011, p1). These machines are made from big protein, average peptide, smaller amino acid and very small molecules like H2O. Each big molecule is made from the atoms held by the atomic force. ”The atomic force field model describes physical systems as collections of atoms kept together by interatomic forces” (J. Meller, 2001, p2). The proteins are also machines and ”are made of long chains of amino acids which in their natural environment (in solution) fold up into simple "secondary" structures, like helices, and then by further folding into higher-order structures” (J. Fox et al., 1994, p290). While big molecules with more than 50 amino acids are called proteins,” short strings of amino acids, called peptides” (W. Noble, 2003, p23). ”Peptides can be considered to be up to 50 amino acids in length, with proteins being larger than this” (C. Walle, 2011, p4). And “there are twenty common amino acids” (F. Altschul, 2011, p8). Cells can be divided in two communication parts, the inside and the outside. Both sides involve sequences of bio-reactions. “The metabolism of a cell is the set of bioreactions that its enzymes can catalyse and such a sequence of reactions is called a path way” (D. Karp et al., 1994, p7). These pathways are like a constantly moving and floating queues of large and small molecules interacting with each other, they are called biochemical pathways “and, in reality, metabolic, signaling, and regulatory pathways interact and intersect in the course of cellular growth and activity” (R. Gostner et al., 2014, p16:2). The inside is made of biochemical pathways, which allow communication between smaller inner organelles. ”Living cells are complex systems whose growth and existence depends on thousands of biochemical reactions” (D. Karp et al., 1994, p2). A lot of biochemical reactions need help or catalysts for them to happen, these are called enzymes. Enzymes “facilitate the association of several molecules to form a complex, and it lowers the energy barrier required for the bond rearrangements that constitute a reaction” (D. Karp et al., 1994, p5). The outside on membrane’s surface receptors and ligands are used to communicate with other cells. How strongly ligand molecule reacts with receptor is called binding affinity. ”Binding affinity represents the strength of association between the ligand and its receptor protein” (M. Ashtawy et al., 2012, p1301) that’s why” the recognition of signal peptides is important for the development of new drugs” (W. Noble, 2003, p12) and” finding the structure and the fold of a protein is very important since it helps to understand the functions (A. Chinnasamy et al., 2003, p1).

All these and even more has to be taken in to consideration when trying to find the disease or a problem with cells. Because of that much of complexity even todays computing power is not enough. Finding similar protein structures can also reduce the problem and machine learning can help recognize them.

Machine Learning

Multi-disciplinary approach may increase reliability and effectiveness of problem solving, but it becomes more and more difficult to grasp it all as a whole. Increasing amount of data is pushing the limits of how much more should be done in the same amount of time to be able to expand the knowledge, which could possibly lead to the problem solutions. “In the era of big data, a necessary goal is the ability to use rapidly accumulating data to pinpoint potential ADME/T issues before entering late-stage development.” (Y. Wang et al., 2015, p508). The big data also gave rise to new techniques for managing it, because mathematical statistics started to become too complicated. That led to various machine learning approaches, which can use large amounts of data to learn and then predict the outcome of unknown new data. ”Computational analysis of biological data obtained in genome sequencing and other projects is essential for understanding cellular function and the discovery of new drugs and therapies” (C. Ding et al., 2000, p349). The hopes are not lost because of how successful machine learning was and still is. As in the early years scientists trusted computers will help understand big problems saying that” it is now generally accepted that modern molecular biology research needs many different types of software to support the management, analysis and interpretation of data (J. Fox et al., 1994, p287) and in later years hoping that “progress in reliable computational methods is greatly anticipated” (M. Blaszczyk et al., 2015).

The Introduction to Biotechnology and Machine Learning

Introduction

Drug Development

Cell Biology

Machine Learning

No comments:

Post a Comment