Last Updated on August 12, 2023 by Oluwajuwon Alvina
Are you an international student? Are you interested in learning more about Bioinformatics? Do you get overwhelmed by the amount of conflicting information you see online? If so, you need not search further because you will find the answer to that question in the article below.
To get more information on Bioinformatics. You can also find up-to-date, related articles on Collegelearners.
An unprecedented wealth of biological data has been generated by the human genome project and sequencing projects in other organisms. The huge demand for analysis and interpretation of these data is being managed by the evolving science of bioinformatics. Bioinformatics is defined as the application of tools of computation and analysis to the capture and interpretation of biological data. It is an interdisciplinary field, which harnesses computer science, mathematics, physics, and biology . Bioinformatics is essential for management of data in modern biology and medicine. This paper describes the main tools of the bioinformatician and discusses how they are being used to interpret biological data and to further understanding of disease. The potential clinical applications of these data in drug discovery and development are also discussed.
- Bioinformatics is the application of tools of computation and analysis to the capture and interpretation of biological data
- Bioinformatics is essential for management of data in modern biology and medicine
- The bioinformatics toolbox includes computer software programs such as BLAST and Ensembl, which depend on the availability of the internet
- Analysis of genome sequence data, particularly the analysis of the human genome project, is one of the main achievements of bioinformatics to date
- Prospects in the field of bioinformatics include its future contribution to functional understanding of the human genome, leading to enhanced discovery of drug targets and individualised therapy
Interaction of disciplines that have contributed to the formation of bioinformatics
This article is based on personal experience in bioinformatics and on selected articles in recent issues of Nature Genetics, Nature Genetics Reviews, Nature Medicine, and Science. Key terms including bioinformatics, comparative and functional genomics, proteomics, microarray, disease, and medicine were used to search for relevant articles in the peer reviewed scientific literature
Bioinformatics and its impact on genomics
Last year it was announced that the entire human genome had been mapped as a result of the efforts of the worldwide human genome project and a private genomic company.However, in recent years, the scientific world has witnessed the completion of whole genome sequences of many other organisms. The analysis of the emerging genomic sequence data and the human genome project is a landmark achievement for bioinformatics.
A novel strategy for random sequencing of the whole genome (the so called “shot gun” technique) was used to sequence the genome of Haemophilus influenzae in 1995. This was the very first complete genome of any free living organism to be sequenced. Other bacterial genomes, such as those of Mycoplasma genitalium and Mycobacterium tuberculosis, were sequenced soon after,and the sequence of the plague bacterium Yersinia pestis was recently completed. The sequence and annotation of the first eukaryotic genome, that of Saccharomyces cerevisiae (a yeast), was followed by those of other eukaryotic species such as Caenorhabtidis elegans (a worm), Drosophila melanogaster (fruit fly), and Arabdopsis thaliana (mustard weed (see fig A on bmj.com). Sequencing of several other species, including zebrafish, pufferfish, mouse, rat, and non-human primates, are either under way or nearing completion by both private and public sequencing initiatives. The knowledge obtained from these sequence data will have considerable implications for our understanding of biology and medicine. As a result of comparative genomic and proteomic research, we will soon be able to not only locate each human gene but also fully understand its function.
The main tools of a bioinformatician are computer software programs and the internet. A fundamental activity is sequence analysis of DNA and proteins using various programs and databases available on the world wide web. Anyone, from clinicians to molecular biologists, with access to the internet and relevant websites can now freely discover the composition of biological molecules such as nucleic acids and proteins by using basic bioinformatic tools. This does not imply that handling and analysis of raw genomic data can easily be carried out by all. Bioinformatics is an evolving discipline, and expert bioinformaticians now use complex software programs for retrieving, sorting out, analysing, predicting, and storing DNA and protein sequence data.
Large commercial enterprises such as pharmaceutical companies employ bioinformaticians to perform and maintain the large scale and complicated bioinformatic needs of these industries. With an ever-increasing need for constant input from bioinformatic experts, most biomedical laboratories may soon have their own in-house bioinformatician. The individual researcher, beyond a basic acquisition and analysis of simple data, would certainly need external bioinformatic advice for any complex analysis.
The growth of bioinformatics has been a global venture, creating computer networks that have allowed easy access to biological data and enabled the development of software programs for effortless analysis. Multiple international projects aimed at providing gene and protein databases are available freely to the whole scientific community via the internet.Go to:
The escalating amount of data from the genome projects has necessitated computer databases that feature rapid assimilation, usable formats and algorithm software programs for efficient management of biological data. Because of the diverse nature of emerging data, no single comprehensive database exists for accessing all this information. However, a growing number of databases that contain helpful information for clinicians and researchers are available. Information provided by most of these databases is free of charge to academics, although some sites require subscription and industrial users pay a licence fee for particular sites. Examples range from sites providing comprehensive descriptions of clinical disorders, listing disease susceptibility genetic mutations and polymorphisms, to those enabling a search for disease genes given a DNA sequence (box).
These databases include both “public” repositories of gene data as well as those developed by private companies. The easiest way to identify databases is by searching for bioinformatic tools and databases in any one of the commonly used search engines. Another way to identify bioinformatic sources is through database links and searchable indexes provided by one of the major public databases. For example, the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) provides the Entrez browser, which is an integrated database retrieval system that allows integration of DNA and protein sequence databases. The European Bioinformatic Institute archives gene and protein data from genome studies of all organisms, whereas Ensembl produces and maintains automatic annotation on eukaryotic genomes . The quality and reliability of databases vary; certainly some of the better known and more established ones, such as those above, are superior to others.
One of the simplest and better known search tools is called BLAST (basic local alignment search tool, at www.ncbi.nlm.nih.gov/BLAST/). This algorithm software is capable of searching databases for genes with similar nucleotide structure and allows comparison of an unknown DNA or amino acid sequence with hundreds or thousands of sequences from human or other organisms until a match is found. Databases of known sequences are thus used to identify similar sequences, which may be homologues of the query sequence. Homology implies that sequences may be related by divergence from a common ancestor or share common functional aspects. When a database is searched with a newly determined sequence (the query sequence), local alignment occurs between the query sequence and any similar sequence in the database. The result of the search is sorted in order of priority on the basis of maximum similarity. The sequence with the highest score in the database of known genes is the homologue. If homologues or related molecules exist for a query sequence, then a newly discovered protein may be modelled and the gene product may be predicted without the need for further laboratory experiments.
Since the completion of the first draft of the human genome,the emphasis has been changing from genes themselves to gene products. Functional genomics assigns functional relevance to genomic information. It is the study of genes, their resulting proteins, and the role played by the proteins.
Analysis and interpretation of biological data considers information not only at the level of the genome but at the level of the proteome and the transcriptome . Proteomics is the analysis of the total amount of proteins (proteome) expressed by a cell, and transcriptomics refers to the analysis of the messenger RNA transcripts produced by a cell (transcriptome). DNA microarray technology determines the expression level of genes and includes genotyping and DNA sequencing. Gene expression arrays allow simultaneous analysis of the messenger RNA expression levels of thousands of genes in benign and malignant tumours, such as keloid and melanoma. Expression profiles classify tumours and provide potential therapeutic targets.
Schematic diagram representing complexity of genomic data processing. Analysis and interpretation of biological data considers information at every level from the genome (total genetic content) to the proteome (total protein content) and transcriptome (total messenger RNA content) of the cell. The images numbered I-IV to the right of the diagram represent relevant examples of DNA (image I is base pair nucleotides); RNA (image II is a microarray showing levels of gene expression); and protein (image III is a structure of a single protein; image IV is a two dimensional gel electrophoresis showing separation of all proteins of a cell—each spot corresponds to a different protein chain)
Bioinformatic protein research draws on annotated protein and two dimensional electrophoresis databases. After separation, identification, and characterisation of a protein, the next challenge in bioinformatics is the prediction of its structure. Structural biologists also use bioinformatics to handle the vast and complex data from x ray crystallography, nuclear magnetic resonance, and electron microscopy investigations to create three dimensional models of molecules
Bioinformatics is the most well known and probably largest application of computational biology. The field developed to complement the growing area of genetics in the biological sciences.
History: What lead to the onset of bioinformatics?
In 1953 James Watson and Francis Crick correctly identified the structure of DNA, proposing that it is a winding double helix held together by pairs of bases (adenine paired with thymine and guanine paired with cytosine) Since that date the study of DNA became a predominent area of biology. Scientists believed that investigating genes (strands of DNA) would lead to a much broader understanding of the workings of the human body and mind. Specificaly, investigation of genes and their proteins provides information about cellular growth, communication, and organization leading to the understanding of the complex biological signals and pathways within each cell. One tremendous benefit of such studies occurs in the area of medicine. The identification of genes that mark inherited diseases could enable doctors to warn their patients before the disease*s onset, allowing affected individuals at risk for such ailments to take precousionary measures early in their lives. Eventually a deeper understanding of the workings of genes and proteins could allow scientists to re-engineer and replace such defective genes.
For most of the time following the onset of the investigation of DNA discovery was aimed at identifying one gene at a time. Within the last few years however, scientists have mapped or sequenced entire small organisms such yeast or bacteria. The map of the entire gentic makeup of an organism is called the genome. Many more complete genome sequences are soon to be available, including the later discussed human genome. To aid in the sequencing of such large quantities of genes a significant increase in three-dimensional protein structure data had to become available. Computational techniques and software development are required to access this information and allow researchers to use databases as research tools. Hence, the field of bioinformatics developed from the need to analyze large amounts of DNA sequences and protein structures.
What exactly is Bioinformatics?
The field involves development of new database methods to store genetic information, computational software methods to process it, applications that enable evaluation of experimental data, the improvement of molecular biological techniques to investigate genetic information, high-thoughput techniques to gather genetic information, and combinatorial chemistry. Information from molecules, protein sequencing, and X-ray crystallography is entered to specific databases. This information is organized to build an information infrastructure (a large database).
Bioinformatics is a “scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation” that combines the tools of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data (NIH Publication No. 90-1590, April 1990). Powerful innovative software is combined with sophisticated database systems and automated biological research methods to create the science of bioinformatics. Another aspect of bioinformatics is the maintanance of such a large database.
The computational foundations of bioinformatics employ statistical analysis, algorithms, and database management systems. Searching a database of DNA or protein sequences relies on search algorithms to identify entries which share similarities with the query. The immense size and network of biological databases provides a resource to answer biological questions about mapping, gene patterns, molecular modeling, molecular evolution, and assistance in developing drugs targeted at fixing specific genes.
Thus, interaction with a biological database can lead to discovery through the analysis of existing data. Database interaction is defined as browsing, asking complex computational questions, submitting experimental data to public data banks, finding information about specific portions of a gene, executing complex queries, and organizing vast networks of information (Karp, 1996). Currently, there are four types of databases in both the public and private domain. A primary database contains one type of information such as DNA sequence data. Secondary databases contain that is exclusively derived from other databases. Specialist databases are called knowledge databases containing information from expert input, other databases, and literature. Integrated databases are clustered or merged primary or secondary databases (Baker and Brass, 1998). Considerable time and effort is currently being devoted to the up-dating and engineering of databases.
Other applications of bioinformatics
Apart from analysis of genome sequence data, bioinformatics is now being used for a vast array of other important tasks, including analysis of gene variation and expression, analysis and prediction of gene and protein structure and function, prediction and detection of gene regulation networks, simulation environments for whole cell modelling, complex modelling of gene regulatory dynamics and networks, and presentation and analysis of molecular pathways in order to understand gene-disease interactions.Although on a smaller scale, simpler bioinformatic tasks valuable to the clinical researcher can vary from designing primers (short oligonucleotide sequences needed for DNA amplification in polymerase chain reaction experiments) to predicting the function of gene products.
Clinical application of bioinformatics
The clinical applications of bioinformatics can be viewed in the immediate, short, and long term. The human genome project plans to complete the human sequence by 2003, producing a database of all the variations in sequence that distinguish us all. The project could have considerable impact on people living in 2020—for example, a complete list of human gene products may provide new drugs and gene therapy for single gene diseases may become routine
Bioinformatics is an interdisciplinary field that develops and improves on methods for storing, retrieving, organizing and analyzing biological data. A major activity in bioinformatics is to develop software tools to generate useful biological knowledge.
Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the textual mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in the comparison of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA, RNA, and protein structures as well as molecular interactions. Researchers affiliated with our program conduct research in systems biology, genomics, and proteomics.
Systems biology is an emerging approach applied to biomedical and biological scientific research. Systems biology is a biology-based inter-disciplinary field of study that focuses on complex interactions within biological systems, using a more holistic perspective (holism instead of the more traditional reductionism) approach to biological and biomedical research. Particularly from year 2000 onwards, the concept has been used widely in the biosciences in a variety of contexts. One of the outreaching aims of systems biology is to model and discover emergent properties, properties of cells, tissues and organisms functioning as a system whose theoretical description is only possible using techniques which fall under the remit of systems biology.
- Robert Zeller (SDSU Biology)
- Alan Calvitti (VA Hospital)
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism). The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome’s networks.
Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods. Recent studies use “shotgun” Sanger sequencing or massively parallel pyrosequencing to get largely unbiased samples of all genes from all the members of the sampled communities. Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world.
- Forest Rohwer (SDSU Biology)
- Scott Kelley (SDSU Biology)
- Anca Segall (SDSU Biology)
- David Lipson (SDSU Biology)
Population genomics is the study of allele frequency distribution and change under the influence of the four main evolutionary processes: natural selection, genetic drift, mutation and gene flow on a genome-wide level. It also takes into account the factors of recombination, population subdivision and population structure. It attempts to explain such phenomena as adaptation and speciation.
- Andrew Bohonak
Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. Proteomics, formed on the basis of the research and development of the Human Genome Project, is also an emerging scientific research, involving exploration of the proteome from the overall level of intracellular protein composition, structure, and its own unique activity patterns. It is an important component of functional genomics.
Basic bioinformatic tools are already accessed in certain clinical situations to aid in diagnosis and treatment plans. For example, PubMed is accessed freely for biomedical journals cited in Medline, and OMIM (Online Mendelian Inheritance in Man at a search tool for human genes and genetic disorders, is used by clinicians to obtain information on genetic disorders in the clinic or hospital setting. An example of the application of bioinformatics in new therapeutic advances is the development of novel designer targeted drugs such as imatinib mesylate (Gleevec), which interferes with the abnormal protein made in chronic myeloid leukaemia.17 (Imatinib mesylate was synthesised at Novartis Pharmaceuticals by identifying a lead in a high throughput screen for tyrosine kinase inhibitors and optimising its activity for specific kinases.) The ability to identify and target specific genetic markers by using bioinformatic tools facilitated the discovery of this drug.
In the short term, as a result of the emerging bioinformatic analysis of the human genome project, more disease genes will be identified and new drug targets will be simultaneously discovered. Bioinformatics will serve to identify susceptibility genes and illuminate the pathogenic pathways involved in illness, and will therefore provide an opportunity for development of targeted therapy. Recently, potential targets in cancers were identified from gene expression profiles.18
In the longer term, integrative bioinformatic analysis of genomic, pathological, and clinical data in clinical trials will reveal potential adverse drug reactions in individuals by use of simple genetic tests. Ultimately, pharmacogenomics (using genetic information to individualise drug treatment) is likely to bring about a new age of personalised medicine; patients will carry gene cards with their own unique genetic profile for certain drugs aimed at individualised therapy and targeted medicine free from side effects.
7 Popular Bioinformatics Careers
Opportunities for high-paying and rewarding bioinformatics careers are growing. According to the Bureau of Labor Statistics, jobs in computer-based analysis are projected to grow 15 percent by 2029 (nearly four times the national average), with the healthcare, pharmaceutical, and biotechnology fields leading the way. Other fields where bioinformatics skills play a critical role, such as microbiology or zoology and wildlife biology, are projected to keep pace with the national average.
But the prospects for bioinformatics careers extend beyond a handful of job titles.
“Bioinformatics is at the intersection of computer programming, big data, and biology,” says Stefan Kaluziak, an assistant professor of bioinformatics at Northeastern University. “Anything that happens in biology in the future will have some component of bioinformatics in it. Where you can go with a master of science in bioinformatics is almost limitless.”
Keep reading to learn the day-to-day responsibilities and most important job skills for seven common bioinformatics careers.
Advance Your Career with a Master’s in Bioinformatics
Be at the forefront of an industry that’s changing lives.
Key Skills and Job Responsibilities for Bioinformatics Careers
Generally, a bioinformatics specialist is responsible for managing and analyzing large sets of genomic data. The setting where that data analysis takes place can vary quite a bit, however.
- In the pharmaceutical and biotechnology industries, work typically involves exploring the human genome to detect how drugs may impact certain proteins within the body’s cells. Work often happens in a laboratory setting.
- In biology and zoology, the analysis focuses on animal and plant genomes and terrestrial data such as elevation and water availability, with an overall focus on environmental health and conservation. Professionals may work in a lab or out in the field.
- In academia, bioinformatics careers are related to experiments that may not have an immediate commercial benefit but could be used to drive future investigations in medical or biological research.
Typical job responsibilities for a bioinformatics career include the following:
- Oversee a laboratory information management system (LIMS).
- Design strategies for DNA, RNA, and protein sequence analysis.
- Develop algorithms to support next-generation sequencing.
- Conduct quantitative analysis of biological images.
- Evaluate drug candidates for their value as targeted therapies.
- Assist in developing more efficient methods of food production.
- Develop systems for analyzing terrain using remote sensor data.
- Create data visualization for use in reports.
Kaluziak said familiarity with Linux is a plus, especially the ability to manage clusters of Linux servers or nodes that are used to store and process data. Familiarity with programming languages such as Python, R, and Perl is also valuable, as these languages are commonly used with large data sets.
Top Bioinformatics Careers and Salaries
Salaries for bioinformatics careers vary by geography as well as industry. Roles in pharma and biotech tend to pay more than roles in healthcare or academia, for example. Some jobs at colleges and universities may also require a PhD, while many industry roles will only require a master’s in bioinformatics along with experience.
In addition, many companies post bioinformatics roles with similar job descriptions and salaries but slightly different titles. For example, a bioinformatics scientist and data scientist may have similar responsibilities, while a bioinformatician and a bioinformatics developer may have similar salaries.
Here’s a look at seven common bioinformatics careers, along with their salaries and considerations for anyone seeking these types of roles.
1. Bioinformatics Scientist
Average base salary: $95,967
A bioinformatics scientist’s key responsibility is to develop software applications and databases to analyze biological data. It is the role most closely associated with the data scientist in industries outside of biology and biotech.
Coordinating with other scientists and professionals within the organization is a valuable skill for bioinformatics scientists. Maintaining extensive records of experiments and analyses is also an important part of the job, as it allows for easy review of past work to identify future research priorities.
Other elements of the day-to-day job may differ depending on where a bioinformatics scientist works. At a large firm such as a pharmaceutical company, there may be standardized workflows for data collection and analysis as well as templates for creating reports, Kaluziak notes. On the other hand, at a startup or smaller firm, the role may involve running experiments as well as performing tasks such as supporting the IT department.
2. Research Scientist
Average base salary: $81,185
Most often found in an academic setting or research organization, the research scientist typically conducts research in a laboratory setting that includes both control and experimental groups. Experiments can be done only after a proposal is peer-reviewed and approved for its methodology and validity to the scientific field. A background in bioinformatics is helpful for interpreting the results of an experiment, such as data collected from tests conducted on living specimens, and for summarizing observations in scientific papers.
Since the research scientist often conducts work that is not directly or immediately applicable to the business world, grant-writing and fundraising are useful skills for this bioinformatics role.
Average base salary: $76,936
A biostatistician focuses on statistical design and analysis as it relates to research work. This role is also responsible for preparing reports that summarize the analysis and refer to any pertinent research methodologies. The work environment may be a hospital or research lab, and areas of focus may be epidemiology, genetics, or ecology.
The ability to create easy-to-read reports is a critical skill for a biostatistician, as the analysis is often presented to clients or other external stakeholders in addition to internal teams. The role is typically part of a larger research team, so interpersonal management and collaboration are also valuable skills.
Average base salary: $75,650
Microbiologists study organisms such as bacteria and viruses that cannot be seen with the naked eye. Increasingly, this role focuses on how these organisms interact with their environment and impact human and animal health. Microbiologists typically work in an academic setting, and an advanced degree is often required to conduct independent research or supervise students or other laboratory professionals.
A background in bioinformatics can help microbiologists interpret anomalies and other patterns in their research data, which in turn can be used to identify areas of additional research or discover new ways that an organism may impact our health. In addition, since publishing scientific articles or papers is an important part of a microbiologists’ work, understanding how to display data and analyze it in an approachable way is a key part of the role.
Average base salary: $71,027
Among the many bioinformatics careers, a bioinformatician is most closely associated with managing data itself. This role is responsible for managing large databases, developing data frameworks, and creating and modifying algorithms. Their analysis is typically used for classifying components of a biological system such as DNA sequences or documenting protein expression.
Bioinformaticians should be well-versed in programming statistical models, combining data sets, and maintaining data integrity and security. Individuals in this role often collaborate with scientists and researchers to interpret and present data sets.
6. Zoologist or Wildlife Biologist
Average base salary: $63,270
These scientists study animals that are visible to the naked eye, whether it’s a single species or a larger group of animals. Work is often done in the field to observe how animals interact with each other, their surrounding environment, and with human populations. Zoologists or wildlife biologists are often employed by public agencies, universities, or private firms focused on conservation or farm protection. Non-standard work hours are common.
Data collection and analysis play a big part in this work, whether gathering specimens, studying migration patterns, or examining satellite and climate data. Knowledge of bioinformatics supports zoologists and wildlife biologists in conducting more in-depth analyses of population models, genetics, or the impact of habitat change.
7. Molecular Biologist
Average base salary: $60,135
A molecular biologist looks at individual cells and molecules in plants, animals, and humans, paying particular attention to how they interact. Individuals in this role typically work at a hospital, university, laboratory, or government agency, though some may work in a corporate setting. This research is often used to diagnose and treat diseases at a molecular level, particularly those caused by inherited genetic mutations.
Bioinformatics skills are helpful for the role of the molecular biologist due to the size of the human genome (more than 3 billion pairs of nucleotide bases) and the human proteome, or set of proteins in the human body (estimated at more than 20,000 combinations of up to 20 different amino acids). In addition, the ability to analyze text, images, or sound files can help molecular biologists generate results from large sets of raw data.
Relation to other fields
Bioinformatics is a science field that is similar to but distinct from biological computation, while it is often considered synonymous to computational biology. Biological computation uses bioengineering and biology to build biological computers, whereas bioinformatics uses computation to better understand biology. Bioinformatics and computational biology involve the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology.
Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from graph theory, artificial intelligence, soft computing, data mining, image processing, and computer simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics.
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Computer programs such as BLAST are used routinely to search sequences—as of 2008, from more than 260,000 organisms, containing over 190 billion nucleotides.
Before sequences can be analyzed they have to be obtained from the data storage bank example the Genbank. DNA sequencing is still a non-trivial problem as the raw data may be noisy or afflicted by weak signals. Algorithms have been developed for base calling for the various experimental approaches to DNA sequencing.
Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to obtain complete gene or genome sequences. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research (TIGR) to sequence the first bacterial genome, Haemophilus influenzae)generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today and genome assembly algorithms are a critical area of bioinformatics research.
In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. This process needs to be automated because most genomes are too large to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes have recognisable start and stop regions, although the exact sequence found in these regions can vary between genes.
The first description of a comprehensive genome annotation system was published in 1995 by the team at The Institute for Genomic Research that performed the first complete sequencing and analysis of the genome of a free-living organism, the bacterium Haemophilus influenzae. Owen White designed and built a software system to identify the genes encoding all proteins, transfer RNAs, ribosomal RNAs (and other sites) and to make initial functional assignments. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA, such as the GeneMark program trained and used to find protein-coding genes in Haemophilus influenzae, are constantly changing and improving.
Following the goals that the Human Genome Project left to achieve after its closure in 2003, a new project developed by the National Human Genome Research Institute in the U.S appeared. The so-called ENCODE project is a collaborative data collection of the functional elements of the human genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays, technologies able to automatically generate large amounts of data at a dramatically reduced per-base cost but with the same accuracy (base call error) and fidelity (assembly error).
Gene function prediction
While genome annotation is primarily based on sequence similarity (and thus homology), other properties of sequences can be used to predict the function of genes. In fact, most gene function prediction methods focus on protein sequences as they are more informative and more feature-rich. For instance, the distribution of hydrophobic amino acids predicts transmembrane segments in proteins. However, protein function prediction can also use external information such as gene (or protein) expression data, protein structure, or protein-protein interactions.
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists by enabling researchers to:
- trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,
- compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction of factors important in bacterial speciation,
- build complex computational population genetics models to predict the outcome of the system over time
- track and share information on an increasingly large number of species and organisms
The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.
The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion.Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the detection of sequence homology to assign sequences to protein families
Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took root in bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic group: although initially applied to closely related strains of a species, it can be applied to a larger context like genus, phylum etc. It is divided in two parts- The Core genome: Set of genes common to all the genomes under study (These are often housekeeping genes vital for survival) and The Dispensable/Flexible Genome: Set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can be used to characterize the Pan Genome of bacterial species.
Genetics of disease
With the advent of next-generation sequencing we are obtaining enough sequence data to map the genes of complex diseases infertility,breast cancer or Alzheimer’s disease. Genome-wide association studies are a useful approach to pinpoint the mutations responsible for such complex diseases. Through these studies, thousands of DNA variants have been identified that are associated with similar diseases and traits. Furthermore, the possibility for genes to be used at prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing both the promising ways to choose the genes to be used and the problems and pitfalls of using genes to predict disease presence or prognosis.
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes.
Two important principles can be used in the analysis of cancer genomes bioinformatically pertaining to the identification of mutations in the exome. First, cancer is a disease of accumulated somatic mutations in genes. Second cancer contains driver mutations which need to be distinguished from passengers.
With the breakthroughs that this next-generation sequencing technology is providing to the field of Bioinformatics, cancer genomics could drastically change. These new methods and software allow bioinformaticians to sequence many cancer genomes quickly and affordably. This could create a more flexible process for classifying types of cancer by analysis of cancer driven mutations in the genome. Furthermore, tracking of patients while the disease progresses may be possible in the future with the sequence of cancer samples.
Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors.