Bioinformatics Frequently Asked Questions
Mail
your questions to me, Damian Counsell, and I'll try to bring you
answers. Alternatively mail your answers and I'll incorporate them.
Please note that I cannot answer individual specific queries---I
am not a careers adviser. I am, however, happy to tackle questions
of general interest to all visitors to the site.
I consider bioinformatics to be a special kind of engineering
discipline---it certainly isn't a "pure" science. It has been
enormously successful in its short existence and I
think its successes have been the result of a practical and rigorous
approach which I hope to encourage in anyone interested in entering
the field.
This document is not a scientific paper or textbook (yet). You
will find blunt
opinions here. If you disagree with me about any of the
following please tell
me. I hope to learn a lot from your inevitable and welcome
criticisms.
There is certainly one sense in which I consider myself a pure
scientist: I'm open to rational persuasion.
Overview
Contents
General
What is
Bioinformatics?
Roughly, bioinformatics describes any use of computers to handle
biological information. In practice the definition used by most
people is narrower; bioinformatics to them is a synonym for
"computational molecular biology"--- the use of computers to
characterise the molecular components of living things.
The Tight
definition
"Classical" bioinformatics
Fredj Tekaia at the Institut Pasteur offers
this definition of bioinformatics:
"The mathematical, statistical and computing methods
that aim to solve biological problems using DNA and amino acid
sequences and related information."
I would say most biologists talk about "doing bioinformatics"
when they use computers to store,
retrieve, analyse or predict the
composition or the structure of biomolecules. As
computers become more powerful you could probably add
simulate to this list of bioinformatics verbs.
"Biomolecules" include your genetic material---nucleic acids---and
the products of your genes: proteins. These activities are
what I would refer to as "classical" bioinformatics, dealing
primarily with sequence analysis.
It is a mathematically interesting property of most large
biological molecules that they are polymers;
ordered chains of simpler molecular modules called
monomers. Think of them as beads or building blocks
which, despite having different colours and shapes, all have the
same thickness and the same way of connecting to one another. Each
monomer molecule is of the same general class, but each kind of
monomer has its own well-defined set of characteristics. Many
monomer molecules can be joined together to form a single, far
larger, macromolecule which has exquisitely
specific informational content and/or chemical properties.
According to this scheme, the monomers in a given macromolecule
of DNA or protein can be treated computationally as letters
of an alphabet, put together in pre-programmed arrangements
to carry messages or do work in a cell.
"New"
bioinformatics
The greatest achievement of bioinformatics methods, the Human Genome
Project, is currently being completed. Because of this the
nature and priorities of bioinformatics research and applications
are changing. People often talk portentously of our living in the
"post-genomic"
era. My personal view is that this will affect bioinformatics in
several ways:
- Now we possess multiple whole genomes we can look for
differences and similarities between all the genes of multiple
species. From such studies we can draw particular conclusions
about species and general ones about evolution. This kind of
science is often referred to as comparative
genomics.
- There are now technologies designed to measure the relative
number of copies of a genetic message (levels of gene expression)
at different stages in development or disease or in different
tissues. Such technologies, such as DNA microarrays will grow in
importance.
- Other, more direct, large-scale ways of identifing gene
functions and associations (for example yeast
two-hybrid methods) will grow in significance and with them
the accompanying bioinformatics of functional
genomics.
- There will be a general shift in emphasis (of sequence
analysis especially) from genes themselves to gene
products. This will lead to:
- attempts to catalogue the activities and characterize
interactions between all gene products (in humans):
proteomics ).
- attempts to crystallize and or predict the structures of all
proteins (in humans): structural genomics.
- fewer DNA double-helices in bad sci-fi movies.
- What some people refer to as research or
medical informatics, the management of all
biomedical experimental data associated with particular molecules
or patients---from mass spectroscopy, to in vitro assays to
clinical side-effects---will move from the concern of those
working in drug company and hospital I.T. (information technology)
into the mainstream of cell and molecular biology and migrate from
the commercial and clinical to academic sectors.
This FAQ
concentrates on classical bioinformatics, but will, I hope, grow to
cover more of the "post-genomic" aspects of the field. It is worth
noting that all of the above non-classical areas of research depend
upon established sequence analysis techniques.
What is the difference
between bioinformatics...
...and medical informatics
or...
The definition provided by the Medical
Informatics FAQ (no relation) is:
"Biomedical Informatics is an emerging discipline that
has been defined as the study, invention, and implementation of
structures and algorithms to improve communication, understanding
and management of medical information." Aamir Zakaria,
the author of the FAQ, emphasises that medical informatics is more
concerned with structures and algorithms for the manipulation of
medical data, rather than with the data itself.
This suggests that one difference between the two fields lies
with their approaches to the data; there are bioinformaticists
interested in the theory behind the manipulation of that data
and there are bioinformatics scientists concerned with the
data itself and its biological implications. I think a good
bioinformatics researcher should be interested in both of these
aspects of the field.
Another distinction seems obvious to me. Medical informatics
generally deals with "gross" data, that is information from
super-cellular systems, right up to the population level, while
bioinformatics tends to be concerned with information about cellular
and biomolecular structures and systems.
On both of these points I'd be happy for any medical informatics
specialists to correct
me.
...computational biology?
Computational biologists might object (please
do), but, I find that people use the term "computational
biology" when discussing that subset of bioinformatics (in the broad
sense) closest to the field of classical general biology.
Computational biologists interest themselves more with
evolutionary, population and theoretical biology rather than cell
and molecular biomedicine. It is inevitable that molecular biology
is profoundly important in computational biology. I get the
impression that computational biologist's have tended to prefer
statistical models for biological phenomena over physico-chemical
ones. This is often very wise on their part...
Overview of most common bioinformatics programs
Everyday bioinformatics is done with sequence search programs
like BLAST,
sequence analysis programs, like the EMBOSS and Staden packages,
structure prediction programs like THREADER
or PHD
or molecular imaging/modelling programs like RasMol and WHATIF.
Overview of most common bioinformatics technology
Currently, a lot of bioinformatics work is concerned with the
technology of databases. These databases include both "public"
repositories of gene data like GenBank
or the Protein DataBank (the
PDB), and private databases like those used by research groups
involved in gene mapping projects or those held by biotech
companies. Making such databases accessible via open standards like
the Web is very important since consumers of bioinformatics data use
a range of computer platforms: from the more powerful and forbidding
UNIX boxes favoured by the developers and curators to the far
friendlier Macs often found populating the labs of computer-wary
biologists.
Databases of existing sequencing data can be used to identify
homologues of new molecules that have been amplified and
sequenced in the lab. The property of sharing a common ancestor,
homology, can be a very powerful indicator in
bioinformatics (see below).
Acquisition of sequence data
Bioinformatics tools can be used to obtain sequences of genes or
proteins of interest, either from material obtained, labelled,
prepared and examined in electric fields by individual
researchers/groups or from repositories of sequences from previously
investigated material.
Analysis of data
Both types of sequence can then be analysed in many ways with
bioinformatics tools.
They can be assembled. Note that this is one of the
occasions when the meaning of a biological term differs markedly
from a computational one (see the amusing confusion
over the issue at Web-based geek forum Slashdot). Computer scientists,
banish from your mind any thought of assembly language. Sequencing
can only be performed for relatively short stretches of a
biomolecule and finished sequences are therefore prepared by
arranging overlapping "reads" of monomers (single beads on
a molecular chain) into a single continuous passage of "code".
This is the bioinformatic sense of assembly.
They can be mapped (see note)---that
is, their sequences can be parsed to find sites where so-called
"restriction enzymes" will cut them.
They can be compared, usually by aligning corresponding
segments and looking for matching and mismatching letters in their
sequences. Genes or proteins which are sufficiently similar are
likely to be related and are therefore said to be "homologous" to
each other---the whole truth is rather more complicated than this.
Such cousins are called "homologues".
If a homologue (a related molecule) exists then a newly
discovered protein may be modelled---that is the three dimensional
structure of the gene product can be predicted without doing
laboratory experiments.
Bioinformatics is used in primer design. Primers are
short sequences needed to make many copies of (amplify) a piece of
DNA as used in PCR (the Polymerase
Chain Reaction).
Bioinformatics is used to attempt to predict the
function of actual gene products.
Information about the similarity, and, by implication, the
relatedness of proteins is used to trace the "family trees"
of different molecules through evolutionary time.
There are various other applications of computer analysis to
sequence data, but, with so much raw data being generated by the
Human Genome Project and other initiatives in biology, computers are
presently essential for many biologists just to manage their
day-to-day results
Molecular modelling / structural biology is a growing field which
can be considered part of bioinformatics. There are, for example,
tools which allow you (often via the Net) to make pretty good
predictions of the secondary structure of proteins arising
from a given amino acid sequence, often based on known "solved"
structures and other sequenced molecules acquired by structural
biologists.
Structural biologists use "bioinformatics" to handle the vast and
complex data from X-ray crystallography, nuclear magnetic resonance
(NMR) and electron microscopy investigations and create the 3-D
models of molecules that seem to be everywhere in the media.
note
Unfortunately the word "map" is used in several
different ways in biology/genetics/bioinformatics. The definition
given above is the one most frequently used in this context, but a
gene can be said to be "mapped" when its parent chromosome has been
identified, when its physical or genetic distance from other genes
is established and---less frequently---when the structure and
locations of its various coding components (its "exons") are
established.
The Loose
definition
There are other fields---for example medical imaging / image
analysis which might be considered part of bioinformatics. There is
also a whole other discipline of biologically-inspired computation;
genetic
algorithms, AI, neural networks. Often these areas interact in
strange ways. Neural networks, inspired by crude models of the
functioning of nerve cells in the brain, are used in a program
called PHD to predict, surprisingly accurately, the secondary
structures of proteins from their primary sequences.
What almost all bioinformatics has in common is the processing of
large amounts of biologically-derived information, whether DNA
sequences or breast X-rays.
How old is the discipline?
"How old is bioinformatics?" The answer to this one depends on
which source you choose to read.
From T K Attwood and D J Parry-Smith's "Introduction to
Bioinformatics", Prentice-Hall 1999 [Longman Higher Education; ISBN
0582327881]:
"The term bioinformatics is used to encompass almost all
computer applications in biological sciences, but was originally
coined in the mid-1980s for the analysis of biological sequence
data."
From Mark S. Boguski's article in the "Trends Guide to
Bioinformatics" Elsevier, Trends Supplement 1998 p1:
"The term `bioinformatics' is a relatively recent invention,
not appearing in the literature until 1991 and then only in the
context of the emergence of electronic publishing...
"...However, some of my role models when I was a graduate
student (Margaret O. Dayhoff, Russell F. Doolittle, Walter M.
Fitch and Andrew D. McLachlan) had been building databases,
developing algorithms and making biological discoveries by
sequence analysis since the 1960s---long before anyone thought to
label this activity with a special term (if anything it was called
`molecular evolution'). Even a relatively new kid on the block,
the National Center for Biotechnology Information (NCBI), is
celebrating its 10th anniversary this year, having been written
into existence by US Congressman Claude Pepper and President
Ronald Reagan in 1988. So bioinformatics has, in fact, been in
existence for more than 30 years and is now middle-aged."
Resources
Can you recommend any bioinformatics
books?
It's notoriously difficult to find any books on bioinformatics
itself that cater well for all of those coming from computing, from
mathematics and from biology backgrounds. The few textbooks
available in the field tend to be eyewateringly expensive as well.
I've divided suggested reading into books of general
interest, those
best suited to people coming from a computational/mathematical
background and books for
biologists interested in bioinformatics. After my suggestions
are some links to other lists of bioinformatics books.
General
introductions
Many people are curious about the Human Genome (Project). The
completion of the first draft probably represents bioinformatics'
coming of age as a discipline. The first couple of books are aimed
at the intelligent layperson.
A gossipy and insightful account of the race to sequence the
genome can be found in "The Sequence" by Kevin Davies
[Weidenfeld; ISBN 0297646982]. Matt Ridley's "Genome"
[Fourth Estate; ISBN 185702835X] is both an interesting layperson's
introduction to the issues raised by the bioinformatic revolution
and an overview of its biology and enormous scope. If I remember
rightly, Ridley's book received a slightly snooty review from Walter
Bodmer. This is understandable, since his and Robin McKie's
excellent "pre-genomic" guide to the Human Genome Mapping Project,
"The Book of Life" [Oxford Paperbacks; ISBN 0195114876] was
undeservedly in a remainders bin when I bought my copy a couple of
years ago.
If you are a non-biological scientist (or a non-scientist) and
are hooked by these, why not go back to the "real beginning" of the
race and read James Watson's entertaining and indiscreet memoir of
his and Francis Crick's determination of the structure of DNA,
"The Double Helix" [Penguin; ISBN 0140268774]---now
updated with an introduction by media don Steve Jones.
Nigel Barber at Peterborough Regional College in the UK
recommends Gary Zweiger's "Transducing the Genome" [McGraw-Hill
Professional Publishing: ISBN 0071369805]. The summary
at Amazon makes it sound a tad pretentious, but all the reviews seem
pretty positive so it might be worth a read.
Computational/Mathematical
aspects
If you are a hardcore maths/computing person Michael Waterman's
"Introduction to Computational Biology" [Chapman &
Hall/CRC Statistics and Mathematics; ISBN 0412993910] and Pavel
Pevzner's "Computational Molecular Biology - An Algorithmic
Approach" [The MIT Press (A Bradford Book); ISBN 0262161974]
will give you all the discrete maths you can shake a stick at, but
perfunctory introductions to the biology.
Bioinformatics.org's very own Jeff Bizzarro recommends Dan
Gusfield's "Algorithms on Strings, Trees and Sequences"
[Cambridge, 1997 ISBN 0-52158-519-8], Richard Durbin, S. Eddy, A.
Krogh, G. Mitchison "Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids"
[Cambridge, 1997 ISBN 0-52162-971-3] (which I think is one of the
clearest and most comprehensive guides to alignment algorithms)
and---for that full "computers-to-biology conversion"--- Geoffrey M.
Cooper "The Cell: A Molecular Approach" [ASM Press,
1996 ISBN 0-87893-119-8]. Jeff Ames writes that a second edition of
this book is now available [Sinauer Associates, Incorporated, 2000
ISBN 0-87893-106-6] and that this version---if you can find it in
the shops---comes with a CD.
Applying
bioinformatics to biological research
One outstanding new comprehensive text for the biologist is David
W. Mount's "Bioinformatics" [Cold Spring Harbor Press;
ISBN0879696087]. It's not cheap, but it's the best I've seen if you
are studying bioinformatics itself.
If you're coming to the subject as a computer user with a
biological background, looking to exploit the many tools available,
you might want to try Terry Attwood and David Parry-Smith's
"Introduction to Bioinformatics" [Longman Higher
Education; ISBN 0582327881], or Des Higgins and Willie Taylor's
"Bioinformatics: Sequence Structure and Databanks"
[Oxford University Press; ISBN 0199637903]. Bioinformatics.org also
recommends Cynthia Gibas and Per Jambeck's "Developing
Bioinformatics Skills" [O'Reilly, 2001 ISBN 1-56592-664-1].
Stuart Brown recommends his own book "Bioinformatics: A
Biologist's Guide to Biocomputing and the Internet" [Eaton
Pub Co; ISBN: 188129918X]. If he sends me a review copy I might
recommend it too ;-) .
Further suggestions for this section are welcome.
Other lists of
bioinformatics books
See also compbiology.org's
list and Steve
Brenner's list.
What bioinformatics sites are
there?
Tutorials
A great place to start, whether you come from a biological,
physical or computational background is at Martin
Vingron's superb online bioinformatics tutorial. (Begin by
choosing a section from the left-hand-side menu bar.)
I recently stumbled upon a promising set of online
lecture notes currently under construction by B. Steipe at the
Genzentrum (Gene
Center) at the Ludwig-Maximilians-Universität
München (University of Munich).
Chemistry for all
A defiantly frames-free chemistry
tutorial site.
Mathematics for biologists
First of all, an almost completely
painless introduction to the horrors of the quadratic equation
by Peter Whalen, James Walker, and Drew Marticorena.
C. J. Schwarz of
the Department of Statistics and
Acturial Science, Simon Fraser
University has produced a course in "Statistics for the Life
Sciences" which is accompanied by set of sound, online
html handouts. They aren't the prettiest, but they'e some of the
best. (Though his "paradigm of statistics" mnemonic "TRRGET" is
completely inconsistent with his explanation of what the letters
stand for... If anyone can enlighten
me I'd be pleased to know what I'm failing to understand.)
Here is a great guide
to a whole array of statistical learning/teaching resources prepared
by Juha Puranen of
the University of Helsinki (English).
Computers for biologists
Programming for biologists
General introduction to biology for
computer scientists
Estrella Mountain
Community College in the States offers this excellent short
introduction to biology (actually "The Nature of Science and
Biology". It's a great place for keyboard jockeys to start their
journey to enlightenment.
Molecular
biology for computer scientists
The Institute of Arable Crop Research Beginner's
Guide to Molecular Biology
Protein chemistry for computer
scientists
Unilever Education Advanced Series tutorial
on proteins.
Cell
biology for computer scientists
The University of Arizona
has made available a high-quality tutorial in cell biology. Not only
does it cover the facts, but it also attempts to introduce some of
the philosophy of the field---recommended. Even better, it's also
available en
Español.
Once you've worked your way through that you might like to see
some scanning electron microscope images of
some of the structures you've read about taken by members of John
Heuser's lab.
Evolution for computer
scientists
Bob Patterson maintains his "Darwiniana" with
amazing diligence.
Practical bioinformatics
Other lists of bioinformatics
tutorials
Societies
Humberto Ortiz Zuazaga kindly introduced me to The International Society for Computational
Biology which he points out "has links to programs of study and
online courses in computational biology and to job postings".
Collections of Tools
I cannot recommend strongly enough the Human Genome Mapping
Project Resource Centre's "GenomeWeb".
Of historical interest only now, I guess, is the legendary "Pedro's
Molecular Biology Search and Analysis Tools".
Portals
CCP11 (Collaborative Computational Project 11) is another great
product of the UK's Genome Campus. To quote their Web site, it
was...
"...established to foster the broad bioinformatics
community and the UK research community in particular. Its purpose
is to facilitate the transfer of knowledge and expertise through
conferences, workshops, a newsletter and the use of the world wide
web. CCP11 is funded by the BBSRC and is hosted at the MRC
Human Genome Mapping Project Resource Centre HGMP-RC located on the
Wellcome Trust Genome Campus,
Cambridge."
Jennifer Steinbachs runs compbiology.org which is a
general computational biology site as well as being a portal to her
own work.
BioPlanet is well worth
visiting, though I have to say I have no idea who runs it or what
its precise status (commercial, personal, for-fun) as a Web site is.
Careers
How can I get involved?
If you want to get involved in bioinformatics, now is an exciting
time. I can honestly say this is one area of science where demand
for skilled practitioners (and salaries) can be very high.
This section is opinionated, partly because there are people in
the field, both computer scientists and biologists, who I would love
to provoke (or convert). If you are a newcomer, and especially if
you come from one of bioinformatics component pure disciplines, I
hope my ranted warnings will help you to avoid the mistakes of your
predecessors---and I write as one of the mistaken. David
S. Roos put it well in his recent review
in the journal Science:
"Lack of familiarity with the intellectual questions
that motivate each side can also lead to misunderstandings. For
example, writing a computer program that assembles overlapping
expressed sequence tags (EST) sequences may be of great importance
to the biologist without breaking any new ground in computer
science. Similarly, proving that it is impossible to determine a
globally optimal phylogenetic tree under certain conditions may
constitute a significant finding in computer science, while being
of little practical use to the biologist."
How can I get
involved?---I am a "newbie"
If you are a high school student / sixth former, think about
taking an interdisciplinary computational biology or bioinformatics
bachelor's degree of the sort offered at, for example, Manchester
University in the UK or UPenn in the States. Don't worry if you
can't find a place on such a course or there isn't one nearby;
perhaps the best way to approach this subject is from two sides. Do
a batchelor's degree in one area while taking a healthy interest in
the other---or (if you can afford to) complement a first degree in
one part of the discipline with a second degree in the second
If you already have a degree in a biological discipline there are
similar Master's courses---both interdisciplinary (e.g.
Birkbeck's in London) and conversion type courses---for biologists
or others to learn computer science, for example.
If you are currently doing a computer science or biology PhD, try
to take advantage of the opportunity to take courses in the "other"
discipline.
How can I get
involved?---I am a biologist
To a biologist I would say: take as many real computing
courses as you can. It's important not just to learn a programming
language, but also to learn the discipline of computing; to
structure and document your work in a rigorous way. What courses you
take might be directed by the kind of work you are interested in
doing when you graduate---whether you see yourself supporting
bioinformatics applications or building them. For the former you
need all-round familiarity with the programs themselves and the
hardware and software needed to run them---plus your existing
understanding of biology. For the latter you need to learn a
structured programming language and the principles of good program
design---plus the ability to talk to and understand biologists.
Courses biologists might consider taking:
- UNIX
-
Of all the computing courses available it is most important
that you have a proper introduction to the UNIX operating system.
Most current bioinformatics software (especially the free stuff)
runs on "open" platforms like UNIX and the Web. UNIX is elegant,
powerful and frustrating. Master it and you will save a lot of
time.
- Mathematics
-
Learn some maths. Basic statistics, logic/set theory and a
little calculus would be my recommendation. Many practising
biologists have little or no grasp of elementary concepts like
statistical significance, permutations and combinations and the
principles of good experimental design. Logic will come in handy
at the very least if you want to query databases in an intelligent
way.
- Programming
-
If you're interested in development, learn a real programming
language: Pascal, C(++), Java or Fortran.
Perl and HTML are the stuff that holds the Web together. A
grasp of these is essential for a lot of the Web/database work
being done by many bioinformaticists at the moment.
Good old BASIC can be very useful as an introduction to
programming or as a tool in its own right, but none of these
latter languages is built to crunch numbers and tackle real world
biological problems---which isn't to say people don't
try...
How can I get involved?---I am a
computational/quantitative scientist
One thing that I will emphasise repeatedly in this section is the
simple value of doing some "proper" biological laboratory science. I
have sat through talk after talk where a bioinformatics "scientist"
describes in great detail how his (it's usually "his") whizzy new
application of a trendy mathematical tool offers a supposed insight
into a (sometimes supposed) biological problem. Nine times out of
ten I know that this "solution" will never be so much as sneezed on
by a practising biologist.
Quantitative scientists talk about their interest in studying
some aspect of "God's mind". Biologists are interested in "Mother
Nature's body". If you want to win Nature over you are going to have
to meet her in the flesh. You are as likely to be useful to
biologists working in isolation at the keyboard as you are to
conceive with your clothes on. Desk-bound bioinformaticists
have written code that has turned out to be popular with
biologists, but almost always because they have collaborated with
biologists.
Courses quantitative scientists might consider taking:
- Molecular biology
-
"MoBi" was the bioinformatics of its day; desperately
fashionable, the province of new, higher-paid practitioners and
considered with slight suspicion by more traditional biologists.
It was once a great achievement to sequence a modest stretch of
DNA, now it's a job for robots. Today we the technology is very
well established. Scientists can buy molecular biology kits to
perform the sort of genetic manipulations that would make your
parents' jaws drop. Some of the kits are so simple your
parents' parents could use them (with a modest amount of
training and supervision).
Despite the profusion of commercial kits, there is still a
requirement for real skill in molecular biology and the general
level of scientific understanding required to be a good biological
scientist---rather than just completing a practical
class---doesn't come easy. Living matter, the stuff you have to
work with is unpredictable and responds slowly---except when it's
dying. Even supposedly fast-growing bacteria can take a long time
to yield up their secrets.
Even now, as the focus of biomedical research shifts from
molecular biology back to cell biology and protein biochemistry,
it's well worth offering yourself up as a volunteer for some
vacation work in a molecular biology lab. The term is now more
often used to refer to the technological tools it provides biology
in general rather than to fundamental research in the field
itself. Those tools are common to a vast array of different kinds
of research, from archaeology to zoology.
- Protein (bio)chemistry
-
Protein (bio)chemistry is experiencing a revival. Proteins are
still more delicate and fussy than nucleic acids. The same advice
that applies to molecular biology applies to protein biochemistry.
That stuff bioinformatics people refer to as "wet lab science" is
much harder than it looks.
You might find it more difficult to get access to a good
protein lab than a good molecular biology lab and do protein
science with real wizards, but the very least you can do is read
about the theoretical aspects of the subject.
For insights into the principles of proteins structure, try,
for example, Carl Branden and John Tooze's "Introduction to
Protein Structure" [Garland ISBN 0-8153-2305-0]. Physicists in
particular might find the lack of general unifying principles in
this area overwhelming. Unfortunately there's no substitute for
acquiring a "feel" from the subject by examining a lot of
examples. Still the most critical stages in the successful
prediction of protein structure from sequence are those requiring
human intervention.
Thomas E. Creighton has been responsible for a range of
standard texts on protein chemistry. If you are working in a
protein lab you are likely to come across his "Protein Function :
A Practical Approach" [ISBN 019963615X] and the rather more
expensive and theoretical "Proteins : Structures and Molecular
Properties" [ISBN 071677030X]
- Evolutionary biology
-
It's a worn quote, but worth repeating:
"The mechanisms that bring evolution about certainly need
study and clarification. There are no alternatives to evolution
as history that can withstand critical examination. Yet we are
constantly learning new and important facts about evolutionary
mechanisms. Nothing in biology makes sense except in the
light of evolution."
Theodosius Dobzhansky in "American Biology Teacher" vol.35
Darwin's theory is one of the simplest and most misunderstood
in science. Start with a good layperson's introduction, Richard
Dawkin's "The Selfish Gene" (and remember: it's a
metaphor, stupid) or Steve Jones' paraphrasing of
Darwin's original "The Origin of the Species" "Almost Like a
Whale". All biologists agree on the underlying principles, but
they are nearly ready to kill one another over the details. After
reading a decent book on evolutionary biology you should have at
least a handful of good questions. Now you are ready to take a
class in the subject. Take your questions with you. You'll
probably start an argument---or a fight.
You might also like to peruse Cynthia Gibas's answers
to similar questions from computational scientists on the O'Reilly Web site.
These damned biologists are making me use Word instead of LaTeX
to write up---what can I do?
Try this.
More
general advice
Use the software
Get access to an installation of EMBOSS and/or Staden and get
someone to lead you through the tools available. RasMol is a simple,
but powerful and elegant molecular imaging program which can teach
you a great deal about biological macromolecules; try a tutorial.
Get out on the Web and do some productive surfing for a
change :-) . The best starting point is the Human Genome Mapping
Project Resource Centre's "GenomeWeb". There's
so much stuff out there -- and most of it is free to
academics.
Where can I study
Bioinformatics?
I am gradually building this section up. Its focus is on
complete, full-time degree programmes rather than on individual
study modules. Curating a list of the latter would be a full-time
job. You can go to other places, however, if you are looking for
short courses. Thanks to various contributors,
including Wentian Li who pointed me to this list at
Rockefeller which is mirrored at various other sites. And to
Humberto Ortiz Zuazaga for mailing me a link to the ICSB, where you
can find this list. In
the UK the wonderful CCP11 project maintains
(among many other resources) lists of (mainly) British Masters
and PhDs
in bioinformatics. If you have any suggestions or updates please contact
me with them. You can publicize your course and offer a public
service at the same time.
Africa
South African National Bioinformatics Institute (SANBI) Honours
Bioinformatics Course at the University of the Western Cape.
Next year the same institute will be offering a Master's in
bioinformatics---thanks to Cathal Seoighe.
If you know of any other bioinformatics courses on the African
continent please feel free to mail
me about them.
The Americas
Canada
The University of
Waterloo, Department of
Computer Science offers undergraduate
and graduate
courses in bioinformatics. More information is here.
California
In apparent contradiction to the the URL, the Keck Graduate Institute claims that
computational
biology is a core element of the curriculum in its Master of
Bioscience degree.
Stanford University M.S./PhD. in
BioMedical Informatics
Thanks to Momchil Georgiev for the information that the University of California at San
Diego offers a Bioinformatics
graduate programme.
University of California,
Irvine Informatics in
Biology and Medicine
David Delong wrote to me to point out that the College of Natural and Agricultural
Sciences at the University of
California, Riverside is developing a "Center in
Genomics and Bioinformatics" which will offer a PhD curriculum
in genomics and bioinformatics from academic year 2001-2002 onwards.
Catherine Velazquez says that the University of California, Santa Cruz
will start
a new undergraduate BS course in
bioinformatics
in the fall of 2001. They also have made public their proposal for
an MS
in Bioinformatics.
Georgia
Georgia Institute of
Technology Masters
of Science in Bioinformatics
Iowa
Iowa State University
offers an Interdisciplinary Ph.D. Program in Bioinformatics
and Computational Biology (BCB).
Maine
The Jackson Lab, a World centre
of mouse genome informatics offers a graduate
training program.
Massachusetts
Boston University and North Eastern University offer a graduate programme in
bioinformatics.
Mexico
At the National Autonomous University of Mexico a doctoral
program in biomedical sciences is available. Their Computational
Molecular Biology Group is here.
Minnesota
The University of Minnesota
offers a graduate programme in bioinformatics.
New York State
Rochester Institute of
Technology Bachelor's
and Masters of Science in Bioinformatics
If you know of any other bioinformatics courses on the American
continent please feel free to mail
me about them.
North Carolina
The North Carolina State
University Genomic Sciences
program offers Masters and PhDs in
Bioinformatics.
Virginia
The Virginia State University's
Bioinformatics Institute offers
graduate
options in Bioinformatics.
Asia
India
According to Rahul Agrawal, the Indian Institute of Technology
Delhi, New Delhi provides courses in Biochemical
Engineering and Biotechnology. He adds that another branch of
the Institute, IIT
Kharagpur also provides various courses
in this area.
There is an Advanced
(Graduate) Diploma in Bioinformatics in the Bioinformatics Centre at the Jawaharlal Nehru University.
Madurai Kamaraj University
in Madurai, India claims to have been the first in the country to
initiate a bioinformatics programme and advanced diploma in
bioinformatics at its School of
Biotechnology
The University of Pune, Maharashtra offers an Advanced Diploma in
Bioinformatics at the Bioinformatics Centre, , India.
Singapore
The Bioinformatics
Centre of the National
University of Singapore offers Undergraduate and
PhD programmes in conjunction with the life sciences departments
and research institutions at NUS.
If you know of any other bioinformatics courses is Asia please
feel free to mail
me about them.
Australasia
Australia
As of 2001 Flinders
University in Adelaide offers a Batchelor's
of Science in Bioinformatics.
The Research School
of Biological Sciences, at the Australian National University in
Canberra offers PhD., MSc. and
Honours programs in Bioinformatics.
The University of New South
Wales in Sydney offers an undergraduate
program in Bioinformatics.
The Biochemistry Department of La Trobe University in
Victoria also offers an undergraduate
course in Bioinformatics.
If you know of any other bioinformatics courses is Australasia
please feel free to mail
me about them.
Europe
Belgium
The Department of
Engineering at the Katholieke Universitiet of
Leuvan offers Master of
Bioinformatics degree.
Denmark
The Technical University of
Denmark, Center for Biological
Sequence Analysis offers MSc.-level and
PhD.-level
courses in bioinformatics.
Finland
The Finnish Graduate
School in Computational Biology, Bioinformatics, and Biometry or
"ComBi"
is a joint venture of the University of Helsinki (English), the University of Turku (English) and the University of Tampere (English).
Eire
(Ireland)
Accoding to James O. McInerney there will be details of the
National University of Ireland's undergraduate degree course in
bioinformatics and computational biology here. He
also organizes a bioinformatics summer school.
Germany
The Technische
Fakultät (Faculty of
Technology) at Universität Bielefeld (Bielefeld
University), offers a graduate programme in Bioinformatik
(bioinformatics).
The Universität
Tübingen (University
of Tübingen) also offers Bioinformatik.
Here are their own Frequently
Asked Questions (in German only) about studying bioinformatics
there.
Sweden
Bjorn Olsson writes that, as well as a 4-year Master's Degree in
Bioinformatics, the University of
Skövde offers a number of short courses and allow computer
science master's students to include bioinformatics in their degree.
There is more information here.
Apart from this, adds Daniel Nilsson, there is only one other
"pure" bioinformatics course in Sweden: the MSc in Bioinformatics
Engineering in Uppsala. There are also opportunities to study
bioinformatics on the "normal" biotech courses in Gothemburg
Linköping
and Umeå. The
former, The School of
Mathematical and Computing Sciences at Chalmers offers an MSc.
programme in bioinformatics---thanks to Samuel Hargestam.
United Kingdom
In the UK, there are only two dedicated undergraduate courses in
bioinformatics---one
at the University of Birmingham
and another
at UMIST. A major problem
is the desperate skills shortage in the area. Experts in the field
can earn considerably more in high-status commercial or government
research jobs than in universities---without having to dedicate time
to teaching. Bioinformatics is the ideal postgraduate scientific
subject, best suited to those who are already trained in one of its
constituent disciplines.
Two pioneering university institutions are Birkbeck College in the University
of London, a British centre with a proud tradition in educating
working and/or mature students to the highest academic standards and
a superb X-ray crystallography group and York University whose Department
of Biology offers Masters
courses and PhDs
in both computational biology and biomolecular science. Other
universities have bioinformatics groups actively involved in the
teaching of their biology/molecular biology undergraduate courses,
including, for example, courses at Leeds University where there are
also MRes
studentships available. Manchester University also teaches bioinformatics to
its undergraduates as well as offering a taught MSc.
course in the subject. University College London (UCL) also
offers a final year undergraduate course: "Bioinformatics:Genes,
Proteins and Computers".
Imperial College recently
displaced Oxford (at least temporarily)
from second place of various "charts" of the "best" universities in
the UK. [Disclaimer: I was a graduate student at Imperial and teach
on two graduate courses there.] From next year the Department of
Biochemistry at Imperial is offering a new MSc
in Computational Genetics and Bioinformatics. (Oxford itself
hasn't yet deigned to recognize the field with a degree course.
[Disclaimer: I was an undergraduate there.])
Thank you to David
Parkinson for pointing out to me that for the past two years
Sheffield Hallam University has offered an MSc/PGDip
in Bioinformatics at its Graduate School
in Science, Engineering and Technology.
Other UK Bioinformatics courses include:
University of Exeter MSc/MRes in
Bioinformatics.
University of Liverpool M.Sc.,
Postgraduate Diploma and Postgraduate Certificate in Biosystems
& Informatics
University of
Nottingham Master
of Philosophy in Molecular Biology with Bioinformatics
In April 2002 City
University's Bioinformatics
group is moving---along with its PhDs---to the University of Glasgow Department of Computer Science.
. Thanks to Will Bachelor for alerting me to the existence of
this group.
If you know of any other bioinformatics courses is Europe please
feel free to mail
me about them.
Where can I find Bioinformatics
jobs?
Start with the appointments / careers sections of the the major
scientific journals, or, better, search their Web jobs pages with
"bioinformatics":
Appropriately for a Web-dependent discipline, there are a variety
of specialist commercial Web sites which carry bioinformatics jobs:
There are also a number of companies actively recruiting in the
area:
Practical tips
This section includes some simple rules-of-thumb to apply when
performing common bioinformatics tasks. I try to give a reference to
a more detailed source of guidance where I know of one.
How do I find a sequence?
The most common task in bioinformatics must be the acquisition of
some bioinformatics data on which to operate. Usually this in the
form of a nucleic acid or protein sequence, stored as characters in
the appropriate alphabet together with a header of related
information: for example some kind of unique identifying number the
species from which the original biological substrate was obtained,
the names of any authors who published the sequence and so on.
You may have already generated your own sequence data
experimentally. In this case you are likely to want to find
sequences which are identical or similar (and therefore possibly
related) to yours. The task is then one of similarity
search.
...I have
a description.
A paradoxical problem generated by the success of the
bioinformatics revolution is the increasing difficulty of navigating
the huge amount of data available. Once you could print out most of
the existing sequence databases onto paper and cram them into a
single binder. Now a search for "actin" alone will pull out hundreds
and hundreds of sequences. The key to find what you want is to
develop your own discriminatory skills rather than rely on computers
to figure out what it is you're really after.
Use PubMed
Make sure you are clear about your aim first. If you are looking
for a sequence for a specific scientific purpose then you might be
best to start with a relevant human-generated publication. For
example, you have cloned a gene which is part of a
well-characterised biochemical pathway and you want to find other
sequences of the same functional gene product in other species
(orthologues) PubMed is your
friend. [XXXX CONTINUE DETAILED ADVICE HERE]
Use Swiss Prot
[XXXX INSERT DETAILED ADVICE HERE]
Use Boolean logic
[XXXX INSERT DETAILED ADVICE HERE]
Use cunning
[XXXX INSERT DETAILED ADVICE HERE]
...I have an
accession number.
[XXXX INSERT DETAILED SEQUENCE ADVICE HERE]
...I have an
another sequence.
This section will be expanded---and there will be a more basic
and detailed explanation for novice searchers, but, in the meantime,
here are the top tips cribbed from the excellent paper
by Hugh B. Nicholas Jr., David W Deerfield II and Alexander J.
Ropelewski in BioTechniques.
- Use a local favourite program on the Web server of your
choice.
- Use at least two and preferably three similarity tables.
- If using Smith-Waterman or FASTA algorithms ensure that the
gap opening penalty is high enough.
- If the initial search finds no or insufficient matches repeat
it with a highly diverged matrix and/or with a
Smith-Waterman-based server.
- If this doesn't work try switching from a PAM matrix to a
BLOSUM matrix.
...I'm not sure
whether or not to use the defaults.
Hugh, David and Alexander again on when not to use the default
search parameters provided by a server.
- ...when the homologues you are looking for to match your query
are highly diverged.
- ...when the query or matches are short.
- ...when you are only interested in a specific (in the sense of
"species") subset of database matches with a particular
evolutionary relationship to your sequence of interest---a
relationship not implied by the default settings.
How can I align two
sequences?
This section will also be expanded for newbies, until then, here
are Hugh, David and Alexander's tips for alignment:
- Use an appropriately divergent matrix (I'll be adding a table
soon to explain this).
- Reduce your gap penalty relative to that you used for your
database search.
- Use the MaxSegs/Waterman-Eggert version of the dynamic
programming algorithm to provide the best local alignment and also
to search for repeats.
How can I
predict the function of a gene (product)?
[XXXX INSERT FUNCTION PREDICTION ADVICE HERE]
How can I
predict the structure of a sequence?
[XXXX INSERT STRUCTURE PREDICTION ADVICE HERE]
How can I write up?
Go here to
download some detailed advice. Go here for
more links.
Glossary of bioinformatics
terms
Here I attempt to define some common terms in bioinformatics. I
have tried to balance clarity, brevity and rigour. Let me know if I
let one of these priorities over-ride the others.
What is an
alignment?
When two symbolic representations of DNA or protein sequences are
arranged next to one another so that their most similar elements are
juxtaposed they are said to be aligned. Many
bioinformatics tasks depend upon successful alignments. Alignments
are conventionally shown as a traces.
In a symbolic sequence each base or residue monomer in each
sequence is represented by a letter. The convention is to print the
single-letter codes for the constituent monomers in order in a fixed
font (from the N-most to C-most end of the protein sequence in
question or from 5' to 3' of a nucleic acid molecule). This is based
on the assumption that the combined monomers evenly spaced along the
single dimension of the molecule's primary structure. From now on I
shall refer to an alignment of two protein sequences.
Every element in a trace is either a match or a
gap. Where a residue in one of two aligned
sequences is identical to its counterpart in the other the
corresponding amino-acid letter codes in the two sequences are
vertically aligned in the trace: a match. When a residue in one
sequence seems to have been deleted since the assumed divergence of
the sequence from its counterpart, its ``absence'' is labelled by a
dash in the derived sequence. When a residue appears to have been
inserted to produce a longer sequence a dash appears opposite in the
unaugmented sequence. Since these dashes represent ``gaps'' in one
or other sequence, the action of inserting such spacers is known as
gapping.
A deletion in one sequence is symmetric with an insertion in the
other. When one sequence is gapped relative to another a deletion in
sequence a can be seen as an insertion in sequence b.
Indeed, the two types of mutation are referred to together as
indels. If we imagine that at some point one of the
sequences was identical to its primitive homologue, then a trace can
represent the three ways divergence could occur (at that point).
Biological interpretation of an alignment
A trace can represent a substitution:
AKVAIL AKIAIL
A trace can represent a deletion:
VCGMD VCG-D
A trace can represent a insertion:
GS-K GSGK
For obvious reasons I do not represent a silent mutation.
Traces may represent recent genetic changes which obscure older
changes. Here I have only represented point mutations for
simplicity. Actual mutations often insert or delete several
residues.
What is a DNA
array?
[INSERT FULL DEFINITION HERE.]
What is a
homologue?
[INSERT FULL DEFINITION HERE.]
What
is a scoring matrix?
[INSERT FULL DEFINITION HERE.]
Acknowledgements
Questions
Thanks to the following people for questions:
- Jonathan Després
- Salma B. Rafi
- "Ritu"
- Michael Wentzel
Links
Thanks to the following people for corrections, links and
sources:
- Anuradha Acharya
- Rahul Agrawal
- Jeff Ames
- Will Bachelor
- Justin Baker
- Nigel Barber
- David Delong
- Steffen Durinck
- Momchil Georgiev
- Samuel Hargestam
- Jim
- Darren Lee
- Wentian Li
- Steve Masticola
- James McInerney
- Markus Montigel
- Daniel Nilsson
- Bjorn Olsson
- David Parkinson
- G. Deepak Reddy
- John Rowland
- Vishal Rupani
- Cathal Seoighe
- Jennifer Steinbachs
- James Thompson
- Junaid A. Mehta
- Humberto Ortiz Zuazaga
- Catherine Velazquez
- Kathy Wiederin
- Zuthur Yew
- Michael Zuker
Answers
Thanks to the following people for suggesting answers:
- Paul Boardman
- Sangeeta Sawant
- Fredj Tekaia
Small Print:
Author and
licensing
This resource is maintained by and © Damian Counsell, UK Medical
Research Council Human Genome Mapping Project Resource Centre (the
HGMP-RC) 1998-2002. It is made available under a modified version of
the Open Publication
Licence. It is currently mirrored at The
Bioinformatics Resource and at eBioinfogen
This resource has also been mirrored, without credit or any
attempt to link to the Open Content Licence, at the so-called "National Bioinformatics
Institute". If you are thinking of handing over money for their
"certification" you can draw your own conclusions about their
standing from this fact.
The first version of this resource was prepared when I was
responsible for bioinformatics in the Section for Cell and Molecular
Biology at the Institute of
Cancer Research (the ICR) in London.
I am now a bioinformatics specialist at the HGMP-RC, part of the
Proteomics
Group and am supported by the Medical Research Council. This page
does not represent their views, but I will happily read
your criticisms. Although I may act on your advice I take no
responsibility for anything that might happen if you browse here.
Version control information
$Revision: 1.69 $ $Date: 2001/12/02 21:24:51 $ $Author: counsell
$ |