Clustering and Classification methods for Biologists


MMU logo

Introduction and organisation

LTSN Bioscience logo

Page Outline

 

Search

[ Yahoo! ] options

Fair use of these pages

You are free to use these pages for educational purposes. If you wish, you can make a local copy (and avoid the advertisements) by downloading the course as a 4584Kb zip file. However it would be polite if you acknowledged the original source and author. If you do decide to use them could you please let me know. You can make changes to the content - but please flag these changes in your text. You can send comments directly to me or leave a comment (see the links in the horizontal menu bar above).

top


Intended Learning Outcomes

After completing this unit you should be able to:

 

It is assumed that you have some prior understanding of the basics of statistical analyses and interpretation.

top


Unit organisation

This unit is based for material developed for Masters students from the School of Biology, Chemistry and Health Science at Manchester Metropolitan University. Hopefully, it will also be a useful resource for other postgraduate, and final year undergraduate, students.

The unit examines the background to, and applications of, a range of clustering and classification techniques across biology.


Studying online

Studying online is not the same as being in a classroom. Although there are some similarities there are important differences in how you should approach your work. The following sections provide some brief advice and links to more detailed information.

When to study?

24/7 availability of online material is an obvious advantage since it means that you can study when and, within reason, where you like. However, you must find the time to study and you are likely to be more successful if you set times, in advance, that are suitable for you. Do not hope that some time will 'turn up', almost certainly other things will be more urgent or attractive! It is generally advisable to set aside some reasonable blocks of time, which means periods between 30 minutes and two hours. Shorter or longer times are likely to be less effective. Try to ensure that your family and friends recognise what you are doing and do not disturb you. After all, you would not expect a friend to wander into a lecture to chat with you.

Communication

Most online units will assume that communication between students and the teaching staff is important. Although you could work completely independently it is not generally advisable. Instead you should make use of all of the available communication tools, such as discussion boards and email. Indeed, many units use evidence of effective electronic communication as one of the assessment elements.

Using electronic communication methods whilst studying online is different from chatting to your friends through a chat room such as MSN. It is important that you recognise and obey the 'rules' of online communication. These rules are generally referred to as 'netiquette' and more information can be found from these links the Core Rules of Netiquette from the book Netiquette by Virginia Shea (available online), the Wikipedia Netiquette entry and a State of Victoria (Dept. Education and Training) document.

It is also important that you organise your ideas before sending them off to others. You need to ensure that your communications are relevant, clear and structured. They must also be relevant to the topic. Do not make unsubstantiated claims or statements, provide evidence to support them or, if seeking clarification, ask specific questions. A comment such as 'I don't understand any of this' is very difficult to respond to. Instead, be specific and begin with the first component that you are having difficulty with.

Studying

As teachers we will have failed if all that you can do at the end of a unit is to remember a list of names, dates, definitions, etc. Learning is much more than remembering. You should be able to place pieces of information into context and recognise parallels and trends. This is achieved by developing critical thinking skills.

One of the most well known and accepted models of types of learning is called Bloom's taxonomy or Bloom's classification of cognitive skills. Although you are more likely to come across references to Bloom's taxonomy in educational research materials the concepts can be useful to help you to recognise the different levels of learning. You should be aiming to achieve the higher skills particularly those that combine comprehension and application and the final three which could be classed as problem solving skills.

Classification of Cognitive Skills (Based on information from http://web.bsu.edu/IRAA/AA/WB/chapter2.htm)
CategoryDefinitionRelated Behaviours
Knowledge Recalling or remembering without necessarily understanding, using, or changing it define, describe, identify, label, list, match, memorize, point to, recall, select, state
Comprehension Understanding something that has been communicated without necessarily relating it to anything else alter, account for, annotate, calculate, change, convert, group, explain, generalize, give examples, infer, interpret, paraphrase, predict, review, summarize, translate
Application Using a general concept to solve problems in a particular situation; using learned material in new and concrete situations apply, adopt, collect, construct, demonstrate, discover, illustrate, interview, make use of, manipulate, relate, show, solve, use
Analysis Breaking something down into its component parts to focus on identification of parts or analysis of relationships between parts, or recognition of organizational principles analyze, compare, contrast, diagram, differentiate, dissect, distinguish, identify, illustrate, infer, outline, point out, select, separate, sort, subdivide
Synthesis Creating something new by putting parts of different ideas together to make a whole. blend, build, change, combine, compile, compose, conceive, create, design, formulate, generate, hypothesize, plan, predict, produce, reorder, revise, tell, write
Evaluation Judging the value of material or methods as they might be applied in a particular situation; judging with the use of definite criteria accept, appraise, assess, arbitrate, award, choose, conclude, criticize, defend, evaluate, grade, judge, prioritize, recommend, referee, reject, select, support

Finding information: separating the good from the bad.

Although it is very easy to search the web for information there are two important skills that will vastly improve the quality of your search.
1. Refining your search criteria to exclude irrelevant material. You can do this if you learn how to search efficiently. Two useful sources for online searching are: UC Berkeley - Teaching Library Internet Workshops and Search Engine Tutorials
2. Only using reliable sources. As a general guideline material that comes from government and higher education sources will be more reliable but beware that you may be reading a piece of coursework submitted by a student. The most reliable sources are peer reviewed academic journals. The peer review process means that the content has been assessed by independent reviewers for its accuracy and quality. One way of helping to ensure the quality of the material is use a search engine such as Google Scholar rather than Google. You can also find peer-reviewed, open-access, material by searching the Directory of Open Access Journals.

Using Information

Having found some information, you may want to use it in one your reports or assessments. This is fine as long as you follow certain rules about ownership. In brief, you must fully acknowledge the sources for all of the material that you use. You must not engage in plagiarism! If you do not know what this is we advise looking at the Turnitin site, this is the resource used by many universities to detect plagiarism in student work.

The Internet Detective is a novel web resource that will help you to get the best out of your web searches.

top


Topics

The material is arranged in six blocks.

The following two documents provide a broad context for the unit.

The analyses are not based on a particular statistical program. Examples are presented for a range of commercial and free packages. The data sets are provided in a range of formats that should enable them to be used with most packages.

top


Suggested bibliography

My book "Cluster and Classification Techniques for the Biosciences, ISBN 0-521-61800-2" is due to be published by Cambridge University Press in late 2006. The contents provide a more detailed and broader coverage than is possible in these pages.

Cluster and Classification Techniques for the Biosciences book cover
  1. Chatfield, C. and Collin, A. J. 1980. Introduction to multivariate analysis. Science Paperbacks.
  2. Field, A. 2000. Discovering Statistics using SPSS for Windows. Sage Publications, London. - an excellent comprehensive text about a wide range of 'difficult analyses.'
  3. Flury, B. and Riedwyl, H. 1988. Multivariate statistics: a practical approach. Chapman and Hall.
  4. Jongman, R. H. et al. 1995. Data analysis in community and landscape ecology. Pudoc Wageningen.
  5. Kinnear, P. R. and Gray, C. D. 2000. SPSS for Windows made simple. Psychology Press, Andover - £14.95 - an excellent and very clear book. (www.psypress.co.uk)
  6. Legendre, P. and Legendre, L. 1998. Numerical Ecology (2nd English Edition). Elsevier, Amsterdam.
  7. Tabachnick, B. G. and Fidell, L. S. 1996. Using multivariate statistics. 3rd edition. Harper.

Do not treat this list as comprehensive. It is wise to search out other texts that you may find more suitable to your needs.

top


General web resources

The following Web sites contain links to free or shareware software, most of which are relevant to multivariate analyses.

  1. The makers of STATISTICA (a commercial software package) have a very useful set of notes about many statistical methods, including some that are only briefly covered in this course.
  2. Pierre Legendre's (Université de Montréal) site has links to many useful programs (particularly those involving spatial analyses). Much of this software is written for Apple Mac computers, but there are also some Window's versions.
  3. PopTools is a very versatile Excel addin from CSIRO. In addition to Mantel tests it also incorporates a range of Matrix methods and resampling techniques.
  4. The ADE-4 site is an online multivariate statistical package. You submit your data, it does the analyses and returns your results. You can also download the entire package to run on your own computer.
  5. The ordination methods for ecologists web site has links to many multivariate statistical techniques.
  6. PAST is a free data analysis package which, although aimed at paleontologists, has great potential for ecological analyses. In addition to many other techniques PAST can be used for Regression: Linear (Standard and Reduced Major Axis), lin-log (exponential), log-log (allometric), logistic; Diversity statistics, rarefaction. Dice, Jaccard and Raup-Crick similarity indices; Principal Components (with Minimal Spanning Tree), Principal Coordinates, Correspondence analysis with detrending, Cluster analysis (three algorithms, nine distance measures); Discriminant analysis; Time series and Spectral analysis; Directional statistics, rose plots, point distribution statistics
  7. The R package is a public domain (i.e. free) 'clone' of the very powerful S-Plus package. Although it is very powerful it is not for the faint-hearted! Using it belies its Unix heritage. If you wish to find a version for the Mac or PC follow the download link and choose the nearest site. Note this is a completely different R statistics package to that distributed from Pierre Legendre's site!
  8. Warren Kovach's MVSP software does most common multivariate analyses, including cluster analysis, PCA and PCO. The windows version also does CA and CCA. This is shareware software but you can try it before you buy it.
  9. Bill Miller has been developing a comprehensive and free statistics package call Openstat that offers a number of multivariate analyses including multiple regression, discriminant analysis cluster analysis, principal components (factor) analysis and logistic regression.
  10. WinIDAMS is a software package for the validation, manipulation and statistical analysis of data, developed by the UNESCO Secretariat. It has a wide range of techniques including regression analysis, one-way analysis of variance, discriminant analysis, cluster analysis, principal components factor analysis and correspondence analysis. It is distributed free-of-charge upon request.

Do not treat this list as comprehensive. If you discover another interesting site please let me know.

top