Clustering and Classification methods for Biologists


MMU logo

Background to clustering and classification

LTSN Bioscience logo

Links

 

Data Pre-processing

Depending on the analysis it may be necessary or desirable to undertake some data pre-processing prior to starting. Pre-processing may be needed because of algorithmic constraints. For example, variables may need to be same data type or it may be desirable to use transformations to speed up processing. Pre-processing can include the following.

Now try the following self-asssment question. If you need help with any of these look at the resources listed on the section menu page.

1

Data Pre-processing

The following statements all refer to possible data-preprocessing actions. Some are correct, others are incorrect. Identify the correct ones.

a) Cosine(X) is a monotonic transformation.
b) The natural logarithm of X is a monotonic transformation.
c) It can be beneficial to use LOG(X+1) rather than LOG(X) as a transformation.
d) Rank(X) is a non-monotonic transformation (replace each value of x with its position in an ordered list).
e) Discretization of a continuous variable is necessary to construct a histogram of frequencies.
f) Information is lost if a variable, such as height, is transformed into two categories: above and below the mean.
a) Correct. a) No, it is not monotonic, a plot of Cosine(X) against X is an oscillating curve (it may be necessary to first transform your data to radians). b) Correctb) This is a monotonic transformation.c) Correct, because LOG(0) is not defined.c) LOG(0) is not defined and will create a problem with your software. If you add 1 to every number LOG(0) becomes LOG(1), which is 0.d) It is a monotonic transformation.d) It is a monotonic transformation.e) Correcte) The values have to be placed in 'bins' to enable the frequency distribution to be calculated.f) Correctf) Information must be lost because the original values, which could be very diverse, are replaced by only one of two possible values.
Check your answer