Data Pre-processing
Depending on the analysis it may be necessary or desirable to undertake some data pre-processing prior to starting. Pre-processing may be needed because of algorithmic constraints. For example, variables may need to be same data type or it may be desirable to use transformations to speed up processing. Pre-processing can include the following.
- Monotonic transformations such square root or logarithmic. Monotonic transformations can be demonstrated graphically. A plot of the original values against the transformed values will not have any peaks or troughs if the transformation is monotonic. The square transformation (change x to x2) is not monotonic if negative and positive numbers are included. The plot will dip down to 0 at x = 0 and increase as x gets increasingly negative or positive.
- Change the data type by degrading it to a less informative type. Information may be lost in a transformation but it is never added. For example, a continuous score can be converted to a binary format, using an above/below threshold rule. Similarly, a continuous variable could be reduced to a small number of ordered values. This process is called discretization. Although conceptually simple, the decisions on the placement of class boundaries for the new ordinal variable is potentially complex. How many classes should there be? How are class boundaries defined: equal interval, equal frequency, natural breaks?
- It might be important to reduce the number of variables using a selection/rejection routines. There are automated, stepwise methods that use statistical criteria. Alternatively. Alternatively, Huberty (1994) suggested the use of logical screening (theoretical, reliability and practical grounds) to screen variables. This is possible if some initial research identifies variables that may have some theoretical link. We may also wish to take into account the data reliability and do not ignore the practical problems of obtaining data, this includes cost (time or financial) factors.
- Data reduction using a projection method such as PCA, Sammon mapping. (Sammon mapping uses an iterative process to produce a two-dimensional representation of a data matrix with more than two variables).
Now try the following self-asssment question. If you need help with any of these look at the resources listed on the section menu page.
1 |
Data Pre-processingThe following statements all refer to possible data-preprocessing actions. Some are correct, others are incorrect. Identify the correct ones. |