Data Processing

Data Processing

  1. Missing data
  2. Size of the data-frame and characteristics
  3. Wide to long format
  4. Factors and some useful commands
  5. Functions in R a simple example

Missing data

Avoid missing data by having a good design

Firstly we should try to avoid having missing data. Of course this is not always possible.
We can reduce missing values by having a focused design.  A simple , but forgotten aspect here, is for instance in a survey plan make sure you prepare a short and focused survey.  Respondents may not finish an online survey if it is not short and attractive. Respondents are busy, so focus on which questions you really need to ask.
Have clear protocols for data capture. With field work, write clearly so that data is correct, accurate and uses valid measurements. Check the data entry, check for typos in the spreadsheet.

Then we have to work out why it is missing.

NA indicates a missing value in R. These are different forms of NA. Data could be NA (Not Available) due to lost data (survey forms left in the back of the car).

NA may also mean “not applicable”.  If it is not applicable- it shows you have a question that may not apply (Children cannot be married, school children cannot be retired)

It may indicate a  subset question, such as families with children, and then questions may be asked about their children. So the NA could be for couples without children. Remember the sub-setted variables then give you a smaller sample size for subsets.
How to deal with missing (See Data Management)

Size of the data-frame and characteristics

To get the length of a data set
dim(dataset)

We can ask questions of our database, by using the structure command
str(dataset)
or we can use logical tests

is.data.frame(dataset)
is.list(dataset)
is.na(dataset)
is.matrix(dataset)
is.vector(dataset)

Wide to long format

The way data is entered is important. Mostly we use the long format, in which there are many rows of individual values, and columns of variables.

This figure show the wide format, the columns show variables ” bdi.pre”, ” bdi.2m” etc. These columns are measurements at different times. So each column is the same units but at different times, and can be rearranged into the long format (see below).

c8-1widedat

This is the long format has the “bdi.pre” as one column, but the other “bdi” ahve been arranged into two columns with a “time” and the associated “bdi”.

In R the command head()  gives the top six rows and tail() gives the bottom six rows.

c8-2longdat

Factors

Making a factor
var1 = as.factor(var)  # this makes the variable var into a factor called var1
Checking a factor
var1 = is.factor(var)  # this will give an answer TRUE or FALSE, a logical answer
Ordered factors
var1 = ordered(var)  # this will order a factor

In a regression, where there is an ordered factor it easy to get the quadratic, cubic and polynomial tests in the ANOVA table.
Factor Labels
var2 = factor(data$v1, levels = c(1,2,3), labels = c("low", "med", "high"))  # this will label a factor
Re leveling a factor
var4 = relevel(x, ref, ...)  # relevel to have a different reference level
To check class of an object (numeric, matrix, data frame, etc)
class(var)

The result will appear in the Console.

Class of Object

 class()

Modes of an Object

mode()

data-frame numeric
factor logic
ordered factor null = empty
matrix character
list complex
Ask by typing                                            is.class(X) Ask by typing         is.numeric(x)

Functions in R a simple example