Data Management

Objectives of this section

If you are starting out it in research it is an appropriate time to plan your electronic filing system. Whether you are collecting data for a three year PhD or within a research organisation, careful and thoughtful attention to data management will streamline your project. There are some simple tips and ideas here to help.

Contents

  1. Management of computer folders (directories)
  2. Organising your analysis
  3. Naming of files and types of files
  4. Dealing with data
  5. Final outputs and reporting

Management of computer folders

Good habits lead to efficiency

The management of computer folders can help save time while performing research. While this seems like a trivial task when you are performing your research, when you look back after a number of years you will be glad of a good file and folder naming strategy. Also your colleagues that have to find files after you have moved on to bigger and better things will thank you for a clear and easy to understand file system.

Folders

Computer folders are the first line of organisation that we can use. Think of your storage system as a filing cabinet or book shelf. You may designate a drawer in your filing cabinet for a particular project and then separate the data collection from your literature review and your reports may all have a separate hanging file. You should treat your computer storage in a similar way. It is advisable to have a folder which contains all the information relating to each project (although this may be a hard disk depending on the size of the project). Within this folder you will want to have a folder for all the separate parts of the project. An example of a file structure is shown in the picture below.
File_organisation_example
Some important points on folders

  • Try to give folders descriptive names. It is a lot easier to search for a file if your folders given a general idea to where it could be
  • Do not over complicate folders. While it is a good idea to have a few levels of folders, it can also be harder to find information if there are too many. If we think back to our filing cabinet example, you generally would only go to about three levels of folders: The drawer, the hanging files and folders inside the hanging files. You can of course have more levels in your project, but be warned about the complexity.
  • It can help to also collect some files together in their own folders in the project, such as data and figures.
File naming

The name you give to a file can also be a useful way to organise your work as well as start to use version control. When working in a team it is very important to have version control so changes can be tracked and combined to produce the final documents. The easiest way of doing this is to include the date, a description of the file and then the initals of the author of the file. While all of this data is recorded within the file, when searching it is faster to use the filename rather than opening every file. So a suggested file name structure is

20160210_filename_ANO.ext

Here the first numbers give the year, month and day that the file was created, then we have the filename, then the author’s initials and then the extension for the file. Also notice that our file name has no spaces in it. While Windows and other operating systems can cope with spaces in filenames, other programs do like to not work with them (mostly UNIX based programs). Also as you do not know how long the file will be stored, it is a good idea to go for compatibility now.

In the initial meetings of a project you should discuss the file naming strategy, especially is working on a shared storage system.

Here is a video of common mistakes

“Grandparent-Parent-Current” Backup Method

When working with files of any type it is good practice to keep a number of backups of the files. In the unlikely event that a file is corrupted or deleted, time can be saved by returning to an earlier backup instead of having to start from scratch. Of course your information technology set up may already include an automatic backup system, but it is still a reasonable idea to have a simple backup as well. The “Grandparent-Parent-Current” method is made up of the following backups

  • The current versions of the files.
  • A recent (e.g. 0<t<1 week old) copy of the files known as the “Parent”
  • An older copy of the files (e.g. 1<t<2 weeks old) copy of the files known as the “Grandparent”

Obviously each week in this example at the end of each week the current files are copied to the Grandparent storage, thus becoming the Parent. The previous Parent storage then becomes the Grandparent until the end of the next week. With this method the maximum amount of work you would lose is 2 weeks.

Back to top

Organising your analysis

Keep raw data

It is very important that before any work is performed on a data set that a copy of the file is made. You should then work on the copy of the file, rather than the original file. Sometimes when performing analysis and removing data because of things such as missing data or poor measurements it can be necessary to change the method of dealing with these items, or you may want to analyse a different variable that has now had data removed due to missing values in other variables. With the original raw data you can quickly go back to a situation before you started analysing the data and proceed with the alternate method. It is also useful when trying to understand how certain data was recorded, or if there is a problem with certain data collection.

Importing Data

After you have organised your files so that they are in an sensible location and format it is time to import the data into your chosen statistical package. As R is used throughout the BeST course, we will show how to get data from various types of files read into the R environment so that it can then be analysed.

Text files

As text files are the simplest of the files used to store data, it makes them a logical starting point. R can handle text files in one of two formats:

  • Comma separated variables (as known as comma separated value) files, *.csv
    These files contain data that looks as follows
Soil,InP,OrgP,AvailP
1,0.4,53,64
2,0.4,23,60
3,3.1,19,71
4,0.6,34,61
5,4.7,24,54
6,1.7,65,77
...

As we can see each of the values for the observations are separated by a comma (this may be a full stop/period if your system uses commas to show decimals) with the variable names given at the top of the file. To read this type of data into R we use the command data.frame.name = read.csv("filename.csv") (if your data set uses commas for decimals use the command data.frame.name = read.csv2("filename.csv") so that the decimals are read in correctly).

  • (Tab) Delimited variables, *.txt
    The files contain data that looks as follows

 

Soil  InP  OrgP  AvailP
1     0.4  53    64
2     0.4  23    60
3     3.1  19    71
4     0.6  34    61
5     4.7  24    54
6     1.7  65    77

Here the values for each observation are separated by a space (normally a tab), again with the variables names given at the top of the file. To read this type of data into R we use the command data.frame.name = read.delim("filename.txt"). (again if your data set uses commas for decimals

use the command data.frame.name = read.delim2("filename.txt") again so the data is read in correctly).

It is advisable not to use the clipboard to read in data to R as this can lead to missing data and errors in variables. Also to allow the analysis to be repeatable, it is better to start with a data set in a file.

Excel Files

Until recently the best method for importing data from an Excel workbook was to convert the data sheet to a comma separated variable file and then import this file using the method given above. However a R package has been developed recently to allow the data to be imported directly to R without having to do this conversion first. The package is called XLConnect.

  1. The first time you want to import data from an excel file you will need to install the XLConnect package. This can be done using the command install.packages("XLConnect"). This only has to been done once on each machine as it will install to the library of functions
  2. Next you need to load the package, as you do with all packages, using library(XLConnect)
  3. After loading the package, you need to load the whole Excel workbook into R. You may not be using all of the data contained in the workbook, but this access lets you select various parts. You load the workbook using the command wb = loadWorkbook(filename), where wb is the name of the handle you will use to reference the workbook in R, and filename is the name of the file you wish to import. You can of course call your workbook handle anything you desire
  4. You can now read in the data from this workbook in two ways
    • To read in an entire worksheet (if you have designed your data collection sheet as above this should be the option used) you read in your data using the command data = readWorksheet(wb,sheet1), where wb is again the name of the workbook handle, sheet1 is the sheet you wish to import, sheet1 is the worksheet you wish to import and data> is the name of the data frame (again you can and should change this to something more descriptive).
    • If your workbook is set up with named regions the command to use is data = readNamedRegion(wb, name1), where again wb and data are defined as before and is the name of the region you wish to import.

The data is now imported, and can be used as with all data frames in R. There are more advanced commands and options for only importing a number of cells from a worksheet, but these will not be given here. Please refer to the package documentation for more information.

Initial checks

After you have imported your data the next step is to make sure that it has imported correctly. While most times this will feel like a pointless exercise, many times correcting an error at the beginning of your analysis will save countless headaches. Data can import without errors, but due to formatting errors or an additional space that you didn’t see, your values can end up with the wrong variable heading.

RStudio Method

As RStudio is a graphical user interface for R, it is very easy to quickly check the data frame that you have imported your data to. In the “Environments” window of RStudio you will see your data frame along with how many variables and observations you have. If you click on the arrow in the circle next to this, a list is expanded to show the type, name and example values of each variable. An example is shown below
Enviroment Window 1
To find out if all the values have been read in correctly you can simply click on this line and it will show the entire data set where the script window is located. Continuing our example, like this:
RStudio2
As you can see it is quite easy to check your data using RStudio.

Command Method

Alternatively you can use commands in R to perform the same tasks. To start with we should like to know the structure of data set PData, so we use the following command:

?View Code RSPLUS
 str(Pdata)
'data.frame':	17 obs. of  4 variables:
 $ Soil  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ InP   : num  0.4 0.4 3.1 0.6 4.7 1.7 9.4 10.1 11.6 12.6 ...
 $ OrgP  : int  53 23 19 34 24 65 44 31 29 58 ...
 $ AvailP: int  64 60 71 61 54 77 81 93 93 51 ...

The command str is used to print out the structure of the dataset.
As we can see we get a similar output to that in RStudio, with the names of the variables listed, their respective types and some sample values of each variable. We can also see how many observations and variables we have. To check the data has been read in correctly we can look at the first 6 rows of the data using the following command:

?View Code RSPLUS
head(Pdata)
  Soil InP OrgP AvailP
1    1 0.4   53     64
2    2 0.4   23     60
3    3 3.1   19     71
4    4 0.6   34     61
5    5 4.7   24     54
6    6 1.7   65     77

There is a similar command, tail that allows you to look at the last 6 rows of the data. To output the entire data set, we either use:

?View Code RSPLUS
view(Pdata)

or to display in the console you can simple type the name of the data frame

?View Code RSPLUS
 Pdata
   Soil  InP OrgP AvailP
1     1  0.4   53     64
2     2  0.4   23     60
3     3  3.1   19     71
4     4  0.6   34     61
5     5  4.7   24     54
6     6  1.7   65     77
7     7  9.4   44     81
8     8 10.1   31     93
9     9 11.6   29     93
10   10 12.6   58     51
11   11 10.9   37     76
12   12 23.1   46     96
13   13 23.1   50     77
14   14 21.6   44     93
15   15 23.1   56     95
16   16  1.9   36     54
17   17 29.9   51     99

Back to top

Dealing with Data

Having now correctly imported the data, we need to deal with some of the most common data processing tasks.

Missing data and types of missing data

The first of the data processing tasks is how to deal with Missing data. With real world data, as opposed to generated data that you may have used during academic courses, dealing with missing data is an important stage of processing the data before your analysis. Unfortunately there is a not a “one size fits all” method to deal with missing data, and the method used will depend on both the type of missing data as well as the type of analysis you will be conducted. Lets look at the type of missing data.

Missing completely at random

When data is missing completely at random the events leading to a data item being missing are independent of all parameters both observable and unobservable. In other words, the probability of any particular observation being missing does not depend on any measurements either taken or not taken during the study. An example of this is missing the recording of the level of rain because the investigator broke their leg.

Missing at random (MAR)

Missing at random data is not missing due to randomness but where the reason for it being missing can be fully accounted for by variables where there  complete information. This is an assumption which is impossible to verify but relies upon reasonableness.

Missing not at random (MNAR)

This is missing data which does not fit into the previous two categories. This means the reason behind the data being missing is related to the value of the variable that is missing.

Incorrect recording

Incorrect recording is not technically  missing data  but incorrectly recorded data, for example recording a length in millimetres rather than centimetres, can lead to similar issues when the analysis is being conducted. In some cases the units can be changed so millimeters are divided by 10 to get centimeters. It must constitute a verifiable mistake in this case. (e.g when a research assistance tells they used the  wrong unit).

Back to top

Dealing with missing data

Imputation/ Interpolation

The use of imputation, or using the existing data to fill in the blank, it a vast topic in itself.  In general it is not simple, and also the value obtained may not be correct for that individual subject as the imputed value is necessarily calculated from all other variables.  In general it is preferable to use an analysis  that can cope with  a  missing value.

Partial imputation

Partial deletion

Full analysis

The use of all the data, with  a few missing allows a full analysis. Some packages in R cannot handle unbalanced data, thus an alternate package will be selected. For example the aov() function requires balanced data. Balanced data is where there are equal replicates across the factors.

Back to top

Looking at Factors

Green is our  comment next to the  R code to assist you

Checking whether a variable is a factor

A factor or categorical treatment,may be  read into the dataset as a numeric (1 through 4).

The factor has a name eg   Genotype  and there are 4 level. (1,2,3,4,) . However they can be converted to a factor and have  a label assigned which makes  plotting and tabling of data easier to interpret.

  • Check whether  variable is a factor

is.factor(Genotype)   # you get the output [1]  FALSE if it is not  a factor

is.numeric(Genotype)   # you get the output [1]  TRUE  if it is numeric

  • Making a numeric variable into a factor

as.factor(Genotype)      # you get the numeric converted to 4 categories

  • Checking actual values saved in the variable and add labels

unclass(datset$Genotype)  # gives the output as a list of what is stored as integers

[1]    1  2   3    2    3    1    4  # example of what you might see if is saved as integers

levels(dataset$genotypes = c(“VarA”,”VarB”,”VarC”,”VarD”)  # applies the labels to this factor

  • Check the data

head(dataset)

  • Renaming a variable

dataset$new    =  dataset$old         # this gives you two variables old and new, where they have the  same data

  • Splitting a numeric variable into categories (use the cut  function)

Split a variable age into a new age- groups variable and make 3 bins of the data, and we have labelled them

dataset$age-cat   = cut(dataset$age, 3, include.lowest=TRUE, labels = c(“young” , “mid”, “older”))

  • Checking the data using cross tabulations

 

  • Means summarized over other variables

Back to top

Final outputs and Reporting

Consider the data coding for your study  and consider the potential alignment with national or international reporting

In some studies, particularly when there is  a link to a national or regional report, it may be worth looking at  the coding or categories used in these related report. This is particularly relevant for demographic variables. Age categories could align with another data set- allowing comparisons to be more easily examined.  Income may be useful to match a national household survey, and definitions examined.

  • Consider the types of output files

With the analysis we may need long complete files that contain all our analysis. However for reporting and for publishing we may only need summary tables and graphs of key findings.  We can save R into Word, HTML, PDF and Markdown documents to contain R code and the outputs from the analysis. We can also annotate these to add comments and interpretation. Note our demo files actually contain R code and the outputs. So you can try the code and also see whether it ran and more importantly the interpretation.  (add a downloadable file  types  in R)

  • Consider the format of files for the report

Does the executive  or your research collaborators  prefer  a Word, a pdf or other format?

When publishing for  a journal check on format and the style required. Various formats are available. R is very flexible and can produce very fine graphics, and tables with some extra coding (this can be expanded elsewhere in this site)

  • Output for the web

If you wish to put reports onto the web, then consider the best format- HTML

  • Reproducible data

By using the advanced and ever increasing flexibility and functionality of R we can provide reports that are reproducible, and ideal as  Reports per se or as Supplementary materials for Theses and journal articles

Back to top

 

2 Responses

  1. rumbimatemba

    January 25, 2017 8:27 pm

    Dear Miranda,
    Thanks for the insightful material under data management. Kindly note that:
    1. When I use as.factor(Pedigree) it will not work for me. I get an error message in RStudio.
    2. When I use as.factor(EIHYB$Pedigree) I do not get the error message, but my variable still remains a character.
    3. When I use EIHYB$Pedigree=as.factor(EIHYB$Pedigree) it works and my variable is converted into a factor.

    Can we discuss this?
    kind regards,
    Rumbi

    Reply
    • Vincent

      January 26, 2017 5:42 pm

      Hi Rumbi,

      Thanks for the comment! I’ll go through your comment item by item:
      1. as.factor(Pedigree) would indeed produce an error as R (either on its own or through RStudio) does not know which data frame (where the data is read into) contains the variable Pedigree.

      2. as.factor(EIHYB$Pedigree) does work as should be expected. This time the data frame is stated and the variable Pedigree selected. However this only temporarily converts the variable to a factor and so in the original data frame it stays as a character variable. This is because the output variable has not be stated and so the output of the command is only stored for that execution. In other words, this is a good option if you only want to have Pedigree as a factor for some commands during the analysis.

      3. EIHYB$Pedigree=as.factor(EIHYB$Pedigree) outputs the factor levels back to the original variable and, in doing so permanently changes the variable from a character to a factor level. EIHYB$Pedigree=factor(EIHYB$Pedigree) would work equally well. This would be good if you wanted to make Pedigree a factor for the entire analysis rather than on a command by command basis. An alternative to point 2 is to create a new variable using EIHYB$PedigreeF=as.factor(EIHYB$Pedigree) which then allows the original variable to remain as well a copy which will be treated as a factor.

      Hopefully this helps clear up the responses you were getting in RStudio. Feel free to ask further questions if my explanation is not clear.

      Thank you for using our training programme, and I hope you are enjoying it.

      Kind regards

      Vincent.

      Reply

Leave a Reply