Importing
multiple Excel files, each with multiple sheets, into R.
Excel is extremely
popular as a tool for organizing data and it has fairly easy-to-use
functions for rudimentary statistics and data displays (i.e. graphs
& charts). However, it is not a statistical software package
and therefore, it is often necessary to import Excel data structures
into other, more statistically oriented software. For this reason, DSA
personnel do not recommend using Excel; for data storage, data display,
or data analysis. An often quoted phrase1 is the
following; the only thing worse than using SPSS, is using Excel. For
more information on the known problems with Excel and other spread
sheet based software, see Burns (2013). DSA recommends storing data in
plain text (.txt) files with comma delimiters; also known as a comma
separated values (.csv) file type. The reason DSA recommends text
(.txt) or comma separated values (.csv) file types is because those
file types can be easily opened or imported into all the statistical
software packages. However, if you feel you must use Excel, then this
article should help you with the inevitable task of getting data from
Excel into a more worthy software package for statistical data
analysis; and there really is no more worthy software for that purpose
than R.
Context of the
Example
An example has been
created to illustrate a procedure for importing several Excel files,
each with multiple sheets, into the R workspace and merging them
together as a single data frame. The premise of our example is a
research design with 10 participants, 3 lighting conditions, and 5 time
series (chin movements, left eye [pupil] movements, right eye [pupil]
movements, left wrist movements, right wrist movements). Each
participant was exposed to each lighting condition and their movements
were measured throughout a 10 minute typing task -- all three body-part
measuring apparatus' took samples 100 times per minute to measure
positional changes (in millimeters) from an enforced baseline / start
position, while each eye's pupil movement reflects the movement (in
millimeters distance) from looking at the center of the screen. In
other words, the eye (pupil) movement refers to changes of movement in
gazing at the center of the screen to gazing at the edges of the
screen, or the keyboard. Again, the time series data was sampled at 100
times per minute for the full 10 minutes of typing (n = 1000,
per time series). Motion capture software exported the resulting data
into 10 Excel files. Each Excel file corresponds to each participant
(participant.1.xls, participant.2.xls…etc.) and each Excel file
contains 3 sheets; one sheet per lighting condition (Off, Dim, Bright).
Each sheet contains five time series corresponding to the five measured
variables (Chin, R_eye, L_eye, R_wrist, L_wrist). The resulting
simulated data is available below so that the reader can download the
data and replicate what is illustrated in this tutorial. Our goal was
to import all the data and merge it into a single data frame.
Example Data:
Illustrative
Example
First, ‘set’ the working
directory (wd) to the (path) location on your computer where the files
are located; in this example, we have the 10 Excel files on our
desktop. Below, and throughout the example, we are using black, Times
New Roman, font for text and we are using Courier New font for R script
(in red) and R output (in blue).
setwd("C:/Users/jds0282/Desktop/")
Next, load the packages
which will allow us to import Excel data files; the XLConnect package
is the package we want and it requires the rJava package.
library(rJava)
library(XLConnect, pos = 4)
XLConnect 0.2-5 by Mirai Solutions GmbH
http://www.mirai-solutions.com ,
http://miraisolutions.wordpress.com
Next, create an object
with the file names. Here, we are using the paste function to create
sequential character string names.
pre1 <- "participant"
pre2 <- seq(1:10)
suf <- "xls"
file.names <- paste(pre1, paste(pre2, suf,
sep = "."), sep = ".")
rm(pre1, pre2, suf)
file.names
[1] "participant.1.xls"
"participant.2.xls" "participant.3.xls"
[2] "participant.4.xls"
"participant.5.xls" "participant.6.xls"
[3] "participant.7.xls"
"participant.8.xls" "participant.9.xls"
[4] "participant.10.xls"
Next, create an object
with the sheet names. Recall, each file contains 3 sheets; each sheet
corresponds to a lighting condition.
sheet.names <- c("Off","Dim","Bright")
sheet.names
[1] "Off"
"Dim" "Bright"
Next, we create a vector
of names which will be the column names for the final data frame. The
data frame must include columns (factor level variables) which contain
coding information which identifies each row’s data. In this example,
we need three such factors; one for the participant, one of the
condition, and one for the sampling frame (1 to 1000) which represents
each of 100 samples per minute (for 10 minutes). The other five names
(and columns) represent the five motion capture time series distance
measures.
e.names <-
c("participant.id","condition","sampling.frame",
+
"Chin","R_eye","L_eye","R_wrist","L_wrist")
e.names
[1] "participant.id"
"condition"
"sampling.frame" "Chin"
[2]
"R_eye"
"L_eye"
"R_wrist"
"L_wrist"
The last step in
preparation is to create the final data frame (data.1), keep in mind,
this data frame only has one row (for now) and that row includes only
'NA' values. However, some simple mathematics allows us to compute the
size of the final data frame. It will have 8 columns and 30,000 rows
(10 participants * 3 conditions each * 1000 rows per condition). It is
important to remember the first row is made up of ‘NA’ values and
represents a place holder (it will be deleted after all the data is
imported).
data.1 <-
data.frame(matrix(rep(NA,length(e.names)), ncol = length(e.names)))
names(data.1) <- e.names
data.1
participant.id condition sampling.frame
Chin R_eye L_eye R_wrist L_wrist
1
NA
NA
NA NA
NA
NA
NA NA
Now, we're ready to use
two 'for-loops' to import each sheet of each file and row bind (rbind)
them to the original / final data frame. However, it may be beneficial
to elaborate on what each line of each ‘for-loop’ is doing. Line
numbers have been added to the script below in order to help facilitate
explanation of each line. Obviously, these line numbers are not
functional R script (red, Courier New) or R output (blue, Courier New)
and therefore are printed in black (Times New Roman) font.
1: for (i in 1:length(file.names)){
2: wb <-
loadWorkbook(file.names[i])
3: for (j in
1:length(sheet.names)){
4: ss
<- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
5: condition
<- rep(sheet.names[j],nrow(ss))
6: sub.id
<- rep(file.names[i],nrow(ss))
7: s.frame
<- seq(1:nrow(ss))
8: df.1
<- data.frame(sub.id,condition,s.frame,ss)
9:
names(df.1) <- e.names
10: data.1 <-
rbind(data.1, df.1)
11: rm(ss, condition,
s.frame, sub.id, df.1)
12: }
13: rm(wb)
14: }; rm(e.names, file.names, i, j,
sheet.names)
Line 1 above simply
initiates a ‘for-loop’; which is nothing more than a way to tell the
computer to read all the lines between the curly braces { }
and before proceeding, it should read those lines again, and again, and
again…until ‘i’ equals the length of the ‘file.names’ object. The
length of the ‘file.names’ object is 10 because we specified earlier 10
file names. So, line 1 is essentially instructions which say; read the
following lines, or iterate through the following lines, 10 times. The
character ‘i’ is assigned a zero until the first iteration is complete,
at which time it is assigned a 1; next iteration i = 2, and so on until
i = 10. The closing curly brace is on line 14 and the script after that
curly brace will only be read when all 10 iterations have completed.
So, lines 2 through 13 will each be read, or processed, 10 times in
sequence (i.e. read lines 2 through 13, then read lines 2 through 13,
then…).
Line 2 above simply
imports an Excel workbook (file) and assigns it to ‘wb’ (an arbitrary
or temporary name of the workbook). We are telling the software the
file name to look for by passing the file.names object to the
loadWorkbook function and because the file.names object contains all 10
names, we specify the one which corresponds to the iteration number
(i). So, for the first iteration, the loadWorkbook function looks for
“participant.1.xls” because that is the first object of the file.names
object.
Line 3 initiates a second
‘for-loop’ but instead of labeling each iteration ‘i’ we are labeling
each iteration in this loop ‘j’ – which differentiates the iterations
of the two loops. The ‘j’ loop will iterate from 1 until the length of
the sheet.names object. Recall, we specified 3 sheet names;
corresponding to the 3 lighting conditions (Off, Dim, Bright). Keep in
mind, the closing curly brace for the ‘j’ loop is on line 12; which
means, there will be 3 iterations of loop ‘j’ occurring inside each
single iteration of the ‘i’ loop. Another way to think about this is;
we read in an Excel file with the ‘i’ loop and that file contains 3
sheets, each of which must be imported before going to the next Excel
file.
Line 4 imports or reads
the jth sheet and assigns it as an object of
‘ss’. The ‘ss’ is simply an arbitrary or temporary name for the sheet.
Each sheet contains the data from the five measurements (chin, right
eye, left eye, right wrist, left wrist) – this includes 1000 time
series data points for each of the five measures or columns. Take note
of the arguments of the readWorksheet function. First, we pass the wb
object (the workbook) to the readWorksheet function, then we specify
which sheet to import using the vector of sheet names (here, the jth
sheet, with j = to the iteration number of the ‘j’ loop). Subsequent
arguments allow us to specify the particular column and row (startCol;
startRow; Header = TRUE or FALSE) of the sheet which contains the data.
We could (although not shown) use other arguments (endCol; endRow) to
specify specific places in the sheet to stop reading or importing data.
Line 5 simply creates a
vector containing the sheet name (of the sheet just imported)
replicated the same number of times as the number of rows of that sheet
(n = 1000) and assigns that vector the name
‘condition’. Line 6 does the same thing for the workbook name or Excel
file name which corresponds to the participant whose data is being
imported. Line 7 creates a vector of sequential values from 1 to the
number of rows of the sheet being imported. These values simply number
each sample from the motion capture software (1000 samples = 100
samples per minute of the 10 minute task). Line 8 simply creates a
temporary data frame (df.1) which has 1000 rows and 8 columns. The
columns correspond to the participant identification (participant.id),
the sheet name or condition (1 of three lighting conditions), the
sequential sampling frame numbers (1 to 1000) and then the five motion
capture measures (chin, right eye, left eye, right wrist, left wrist).
Line 9 assigns the proper names to these columns, which are the same
names and will match the columns of the final data frame (data.1). Line
10 ‘row binds’ (rbind) the newly imported data (df.1) to the bottom of
the final data frame (data.1) – simply adding rows to the final data
frame.
Line 11 removes (rm) all
the no longer needed objects. Line 12 ends the ‘j’ loop. Line 13
removes (rm) the no longer needed workbook (wb). And finally, line 14
ends the ‘i’ loop and then removes objects no longer needed. Line 11
and line 13 are not strictly necessary because each iteration of each
loop will re-write or over-write the objects contained in those lines.
However, programming has some best practices which can be described as
similar to some rules learned in kindergarten…always share and always
cleanup after yourself.
Now, to point out one of
the benefits of using R: after having read the above section and having
studied the R script it describes; it is plain to see that an object
oriented programming language, such as the R programming language, is
much more efficient than written American English. It took several
paragraphs to explain only 14 lines of programming.
Once the looping
functions have completed (it should take less than 10 seconds), you can
run a summary of the final data frame. You’ll notice there are some
oddities associated with the data frame, which are revealed in the
summary output.
summary(data.1)
participant.id
condition
sampling.frame
Chin
Length:30001
Length:30001
Min.
: 1.0 Min.
:-501.606
Class :character Class
:character 1st Qu.: 250.8 1st
Qu.:-249.776
Mode :character
Mode :character Median :
500.5 Median : 0.044
Mean : 500.5
Mean :
-1.056
3rd Qu.: 750.2 3rd Qu.: 247.143
Max. :1000.0
Max. : 501.578
NA's
:1
NA's
:1
R_eye
L_eye
R_wrist
L_wrist
Min.
:-5.022822
Min. :-5.018528
Min. :-502.5246
Min. :-502.9264
1st Qu.:-2.504122 1st
Qu.:-2.508357 1st Qu.:-249.2208
1st Qu.:-249.9485
Median : 0.000448 Median
:-0.001666 Median :
-0.3302 Median :
0.2345
Mean :
0.008412 Mean
:-0.000738 Mean :
-0.9942 Mean :
-0.7184
3rd Qu.: 2.500890 3rd Qu.:
2.521117 3rd Qu.: 248.6416 3rd
Qu.: 248.3726
Max. :
5.013923 Max. :
5.028788 Max. :
503.3902 Max. :
502.5689
NA's
:1
NA's
:1
NA's
:1
NA's :1
The first thing to notice
is the participant identification (participant.id) and condition
columns contain character string information instead of factor level
data. Also, notice the number of rows (for all columns) is 30001
instead of 30000. The extra row is the first row of the data frame
which contains all NA as a result of how we created the data frame
prior to importing the data. So, we need to remove the first row and we
need to convert the first two columns to factors.
data.1 <- data.1[-1,]
data.1[,1] <- factor(data.1[,1])
data.1[,2] <- factor(data.1[,2])
summary(data.1)
participant.id
condition
sampling.frame
Chin
participant.1.xls : 3000
Bright:10000 Min.
: 1.0 Min.
:-501.606
participant.10.xls: 3000
Dim :10000
1st Qu.: 250.8 1st Qu.:-249.776
participant.2.xls : 3000
Off :10000 Median :
500.5 Median : 0.044
participant.3.xls
:
3000
Mean : 500.5
Mean :
-1.056
participant.4.xls
:
3000
3rd Qu.: 750.2 3rd Qu.: 247.143
participant.5.xls
:
3000
Max. :1000.0
Max. : 501.578
(Other)
:12000
R_eye
L_eye
R_wrist
L_wrist
Min.
:-5.022822 Min.
:-5.018528 Min.
:-502.5246 Min.
:-502.9264
1st Qu.:-2.504122 1st
Qu.:-2.508357 1st Qu.:-249.2208
1st Qu.:-249.9485
Median : 0.000448 Median
:-0.001666 Median :
-0.3302 Median :
0.2345
Mean :
0.008412 Mean
:-0.000738 Mean :
-0.9942 Mean :
-0.7184
3rd Qu.: 2.500890 3rd Qu.:
2.521117 3rd Qu.: 248.6416 3rd
Qu.: 248.3726
Max. :
5.013923 Max. :
5.028788 Max. :
503.3902 Max. : 502.5689
Now that we have the data
imported and merged into a single data frame, we can then export that
data frame by writing it to our working director, which was set at the
beginning of the script (‘setwd’) to our desktop. The file which is
saved to the desktop will be named “typing_experiment_data.txt” and it
will contain comma delimited (or comma separated) values, without row
names but with column names. Any missing data (there is none in this
example) will be recognized as ‘NA’ and decimals will be represented
with periods (‘.’).
write.table(data.1, file =
"typing_experiment_data.txt",
sep = ",", na = "NA", dec =
".", row.names = FALSE,
col.names = TRUE)
Conclusions
Keep in mind, there are a
variety of different ways of accomplishing what was accomplished in
this article. The example here merged all the data into one data frame.
Different situational needs might dictate keeping the data separated by
participant (i.e. workbook or file) or separated by condition (i.e.
sheet); in those instances it may be preferable to import the data
structures to multiple list objects or multiple data frames. That is
another benefit of using R, the flexibility it affords the analyst in
deciding what to do and how to do it. An R script file with the same
information as contained in this tutorial is available
here. Lastly, for those interested in seeing how the example
data was created in R, and how it was exported from R into Excel.xls
files; please take a look at
this script which was used. An Adobe.pdf version of this
article can be found here.
Footnote1:
The phrase is believed to have originated with respected statistician
and prominent R user Frank Harrell of Vanderbilt University at the 5th
annual Bayesian Biostatistics Conference.
References / Resources
Burns,
P. (2013). Spreadsheet Addiction. Available at: http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction/
Mirai Solutions GmbH
[package maintainer: Martin Studer]. (2013). Package XLConnect.
http://cran.r-project.org/web/packages/XLConnect/index.html
Urbanek, S. (2013).
Package rJava.
http://cran.r-project.org/web/packages/rJava/index.html
A note on the
tutorials of subsequent Modules of this website.
In future tutorial notes,
we will be using R console and script files; but remember all scripts
can be copied and pasted into the R Console. The script files can also
be downloaded and then opened with the R Console or in R Commander
using ‘File’, ‘Open script file…’ in the Console or Rcmdr top task bar.
When reading the script
files, you'll notice the common convention of using # to start a
comment line (which is not working code), while lines without # are
working code.