E-Z-R: an introduction to R for typologists
This guide is meant for people who would like to use R for statistics and graphics but don't want to spend hours on learning the R programming language. (I wrote this guide because I had a frustrating, time-consuming experience learning R.) There are many other resources and introductions to R, all available from www.r-project.org, and they may suit your needs better. Note though that the official introductory texts are difficult for beginners with no experience in command lines or programming. However, there are two textbooks on linguistic statistical analysis in R coming out soon, an introductory one by Keith Johnson at UC Berkeley, and an advanced one by Harald Baayen at the MPI for Psycholinguistics in Nijmegen.
I use R mainly on typological linguistic data, and this biases the kinds of tests and graphs mentioned. However, for some help with DOBES corpora, see CPDP Corpus help
Since I use R on a Mac (OS X) I sometimes refer to the specifics of how R runs on a Mac, but the choice of platform only matters for menu commands such as 'change directory' and the way packages are installed.
Getting started
- Launch X11.app (in Utitilities) and keep it running in the background, then launch R.app
- Load at least the following packages because they are useful for typologists (and are required by many of the commands explained in this documents): "MASS", "pscl", "coin", "car", "gplots", "gmodels","gdata", "gtools","catspec", "vcd", "spatstat", "reshape","cluster", "catspec", "survival", "epitools", "exactLoglinTest", "Design","FactoMineR", "languageR", "beanplot trotter". You can install them one-by-one using the Package Installer and then load the packages you need with the Package Manager. Or, simply install all packages, especially if you work on a laptop and might not have constant internet access.
- Load additional source files you need, e.g. from my own collection
source('http://www.uni-leipzig.de/~bickel/scree.plot.r') # loads a convenience function for producing scree plots described below
source('http://www.uni-leipzig.de/~autotyp/gsample3.r') # loads the g-sampling algorithm described in Bickel 2008.
Hint: create a little *.r file where you list all source files and packages you usually need, e.g. "mystuff.r":
library(vcd)
source('http://www.uni-leipzig.de/~autotyp/gsample3.r')
save this as a plain text file with 'UTF-8, no BOM' and unix (LF) linebreaks.
Then, when you need the stuff, use cmd-E ('soure document') or type:
source('path/to/my/mystuff.r') # the file can also be on the internet, for easy access from different computurs; then, the path is a URL.
Another hint: create this applescript:
tell application "R"
activate
cmd "source('path/to/my/mystuff.r')"
end tell
Then your stuff it at your finger tips; and with Spark you can even create a handy shortcut for this...
Note: if you are addicted to window clicking and menu browsing, load the package "Rcmdr", which gives you a graphical user interface. The good thing about it is that when you select a command you also see how you would write it if you were to use plain R -- a good method for weaning off window clicking and menu browsing!
What is there in R
- the console -- here you type commands and get results, or error messages ;) A useful aspect of the console is that you can you use the UP-arrow to browse through all your previous commands (or look at the command history sidebar). This is standard UNIX but a revelation to Mac users.
- the object = workspace browser -- this lists all objects that you loaded (from an R data file) or created (e.g. by importing new data, cf. below); it also contains functions you built, models you created, etc. -- about anything. It is good to repeatedly save your workspace into datafiles (using "save" in the menu "workspace", with suffixes *.rda or *rData). When you start R anew, it is useful to first clear your workspace and load the one that you want to work with, using the 'load' command in the menu "workspace". (Otherwise you get a cluttered space because R -- at least on the Mac -- tends to remember your objects even when you say 'don't save workspace image' when quitting R.) You can also save individual objects for later re-use by the following command:
save(object_name, file="file_name.rda") # e.g.
save(my_model, file="my_model.rda") # or
save(my_data, file="my_data.rda")
Such a file can then be loaded again into R by simple double-clicking
- the editor -- to edit plain text documents that contain source code, individual commands or a plain text version of what you are looking at right now (simply save the web page to your disk and open it in R!). Use 'New',
'Open', 'Save' from the menu 'File'; you can also double-click r-documents. (You might have to tell your Mac that *.r is for R and not for Xcode, though. For this, use 'get info' and specify that all *.r files should be read by R). I personally prefer an external plain-text editor (e.g. the free editor TextWrangler) instead of the one that is built in. (You can set the preference in R's preference panel.) If you do this as well, you may want to enrich TextWrangler by R syntax coloring (add the file http://www.smalltime.com/gene/R.plist to ~/Library/Application Support/TextWrangler/Language Modules/) and by an AppleScript for automatic code execution from within TextWrangler:
set selectedText to selection as string
tell application "R"
activate
cmd selectedText
end tell
And in TextWrangler under Window:Palettes:Scripts you can add a key combination (e.g. cmd-return) that executes this script. Then, when you select a piece of code and hit your key combination, the code is executed directly in R.
- help -- on any command, use ?commandname to get help, e.g. ?chisq.test; for help on a topic, use
help.search("my topic") # e.g. help.search("linear models")
Much more help is available at http://www.r-project.org/search.html. This is also the place where you can find additional functions or packages.
Useful work habits in R
It is good practice to formulate all commands and all analyses in an R document of your own (using R.app's editor or TextWrangler, cf. above) and run them from there rather than in the console. If you do this you can save the R document, add comments on why you did what you did (e.g. why you recoded some variables, why you chose a particular way of setting up models, etc.) -- best preceded by "#" so that they are ignored by R -- and then you can always check what you did with the data in five or ten years. This is THE big advantage of command-line over menu-driven stats software. (I once spent several days on developing various analyses for some data in SPSS, saved the outputs -- but when I wanted to work on similar issues six years later, I had no idea anymore how I precisely arrived at the results I got, i.e. how exactly I defined subset conditions, how I entered terms into a model, etc.). If you save your analysis while doing it, it is always easy to track down what you did; in addition, sometimes you find out clever ways of doing something. If it is saved to a text document, you will also be able to retrieve this solution and adapt it to the new problem at hand.
Examples used in this document
In the following, 'dv' and 'iv' are place-holders for the names you gave to your dependent and independent variables, 'var' stands for any variable name. 'df' is a place-holder for the name of your dataframe (dataset). The symbol '#' is used after commands in R to indicate that the following is not part of the command: e.g. my_command(var) # blabla means that 'blabla' is a comment and not part of the command.
Data import
-- Data with variables in columns and cases in rows (spreadsheet)
- get your data (from your database or wherever) into a csv
(comma-separated value) format with headers. (Hint for FileMakerPro users: the export format 'merge' is the only one that saves headers and produces a clean csv structure, but you need to replace the extension .mer by .csv.) Make sure the file is
encoded as UTF-8 (unicode) (If it is not, you might run into problems
when the file contains, e.g., language names with special characters.
- change your work directory (cmd-D, or menu item 'Misc') to where
your csv file is. Then:
df = read.csv("filename.csv", na.string="") # The NA string specification is important; without it, empty cells might show up as space character values.
Note: if your database has a field (column) with character strings (e.g. language names, or some typological feature, e.g. "ergative"), R converts this into factors when using read.csv(). Often this is good, because you will indeed use this field as a factor; but sometimes it is not good, e.g. for language names, which you may want to use for sorting purposes etc. If you don't want R to convert character strings into factors, set the "stringsAsFactors" option of read.csv() to "FALSE":
df = read.csv("filename.csv", na.string="", stringsAsFactors=FALSE)
-- Ready-made contingency tables:
- The table should be in a plain text file, columns separated by tabs, e.g.
TAB TAB area1 TAB area2
typeA TAB 85 TAB 63
typeB TAB 31 TAB 88
- Change your work directory (cmd-D, or menu item 'Misc') to where your table file is located
- Then:
mytable = read.table("filename.txt")
-- Type up a contingency table directly in R:
mytable = matrix(c(1,2, 11,12), nrow = 2, ncol=2, dimnames = list(c("row1", "row2"), c("column1", "column2"))) # "row1" etc. are the row names; "column1" etc. the column names; nrow= number of rows, ncol = number of columns.
Add variable names:
names(dimnames(mytable))=c("row_variable", "column_variable")
Sometimes it helps to expand a contingency table into a dataframe so that it can be analyzed using the same formula methods (cf. below) as any other dataframe. There are many ways of doing this, but the far most convenient one is provided by the epitools package:
mydf=expand.table(mytable)
Data export
write.table(my_stuff,file="mytable.txt", sep="\t", row.names=F) # "\t" creates a tab-separated table
where mystuff can, e.g., be a contingency table or an ANOVA table, produced by summary(anova(lm(dv~iv1*iv2,df))) (see below)
A convenient alternative is:
HTML(mystuff,file="output.html") # and then open output.html in Word or whatever text processor you use.
Or better: via the Mac OS X clipboard:
write(my_vector, "|pbcopy") # for simple stuff (single elements or vectors)
write.table(df, pipe('pbcopy'), sep='\t', row.names=F, quote=F) # for tables, lists and dataframes
Then just select 'paste' in the application in which you want to insert the stuff (e.g. paste it into a table field in Keynote, Pages or Numbers, and you're done!)
And for LaTeX (unless you Sweave anyway:)
latex(df, file='', booktabs=T) # then copy & paste into your *.tex document; booktabs=T requires \usepackage{booktabs} for nice tables in tex
Data manipulation
-- View, edit or create a dataframe in spreadsheet form
invisible(edit(df)) # "invisible" means that you view it in a separate spreadsheet-like window and it is not visible in the console.
Create a new dataframe:
df = invisible(edit(data.frame()))
Edit a dataframe:
df = invisible(edit(df)) # or assign it to a new name:
df2 = invisible(edit(df))
In the spreadsheet window, you can edit entries.
-- Inspect the structure of a dataframe (names, kinds of entries)
str(df) # more informative and faster than invisible(edit(df))!
df[1:10] # show the first ten lines
Easier:
head(df)
df[1:10,1:5] # show the first ten lines coded for the first five variables.
-- Sorting data:
df$var1[order(df$var1)] # sorting var1. Same as
sort(df$var1) # but order() is more flexible than sort() because you can also do things like
df[order(df$var1),] # order the entire df based on var1, or
df[order(df$var1, df$var2),] # by var 1 and within that, by var2, or
df$var1[order(df$var2)] # sort var1 by values of var2 and display the result,
or use this in order to order factor levels for nice displays e.g. inside plot()
factor(iv, levels=unique(iv[order(dv)])) # orders factor "iv" according to values in dv
Use decreasing=T inside order() for decreasing ordering; or a minus before a variable:
df[order(df$var1, -df$var2),] # order by var1 in ascending, and then, within that by var2 in descending order.
Note: if var2 is a factor, "-df$var2" does not work. Use "-rank(df$var2)" instead:
df[order(df$var1, -rank(df$var2),]
A convenience function for displaying a sorted dataframe in spreadsheet form can be found here (source the code or copy-paste it into the console first).
-- Extract levels (e.g. the names of language families in a table of languages)
levels(df$var) # but this may include levels with empty cells. To get rid of them, use:
levels(df$var[,drop=T]) # or
unique(df$var)
length(unique(df$var)) # gives you the number of levels (technically, the length of the vector)
-- Recoding
The following are the operators used for stating conditions: < 'smaller than', <= 'smaller/equal', > 'bigger than', >= 'bigger/equal', != 'does not have value'; c1 & c2 is the intersection ("and"), c1 | c2 is the union ("or"), and !c1 is the negation of the conditions c1 and c2. Always use quotes for the values (e.g. var1 != “Europe”, or var1! ="1"). Use these operators in the function replace():
df$var1=replace(factor(df$var1, levels=c(...)), df$var2=="language1", "new_value") # assigns "new_value" to var1 for each entry that is "language1" in var2 of df.
df$var1=replace(factor(df$var1, levels=c(...)), df$var1=="old_value", "new_value") # replaces "old_value" in var1 by "new_value"
NOTE: replace() does not allow assignment of values that are not first introduced as levels. This is why one needs to specifiy levels=c(...): inside c(), write each level that you want in the end, e.g. levels=c("old_value2", "old_value3", "new_value") (assuming "old_value1" was the one that was replaced by "new_value"); it may be easier to first define the new factor by
df$var_new=factor(NA, levels=c(....)) # oder of level names will define contrasts
and then do the replacements of the NA with the values needed.
df$var_new=replace(df$var_new, df$var1=="value1", "A") # write "A" into the the new variable wherever var1 has value 'Value1'
Shortcut if you simply want to replace the name of a level in a factor:
levels(df$var) # will show you the levels as characters, e.g.
> [1] "level1" "level2"
Check which one you want to replace. If it's "level2", type:
levels(df$var)[2] = "my_new_label"
Rescaling and transformation
of a vector into a [0...1] range, or into ranks, or through sd division etc. can be conveniently done with:
rescaler(df$var, type="range")
from library(reshape)
others:
Log for counts including continuity correction for zero counts:
log(df$dv + 0.5)
boxcox()
-- Parametrization (factor coding) for linear models
Factors (i.e. variables stored as factors or coerced into factors by as.factor(var)) inside linear models are automatically parametrized. The method of parametrization can be chosen prior to setting up a model with the command
options(contrasts=c("MY.UNORDERED.FACTOR", "MY.ORDERED.FACTOR")) # where as MY.UNORDERED.FACTOR you can indicate what kind of parametrization you want for unorderded factors.
The possibilities are:
- sum-to-zero parametrization, a.k.a. contrast parametrization or contrast coding or contrast vectors, and typically used for orthogonal (uncorrelated) pairwise comparisons in ANOVA; α (the intercept) gives the grand mean of the dv:
options(contrasts=c("contr.sum", "contr.poly")) # here, with ordered factors parametrized polynomially
- dummy parametrization, with the alphanumerically first level set to 0 (baseline), as typically used in multiple regression analysis, where α then gives the value of the dv when all coefficients β[1]...β[k] are zero.
options(contrasts=c("contr.treatment", "contr.poly")) # with the first level set as baseline (i.e. as 0). This is the default!
options(contrasts=c("contr.SAS", "contr.poly")) # with the last level set as baseline (i.e. as 0)
In order to set a chosen level as the baseline, use:
df$var_new = relevel(df$var, ref="my_baseline_level")
A handy alternative to this is by simply re-ordering the levels:
df$var_new = factor(df$var, levels=("my_baseline_level", "my_second_level", "my_third_level"))
User-defined contrasts:
contrasts(df$var)=matrix(c(0,1,0,0,0,1), nrow=3, ncol=2)) # same as default: change 0 and 1 distribution as long as it makes sense...;-).
- Helmert parametrization, which contrast the second level with the first, the third with the average of the first two, and so on (as typically used for ordered factors):
options(contrasts=c("contr.helmert", "contr.poly"))
The type of parametrization can be made visible by:
my_model$contrasts
The actual coding matrix can be inspected by using the command:
contrasts(df$var) # and the real coding, language by language:
invisible(edit(model.matrix(~var, df))) # shows it in a spreadsheet browser
or, if you set up you own matrix:
contrasts(my_model_matrix)
-- Transposition
Transpose (rows become columns, columns become rows):
data=as.data.frame(t(as.matrix(data.frame(data))))
-- Renaming and adding variables
df$var_new = df$var_old # adds var_new with same values as var_old
-- Selecting cases = creating subsets
dfsubset = subset(df, condition) # puts the subset into a new dataframe called "dfsubset" (chose your own name)
A more compact alternative, especially useful when used inside other commands is:
df[df$condition,] # for a df;
df$var1[df$condition] # for a variable within a df, e.g.:
hist(df$var1[df$var2=="x"]) # a histogram of var1 for all languagse where var2 has value "x"
Be careful with variables containing NAs!: NAs are always matched, e.g.
df[df$var=='a',] # matches not only rows with 'a' but also those with 'NA' in df$var. To avoid this, use:
df[df$var=='a' & !is.na(df$var),] # or
subset(df, var=='a') # or
df[df$var %in% 'a'] # this is the most general solution, as it also covers the case when you want
multiple types:
df[df$var1 %in% c("A", "B"), ] # df with all rows with var1 either A or B
df$var1[df$var1 %in% c("A", "B")] # var1 with values either A or B.
Draw random subsets:
df[sample(rownames(df), 10),] # where '10' is the size of the random samples. See ?sample
Subsetting by substrings, using grep:
df[grep('regex', df$var),] # all rows in which var has a string matching regex. For multiple condition use grepl() (introduced in May 2009, with R version 2.9):
df[grepl('regex1, df$var1) & grepl('regex2', df$var2),] # or combine this with logical conditions, e.g.
df[df[grepl('regex1, df$var1) & df$var2 %in% 'value2', ] # find all records/rows that contain regex1 in var1 and match value2 in var2
df[,grep('regex', colnames(df)] # all columns whose names contains 'regex'
NOTE: especially when subsetting for plots and tables, it happens that factor levels are displayed even if they are outside of the subset. This results from the fact that factor levels are stored independently of the actual values in a vector and therefore appear even if they contain no value. The problem can be solved by specifying my_factor[,drop=T], e.g.
plot(dv~iv[,drop=T], df, subset(rd, iv %in% c("language1", "language2"))
In order to remove all unused ('orphaned') levels of all factors in a dataframe, use:
drop.levels(df) # from package {gdata}
As a condition, use the same conventions as for recoding (cf. above), e.g. "var!=1", which means 'exclude all rows with value 1 in var'; or "var=="Europe"", which means 'select only languages coded as "Europe" in var'. Also note that 'a==b' is true also if b is NA; to avoid this either embed the condition in which() or use %in% instead of '=='. ( Also, don't confuse "==" with "="! In R "=" normally means 'assign to' or 'is equal to (equivalent: "<-"), not 'has value'. 'any' means '∃', i.e. 'some' )
In many function, an alternative method is to write:
plot(...., df, subset=var %in% "value")
Matrix subsetting is a bit more complex as matrices have a special structure which is very different from that of dataframes, but the following will return the part of the matrix that satisfies a specific condition, e.g. x<5
do.call(rbind, apply(my_matrix, 1, function(x) x[x<5])) # effectively, apply(...1...) returns the rows that satisfy the condition and the "do.call" to rbind binds them together and spits them out.
Subsetting by condition on multiple or all columns
df[apply(df, 1, function(x){any(x %in% 'value')}), ] # select all rows where at least one column has 'value'
df$code <- character(length(dim(df)[1])) # set up a new char vector
df$code[apply(df, 1, function(x){any(x %in% 'value')}), ] <- 'value' # assigns 'value' to 'code' in those rows that contain 'value' somewhere
-- How to deal with missing values (blanks, NA)
Throwing out NAs is not needed for DVs and contingency tables, but it is sometimes needed for IVs with string values (e.g. if a language is not coded for a geographical factor level like “Europe”); if needed, exclude by using
df$var = df$var[!is.na(df$var)] # for an individual variable
df = df[!is.na(df$var),] # removing rows from df which have NA in var
df = subset(df, !is.na(var)) # same
Removing rows where all or any variables are NA:
df = df[-which(apply(df, 1, function(x) all(is.na(x)))),]
df = df[-which(apply(df, 1, function(x) any(is.na(x)))),]
Removing columns where all or any variables are NA:
df = df[,-which(apply(df, 2, function(x) all(is.na(x))))]
df = df[,-which(apply(df, 2, function(x) any(is.na(x))))]
Radical:
df_without_NA = na.omit(df) # removes all rows with NAs in any column
If anything goes wrong here, a typical reason is that something went wrong when reading in the data. A quick fix is to add "na.strings=""" to the read.csv() or whatever command.
--- Aggregating data per language
R has a very convienent way of dealing with 'within-language' databases where languages have multiple records (e.g. alignment types) and we want to get aggregate measures per language (e.g. the modal alignment type)
x_by_lg=by(df$x, df$language, function(x) length(unique(x))) # number of unique x-values in each language ('by' language); use any function you want for summarizing data, e.g. for getting the number of languages that have value "a" in df$x:
function(x) sum(x=='a') # note how sum() can be used to count the number of cases in which xx=='a' evaluates to 'TRUE'! (Convenient, once you know it!)
The result is a numerical vector with names. Subsetting by:
x_by_lg[names(x_by_lg)=='Belhare'] # selecting the value of a given language
x_by_lg[x_by_lg==3] # selecting a given value
Omitting NAs:
x_lg=na.omit(as.integer(x_by_lg))
Summarizing counts:
table(x_by_lg)
barplot(table(x_by_lg))
Equivalent of by():
x_by_lg=tapply(df$x, df$language, function(x) length(unique(x)))
The same can be achieved by aggregate() but this creates a dataframe and is therefore especially useful when the aggregation is not defined by a single factor (e.g. df$language), but by the interaction of two or more factors, e.g, by stock*area :
x_by_lg=aggregate(df$x, list(stock=df$stock, area=df$area), function(x) length(unique(x))) # the command "list(stock=df$stock)" makes sure that the grouping variable will have a good name in the result; just writing "list(df$stock)" will name the variable "Group.1".
Total sum of a vector:
sum(df$var, na.rm=T)
with multiple arguments, or 'how to do this all by hand': split into list elements, then process each list element, i.e. group and tie back together again
sapply(split(df, list(df$var1, df$var2)), function(x) {some function operating on e.g. x$var1 and x$var2})}) # will output the results of the function per levels of variable 1 and variable 2
--- Change type of specific columns (vectors) in dataframes
df[, mycolumns] <- lapply(df[, mycolumns], function(x) as.factor(as.character(x))) # where 'mycolumns' is the index of the columns, e.g. 33:56 or c('A', 'D', 'XX') etc.
--- Advanced stuff:
See packages gsubfn and
doBy
Categorical data
-- Contingency tables
ftable(iv~dv, df) # simple table
or:
table(df$iv, df$dv) # note the reversed sequence of dv (columns) and iv (rows).
or:
with(df, table(iv, dv)) # with() specifies the df and can be used with all other commands as well.
addmargins(table(df$iv, df$dv)) # with margin totals
Get the expected values:
chisq.test(ftable(iv~dv, df))$expected
Get cell proportions:
prop.table(ftable(iv~dv,df)) # use subset() for indvidual per-row or per-column proportions
prop.table(table$var) # proportion of values in var
Get counts on how many datapoints there are in a given area coded in variable "var" (e.g. how many languages in a given area). This is done by one-dimensional tables, using table()
table(subset(df, condition)$var) # for condition, see on subsetting above.
or:
table(df$var[df$ ...]) # where ... is the condition, e.g.
table(df$vp[df$area=="Eurasia"]) # Frequencies of VP order in Eurasia
Alternatives:
structable() # with useful subsetting facilities, e.g.
my_table=structable(dv~iv, df)
mytable[2,3] # show cell of 2nd dv level and 3rd iv level
xtabs(~dv+iv,df) # useful for n-way tables:
xtabs(~dv+iv1+iv2,df) # etc.
Combining data from the two variables. Suppose we have a binary variable 'agr' for A and one for O and we want to combine this into a single table to compare proportions of agr for A vs for O:
rbind(table(df$agrA), table(df$agrO))
which can then be input to spineplot() for a nice display
-- Grouping data
Create a frequency table with a count variable "Freq" for us in loglinear analysis, on which see below:
mytable = as.data.frame(xtabs(~iv+iv+iv, df))
Create a success/total frequency table, as input to, e.g., glm() or elrm():
temp1=as.data.frame(xtabs(~iv1+iv2,df, subset=dv=='my_success_level', drop.unused.levels=T))
temp2=as.data.frame(xtabs(~iv1+iv2,df)) # total=success + failure
my_freqtab=cbind(temp1, temp2$Freq)
my_freqtab$success=as.integer(my_freqtab$Freq)
my_freqtab$n=as.integer(my_freqtab$'temp2$Freq')
-- Association and goodness of fit tests
fisher.test(ftable(dv~iv, df)) # available (within reasonable limits) also for multinomial variables, using Mehta & Patel's (1986) algorithm
chisq.test(ftable(dv~iv, df)) # based on the Chi-Square Distribution; for randomization-based p estimates, use:
chisq.test(ftable(dv~iv, df), simulate.p.value=T, B=10000) # with 10000 randomizations
or, from library(coin):
chisq_test(ftable(dv~iv, df), distribution='approximate') # and many more randomization tests in that library.
An alternative to this is the trotter routine described in Janssen, Bickel & Zúñiga 2006:
trotter.chisq.wrapper(dv~iv, df, B= 10000)
Add a graphical output:
trotter.chisq.wrapper(dv~iv, df, B= 10000, plot=T)
Chi-square test with given probabilities:
chisq.test(table(df$var), p=c(.4,.6)) # calculates deviation from an expected 40%-60% distribution
chisq.test(table(df$var), p=c(.4,.6))$statistic # extracts the resulting chi-square value (with $p you extract the p value etc; see ?chisq.test)
Association statistics:
assocstats(ftable(dv~iv)) #gives the Pearson Coefficient φ, the Contingency Coefficient C, and Cramer's Coefficient V.
Matched pairs/n-tuples (e.g. variables measured in pairs/sets of languages):
McNemar's test:
The first argument must be a contigency table where cells [i,i] and [j,j] represent agreement and [i,j], [j,i] disagreement, e.g.
hm no hm
hm 146 74
no hm 12 59
If the data are in two columns of a df tied to LIDs, the table can be easily generated by
table(df$column1, df$column2)
If the result doesn't put the agreement counts into the right corner, relevel() is your friend (e.g. table(df$column1, relevel(df$column2, ref="hm")). The resuyting table can then be fed in to
mcnemar()
Randomization version:
mh_test(mytable, distribution='approximate')
-- Reliability tests
Following Janssen, Bickel & Zúñiga 2006 and available in the trotter package:
plot(trotter.fisher.landscape(dv~iv, df)) # use radius= to specify the number of cells covered (default is 4).
-- Generalized linear models (GLM)
Note: multinomial variables are automatically parametrized into dummy variables if they are factors or coerced into factors by as.factor(). See above
Logistic Regression:
my_model = glm(dv~iv1*iv2, df, family=binomial()) # assuming ungrouped (one row=one datapoint) data, and all variables declared as factors. Use '*' for interactions, '+' for an additive model, i.e. removing interactions:
my_model = glm(dv~iv1+iv2, df, family=binomial())
Baseline multinomial regression:
multinom(dv~iv1*iv2, df)
If the dv is parameterized using the default setting ('treatment'), the alphanumerically first level will be the baseline against wich the odds are computed for each other level in the dv. Another level can be chosen as the baseline via contrasts() or factor(....levels=c(.....)). See the section on parametrization above.
If the data is in grouped form, i.e. in a frequency table format, with a column for sucesses and one for totals, use
my_model = glm(cbind(success, total-success)~iv1+iv2, df, family=binomial()) #
View coefficients:
summary(my_model, cor=T)
Alternative, with more information:
my_model = lrm(dv~iv*iv, df, x=T, y=T)
see the {Design} package for more about lrm()
Stepwise AIC-based model selection:
stepAIC(saturated_model) # The overall output is a table with all models, ranked by AIC, "-" means "if you drop this term, you get the following AIC"; "none" means: "if you don't drop anything, you will get the following AIC:"
Analysis of deviance / Likelihood Ratio tests
anova(my_smaller_model1, my_bigger_model2, test="Chisq") # testing whether models differ significantly, using the χ2-criterion.
Test against null model:
anova(my_model, test="Chisq") # performs a sequence of tests, e.g. if the model is dv ~ iv1 + iv2 + iv1:iv2, we first get iv1 against null, then iv2 againts null plus iv1, then the full model (with interaction) against the additive model. The disadtvantage is that we will miss the test of iv2 against null. To get it, we need to re-order the model first. Or, better, we do separate LR tests, each comparing a smaller with the next bigger model, from the outset. Then we can test exactly what we want.
anova(modelsteps, test="Chisq") # summary anova table of selected model from step()
NOTE: anova(glm(...), test="Chisq") produces sequential (type I) analysis of deviance tables, while anova(lrm()) from the {Design} package produces partial (type III) analysis of deviance tables. See the notes on factorial analysis below for the difference. Also note that anova(lrm()) does not need the specification: test="Chisq" while anova(glm()) needs it. If you want to perform sequential or pairwise model testing on lrm models, don;t use the anova(lrm()) function, but instead
lrtest(my_smaller_model1, my_bigger_model2)
Exact (binary) conditional logistic regression for sparse datasets, using Markov Chain Monte Carlo randomization: Package {elrm}. The data need to be in a success/total format
Loglinear analysis (again same model selection methods as above):
Count regression, where each observation has as dependent variable a count (e.g. number of consonants) -- any positive-only integer, to be regressed on some predictors:
glm(count.dv~iv1*iv2, df, family='poisson')
For counts in contingency tables, first extract the counts by:
my_table = as.data.frame(xtabs(~var1+var2+var2, df)))
then
glm(Freq~iv1*iv2, my_table, family='poisson')
or
loglm(Freq~iv1*iv2, my_table, family='poisson')
Exact loglinear analysis: for sparse datasets, using randomization: Package {exactLogLinTest}
mcexact(Freq ~ iv1*iv2, my_table) #
Residuals Diagnostics:
plot(my_model) # so simple!!
or plotting the deviance residuals against the predictors, checking the homogenous residuals assumption:
plot(residuals(my_model)~predict(my_model, type='link')) # should increase under Poisson assumption; stable under Binomial assumption; use 'type='response'' for plotting against the fitted values
Plotting the response residuals, checking the homogenous variance assumption:
plot(residuals(my_model, type='response')~predict(my_model, type='link'))
plot(residuals(my_model, type='Pearson')~predict(my_model, type='link'))
Overdispersion: sometimes data show more variance than the Poisson or Binomial distributions can assume (i.e. non-constant response probabilities), perhaps as the result of unmodeled dependence between datapoints (a situation common in typology) or as the result of nonlinearity in some predictors. For such situaions (often detectable through deviances much larger than the degrees of freedom), use quasi* in the family specification:
glm(...., family='quasipoisson') # or 'quasibinomial'
For more help on doing GLMs in R, see Laura Thompson's R manual to accompany Agresti's Categorical Data Analysis
-- Plots, graphs, and charts
Scaled bar chart, i.e. the dv gets scaled to 100% on the y-axis and the iv gets columns with a width proportional to the data (highly recommended!)
plot(as.factor(dv)~as.factor(iv),df, col=c("red","grey")) # If variables are factors (check this using str(df)), as.factor() is not needed. Alternatives:
with(df, plot(dv, iv)) # or
plot(df$dv, df$iv)
Ad-hoc labels:
plot(as.factor(dv)~as.factor(iv),df,xaxlabels=c("Type1", "Type2"), yaxlables=c("Group1", "Group2"),col=c("red","grey")) # set xaxlabels="" if you want to have no labels.
If you want to revert the order of levels, use reorder.factor(var, decreasing=T):
plot(reorder.factor(dv, decreasing=T)~as.factor(iv),df, col=c("red","grey"))
Redordering of monovariate counts:
with(df, plot(reorder(dv, dv, FUN=length))) #where FUN can be any function computing on what the order should be based on.
NOTE: if your table is a matrix (e.g. created by matrix(), as described above), do not use plain plot(), but instead:
spineplot(mytable)
If you want to get scaled 1-dimensional bar charts, the only solution I know of is to create a dummy factor filled with a constant, i.e:
df$dummy <- factor("blabla")
with(df, plot(dummy, final, ylab="", xlab="", xaxlabels=""))
Association and mosaic plots with information on residuals
assoc(dv~iv+iv, df)) # association plot, with bars proportional to Pearson residuals (i.e. standardized χ2 contributions) in each cell
mosaic(~var1+var2+var3, df, pg=shading_max) # with shading indicating the Pearson residuals: blue=negative, red=positive; grey if not individually significant (cut-off based on significance levels derived from the fact that Pearson residuals approximate a normal distribution)
cotabplot(~ var1 + var2 | var3, df, panel = cotab_coindep, n=5000, type = "assoc") # one panel for each level of var3, association plots, significance tested on 5000 permutations
For customizing and on how to highlight or edit specific parts of a plot, see the vcd vignette
Classical non-scaled bar charts:
barplot(ftable(dv~iv, df), beside=F, col=c("red", "grey"), axisnames=T, names.arg=c("IV1", "IV2")) # plots stacked bars; use "beside=T" if you want to see them next to each other
Plot only a subset (e.g. only one area):
barplot(ftable(iv~dv, df, exclude=c("name_of_area"), col=c("grey", "red"), legend=c("label1", "label2")))
Pie charts
pie(ftable(iv~dv, df, col=c(“red”, “grey”, “blue”)) # Colors are listed counterclockwise; see ?col for more info
Create two pie charts for two areas "area 1" and "area 2" defined in a variable named "iv":
pie(ftable(dv~iv, subset(df, iv=="area 1"))) # first pie
pie(ftable(dv~iv, subset(df, iv=="area 2"))) # second pie
NB: pie() requires as.numeric variables!
To remove ticks and labels, use labels="" within the brackets of pie(), e.g.
pie(ftable(dv~iv, subset(df, iv=="area 1")), labels="")
Scalar (interval and rank) data
-- Summaries and quantiles
summary(df$var) $ produces mean, median and quartiles
quantile(df$var, c(.1,.25,.5,.75,.9)) # gives the quantiles of the lowest 10%, 25%, 50%, 75%, 90%, or any other set or your choice
The mode can be computed via table() and max():
xtab = table(df$var)
modal.value = names(which(xtab == max(xtab)))
A convience function is here
-- Regression and correlation
summary(lm(var1~var2, df)) # linear regression
spearman_test(var1~var2, df, distribution=c("approximate")) # Spearman and Kendall Correlation coefficients (non-parametric)
-- t-test, ANOVA and its alternatives
t.test(dv~iv,df)
anova(lm(dv~iv1*iv2*iv3..., df))) # with all interactions! (Use "iv1+iv2" instead of "iv1*iv2" for a linear model without interaction, i.e. if you are interested in the contribution of each iv but not in the contribution of their interaction.)
summary(lm(dv~iv1*iv2*iv3..., df)) # outputs the coefficients, as in a multiple regression approach
Factorial analysis: Note that anova() computes sequential sums of squares (a.k.a. as Type I), where β(X)≠0 is tested in a model including only those factors that were listed earlier in the formula (iv1 before iv2 etc). For partial (a.k.a. adjusted or Type III or lazy summary) sums of squares, where we test whether β(X)≠0, given that all other factors and their interactions are in the model, you can use:
Anova(lm(dv~iv..., df), type="III") # with capital "A"; see ?Anova for more information. And note especially, that under Type III testing, contrast coding (on which see above) must sum to zero -- which is not the default in R, and so you should use:
options(contrasts=c("contr.sum", "contr.poly"))
Also note that Type III testing presupposes balanced designs (equal number of datapoints per cell.)
When using sequential SS computation, which does not presuppose balanced designs, one normally follows the same principles of model selection as in any other linear model approach (including regression) (which is one reason sequential SSs are advocated in R): with sequential SSs, only the last term in the model is tested in relation to a model that includes all other terms. So, you first test whether the highest interactions (e.g. the three 2-way interactions in our 2-way design) are significant by first testing a model with one of 2-way interactions as the last term, then with the second two-way interaction as the last term, then with the third 2-way as the last term. If they are not significant, this suggests they don't belong to the best-fitting model and you then test models, where each of the main effects is the last term in turn. This sounds cumbersome but it lets you concentrate on precise hypotheses and stay clear of hypotheses which are not theoretically motivated (in which case you simply leave out that particular test) and of the danger of interpreting a single-factor effect when there is an interaction (something that type III tables invite). There is also an easy shortcut (actually, there are many shortcuts, see ?drop1, ?step etc.), e.g.
full=lm(dv~iv1+iv2+iv1:iv2)
small=lm(dv~iv1+iv2)
anova(full, small) # testing only the contribution of the iv1:iv2 interaction (cf. the syntax to what is used in GLMs!). This is equivalent to reading the last row in the output from:
anova(full) # which is the same as anova(lm(dv~iv1*iv2))
More examples:
anova(lm(dv~iv1+iv2)) # testing iv2
anova(lm(dv~iv2+iv1)) # testing iv1
Results for iv1 and iv2 of these two tests are the same ones summarized by:
Anova(lm(dv~iv1+iv2), type="III") # with type III, order does not matter.
Covariance: if an iv is numeric, you will automatically get an ANCOVA!
Exploring interactions:
interaction.plot(df$iv1, df$iv2, ...., df$dv, xlab="iv1", ylab="dv (iv2)", trace.label="Trace Factor")
Much more informative with lattice plots, e.g.
densityplot(~dv | iv1, df, groups=iv2, auto.key=T, strip=strip.custom(bg="lightgrey"), scales=list(alternating=1, tck=c(1,0)), layout=c(3,1), plot.points=F, ylab="Title of y-axis", xlab="Title of x-axis", par.settings=list(superpose.line=list(col=c("blue", "red"), lwd=1.2))))
Planned contrasts: first, create a contrast matrix (example with 4 levels in the relevant IV):
my_contrast_matrix = rbind("Group One vs. All others" = c(-3,1,1,1), "Ignore Group One and Contrast Groups 2 vs. 3-4 together" = c(0,-2,1,1))
Then:
my.aov = Anova(lm(dv ~ iv, df)) # and now:
fit.contrast(my.aov, "iv", my_contrast_matrix) #name iv in quotes!
See above for more on parametrization (factor coding)
Nonparametric and distribution-free alternatives:
kruskal.test(dv~iv, df) # Kruskal-Wallis Test (two or more iv levels)
wilcox.test(dv~iv, df) # Mann-Whitney-Wilcoxon Test (two iv levels)
Randomization-based ANOVA (without interactions sofar) from trotter routine described in Janssen, Bickel & Zúñiga 2006:
trotter.aov.wrapper(dv~iv1+iv2..., df, plot=T)
NOTE: the current version of trotter does not allow interactions directly in the model, but a preliminary shortcut is to add an interaction of interest by introducing a new factor that combines the levels of the interacting factors (i.e. a new factor with 4 levels, combining the levels of two binary factors). This can be done easily by:
df$interacton_var = interaction(df$iv1, df$iv2) # combines iv1 and iv2
The trotter test is based on sequential SSs, so you should follow the same model selection procedure as with linear models in general; cf. above on factorial analysis.
-- Reliability tests
Following Janssen, Bickel & Zúñiga 2006 and available in the trotter package:
plot(trotter.aov.landscape(dv~iv, df)) # use radius= to specify the number of cells covered (default is 4).
-- Homogeneity test:
fligner.test() # Fligner-Killeen Test (said to be more robust than Levene's Test, which is the default offered by SPSS)
-- Plots, graphs, and charts
Histogram:
hist(df$var) # add colors, e.g. 7 levels of heat:
hist(df$var, col=heat.colors(7)) # see below for more on coloring.
Smoothed histogram, i.e. density plots:
plot(density(df$var)) # everything in this can be customized, as in the followowing example:
plot(density(df$var1[df$var2=="language1"]), lwd=2.5, main="", xlab="Name of var1", ylab="Probability density", cex.lab=1.2, lty=4, xaxp=c(0,1,10), xlim=c(0,1), ylim=c(0,5)) # subset for the values of language 1, make lines thicker (lwd), remove main title (main), add x-axis title (xlab), adjust font size (cex), line type (lty), make sure there is a tick every 0.1 interval (xaxp), predefine extent of x-axis (otherwise only that part is printed which has observations) (xlim), adjust height of graph (ylim) (for this, get the maximum value on the y axis from:
density(df$var1[df$var2=="language1"]) # prints a nice summary
polygon(density(df$var1[df$var2=="language1"]), col="lightgrey") # fills the curves!
abline(v=mean(df$var1[df$var2=="language1"]), col="red") # adds a vertical line showing the mean:
rug(df$var1[df$var2=="language1"]) # adds small ticks for each observation to the x-axis (looking like a rug)
Add a legend:
legend("topleft", bty="n", lty=4, lwd=c2.5, legend="language1", cex=1.2)
More on this, below
Boxplot:
boxplot(dv~iv, df, col="red") # or c("red", "blue") etc. if you want each box with a different color (first red, then blue, then red again etc. -- specify as many colors as you want and note that if you have less colors than groups, they will be recycled.)
Add useful options like:
- Sorting alphanumerically by iv level/group:
boxplot(dv~iv, df, sort=T)
- Sorting by median:
bymedian <- sort(with(df, tapply(dv, iv, median))) # and then:
boxplot(dv ~ factor(iv, levels = names(bymedian)), df)
- Adding custom group names:
boxplot(dv~iv, df, names=c("Group1 (N=23)", "Group2 (N=45)", "Group3 (N=32)"))
- Adding a custom y-axis label:
boxplot(dv~iv, df, ylab="label")
- Removing tick marks:
boxplot(dv~iv, df, par(tck=FALSE))
- Want to change the extent of the y-axis, or do any more fanciful adjustments? -- use bpx(boxplot(...)) instead of plain boxplot, e.g
bxp(boxplot(), ylim=c(0,1)) # the y-axis is now strictly from 0 to 1, regardless of how much of this range is actually filled by datapoints
NOTE:see below for adjusting font sizes
Violinplot:
library(beanplot)
beanplot(dv~iv, df, what=c(0,1,1,0)) # what is yes/no vector for (in this order:) total average line, the beans, the bean average, and the beanlines
sorted by means:
bymean <- sort(with(rd2, tapply(RD, LG, mean)))
beanplot(dv ~ factor(iv, levels = names(bymean)), df)
Add point estimates with means etc:
means <- tapply(rd2$RD, rd2$LG, mean)
points(sort(means), pch=19)
Scatterplot:
plot(var1~var2, df, col="red", pch=19, cex=5, xlab=c("var1", cex=5),ylab=c("var2"), font=2) # where pch= 19 means solid, 1 means empty circle, and cex = size of dot
See ?plot for more on this.
Ordering languages by value of dv:
plot(dv~factor(language, levels=unique(language[order(dv)])), df, par(las=2, mar=c(12,4,4,4)), xlab="") # orders languages according to their values on the variable "dv" and plots them in such a way that language names appear perpendicular (las=2) to the x-axis (with enough space: mar=c(12,...).
Add a fitted line:
abline(lm(var1~var2,df), lwd=2) # where lwd specifies the line weight.
Distance/dissimilarity analysis
-- Calculate distance matrices
Make sure you have enough pairs of rows with non-empty cells
df2=subset(df, var1!="" & var2!="") # and so on for all var on which dissimilarities will be measured
my_distance_matrix = daisy(df2[,c("var1","var2")]) # again, add all var which are used.
Or:
my_distance_matrix = dist(df2[,c("var1","var2")]) # see ?dist and ?daisy for differences and options, especially about how to chose distance calculation methods
For useful plots, add the names of each object by:
names(my_distance_matrix) = df2$my_objects # where my_objects are useful identifiers of the objects whose similarities are measured by var1 ... var2
NOTE: Both dist() and daisy() require at least one column/variable in which all pairs of rows/objects/taxa have non-NA values. See above for methods of removing NA-rows or NA-colums.
For dissimilarity computation on categorical variables, see ?daisy for helpful information, but especially note the following:
For multinomial variables, daisy() computes dissimilarity using Gower's coefficient, which is the sum of different values divided by the total number of (non-empty) variables. (This is essentially the same as the Hamming distance relative to the number of variables.) For binary variables, the Gower method can be enforced by setting the evaluation type as "symmetrical":
daisy(....type=list(symm=c("var1", "var2")))
If only 1-1 should increase similarity; and 0-0 should be ignored by the calculation (i.e. treated like missing values), set the evaluation type as "asymmetrical", which is equivalent to Jaccard coefficients:
daisy(...type=list(asymm=c("var1", "var2")))
For presence/absence variables (e.g. has tone vs lacks tone), asymmetrical (Jaccard) evaluation is sometimes preferred; for others (e.g. OV vs VO), symmetrical (Gower) evaluation is more appropriate.
Note that nominal variables need to have at least two levels defined (if if they may happen to have one value only). Enforce this for all variables in a dataframe by:
df = as.data.frame(lapply(df, function(x) factor(x, levels=c('level1','level2'))))
-- Display and explore distances
boxplot(my_distance_matrix) # one way of checking for outliers (if there are many, you may want to calculate distances using the Manhattan method)
plot(im(my_distance_matrix)) # displays distances/correlations by a heat color gradient
-- Cluster analysis:
plot(agnes(my_distance_matrix, method="average"), which.plots=2, main="main title", axes=T, sub = c("Clustering based on average dissimilarities"), xlab="", ylab="")
Adding the Agglomerative Coefficient:
plot(agnes(my_distance_matrix, method="average"), which.plots=2, main="main title", axes=T, sub = paste("Clustering based on average dissimilarities; ", "Agglomerative Coefficient = ",round(agnes(my_distance_matrix, method="average")$ac, digits=2)), xlab="", ylab="")
Terminology:
single linkage = nearest neighbor
complete linkage = furthest neighbor
average = UPGMA
weighted = WPGMA
-- Multidimensional scaling:
my_mds=isoMDS(my_distance_matrix) # produces non-metrical (i.e. rank-based) scaling; add "k=3" if 3-dim is wanted; default is k=2
or:
my_mds=cmdscale(my_distance_matrix, eig=T) # produces metrical scaling (i.e. based on actual numerical values)
Other tools: nmds {ecodist}, metaMDS {vegan}
NB: isoMDS() does not allow off-diagonal zero distances. A way out is to add a small constant: 1. find the smallest distance:
min(my_distance_matrix[which(my_distance_matrix>0)])
2. replace zeros by a constant between zero and the smallest distance, e.g.:
my_distance_matrix[which(my_distance_matrix==0)] <- 0.01
Also, isoMDS() does not allow NA distances. Here is a way to remove them without destroying the structure of the object:
my_distance_matrix <- do.call(rbind, apply(as.matrix(my_distance_matrix), 1, function(x) x[!is.na(x)]))
Plotting:
plot(my_mds$points, pch="") # chose invisible points so that labels appear nicely after the following labeling command:
text(my_mds$points[,1],my_mds$points[,2], rownames(my_mds))
with output from cmdscale(), use labels(my_mds) instead of rownames(); also, in case cmdscale was used without "eig=T", there is no "points" attribute and the coordinates need to be adressed directly:
plot(my_mds, pch="")
text(my_mds[,1],my_mds[,2], labels(my_mds))
for moving text a bit off from the surrounding box, use the following inside plot():
xlim=1.2*range(my_mds$[,1]), ylim=1.2*range(my_mds$[,2])
plotting with colored symbols and a legend, first define good colors for each level of the grouping variable you want to use, e.g.:
df$my_colors=df$my_objects # e.g. language families, or areas, or structural types.
levels(df$my_colors) = rainbow(length(levels(df$my_objects))) # rainbow() for exploiting the full spectrum; could also use a handmade vector of colors c('red', 'blue'...), equal length as the levels of the grouping variable ('my_objects')
Then:
plot(my_mds$points, pch=19, col=as.character(df$my_colors))
legend("bottomright", legend=sort(unique(df$my_objects)), col=as.character(sort(unique(df$my_colors))), pch=19, ncol=2) # 2-column listing.
For other dimensions, e.g. dim1 against d4, simply use ...[,1] and ...[,4]
Plotting all pairs of dimensions, e.g. with 5 dimensions:
splom(my_mds$points[,1:5], pch=19, xlab='', sub='my_title', varnames=c('Dimension 1', 'Dimension 2', 'Dimension 3', 'Dimension 4', 'Dimension 5'), col=as.character(df$my_colors), key=list(text=list(as.character(sort(unique(df$my_objects)))), points=list(pch=19,col=as.character(sort(unique(df$my_colors)))),columns=4)) # see lattice graphics for details.
Here, columns is for arranging the items in the legend into 4 columns. For coloring see above.
Static 3D plotting:
cloud(my_mds$points[,1]~my_mds$points[,2]*my_mds$points[,3], pch=19, col=as.character(df$my_colors), xlab='', ylab='', zlab='', sub='', key=list(text=list(as.character(sort(unique(df$my_objects)))), points=list(pch=19,col=as.character(sort(unique(df$my_colors)))))) # same legend ('key' details as above for splom(); for more, see the lattice graphics notes.
Dynamic 3D plotting, allowing interactive rotation and zooming can be done with library(rgl) or library(ggobi):
my_mds=isoMDS(my_distance_matrix, k=3)
plot3d(my_mds$points[,1], my_mds$points[,2], my_mds$points[,3], pch="", xlab="", ylab="", zlab="", axes=F)
texts3d(my_mds$points[,1],my_mds$points[,2],my_mds$points[,3], substr(df$my_objects, 1,6), size=.8, color=as.character(pclusters_core$my_colors))
or:
texts3d(my_mds$points[,1],my_mds$points[,2],my_mds$points[,3], rownames(my_mds$points))
or using ggobi instead:
ggobi(my_mds)
If you want code colors in R rather than ggobi:
ggobiparam <-ggobi_get()$my_mds
glyph_colour(ggobiparam) <- df$my_colors
Goodness of fit:
cmdscale gives the eigenvalues ($eig) of each dimension and a goodness of fit measure ($GOF) of the k-dimensional solution, based on dividing the sum of the k chosen eigenvalues by the sum of all eigenvalues (also known as "P"). The higher the better. Kruskal stress (also known as "Φ"), which compares the sum of squared deviations of observed distances from the reproduced distances, can be calculated by:
sum((my_distance_matrix-dist(my_mds$points))^2)/sum(my_distance_matrix^2)
There are various variations of this, including transformations of dist(my_mds$points) or weighting methods. A popular one is built-into isoMDS, where stress can be directly called by my_mds$stress. A simple scree plot can be generated by plotting stress values by number of dimensions (k), e.g.
plot(c(1,2,3...k),c(stress1D,stress2D, stress3D... srtresskD)) # or a custom function.
-- Export distance matrices for use in SplitsTree4
write.table(as.matrix(my_distance_matrix), file="my_distance_matrix.tab", sep="\t", quote=F)
Then replace the first line by one containing exactly 5 spaces and a number specifying the number of taxa (objects) in your matrix; and remove all special characters from the row names, including " ", "-", "&", etc.
-- (Multiple) Correspondence Analysis:
Make sure that that levels are uniquely identified across variables by labeling them uniquely, e.g. in form of a concatenation of the level name with the variable name:
df$var=factor(df$var, labels=paste(levels(df$var), "var"))
Then, use MCA() from the package "FactoMineR" for multiple and CA() for simple correspondence analysis:
my_ca=CA(df[,c("var1","var2")])
my_mca=MCA(df[,c("var1","var2", "var3"....)])
plot(my_mca)
The relative inertia (λ) of dimension k -- i.e. percentage π of the total χ2/N of the table that is covered by k -- can be extracted by
my_ca$eig[k,2]
Maps
Given a dataframe with coordinates and a typological variable, the first step is to prepare symbols, e.g. various shapes or colors according to the levels/values of the variable:
df$my_symbols = numeric(length(df$my_variable)) # define new vector
df$my_symbols[df$my_variable=='level1'] <- 19 # circle for level 1
df$my_symbols[df$my_variable=='level2'] <- 23 # diamond for level 2
df$my_symbols[df$my_variable=='level3'] <- 24 # triangle for level 3
See ?points for the numerical codes of each shape
Then after:
library(maps, mapdata, mapproj)
plot the map, e.g.:
map('worldHires', interior=F, xlim=c(-10,180), ylim=c(3,80), projection='mollweide')
xlim and ylim define the borders of the mapping area; for the names of available maps apart from 'worldHires' (Atlantic-centered) and 'world2Hires' (Pacific-centered), see the index of package 'maps'. For the choice of projections, see ?mapproject
and add the symbols
points(mapproject(df$longitude, df$latitude), pch=df$my_symbols, bg='black',cex=2)
NEW:mapproject() no longer seems to accept two vectors!? -- Use this instead:
mapproject(list(x=df$longitude, y=df$latitude), ....)
and/or language names:
text(mapproject(df$longitude, df$latitude), pos=3, offset=.8, df$language, cex=.6, font=2, col='black') # adds language names at the above (pos=3) each symbol, offset to 80% of the character width, font=2 makes the names bold and cex=.6 makes them small.
or plot symbols of choice:
symbols(mapproject(df$longitude, df$latitude), rectangles=cbind(rep(1,length(df$my_variable),df$my_frequency_variable), bg='blue') # plots 'my frequencies' as bars whose lengths are the given frequencies.
A quick code for coloring, e.g of families, is:
df$familycolors=df$families
levels(df$familycolors) <- rainbow(length(levels(df$familycolors))) # or sample(rainbow....), or rev(sample(rainbow...))
points(mapproject(df$longitude, df$latitude), pch=21, bg=as.character(df$familycolors)) # 'as.character' is important!
Add a coordinate grid:
map.grid(nx=20,ny=20,cex=.5,col='black',lty=3) # a 20x20 dotted (lty=3) grid, labeled black
Transforming AUTOTYP colors
df$color=paste('#', as.character(df$AUTOTYP_COLOR), sep='')
but before doing this, make sure there are no empty (colorless) rows, because the previous command will assign them a color specification '#', which will throw an error.
Note: The way color symbols are placed on top of each other is determined by the order in the dataframe. Consider
relevel()
order()
if it doesn't look they way it should.
Quick code for daily use with AUTOTYP coordinates: autotyp.map() , e.g. (with 'autotyp.*.g' standing for any geographically and genealogically enriched AUTOTYP database table)
autotyp.map(data = autotyp.*.g, var = 'myvariable', col = "'type1'='red'; 'type2'='blue'; else='white'")
or (just a few languages with the same color)
autotyp.map(data = subset(autotyp.*.g, grepl('Chintang', language) | grepl('Puma', language)), col = 'red)
Exporting data for plotting in WALS' map tool:
First add shape and color variables, e.g.
df$colors='004848' # red
df$shapes = 'circle'
Then export with exactly this specification (order of column names matters!):
write.table(df[,c('latitude', 'longitude', 'LID', 'shapes','colors','language', 'LID')], file='df.wals.tab', sep='\t', row.names=F, col.names=F, quote=F)
This can be opened in WALS (Menu: 'Tools')
General plotting issues
-- Creating graphs with transparent background:
par(bg="transparent") # then plot something
NOTE: inserting this into slides (Keynote) only works if you save the Quartz output to a file and then insert this file (by drag&drop or via the 'insert' menu). The transparent background is replaced by a grey background if you copy the Quartz output directly.
NOTE 2: none of this is needed if you plot from within TextMate!
-- Adjusting fonts:
On the Mac, the font family can be chosen via:
par(family='My Font') # e.g. "Marker Felt" or "DejaVu Serif" or "Gentium" etc. NB: saving the quartz output via 'save' won't create a PDF with the chosen font family; use instead
quartz.save('my.file.name.pdf', type='pdf')
More specific settings, including choices like 'italic' or 'bold' are done via par() specifications before or inside a plot() command, e.g.
par(cex=1.5) #1.5 times bigger than before (The initial size can be specified by quartz(pointsize=12) where '12' is the default size 12p
par(font=2) # 1=plain, 2=bold, 3=italic, 4=bold italic
Without further specification, this changes all text in a graph.
For changing specific parts, use:
par(cex.axis=1.5) # size of the value names
par(cex.lab=1.5) # size of the axis names, and so on:
par(font.axis=1.5) # big value names
par(col.axis="red") # red letters
etc. (see ?par)
Example
plot(dv~iv, df, ylab="name of y-axis",par(cex.lab=1.5, col.lab="darkred"))
If you want to specify someting for a specific axis only, use list():
plot(dv~iv, df, ylab=list("name of y-axis",cex=1.5, col="darkred")) # only affects the y-axis title
When labels don't appear, this can either be due to the fact that there is not enough margin space (solution: adjust them as described below) or because of font encoding issues (simple solution: remove diacritics and special characters; complex solution: try another text encoding).
-- Side-by-side display of two graphs:
par(mfrow=c(1,2)) # sets up a 1 row, 2 column display
-- Remove tick marks and annotations:
For removing tick marks, you can set names or axis labels to ="" (ylab="")
But in some cases, it makes more sense, to remove all annotations:
plot(..., par(xaxt="n", yaxt="n"))
-- Print on top of each other:
Generate a graph, and then type:
par(new=T) # and generate the next graph, making sure to now set axes=FALSE, and ann=FALSE in order not to print them twice
-- Add text, legends, lines, and arrows
Locate the place where you want to add it
z=locate(1) # now click in the graph where you want the text to be placed, and then:
text(z, "my text")
text(z, "my text", font=3) # in italics (1=plain, 2=bold, 3=italics, 4= bold italic, as above, under ?par
For typesetting subscripts, greek letters, mathematical expressions etc., see
?plotmath
An example:
text(z, expression(beta[italic(i)]))
Add a legend:
legend("topright", legend=c("Group1", "Group2"),col=c("red", "blue"), lwd=2) # adds a legend for a red and a blue line in a density chart (use ?legend to find out about other options)
Add a line:
locator(2, type="l") # click beginning and end of the line
Add an arrow:
arrows(.78, 1, .78, 20/(58+20), col="red", lwd=5) # adjust position with the x=.78,y=1 coordinates of the beginning and the x=.78, y=20/(58+20) coordinates at the end, color with col= and weight with lwd
-- Adjust margins (e.g. when labels don’t fit)
I find this easier to do before creating the graph, but the following can also be including inside the bracket of e.g. plot(…):
par(mar=c(5,5,4,4)) # The numbers define the margins at (in this sequence): bottom, left, top, right
Example
plot(dv~iv, df, ylab="name of y-axis",par(mar=c(6,6,4,4)))
For adding space between axis title and axis (e.g. if title contains a fraction), use the following inside plot():
mpg=c(4,3,0) # see ?par
-- Coloring options apart from c("red", "grey", "blue") etc.:
colors() # list all color names that R knows.
Colors can also be specified in hexadecimal RGB code, with a preceding "#", and optionally followed by a two-digit specification of the opacity/transparency degree, e.g.:
col="#ff000022" for a translucent red
col="#ff0000ff" for a fully opaque red
View colors and their names online at the Chart of R Colors site. A great offline tool for picking colors on the Mac is the LaTeXColorSelector.
rainbow(n) # n rainbow colors
heat.colors(n) # n heat colors
terrain.colors(n) # n cartographic terrain colors
topo.colors(n) # # n toponymy colors
-- Lattice and grid plotting:
?xyplot # gives an overview of lattice (aka trellis) plotting, which has the distinct advantage that it can arrange several plots in what is called 'panels' -- e.g. the densities or boxplots of some variables across a number of levels of a control variable, e.g.:
densityplot(~ dv | iv, df) # densities in each level of iv in separate panels
densityplot(~ dv | iv1, groups=iv2, df) # densities in each level of iv2 in different colors across all levels of iv1 in separate panels
Categorical data (package "vcd", but based on grid, just like lattice):
doubledecker(samecoding~macrocontinent, identical.or.not.g) # which is roughly the same as
mosaic(~samecoding|macrocontinent, identical.or.not.g, split_vertical=c(T,F)) # where we can fine-tune more things, e.g. boxes=c(T,F) for putting boxes around level names, as in doubledecker
Some useful options that go into the bracket of densityplot() and other lattice functions:
- gp=gpar(fill=c('grey','blue','red')) # fill colors in mosaic and doubledecker
- labeling_args=list(set_varnames=c(var='Custom variable name'),
varnames=c(F,F,T) # whether to rpint variable names,
labels=c(T,F,F)) # whether to print label names in mosaic
- labeling_args=list(boxes=c(T,F,F), gp_labels=gpar(cex=1, height=3), varnames=F, dep_varname="Custom variable name"), set_labels=list(final_merged=c('custom label 1','custom label 2')) # customizing in doubledecker
- gp_labels=gpar(cex=1.5, lineheight=1.5) # size of label text and height of box (if present)
- gp_varnames=gpar(cex=1.5)
- tl_labels=F # labels on bottom rather than top
- margins=c(1,10,3,1)
- offset_labels=.4
- offset_varnames=c(0,.5,1,0)
- as.table=T # plots panels left-to-right, top-down; =F is the reverse.
- auto.key=T # adds legend for iv2 levels ("groups"). Could instead use:
- auto.key=list(cex=1.2) # size of legend text
- key=list(text=list(as.character(sort(unique(df$var)))), points=list(pch=19,col=as.character(sort(unique(df$my_colors)))),columns=4)) # create your own legend, with used levels from df$var keyed to colors. Note that inside key=list(), you add text, points, etc. all as lists, not directly.
- strip=strip.custom(bg="mycolor") # for the background color of titles of each panel
- par.strip.text=list(cex=1.2, lines=2) # text size (cex) and height (lines) of panel title
- scales=list(alternating=1, tck=c(1,0)) # axis only on left and bottom in each panel
- layout=c(2,1) # panels in 2 columns, 1 row
- plot.points=F # no rug-like dots
- ylab="Title of y axis" # adjust color, font etc like in other plots.
- xlab="Title of x axis"
- For changing all font sizes at once use
trellis.par.set(list(fontsize=list(text=14))) # before plotting
- Detailed settings by adding specifications inside the brackets of
par.settings=list(...) # e.g.:
- superpose.line=list(col=c("blue", "red", "black"), lwd=1.8)) # colors and line width for three levels of iv2
- par.xlab.text=list(cex=1.2)) # xlab font size;
- axis.text=list(cex=1.2) # x axis values, e.g. group names
- box.rectangle=list(fill="grey", col="black") # grey-filled box in boxplot, black lines around it
- box.umbrella=list(col="black") # black 'whiskers'
- (trellis.par.get() # to see all possibilities)
- select one panel and add text to it:
trellis.focus("panel", 2,1, highlight=F) # choses panel in row 1 and column 2; use highlight=T for a red line around the panel
panel.text(0.5,3.5, "F=.02, n.s.", cex=1, font=3) # adds text at coordinates .05/3.5 inside the selected panel.
-- Save a graph:
On a Mac at least, simply select the graph (=click on its window) and choose "save" from the "File" menu. This produces a high-quality (Quartz-quality) scalable PDF file. For use in Microsoft Office products, use *png instead of *pdf, either by copying the quartz PDF into Preview and the save it as PNG or by:
dev.print(png, filename="/Users/myaccount/Desktop/myfilename.png", width=500, height=500) # then drag the PNG file wherever you want (e.g. a ppt slide)
More help on this can be found here.
-- Advanced graph techniques:
see http://addictedtor.free.fr/graphiques
Batch Mode (Processing R non-interactively from the shell)
For running R non-interactively, write a script my.script.R containing your code and also some way of storing results, e.g. with a final line like
save(my.object, file='myresults.rda')
Note that it is essential that the script is self-contained, i.e. if you want to process existing data, the script must load them (and finde them on them on the machine it is being run on). Same for packages and other scripts: they need to be loaded via library() or source()
Once all is in place you can run the script as follows:
R CMD BATCH --slave --save my.script.R &
And then collect the resulting myresults.rda file when it's done. There will also be a file "my.script.out" which contains a summary of what was done (processing time plus whatever you chose to print out as part of the code in the script via cat() or print())
A very fine thing is to be able to pass arguments to the script from the command line, e.g. when running the same script for different subsets of a large dataset on different processor cores in parallel. Here is how to do this with arguments 'from' and 'to':
1. Inside the script, add the following two lines (which make the arguments accessible inside the script):
args <- commandArgs(TRUE)
for(i in 1:length(args)) {eval(parse(text=args[[i]]))}
and use the arguments, e.g
my.results <- lapply(my.data.list[from:to], function(l) {.. do something with l...})
save(myresults,file=paste('my.results', from, to, 'rda', sep='.'))
2. Then you can run the script as follows, for, say, elements 1 through 10 in my.data.list:
R CMD BATCH --slave --save '--args from=1 to=10' my.script.R
and you well get an object 'myresults.1.10.rda' etc.
3. Read them back in together by something like into (e.g.) one big dataframe
all.results <- do.call(rbind, lapply(dir('path/to/rda/files', pattern='\\.rda'), function(i) {load(paste(i)); result <- my.results}))