## Monday, March 27, 2017

### Fitting Bayesian Linear Mixed Models for continuous and binary data using Stan: A quick tutorial

I want to give a quick tutorial on fitting Linear Mixed Models (hierarchical models) with a full variance-covariance matrix for random effects (what Barr et al 2013 call a maximal model) using Stan.

For a longer version of this tutorial, see: Sorensen, Hohenstein, Vasishth, 2016.

Prerequisites: You need to have R and preferably RStudio installed; RStudio is optional. You need to have rstan installed. See here. I am also assuming you have fit lmer models like these before:
lmer(log(rt) ~ 1+RCType+dist+int+(1+RCType+dist+int|subj) + (1+RCType+dist+int|item), dat)

If you don't know what the above code means, first read chapter 4 of my lecture notes.

## The code and data format needed to fit LMMs in Stan

### The data

I assume you have a 2x2 repeated measures design with some continuous measure like reading time (rt) data and want to do a main effects and interaction contrast coding. Let's say your main effects are RCType and dist, and the interaction is coded as int. All these contrast codings are $\pm 1$. If you don't know what contrast coding is, see these notes and read section 4.3 (although it's best to read the whole chapter). I am using an excerpt of an example data-set from Husain et al. 2014.
"subj" "item" "rt""RCType" "dist" "int"
1       14    438  -1        -1      1
1       16    531   1        -1     -1
1       15    422   1         1      1
1       18   1000  -1        -1      1
...

Assume that these data are stored in R as a data-frame with name rDat.

### The Stan code

Copy the following Stan code into a text file and save it as the file matrixModel.stan. For continuous data like reading times or EEG, you never need to touch this file again. You will only ever specify the design matrix X and the structure of the data. The rest is all taken care of.
data {
int N;               //no trials
int P;               //no fixefs
int J;               //no subjects
int n_u;             //no subj ranefs
int K;               //no items
int n_w;             //no item ranefs
int subj[N]; //subject indicator
int item[N]; //item indicator
row_vector[P] X[N];           //fixef design matrix
row_vector[n_u] Z_u[N];       //subj ranef design matrix
row_vector[n_w] Z_w[N];       //item ranef design matrix
}

parameters {
vector[P] beta;               //fixef coefs
cholesky_factor_corr[n_u] L_u;  //cholesky factor of subj ranef corr matrix
cholesky_factor_corr[n_w] L_w;  //cholesky factor of item ranef corr matrix
vector[n_u] sigma_u; //subj ranef std
vector[n_w] sigma_w; //item ranef std
real sigma_e;        //residual std
vector[n_u] z_u[J];           //spherical subj ranef
vector[n_w] z_w[K];           //spherical item ranef
}

transformed parameters {
vector[n_u] u[J];             //subj ranefs
vector[n_w] w[K];             //item ranefs
{
matrix[n_u,n_u] Sigma_u;    //subj ranef cov matrix
matrix[n_w,n_w] Sigma_w;    //item ranef cov matrix
Sigma_u = diag_pre_multiply(sigma_u,L_u);
Sigma_w = diag_pre_multiply(sigma_w,L_w);
for(j in 1:J)
u[j] = Sigma_u * z_u[j];
for(k in 1:K)
w[k] = Sigma_w * z_w[k];
}
}

model {
//priors
L_u ~ lkj_corr_cholesky(2.0);
L_w ~ lkj_corr_cholesky(2.0);
for (j in 1:J)
z_u[j] ~ normal(0,1);
for (k in 1:K)
z_w[k] ~ normal(0,1);
//likelihood
for (i in 1:N)
rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);
}


### Define the design matrix

Since we want to test the main effects coded as the columns RCType, dist, and int, our design matrix will look like this:
# Make design matrix
X <- unname(model.matrix(~ 1 + RCType + dist + int, rDat))
attr(X, "assign") <- NULL


#### Prepare data for Stan

Stan expects the data in a list form, not as a data frame (unlike lmer). So we set it up as follows:
# Make Stan data
stanDat <- list(N = nrow(X),
P = ncol(X),
n_u = ncol(X),
n_w = ncol(X),
X = X,
Z_u = X,
Z_w = X,
J = nlevels(rDat$subj), K = nlevels(rDat$item),
rt = rDat$rt, subj = as.integer(rDat$subj),
item = as.integer(rDat$item))  ### Load library rstan and fit Stan model library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) # Fit the model matrixFit <- stan(file = "matrixModel.stan", data = stanDat, iter = 2000, chains = 4)  #### Examine posteriors print(matrixFit)  This print output is overly verbose. I wrote a simple function to get the essential information quickly. stan_results<-function(m,params=paramnames){ m_extr<-extract(m,pars=paramnames) par_names<-names(m_extr) means<-lapply(m_extr,mean) quantiles<-lapply(m_extr, function(x)quantile(x,probs=c(0.025,0.975))) means<-data.frame(means) quants<-data.frame(quantiles) summry<-t(rbind(means,quants)) colnames(summry)<-c("mean","lower","upper") summry }  For example, if I want to see only the posteriors of the four beta parameters, I can write: stan_results(matrixFit, params=c("beta[1]","beta[2]","beta[3]","beta[4]"))  For more details, such as interpreting the results and computing things like Bayes Factors, see Nicenboim and Vasishth 2016. #### FAQ: What if I don't want to fit a lognormal? In the Stan code above, I assume a lognormal function for the reading times:  rt[i] ~ lognormal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);  If this upsets you deeply and you want to use a normal distribution (and in fact, for EEG data this makes sense), go right ahead and change the lognormal to normal:  rt[i] ~ normal(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]], sigma_e);  #### FAQ: What if I my dependent measure is binary (0,1) responses? Use this Stan code instead of the one shown above. Here, I assume that you have a column called response in the data, which has 0,1 values. These are the trial level binary responses. data { int N; //no trials int P; //no fixefs int J; //no subjects int n_u; //no subj ranefs int K; //no items int n_w; //no item ranefs int subj[N]; //subject indicator int item[N]; //item indicator row_vector[P] X[N]; //fixef design matrix row_vector[n_u] Z_u[N]; //subj ranef design matrix row_vector[n_w] Z_w[N]; //item ranef design matrix int response[N]; //response } parameters { vector[P] beta; //fixef coefs cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef corr matrix cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef corr matrix vector[n_u] sigma_u; //subj ranef std vector[n_w] sigma_w; //item ranef std vector[n_u] z_u[J]; //spherical subj ranef vector[n_w] z_w[K]; //spherical item ranef } transformed parameters { vector[n_u] u[J]; //subj ranefs vector[n_w] w[K]; //item ranefs { matrix[n_u,n_u] Sigma_u; //subj ranef cov matrix matrix[n_w,n_w] Sigma_w; //item ranef cov matrix Sigma_u = diag_pre_multiply(sigma_u,L_u); Sigma_w = diag_pre_multiply(sigma_w,L_w); for(j in 1:J) u[j] = Sigma_u * z_u[j]; for(k in 1:K) w[k] = Sigma_w * z_w[k]; } } model { //priors beta ~ cauchy(0,2.5); sigma_u ~ cauchy(0,2.5); sigma_w ~ cauchy(0,2.5); L_u ~ lkj_corr_cholesky(2.0); L_w ~ lkj_corr_cholesky(2.0); for (j in 1:J) z_u[j] ~ normal(0,1); for (k in 1:K) z_w[k] ~ normal(0,1); //likelihood for (i in 1:N) response[i] ~ bernoulli_logit(X[i] * beta + Z_u[i] * u[subj[i]] + Z_w[i] * w[item[i]]); }  #### For reproducible example code See here. ## Tuesday, November 22, 2016 ### Statistics textbooks written by non-statisticians: Generally a Bad Idea A methodologist from psychology called Russell Warne writes on twitter: It is of course correct that you can usually increase power by increasing sample size. But a lot of the other stuff in this paragraph is wrong or misleading. If this is an introductory statistics textbook for psychologists, it will cause a lot of harm: a whole new generation of psychologists will emerge with an incorrect understanding of the frequentist point of view to inference. Here are some comments on his text: 1. "When a study has low statistical power, it raises the possibility that any rejection of the null hypothesis is just a fluke, i.e., a Type I error": A fluke rejection of a null hypothesis, isn't that the definition of Type I error? So, low power raises the possibility that a rejection is a Type I error? There is so much wrong here. First of all, Type I error is associated with hypothetical replications of the experiment. It is a statement about the long run repetitions of the procedure, not about the specific experiment you did. You cannot talk of a particular result being a "Type I error" or not. Second, the above sentence says that if power is low, you could end up with an incorrect rejection; the implication is that if power is high, I am unlikely to end up with an incorrect rejection! What the author should have said is that when power is low, by definition the probability of correctly detecting the effect is low. Punkt. Furthermore, the much more alarming consequence of low power is Type S and M errors (see my next point below). I'm surprised that psychologists haven't picked this up yet. 2. When power is low, "...the study should most likely not have been able to reject the null hypothesis at all. So, when it does reject the null hypothesis, it does not seem like a reliable result": I think that one word that should be banned in psych* is "reliable", it gives people the illusion that they found out something that is true. It is never going to be the case that you can say with 100% certainty that you found out the truth. If reliable means "true, reflecting reality correctly", you will *never* know that you have a reliable result. The trouble with using words like reliable is when people read a sentence like the one above and then try to construct the meaning of the sentence by considering the converse situation, when power is high. The implication is that when power is high, the rejection of the result is "reliable". I have lost count of how many times I have heard psych* people telling me that a result is "reliable", implying that they found something that is true of nature. Even when power is high, you still have a Type I error of whatever your$\alpha$is. So any individual result you get could be an incorrect rejection; it doesn't matter what you think the power is. A further important point is: how do you *know* what power you have? Due to Type S and M errors, you are most likely doing your calculation based on previous, underpowered studies. You are therefore going to be getting gross overestimates of power anyway. Power is a function, and typically, you will have a lot of uncertainty associated with your estimate of the plausible values of power under different assumptions (after all, you don't *know* what the true effect is, right? If you know already, why are you doing the study?). Giving a student the false security of saying "oh, I have high power, so my result is reliable" is pretty irresponsible and is part of the reason why we keep messing up again and again and again. ## Tuesday, August 02, 2016 ### Two papers, with code: Statistical Methods for Linguistic Research (Parts 1 and 2) Here are two papers that may be useful for researchers in psychology, linguistics, and cognitive science: Shravan Vasishth and Bruno Nicenboim. Statistical methods for linguistic research: Foundational Ideas - Part I. Language and Linguistics Compass, 2016. In Press. Bruno Nicenboim and Shravan Vasishth. Statistical methods for linguistics research: Foundational Ideas - Part II. Language and Linguistics Compass, 2016. In Press. PDF: ## Wednesday, April 27, 2016 ### A simple proof that the p-value distribution is uniform when the null hypothesis is true [Scroll to graphic below if math doesn't render for you] Thanks to Mark Andrews for correcting some crucial typos (I hope I got it right this time!). Thanks also to Andrew Gelman for pointing out that the proof below holds only when the null hypothesis is a point null$H_0: \mu = 0$, and the dependent measure is continuous, such as reading time in milliseconds, or EEG responses. Someone asked this question in my linear modeling class: why is it that the p-value has a uniform distribution when the null hypothesis is true? The proof is remarkably simple (and is called the probability integral transform). First, notice that when a random variable Z comes from a$Uniform(0,1)$distribution, then the probability that$Z$is less than (or equal to) some value$z$is exactly$z$:$P(Z\leq z)=z$. Next, we prove the following proposition: Proposition: If a random variable$Z=F(T)$, then$Z \sim Uniform(0,1)$. Note here that the p-value is a random variable, call it$Z$. The p-value is computed by calculating the probability of seeing a t-statistic or something more extreme under the null hypothesis. The t-statistic comes from a random variable$T$that is a transformation of the random variable$\bar{X}$:$T=(\bar{X}-\mu)/(\sigma/\sqrt{n})$. This random variable T has a CDF$F$. So, if we can prove the above proposition, we have shown that the p-value's distribution under the null hypothesis is$Uniform(0,1)$. Proof: Let$Z=F(T)$.$P(Z\leq z) = P(F(T)\leq z) = P(F^{-1} F(T) \leq F^{-1}(z) )
= P(T \leq F^{-1} (z) )

## Saturday, December 19, 2015

### Best statistics-related comment ever from a reviewer

This is the most interesting comment I have ever received from CUNY conference reviewing. It nicely illustrates the state of our understanding of statistical theory in psycholinguistics:

"I had no idea how many subjects each study used. Were just one or two people
used? ... Generally, I wasn't given enough data to determine my confidence in the provided t-values (which depends on the degrees of freedom involved)."