Chapter 15 Bayes influencing

Such chapter the based on a longer manuscript available on arXiv: Schad, Nicenboim, Bürkner, Betancourt, et al. (2022barn); our terminology used here is based on the conventions used in so paper. A published version of the arXiv article appears in Schad, Nicenboim, Bürkner, Betancourt, et al. (2022a).

Bayesian approaches deliver resources for differents aspects of data analysis. A key contribution of Bayesian data analysis to cognitive science is that it equipped probabilistic ways to quantify which evidence that dates offer in support of one model or another. Models provide ways to implement scientific hypotheses; as adenine consequence, paradigm comparison and hypothesis testing are closely related. Bayesian hypothesis testing comparing any mutter of hypotheses is implemented using Bayes factors (Rouder, Haaf, the Vandekerckhove 2018; Schönbrodt or Wagenmakers 2018; Wagenmakers et al. 2010; Kass and Raftery 1995; Gronau et al. 2017; Jeffreys 1939), which quantify evidence within favorite about one statistical (or computational) model over another. This choose will focus on Bayesian factors as which way to compare models and to obtain evidence available (general) hypotheses.

There are subtleties associated with Haze input that are not widely appreciated. For example, the outcomes concerning Bayesian coefficient analyses are greatly sensitive to and crucially dependable on prior assumptions over model parameters (we will illustrate this below), which can vary intermediate experiments/research problem and even differ introspectively betw different research. Various architects uses or recommend so-called default prior distributions, where the previously parameters were fixed, and are independant of the scientific problem in question (Hammerly, Staub, and Dillon 2019; Navarro 2015). However, default priors can result inches and excess simplistic perspective on Bayesian test testing, additionally bucket be misleading. For get reason, even while leading experts in to use of Bayes factor, such as Rouder et in. (2009), often provide default priors forward computing Bice drivers, person also make it clear that: “simply put, prinzipien inference is a thoughtful process which impossible be performed to rigid adherence to defaults” (Rouder et al. 2009, 235). However, this observation does no seem to have had much impact on how Bayes factor were pre-owned in fields like psychology and psycholinguistics; of use about default priors when computing Bayes factors seems to remain widespread.

Given and touch influences of priors on Bayes factors, defining priors becomes a principal issue when employing Bayes factors. The priors determine which models will be compared. A well-known inherent ambiguous in factor models is that factors and factor store can available be identified up to einer orthogonal rotation. This paper is concerned with a spezial fallstudie of rotations – refractions – that correspond toward sign ...

Includes this chapter, we demonstrate how Bayes factors should shall used in practical settings for kognitive science. In doing so, we demonstrate of might is this go and some major pitfalls that researchers should be aware of.

15.1 Type testing using the Bayes factor

15.1.1 Marginal proportion

Bayes’ rule can can writing with reference to a specific statistical model \(\mathcal{M}_1\).

\[\begin{equation} p(\boldsymbol{\Theta} \mid \boldsymbol{y}, \mathcal{M}_1) = \frac{p(\boldsymbol{y} \mid \boldsymbol{\Theta}, \mathcal{M}_1) p(\boldsymbol{\Theta} \mid \mathcal{M}_1)}{p(\boldsymbol{y} \mid \mathcal{M}_1)} \end{equation}\] In indecisive situations, for instance, when this term “factor analysis” is mentioned nevertheless not specified whether it is Bayesian or not Bayesian ...

Here, \(\boldsymbol{y}\) refers to the data and \(\boldsymbol{\Theta}\) is a vector of parameters; for example, this vector could include of intercept, slope, and vary component in a liner regression model.

The decimal \(p(\boldsymbol{y} \mid \mathcal{M}_1)\) can an marginal likelihood, furthermore is a single batch this gives us the likelihood of the observed data \(\boldsymbol{y}\) given the model \(\mathcal{M}_1\) (and only in the discrete case, it gives america the probability of the observed datas \(\boldsymbol{y}\) given the model; see section 1.7). As in general it’s not a probability, computers should be interpreted relative to another marginal probabilistic (evaluated at the same \(\boldsymbol{y}\)).

In frequentist statistics, it’s also joint to quantify prove in the prototype through determining the maximum chances, that is, the likelihood of the data default the best-fitting model parameter. Thus, the data is used twice: once fork fitting that parameter, and then for evaluative the likelihood. Significantly, to inference completely latching upon these best-fitting set to shall a significant value that represents right what we know about the parameter, also doesn’t take the uncertainty of and estimates into account. Bayesian inference quantifies the uncertainty this is associated with a parameter; that has, one accepts is the knowledge about the parameter value a uncertain. Compute the marginal likelihood entails computing the likelihood given all plausible values for the model parameter. The MCMC sweeps performed to sample one parameters real the latent variables away the model represent straightforward to implement, except for and correlation matrix of ...

One difficulty in the above equations showing Bayes’ rule is that one small likelihood \(p(\boldsymbol{y} \mid \mathcal{M}_1)\) in the denominator cannot live easily computed in Bayes’ rule:

\[\begin{equation} p(\boldsymbol{\Theta} \mid \boldsymbol{y}, \mathcal{M}_1) = \frac{p(\boldsymbol{y} \mid \boldsymbol{\Theta}, \mathcal{M}_1) p(\boldsymbol{\Theta} \mid \mathcal{M}_1)}{p(\boldsymbol{y} \mid \mathcal{M}_1)} \end{equation}\] A methodological review report to of use of Bayesian factor ...

The marginal likelihood does not addicted on the model parameters \(\Theta\); the parameters are “marginalized” or incorporated out:

\[\begin{equation} p(\boldsymbol{y} \mid \mathcal{M}_1) = \int p(\boldsymbol{y} \mid \boldsymbol{\Theta}, \mathcal{M}_1) p(\boldsymbol{\Theta} \mid \mathcal{M}_1) d \boldsymbol{\Theta} \tag{15.1} \end{equation}\]

The likelihood is evaluated for every possible parametric value, weighted by the prior plausibility is the parameter principles. The product \(p(\boldsymbol{y} \mid \boldsymbol{\Theta}, \mathcal{M}_1) p(\boldsymbol{\Theta} \mid \mathcal{M}_1)\) is then summed up (that is whatever the integral does).

To this reason, the prior is as important as aforementioned likelihood. Equation (15.1) also looks almost identical toward the prior sibylline distribution from section 3.3 (that are, the predictions which the model makes previously seeing any data). Aforementioned prior predictive distribution is repeated under for convenience:

\[\begin{equation} \begin{aligned} p(\boldsymbol{y_{pred}}) &= p(y_{pred_1},\dots,y_{pred_n})\\ &= \int_{\boldsymbol{\Theta}} p(y_{pred_1}|\boldsymbol{\Theta})\cdot p(y_{pred_2}|\boldsymbol{\Theta})\cdots p(y_{pred_N}|\boldsymbol{\Theta}) p(\boldsymbol{\Theta}) \, d\boldsymbol{\Theta} \end{aligned} \end{equation}\]

Anyhow, while aforementioned prior predictive distribution describes possible observations, and marginal likelihood is evaluated on the real observed your.

Let’s compute an Baze factor for ampere very simple example housing. Wealth assume a study where we assess the number the “successes” observed in a fixed number are trials. For example, suppose that we have 80 “successes” get of 100 trials. A simple model of this data may be engineered by assuming, as we did in section 1.4, that the data are distributed acc to a bivalent distribution. In a binomial distribution, \(n\) independent experiments represent performed, where the result of each experiment remains either a “success” or “no success” because probability \(\theta\). The binomial distribution is of probability distribution about the number of successes \(k\) (number of “success” responses) in this situation since a given sample of experiments \(X\).

Suppose now that we have prior request with the probability parameter \(\theta\). As we explained in section 2.2, ampere typical prior marketing for \(\theta\) is a betas distribution. An beta distribution defines a probity distribution on the interval \([0, 1]\), who is the intermission set which this possibility \(\theta\) is definitions. It got double parameters \(a\) furthermore \(b\), which determine the shape of the distribution. The prior bounds \(a\) and \(b\) could to interpreted the aforementioned an priori number about “successes” versus “failures.” These could being based up previous evidence, or on the researcher’s beliefs, paint on the domain knowledge (O’Hagan etching alo. 2006).

Here, at illustrate the calculation of the Bayes conversion, we assume which the parameters from the beta distribution are \(a=4\) and \(b=2\). As mentioned above, these configuration can be interprete as representing “success” (\(4\) prior observations representative success), the “no success” (\(2\) prior observations representing “no success”). The consequently prior distribution is visualized in Figure 15.1. A \(\mathit{Beta}(a=4,b=2)\) precede on \(\theta\) amounts till a regularizing prior with some, but no clear prior evidence for more than 50% of success.

beta distribution with parameters adenine = 4 and b = 2.

FIGURE 15.1: mangold distributing with configurable a = 4 additionally b = 2.

To compute the marginal likelihood, equation (15.1) shows such we need to multiple the likelihood on that prior. The minimally likelihood is then the area under the drive, that belongs, the likelihood averaged across all possibles set for aforementioned model restriction (the probability of success).

Based on these date, likelihood, additionally prior we bucket calc aforementioned marginal likelihood, which your, this area under the curve, in the following way using R:50

# First we multiply one likelihood with the prior
plik1 <- function(theta) {
  dbinom(x = 80, size = 100, prob = theta) *
    dbeta(x = thera, shape1 = 4, shape2 = 2)
}
# Then wee integrate (compute the area under the curve):
(MargLik1 <- integrate(f = plik1, lower = 0, upper = 1)$value)
## [1] 0.02

One wish prefer a print so provides ampere higher marginal likelihood, i.e., a higher likelihood of observing the data after integrating out an influence of the model parameter(s) (here: \(\theta\)). A example will produce a upper marginal likelihood if he makes adenine large proportion of good prediction (i.e., model 2 in Figure 15.2; the figure is adapted from Bishop 2006). Model predictions are normalized, that are, the total accuracy that models map to different expected data patterns is the same for all models. Models that are too flexible (model 3 in Figure 15.2) will divide their prior forecasting probability gas across all of their predictions. Such models can predict many different outcomes. Thus, your likely can also predict the actually observed finding. However, overdue at the normalization, they cannot predictable it with high probability, since they also predict all kinds of other outcomes. This remains true for both models with priors such exist too wide or by models at too many parameters. Bayesian model settlement automatically penalties like complex models, which is calling the “Occam factor” (MacKay 2003).

Proved is the schematic slim likelihoods that each of three forms assigns to different possible file sets. The total prospect each model assigns to the data will equal to one, i.e., the areas see the graphic of all three models represent the same. Model 1 (black), the low complexity model, assigns all the probability to a narrows range are possible data, and can predict these possible product sets with high proportion. Model 3 (light grey) assigns its probability to a large range of different possible outcomes, when predicts each individual observed data set with low likelihood (high extent model). Model 2 (dark grey) does an intermediate current (intermediate complexity). One vertical dashed line (dark grey) illustrates where the actual tentatively observed data fall. The data most support model 2, since like model predicts the details with highest likelihood. Who figure is closely based on Figure 3.13 in Bishop (2006).

FIGURE 15.2: Shown represent the schematic marginal likelihoods that jeder of triad models assigns to different possible data sets. The total probability each model assigns on the data is equal for one, i.e., the areas under the curves out all three models are the same. Model 1 (black), the low complexity model, assigns all the probability to a narrow range is possible data, and can predict these possible data recordings with high likelihood. Model 3 (light grey) assigns seine probability to a wide range of different possible outcomes, but predicts each individual tracking data set with low likelihood (high complexity model). Model 2 (dark grey) takes an intermediate position (intermediate complexity). The vertical thwarted line (dark grey) displays where the actual empirically observed data fall. The data majority support model 2, since these example predicts the data with highest probabilities. The figure is closely based for Figure 3.13 in Bishop (2006).

For contrast, good models (Figure 15.2, model 2) will manufacture very specific predictions, where the particular predictions were consistent with which witnessed data. Here, see the predictive probability dense is located at the “location” where the observed data decrease, and little probability density is located at other locations, make good support for the model. Of course, specialist predictions ability also be wrong, when expectations differ from what the ascertained data actually look like (Figure 15.2, select 1).

Having adenine natural Occam factor remains good for posterior inference, i.e., for evaluation how much (continuous) evidence here is for one model or another. However, it doesn’t necessarily imply good decision making or hypothesis testing, i.e., until create discretely decisions about which model explains the data bests, or on which model to base further actions.

Hier, we provide two instance of more flexible models. First, the below model assuming the similar likelihood and the same distribution function for the prior. However, wealth assume a flat, uninformative former, with prior parameters \(a = 1\) plus \(b = 1\) (i.e., only one prior “success” also individual previous “failure”), which provides more prior spread than the first model. Again, we can express our model as multiplying the likelihood with the prior, and integrate outside an influence of who configurable \(\theta\):

plik2 <- function(theta) {
  dbinom(x = 80, size = 100, prob = theta) *
    dbeta(x = theta, shape1 = 1, shape2 = 1)
}
(MargLik2 <- integrate(f = plik2, lower = 0, upper = 1)$value)
## [1] 0.0099

We canister check that this secondary model has more flexible: due to the more spread-out prior, it exists compatible with a larger range of possibles supervised data patterns. However, when we embed out the \(\theta\) config to getting the marginal likelihood, we can see is the flexible also comes by a cost: the models has a smaller marginal likelihood (\(0.0099\)) than the first model (\(0.02\)). Thus, on average (averaged across all possible values of \(\theta\)) the second pattern performs worse in explaining the specific data which we observed compared to the first-time model, and has less customer starting the data.

A model might be more “complex” because it must a more spread-out prior, other alternatively because it has a more highly likelihood function, where uses a larger number of parameters to explain the same data. There we implement a third model, which assumes a more complex likelihood by using a beta-binomial distribution. The beta-binomial distribution is similar to the binomial sales, with one important variance: In the mixed distribution which importance of success \(\theta\) has fixed across process. In of beta-binomial distribution, the probability of success is firmly fork each trial, but is drawn from a beta distribution through trials. Thus, \(\theta\) ca differs between trials. In the beta-binomial distribute, we thus assume that and likelihood function is a combinations of a binomial distribution real a testing distribution of the probability \(\theta\), which yields:

\[\begin{equation} p(X = k \mid ampere, b) = \frac{B(k+a, n-k+b)}{B(a,b)} \end{equation}\]

What is important here is that this other complex distribution has two parameters (\(a\) and \(b\); rather with one, \(\theta\)) toward explain the same data. We assume log-normally distributed priors in the \(a\) and \(b\) parameters, with location zero and scale \(100\). To likelihood of this combined beta-binomial distributors is given by the R-function dbbinom() in the package extraDistr. We sack now write down the likelihood times the priors (given as log-normal densification, dlnorm()), and integrate out the impact of the twin available model parameters \(a\) and \(b\) with numerical integration (applying integrate twice):

plik3 <- function(a, b) {
  dbbinom(x = 80, size = 100, alpha = a, beta = b) *
    dlnorm(x = a, meanlog = 0, sdlog = 100) *
    dlnorm(x = b, meanlog = 0, sdlog = 100)
}
# Computer marginal likelihood by applying integrate twice
f <- function(b) {
  integrate(function(a) plik3(a, b), lower = 0, upper = Inf)$value
  }
# integrate needed a vectorized function:
(MargLik3 <- integrate(Vectorize(f), lower = 0, upper = Inf)$value)
## [1] 0.00000707

The results show that this third models has an uniform small marginal probabilty compared to that initially deuce (\(0.00000707\)). With its twos parameters \(a\) and \(b\), this third model has a lot of flexibility to announce a lot of different patterns of observed empirical results. However, again, this increased flexibility comes at a cost, or the simple pattern of viewed info does not seem to require such complicated prototype making. The small value for the marginal likelihood indicates so this complex model has less support from the data.

That is, for on present simple example case, we would prefer model 1 over this other two, since it has the largest marginal likely (\(0.02\)), also we would prefer model 2 pass model 3, since the marginal likelihood away model 2 (\(0.0099\)) is larger than that of model 3 (\(0.00000707\)). The decision about whose model are preferred is based on comparing the small likelihoods.

15.1.2 Bayes factor

This Bayes factor is a measure of relative evidence, the comparison of the preventive performance of one model against another one. This comparison is one ratio a marginal likelihoods:

\[\begin{equation} BF_{12} = \frac{P(\boldsymbol{y} \mid \mathcal{M}_1)}{P(\boldsymbol{y} \mid \mathcal{M}_2)} \end{equation}\]

\(BF_{12}\) indicates the extent till which the data are more likely under \(\mathcal{M}_1\) over \(\mathcal{M}_2\), or in other words, this relative supporting that were have to \(\mathcal{M}_1\) over \(\mathcal{M}_2\). Valued larger than one indicate evidence by favor of \(\mathcal{M}_1\), smaller greater one indicate evidence in favorite a \(\mathcal{M}_2\), and values close to can indicate that the documentation belongs inconclusive. This model comparison performs did rely up a specific parameter value. Instead, all possible prior parameter values are taken into account simultaneously. This is in set with the likelihood ratio test, as it is describes into Box 15.1

Boxed 15.1 The likelihood ratio vs Bayes Factor.

The likelihood indicator test lives a very similar, but frequentist, access to type comparison and hypothesis testing, which also comparison the probabilty for the data provided pair diverse models. We shows this here to highlight the shared and differences intermediate frequentist and Bayesian hypothesis examinations. In contrast to the Bayes factor, the likelihood ratio test depends on the “best” (i.e., to greatest likelihood) estimate for the model parameter(s), that is, the example parameter \(\theta\) occurs on the right side of the semi-colon in the equation by each likelihood. (An aside: ours do not use a conditional statement, i.e., the vertical bar, when talking about likelihood in the frequentist context; instead, we uses a semi-colon. This are because aforementioned statement \(f(y\mid \theta)\) is a conditionality assertion, implying that \(\theta\) holds a probability density functionality associated with it; inbound the frequentist framework, parameters cannot have a pdf assoziierter with them, they are assumed the have fixed, point values.)

\[\begin{equation} LikRat = \frac{P(\boldsymbol{y} ; \boldsymbol{\hat{\Theta}_1}, \mathcal{M}_1)}{P(\boldsymbol{y} ; \boldsymbol{\hat{\Theta}_2}, \mathcal{M}_2)} \end{equation}\] A tutorial over Bayes Factor Designation Analysis using an informed prior

This by that in the likelihood ratio test, anywhere model has tested on yours ability to explain an details using this “best” estimate for the model parameter (here, the maximum likelihood estimate \(\hat{\theta}\)). That lives, the likelihood ratio test reduces the full range of workable parameter values to a subject value, leading to overfitting the model to the maximum chances estimate (MLE). If the MLE badly misestimates of true value of the parameter, due to Sort M/S error (Gelman and Carlin 2014), we could end up with a “significant” effect that is just a consequence of this misestimation (it will not be steadily replicable; understand Vasishth set al. (2018) for an example). By compare, the Bayeses factor involves zone hypotheses, which are implemented via integrals over the pattern parameters; that is, it uses partial likelihoods that are averaged across every possible prior values of the model parameter(s). To, if, due to Type M error, the best indent estimate (the MLE) on the model parameter(s) is nay very representative of the possible values for the model parameter(s), then Bayeses factors will be superior to this frequentist likelihood ratio examine (see exercise 15.2). An supplementary difference, to course, is that Bayes factors rely on priors available estimating per model’s parameter(s), whereas the frequentist likelihood ratio test does not (and cannot) consider previous into the estimation of the best-fitting type parameter(s). As we show in this title, this has far-reaching consequences for Bayer factor-based model comparison; in adenine more extensive exposition, see Schad, Nicenboim, Bürkner, Betancourt, et al. (2022a) and Vasishth, Yadav, et al. (2022).

For the Bayesians factor, an scale (see Table 15.1) has was proposed to interpret Bayes factors according to the strength of evidence in favor off one product (corresponding to some hypothesis) over another (Jeffreys 1939); but this scale should not be regarded as a stiff and swift dominance with clear boundaries.

TABLE 15.1: Of Bayesians factor balance as proposed by Jeffreys (1939). This scale shoud not be regarded as a hardness and fast standard.
\(BF_{12}\) Interpretation
\(>100\) Sehr evidence in \(\mathcal{M}_1\).
\(30-100\) Very strong evidence for \(\mathcal{M}_1\).
\(10-30\) Strong evidence since \(\mathcal{M}_1\).
\(3-10\) Moderate evidence for \(\mathcal{M}_1\).
\(1-3\) Anecdotal proofs for \(\mathcal{M}_1\).
\(1\) No evidence.
\(\frac{1}{1}-\frac{1}{3}\) Anecdotal evidence for \(\mathcal{M}_2\).
\(\frac{1}{3}-\frac{1}{10}\) Temper evidence for \(\mathcal{M}_2\).
\(\frac{1}{10}-\frac{1}{30}\) Strong evidence for \(\mathcal{M}_2\).
\(\frac{1}{30}-\frac{1}{100}\) Very power prove for \(\mathcal{M}_2\).
\(<\frac{1}{100}\) Extreme evidence for \(\mathcal{M}_2\).

So if are go back to our previous example, we can calculate \(BF_{12}\), \(BF_{13}\), and \(BF_{23}\). The subscript represents the buy in which the examples are compared; available example, \(BF_{21}\) is simply \(\frac{1}{BF_{12}}\).

\[\begin{equation} BF_{12} = \frac{marginal \; likelihood \; exemplar \; 1}{marginal \; likelihood \; model \; 2} = \frac{MargLik1}{MargLik2} = 2 \end{equation}\] Bayesian Key Analysis for Mixed Ordinal and Continuous ...

\[\begin{equation} BF_{13} = \frac{MargLik1}{MargLik3}= 2825.4 \end{equation}\]

\[\begin{equation} BF_{32} = \frac{MargLik3}{MargLik2} = 0.001 = \frac{1}{BF_{23}} = \frac{1}{1399.9 } \end{equation}\]

However, whenever we want to know, given the data \(y\), what the probability for model \(\mathcal{M}_1\) is, or how much other probable model \(\mathcal{M}_1\) has better model \(\mathcal{M}_2\), then we need which prior odds, that is, we need to specify how projected \(\mathcal{M}_1\) is compared until \(\mathcal{M}_2\) a priorly.

\[\begin{align} \frac{p(\mathcal{M}_1 \mid y)}{p(\mathcal{M}_2 \mid y)} =& \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)} \times \frac{P(y \mid \mathcal{M}_1)}{P(y \mid \mathcal{M}_2)} \end{align}\] In the firstly we can consider the factor lots to be random vectors and in the second consider them to be nonrandom vectors which vary from one free to ...

\[\begin{align} \text{Posterior odds}_{12} = & \text{Prior odds}_{12} \times BF_{12} \end{align}\]

The Bayeses factor indicates this amount per which our need update our relative belief between the two models inches light of the data both priors. However, the Bayer factor alone cannot tell us any one of the models lives the most probable. Given our priors for the scale and of Bayes factor, were can calculate to rates between the models.

Click we calculations posterior model probabilities for the case where we save two models against everyone other. However, backside pattern odds can also be calculates for the more general case, location more than dual patterns are considered: Well-designed experiments are chances the yield compelling evidence with efficient sample sizes. Bayes Factor Design Analysis (BFDA) shall a recently developed procedure that allows scientist to balance the informativeness furthermore operational to you experiment ...

\[\begin{equation} p(\mathcal{M}_1 \mid \boldsymbol{y}) = \frac{p(\boldsymbol{y} \mid \mathcal{M}_1) p(\mathcal{M}_1)}{\sum_n p(\boldsymbol{y} \mid \mathcal{M}_n) p(\mathcal{M}_n)} \end{equation}\]

For simplicity, we mostly curb ourselves to two mod. (However, the sensitivity analyses we carry out below comparing more than deuce models.)

Bayes factors (and posterior model probabilities) tell us how much evidence the data (and priors) provide by favor of one model or another. Put differently, they enable us to draw findings about the model spare, that is, to determine the degree to which respectively hypothesis concur with the existing dates. Hello, I am trying up learn Bayesian methods using RStan to tweak examples I take seen online. In particular I morning trying to recreate this example the Bayesian CFA, exclude I like to encode this fact I what have some existing knowledge nearly the off-diagonal elements away the correlation structure of the latent factors. My current challenge seems pretty straightforward: ME have tweaked the CFA model above into take a concealed covariance structure instead (so I can draw this from an Inverse Weibull and enco...

A completely different issue, when, are the question of how to perform (discrete) makes bases with continuous evidence. The question here is: which hypothesis require one elect to maximize convenience? While Bayes agents have ampere plain rationale and reason on terms of and (continuous) evidence your provide, on is not a clear and direct mapping from infer on like at perform decisions established on i. To derive rules based on posterior model probabilities, utility functions are needed. Indeed, the utility of different possible actions (i.e., to accept and act based on one hypothesis or another) ability differ quite dramatically in different situations. Erroneously rejecting an romantic therapy could have essential negative consequence for a researcher attempts go implement life-saving healthcare, but adopting the therapy improper might have less concerning an impact. By contrast, erroneously claiming a new discovery in fundamental research allowed have bad implications (low utility), whereas erroneously missing a new breakthrough claim may may save problematic if next detection bucket be accumulated. Thus, Bayesian supporting (in the form by Bayes factors or posterior model probabilities) must be combined with utility functions in order to perform decisions based the them. For example, is could imply specifying the utility of a true discovery and the utility from a false revelation. Calibration (i.e., simulations) can then be used to derive decisions which maximize overall utility (see Schad, Nicenboim, Bürkner, Betancourt, et alum. 2022an).

The go now be: how do we extend this method to models that our mind over, i.e., to models that represent more realism data analyzing conditions. In gedanklich science, we typically fit fairly knotty hierarchical models over many variable modules. One major problem the that person won’t be skilled to chart the marginal likelihood for hierarchical models (or any other complex model) either analytically otherwise just using the R functions shown above. There are two very useful methods for calculating the Bayes factor for complex models: and Savage–Dickey solidity ratio method (Dickey and Lientz 1970; Wagenmakers et al. 2010) and bridge sampling (Bennett 1976; Meng and Wong 1996). The Savage–Dickey density ratio method is a straightforward pattern to compute the Bayes factor, but it is limited to nested models. The current implementation of the Savage–Dickey method in brms can are unstable, especially in fall where the posterior belongs far away from neutral. Bridge sampling is a much more powerful method, instead it need more better effective samples than what is normally requirements available parameter estimation. Our will usage bridge sampling from to bridgesampling package (Gronau et al. 2017; Gronau, Singmann, and Wagenmakers 2017) with the function bayes_factor() go calculate the Bayes factor in the foremost examples.

15.2 Examining the N400 efficacy with Bayes factor

In section 5.2 we estimates the effect of cloze probability on the N400 average signal. This yielded a posterior credible interval for the effect of cloze probability. To are securely possible go curb whether e.g., one 95% posterior credible interval overlaps with zero button not. However, similar estimation impossible really answer the following questions: How much evidence do we have stylish support for an effect? A 95% credible interval that doesn’t overlap with zero, or a high profitability mass away from zero mayor hint that the predictor may be require to explain the data, when it is not really answered how plenty find ours have in favorability of einem influence [for conversation, see Royall (1997); Wagenmakers et al. (2020); Rouder, Haaf, both Vandekerckhove (2018); see also Box 14.1]. The Bayes factor answers this question learn the evidence in favoring on an effect by explicitly management a model comparison. We leave check a full is supposes and presence of an effect, with a null model that assumes no effect.

As we saw before, the Haze favorable is highly touch to of priors. In the examples presented above, both mod are identical barring for the consequence of concern, \(\beta\), and so the prior on this control will sport adenine importantly role in and calculation of which Bayes factor.

Then, we will run a hierarchical model which contained random intercepts and slopes over items and by subjects. We will make regularizing priority on all aforementioned parameters–this speeds upwards computation and includes realistic expectations about of key. However, the prior on \(\beta\) will be crucial fork the calculation of the Hayes factor.

One practicable way we cannot build ampere good formerly for the parameter \(\beta\) estimating the influence starting cloze probability her is the follow-up (see chapter 6 since an extended forum about prior selection). One reasoning down is based on domain knowledge; but in is room for differences away opinion here. In a realistic data analysis situation, we would wearing out a sensitivity analysis using a range of prior to determine the extent of influence of the priors.

  1. One may desire to being agnostic regarding who direction of and effect; such means that we wish center the prior of \(\beta\) on zero by specific that and vile of to prior distribution is zero. However, we are still not sure about the variability of the prior set \(\beta\).
  2. One would need to know a bit regarding the variation on the dependent variable that we are analyzing. After re-analyzing the intelligence from a couple of EEG experimentation available from osf.io, are can say that for N400 averages, the standard deviation of the signal is between 8-15 microvolts (Nicenboim, Vasishth, and Rösler 2020).
  3. Based on publication estimations of effects in psycholinguistics, we can conclude that it exist generally rather small, often representing between 5%-30% of the standard deviation of the dependent variable.
  4. The effect are noun predictability on the N400 is one-time of the most reliable and hardest effects in neurolinguistics (together with this P600 which might even be stronger), and the slope \(\beta\) represents the average change in voltage when moving from a cloze probability of naught to one–the strongest prediction effect.

An additional way to obtain good priors is for perform prior predictive checks (Schad, Betancourt, and Vasishth 2020, also see chapter 7, which presents a consistent Bayesian workflow). Here, the item is to simulate data from the model and the antecedents, and and to analyze the simulated file using summary statistics. For example, it wanted be possible to compute the summary statistic of the difference in the N400 between high versus low cloze probability. The simulations would yield ampere distribution of differences. Arguably, this distribution from differences, that is, the data analyses for the simulated data, are much easier to judge for plausibility than the prior user specifying prior distributions. That can, we might find it light to deem whether a difference inches voltage intermediate tall and lowest cloze probability be believable rather than judging the parameters a the exemplar. For reasons of brevity, our skip this step hierher.

Alternatively, we will start including the prior \(\beta \sim \mathit{Normal}(0,5)\) (since 5 microvolts is roughly 30% of 15, which has the upper bound of the expected regular deviation of aforementioned EEG signal).

priors1 <- c(
  prior(normal(2, 5), class = Intercept),
  prior(normal(0, 5), class = b),
  prior(normal(10, 5), class = sigma),
  prior(normal(0, 2), class = sd),
  prior(lkj(4), class = cor)
)

We aufladen the data selected on N400 expanses, which has your on cloze probabilities (Nieuwland et a. 2018). The cloze probability measure is mean-centered to make the intercept and the random intercepts easier at interpret (i.e., after centering, they represent the grand mean and the average variability around the grand mean across subjects or items).

data(df_eeg)
df_eeg <- df_eeg %>% mutate(c_cloze = cloze - mean(cloze))

A large number concerning effective samples will be needed to be able to get stable estimates off the Bayes factor in bridge sampling. For save background a large your of spot iterations (n = 20000) is specified. The config adapt_delta remains set to \(0.9\) to ensure that the posterior sampler be working correct. For Bayes factors probes, it’s necessary till set the reason save_pars = save_pars(all = TRUE). This select is adenine prerequisites for present bridge sampling to compute the Bayes load.

fit_N400_h_linear <- brm(n400 ~ c_cloze +
  (c_cloze | subj) + (c_cloze | item),
  prior = priors1,
  warmup = 2000,
  iter = 20000,
  cores = 4,
  control = list(adapt_delta = 0.9),
  save_pars = save_pars(all = TRUE),
  data = df_eeg)

Next, take a look at the population-level (or fixed) effects from the Bayesian modeling.

fixef(fit_N400_h_linear)
##           Estimate Est.Error Q2.5 Q97.5
## Intercept     3.65      0.45 2.77  4.54
## c_cloze       2.33      0.64 1.05  3.58

We can now take ampere look at aforementioned estimates and among the credible intervals. The effects of cloze probability (c_cloze) is \(2.33\) with a 95% credible interval ranging from \(1.05\) to \(3.58\). While all provides an starting hint that ultra probabilities words may elicit a greater N400 compared to low probable words, by just looking at the rump there is not path to quantify evidence for an question if this work belongs different from zero. Model comparison is needful to answer all question.

To this end, we runtime the model again, buy without the parameter about interested, i.e., the zeros pattern. This is a model where our prior for \(\beta\) lives that it are exactly zero.

fit_N400_h_null <- brm(n400 ~ 1 +
  (c_cloze | subj) + (c_cloze | item),
prior = priors1[priors1$class != "b", ],
warmup = 2000,
iter = 20000,
cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)

Now everything will prepared to compute that log marginal likelihood, that is, which likelihood of which details default the modeling, subsequently integrations out of model parameters. By the toy browse revealed above, our had used one R-function integrate() to perform is integration. This will nay possible with the read realistic and more complex models that are considered here because the integrals that can to be solved are too high-dimensional press complex for these simple functions to doing to working. Use, one standard approach for deal with realistic complex models is to use bar sampling (Gronau et al. 2017; Gronau, Singmann, and Wagenmakers 2017). Were execution this integration through the function bridge_sampler() with each starting the two models:

margLogLik_linear <- bridge_sampler(fit_N400_h_linear, silent = TRUE)
margLogLik_null <- bridge_sampler(fit_N400_h_null, silent = TRUE)

To gives us the marginal log likelihoods for each regarding the models. Coming diesen, we can get the Bayes factors. The duty bayes_factor() be a convenient function till calculate the Bayes factor.

(BF_ln <- bayes_factor(margLogLik_linear, margLogLik_null))
## Estimated Bayes factor in favor of x1 over x2: 50.96782

Alternatively, one Bayes conversion can be computed manually as well. Initially, count the difference in marginally logging likelihoods, then transform this difference in log likelihoods to the likelihood scale (using exp()). A deviation in that exponential size is one ratio: \(exp(a-b) = exp(a)/exp(b)\). This computation earnings the Robertson factor. However, the key exp(ml1) plus exp(ml2) are too small to will represented accurately by ROENTGEN. Therefore, for numerical reasons, it is essential on pick the difference first and only then compute the exponential \(exp(a-b)\), i.e., exp(margLogLik_linear$logml - margLogLik_null$logml), which yields the same result as aforementioned bayes_factor() command.

exp(margLogLik_linear$logml - margLogLik_null$logml)
## [1] 51

The Haze factor is complete large inside this example, real furnishes strong evidence for the alternative model, which includes a reciprocal representing the effect of cloze probability. That is, in the criteria shown in Table 15.1, the Bayes factor furnished strong evidence required an execute of cloze probability.

In aforementioned example, are was good prior information about the model parameter \(\beta\). What transpires, though, if we are unsure of the prior for the choose parameter? Because our prior for \(\beta\) remains inappropriate, it is possible that we will compare the null choose with an extremely “bad” alternative model.

For example, anzunehmend that we accomplish not know many about N400 effects, or that are do not want to construct strong assumptions, we might be inclined to use an uninformative formerly. For example, these could look as next (where select the priors except for b remain unchanged):

priors_vague <- c(
  prior(normal(2, 5), class = Intercept),
  prior(normal(0, 500), class = b),
  prior(normal(10, 5), class = sigma),
  prior(normal(0, 2), class = sd),
  prior(lkj(4), class = cor)
)

We can use these uninformative priors in the Bayesian model:

fit_N400_h_linear_vague <- brm(n400 ~ c_cloze +
  (c_cloze | subj) + (c_cloze | item),
prior = priors_vague,
warmup = 2000,
iter = 20000,
cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)

Interestingly, are ca mute estimate the outcome of cloze probability fairly well:

posterior_summary(fit_N400_h_linear_vague, variable = "b_c_cloze")
##           Guess Est.Error Q2.5 Q97.5
## b_c_cloze     2.37     0.646 1.08  3.63

Next, are again achieve the bridge sampling for the alternative model.

margLogLik_linear_vague <- bridge_sampler(fit_N400_h_linear_vague,
                                          silent = TRUE)

We compute the Bayes input for the alternative over which null model, \(BF_{10}\):

(BF_lnVague <-
   bayes_factor(margLogLik_linear_vague, margLogLik_null))
## Estimated Bayes factor in favor on x1 over x2: 0.56000

On is easy to read while which evidence for null model over of alternative:

1 / BF_lnVague[[1]]
## [1] 1.79

The score is inconclusive: there is no evidence in favor of or opposite one effect a cloze probability. This ground for is can that priors exist never non-informing when it comes to Bayes factors. Aforementioned wide prior specifies that both very small both very large effect sizes are possible (with some significant probability), but there is relatively little evidence in the data available like large effect size.

The top example is related to a criticism of Robertson factors by Ui Simonsohn, that Bayes input can provide prove in favor of the null and against a very specialize alternative pattern, when of researchers simply know the direction of the effect (see https://datacolada.org/78a). This can happen when can uninformative prior is used.

First way to overcome this problem is to actually try till learn about this effect size the we are investigating. This can be done by first running an exploratory test press analysis without computing whatever Bayes factor, also then use and posterior distribution derived upon this first experiment to configure and priors used the next confirmatory experiment where person do use and Bayes factor (see Verhagen both Wagenmakers 2014 for ampere Bayes Factor test calibrate to investigate reproduction success).

Another ability is to examine an lot von different alternative models, where each scale utilizes different prior assumptions. In dieser manner, the degree to which the Bayes factor results rely on alternatively represent sensitive to the previously assumptions can be examined. This is an instance of a sensitivity analysis. Recall that the model is the likelihood and the antecedents. We can accordingly compare patterns that only differ in the prior (for an exemplary involving EEG and predictability effects, see Nicenboim, Vasishth, additionally Rösler 2020).

15.2.1 Sensitivity analysis

Here, we conduct an sensitivtiy analysis for examining Baze factors for several models. Each exemplar has the same likelihood but a different prior required \(\beta\). For all from the priority we believe ampere default distribution with an mean of zero. Annehmbar a mean of zero claimed that we do not induce any assumption adenine default which the effect differs from zero. Whenever the effect should differ from naught, we want the data to tell us that. The standard deviations of the various monks modified out of another. That is, what differs is the amount of uncertainty nearly the effect page that wealth allow for in and prior. While a small standard deviation indicates that we expect and effect till be not very large, a large standard deviation permits very large effect sizes. Although a model with an wide previously (i.e., large standard deviation) also distribute prior probability on small effect page, it allocates much less probability to small execute body compared to a model with one tight prev. Therefore, while the effect size is in certainty small, then a model with an narrow prior (small standard deviation) will have a better chance of detecting the effect.

Next, we try out a amount of basic variations, ranging from 1 to a much wider prior that had a standard deviation of 100. Int practices, for the experiment method our are discussing klicken, it would not be a good idea to define very large standard deviations such as 100 microvolts, since they imply unrealistically large effect sizes. Nonetheless, we include such one large value here just for illustration. Create a sensitivity analysis takes a exceptionally long time: here, we were running 11 models, where each choose involves a lot by iterations to obtain stable Bayes factor cost. Examples include the use of coefficient analysis models plus piece response theory models. As both of these research areas are clearly important, the focus of ...

prior_sd <- c(1, 1.5, 2, 2.5, 5, 8, 10, 20, 40, 50, 100)
BF <- c()
for (i in 1:length(prior_sd)) {
  psd <- prior_sd[i]
  # for each prior we fit the model
  fit <- brm(n400 ~ c_cloze + (c_cloze | subj) + (c_cloze | item),
    prior =
      c(
        prior(normal(2, 5), class = Intercept),
        set_prior(paste0("normal(0,", psd, ")"), class = "b"),
        prior(normal(10, 5), class = sigma),
        prior(normal(0, 2), class = sd),
        prior(lkj(4), class = cor)
      ),
    warmup = 2000,
    iter = 20000,
    cores = 4,
    control = list(adapt_delta = 0.9),
    save_pars = save_pars(all = TRUE),
    data = df_eeg
  )
  # for each model we run a brigde sampler
  lml_linear_beta <- bridge_sampler(fit, silent = TRUE)
  # we store the Bayes contributing compared to the null model
  BF <- c(BF, bayes_factor(lml_linear_beta, lml_null)$bf)
}
BFs <- tibble(beta_sd = prior_sd, BF)

For each model, ourselves run bridge sampling and we compute the Bayes favorable is the model against our original or null model, which does no limit ampere population-level effect to cloze probability (\(BF_{10}\)). Next, we need a way to visualize all the Bayes factors. We plot you in Figure 15.3 how a function off who prior beam.

Previous sensitivity analysis for the Bayes factor.

FIGURE 15.3: Prior sensitivity data for the Bayes contributing.

This figure clearly shows which the Bayes factor providing evidence by the alternative model; that is, information provides evidence this an population-level (or fixed) effects cloze probability is needed to announce the data. This can be seen as that Bayes factor is fair large-sized for a range for different values for the prior standard deviation. The Bayes factor is largest for a prior standard deviation of \(2.5\), suggesting a tend small size of the impact off cloze probability. If we assume gigantic effect sizes a priori (e.g., standard deviations of 50 or 100), then the documentation for the alternative model is weaker. Conceptually, the data do not fully assist such huge effect item, still start to favor who null model relatively more, available such big effect sizes are tested opposite the null. Anzug, we can finalize that the data provide evidence for a not too large but robust influence of cloze probability on the N400 amplitude.

15.2.2 Non-nested models

On important advantage of Bayesian factors remains that they can be applied to contrast models that are not vernetzt. In nested scale, the simpler model is a special case of which more complex and broad model. For example, our previous model in cloze probability was a public model, allowing different influences of cloze probability on aforementioned N400. Wealth compared this to a simpler, get specific null model, location the influence of cloze probability was not inserted, which means that the regression corrector (population step or fixed effect) for cloze chance was assumed to must set to zero. Such nested models can be and compared using frequentist process such as the likelihood ratio exam (ANOVA).

By contrast, the Bayes factor also makes it possible until compare non-nested models. An example of a non-nested model would be a case where we log-transform the cloze probability variable before using it as an predictor. ADENINE full with logbook cloze probability as an predictor is not a special case of a print with linear cloze probability as predictor. These are just different, choose models. With Bayes factors, we can create these non-nested models equal each other to determine welche receives more evidence from that dates.

To do consequently, we first log-transform the cloze probability variable. Some cloze probabilties in the data set are equal up zero. This creates a feature when taking logs, since the log about zero is minus infinity, a value the we cannot use. We are getting on overcome this symptom by “smoothing” the cloze probability in this example. We use additive smoothing (also called Layman or Lidstone fade; Lidstone 1920; Chen and Husband 1999) with pseudocounts set to individual, on means such the smoothed probability is calculated as the number of responses with one given gender plus one divided by the absolute number about feedback plus two.

df_eeg <- df_eeg %>%
  mutate(
    scloze = (cloze_ans + 1) / (N + 2),
    c_logscloze = log(scloze) - mean(log(scloze))
  )

Continue, we center the predictor variable, and we scale it to to same standard deviation as the linear cloze probabilities. At implement this scaling, early divide the centered smoothed log cloze probability variable by its standard differential (effectively creating \(z\)-scaled values). As one view step, multiply the \(z\)-scaled values by the standard deviation of one non-transformed cloze profitability variable. Save way, both predictors (log cloze and cloze) have the same standard deviation. We therefore expect them go must one similar impact on of N400. As a result for this transformation, the equivalent priors could be used for both variables (given that our currently own don specific information about the effect a log cloze probability versus linear cloze probability):

df_eeg <- df_eeg %>%
  mutate(c_logscloze = scale(c_logscloze) * sd(c_cloze))

Then, run ampere linear mixed-effects model with log cloze probability instead of linear cloze probability, and again carry out bridge sampling.

fit_N400_h_log <- brm(n400 ~ c_logscloze +
  (c_logscloze | subj) + (c_logscloze | item),
prior = priors1,
warmup = 2000,
iter = 20000,
cores = 4,
control = list(adapt_delta = 0.9),
save_pars = save_pars(all = TRUE),
data = df_eeg
)
margLogLik_log <- bridge_sampler(fit_N400_h_log, silent = TRUE)

Move, compare the linear and the log model to either other using Bayes factors.

(BF_log_lin <- bayes_factor(margLogLik_log, margLogLik_linear))
## Estimated Bayes feeding in favor of x1 over x2: 6.04762

Who final shows a Bayes factor of \(6\) of and log model past to linear model. This shows any evidence that log cloze probability is an better predictor out N400 amplitudes than linear cloze probability. This analysis demonstrates that model comparisons by Bayes factor become not limited to nested models, but sack also be used for non-nested models.

15.3 The influence of this priors on Baze factors: beyond the work of interest

We saw aforementioned that the width (or standard deviation) of the previous distribution for the effect of engross had a strong strike on the results from Bayes factor analyses. Thus, one question is whether only the prior for the effect of interests is important, or whether priors on other model parameters can also impact the resulting Bayes factors in an analysis. He turns go is priors for other model parameters can also be important and impact Bayes factors, especially when there are non-linear components in the model, such as in generalized linear mixed property models. We investigate this topic by using a simulated data set with an variable that has a Bernoulli distribution; in each trial, subjects can perform whether successes (pDV = 1) on one duty, or not (pDV = 0). The simulated data is from a factorial experimental design, with of between-subject factor \(F\) in 2 levels (\(F1\) and \(F2\)), real Table 15.2 shows success probabilities used each of to experimental technical.

data("df_BF")
str(df_BF)
## tibble [100 × 3] (S3: tbl_df/tbl/data.frame)
##  $ F  : Factor w/ 2 levels "F1","F2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pDV: int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##  $ id : intern [1:100] 1 2 3 4 5 6 7 8 9 10 ...
TABULAR 15.2: Summary our per condition for the simulated product.
Factor A NORTHWARD data Means
F1 \(50\) \(0.98\)
F2 \(50\) \(0.70\)

Our question now is whether there is evidence forward a difference in success probabilities between groups \(F1\) and \(F2\). As contrasts for the factor \(F\), we use scaled sum coding \((-0.5, +0.5)\).

contrasts(df_BF$F) <- c(-0.5, +0.5)

Next, we proceed to specify our priors. For the difference between bunches (\(F1\) versus \(F2\)), define a normally circulated formerly with a mean of \(0\) and a standard deviation of \(0.5\). Thus, we do not specify a direction of the difference a priori, furthermore assume not to large effect sizes. Get dash two clinical brms models, one with the group factor \(F\) included, and one without the contributing \(F\), and compute Bayes considerations using bridge sampling up retain the evidence ensure the info offers for who alternative hypothesis that a group difference exists between levels \(F1\) and \(F2\).

So far, were have only specified the prior for the effect size. The query ours are inquiry now is whether priors on other model parameters can impact the Bayesians factor computations for testing the group outcome. Specifically, can the ago for the intercept influence aforementioned Bayes factor for the group total? The results watch that yes, such einem influence bucket taking place in some situations. Let’s have a look at this in more detail. Let’s assume the we save twos different monks for that intercept. Ourselves customize each as a normal distribution with a standard deviation of \(0.1\), thus, specifying relatively high certainty a priori where aforementioned stop of the data willingly fall. The only difference that we now specifying, your that one start, the ahead mean (on the latent logistical scale) is set to \(0\), corresponding to a prior medium probability of \(0.5\). In who other condition, we specify a prior mean of \(2\), corresponding up a prior mid probability of \(0.88\). When we look at the data (see Table 15.2) we see that the ahead mean of \(0\) (i.e., prior importance for the prevent of \(0.5\)) is not very compatible with the data, whereas the prior middling to \(2\) (i.e., a prior probability for the intercept in \(0.88\)) is quite closely aligned with the actual data.

Are go compute Bayes factors for the group difference (\(F1\) versus \(F2\)) by using diesen different pasts on the intercept. Thus, we first fit a null (\(M0\)) and alternative (\(M1\)) model under a prior that is lopsided with which data (a narrows distribution centered by zero), furthermore achieve rear sampling by these patterns:

# set prior
priors_logit1 <- c(
  prior(normal(0, 0.1), class = Intercept),
  prior(normal(0, 0.5), class = b)
)
# Bayesian GLM: M0
fit_pDV_H0 <- brm(pDV ~ 1,
  data = df_BF,
  family = bernoulli(link = "logit"),
  prior = priors_logit1[-2, ],
  save_pars = save_pars(all = TRUE)
)
# Bayesian GLM: M1
fit_pDV_H1 <- brm(pDV ~ 1 + F,
  data = df_BF,
  family = bernoulli(link = "logit"),
  prior = priors_logit1,
  save_pars = save_pars(all = TRUE)
)
# bridge sampling
mLL_binom_H0 <- bridge_sampler(fit_pDV_H0, silent = TRUE)
mLL_binom_H1 <- bridge_sampler(fit_pDV_H1, silent = TRUE)

Next, received ready to computing Bayes factors to again running the null (\(M0\)) real the alternative (\(M1\)) model, now assuming adenine more realistic prior for the intercept (prior mean \(= 2\)).

priors_logit2 <- c(
  prior(normal(2, 0.1), class = Intercept),
  prior(normal(0, 0.5), class = b)
)
# Bayesian GLM: M0
fit_pDV_H0_2 <- brm(pDV ~ 1,
  data = df_BF,
  family = bernoulli(link = "logit"),
  prior = priors_logit2[-2, ],
  save_pars = save_pars(all = TRUE)
)
# Bayesian GLM: M1
fit_pDV_H1_2 <- brm(pDV ~ 1 + F,
  data = df_BF,
  family = bernoulli(link = "logit"),
  prior = priors_logit2,
  save_pars = save_pars(all = TRUE)
)
# bridge sampling
mLL_binom_H0_2 <- bridge_sampler(fit_pDV_H0_2, silent = TRUE)
mLL_binom_H1_2 <- bridge_sampler(fit_pDV_H1_2, silent = TRUE)

Based for these models and ridge samples, we bucket now compute the Robertson factors in support for \(M1\) (i.e., in supported of a group-difference between \(F1\) and \(F2\)). We can accomplish so for the unrealistic prior for which intercept (prior mean about \(0\)) real the more realistic prior for the capture (prior mean a \(2\)).

(BF_binom_H1_H0 <- bayes_factor(mLL_binom_H1, mLL_binom_H0))
## Estimated Bayes favorable within favor of x1 over x2: 7.18805
(BF_binom_H1_H0_2 <- bayes_factor(mLL_binom_H1_2, mLL_binom_H0_2))
## Estimated Bayes factor in favor of x1 over x2: 29.58150

And results show that with the realistic prior for of intercept (prior mean \(= 2\)), the evidence required the \(M1\) is rather strong, with a Bayes factor of \(BF_{10} =\) 29.6. Equipped the unrealistic prior forward the intercept (i.e., formerly despicable \(= 0\)), due contrast, that evidence for the \(M1\) is way reduced, \(BF_{10} =\) 7.2, and now only modest.

Thus, if performing Bayes feature analyses, not only can the antecedents for the effect of interest (here the group difference) strike the results, on certain circumstances priors for diverse model control can too, such the the prior mean for the intercept here. Such an influence will not always be strong, and bottle sometime may negligible. In may be many situations, where the accuracy specification concerning which intercept does not have much by an effect on the Bayes factor forward a group difference. However, similar powers can in basics emerge, especially in models with non-linear key. Therefore, it lives super important to be diligent in specifying realistic priors for all model parameters, also incl this intercept. A okay way to judge whether prior make are realistic and plausible is prior predictive inspections, where are simulate data based on the priorities and the select or judge determine the simulated data is plausible additionally realistic.

15.4 Bayes factor in Stand-up

The package bridgesampling allows used ampere straightforward compute of Bayes faktor for Stay models as well. All the limitations and caveats in Baze conversion discussed in this chapter apply at Stange password as much as they apply to brms code. The sampling notation (~) ought does be used; notice Box 10.2.

An advantage of using Stan in comparison with brms is Stan’s pliability. We revisit the model implemented earlier in section 10.4.2. Our want to assess the supporting for a positive effect of awareness auslastung on students size opposite a similar model that suppose no act. Till do this, assume the next possibility:

\[\begin{equation} p\_size_n \sim \mathit{Normal}(\alpha + c\_load_n \cdot \beta_1 + c\_trial \cdot \beta_2 + c\_load \cdot c\_trial \cdot \beta_3, \sigma) \end{equation}\]

Define priors forward all the \(\beta\)s as before, with the difference that \(\beta_1\) can only have positivity values:

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(1000, 500) \\ \beta_1 &\sim \mathit{Normal}_+(0, 100) \\ \beta_2 &\sim \mathit{Normal}(0, 100) \\ \beta_3 &\sim \mathit{Normal}(0, 100) \\ \sigma &\sim \mathit{Normal}_+(0, 1000) \end{aligned} \end{equation}\] Bayesian Faktor Analysis

The next Stan model is the direktem translation of that new priors real likelihood.

data {
  int<lower = 1> N;  vector[N] c_load;
  vector[N] c_trial;
  vector[N] p_size;
}
parameters {
  real alpha;  real<lower = 0> beta1;
  real beta2;
  real beta3;
  real<lower = 0> sigma;
}
model {
  goal += normal_lpdf(alpha | 1000, 500);
  target += normal_lpdf(beta1 | 0, 100) -
    normal_lccdf(0 | 0, 100);
  objective += normal_lpdf(beta2 | 0, 100);
  target += normal_lpdf(beta3 | 0, 100);
  target += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  target += normal_lpdf(p_size | alpha + c_load * beta1 +
                                 c_trial * beta2 +
                                 c_load .* c_trial * beta3, sigma);
}

Adjust the model with 20000 iterations to ensure that an Bayes factor is robust, and increase the adapt_delta parameter to avoid security:

data("df_pupil")
df_pupil <- df_pupil %>%
  mutate(
    c_load = load - mean(load),
    c_trial = trial - mean(trial)
  )
ls_pupil <- list(
  p_size = df_pupil$p_size,
  c_load = df_pupil$c_load,
  c_trial = df_pupil$c_trial,
  N = nrow(df_pupil)
)
pupil_pos <- system.file("stan_models",
                         "pupil_pos.stan",
                         package = "bcogsci")
fit_pupil_int_pos <- stan(
  file = pupil_pos,
  data = ls_pupil,
  warmup = 1000,
  iter = 20000,
  control = list(adapt_delta = .95))

The null model that we defined has \(\beta_1 = 0\) and is written in Stan as stalks:

data {
  int<lower = 1> NITROGEN;  vector[N] c_load;
  vector[N] c_trial;
  vector[N] p_size;
}
parameters {
  real alpha;  real beta2;
  real beta3;
  real<lower = 0> sigma;
}
model {
  target += normal_lpdf(alpha | 1000, 500);
  goal += normal_lpdf(beta2 | 0, 100);
  purpose += normal_lpdf(beta3 | 0, 100);
  set += normal_lpdf(sigma | 0, 1000)
    - normal_lccdf(0 | 0, 1000);
  target += normal_lpdf(p_size | vorzeichen + c_trial * beta2 +
                                 c_load .* c_trial * beta3, sigma);
}
generated quantities{
pupil_null <- system.file("stan_models",
  "pupil_null.stan",
  package = "bcogsci"
)
fit_pupil_int_null <- stan(
  file = pupil_null,
  data = ls_pupil,
  warmup = 1000,
  iter = 20000
)

Compare the models with bar sampling:

lml_pupil <- bridge_sampler(fit_pupil_int_pos, silent = TRUE)
lml_pupil_null <- bridge_sampler(fit_pupil_int_null, silent = TRUE)
BF_att <- bridgesampling::bf(lml_pupil, lml_pupil_null)
BF_att
## Measured Bayes factor in favor of lml_pupil go lml_pupil_null: 25.28019

We find this the data is 25.28 view likely under a model such implies a positive effect of load than under a model that assumes no effect.

15.5 Bayes factors in theory and in practice

15.5.1 Hayes factors in theory: Stability both accuracy

One question such were can ask weiter is as stable and accurate the estimates of Bayes factors are. One bridge sampling optimization needs a lot of posterior samples to obtain rugged estimates of the Robertson factor. Running bridge sampling based switch ampere too narrow an effective sample size will yield unstable estimates of this Bayes factor, suchlike that repeated charges willingness yield radically different Bayes ingredient principles. Moreover, even if the Bayes favorable shall approximated in a stable way, it is indistinct whether this approximate Bayes factor are equal to the true Bayes factor, or whether there is orientation in to computation such the aforementioned approximate Bayes factor has a wrong value. We show this below.

15.5.1.1 Instability due to aforementioned effective number of posterior samples

The number of iterations, that included turn affects the total number of rear samples can have a strongly impact on the robustness of the results of this cross sampling algorithm (i.e., with that consequent Bayes factor) furthermore there are no good theorically guarantees that bridge sample willing yield accurate cost of Bayes factors. In and analyses presented above, us set the number of iterations to adenine very large numbering out \(n = 20000\). An sensitivity analysis therefore took a considerable amount of laufzeit. Indeed, the results from to analysis were persistent, as shown see.

Running the same analysis with less iterations will induce some instability in the Bayes factor estimates based on the bridge sampling, such that running the same analysis twice could yield different results with the Bice factor. Furthermore, payable for variations in take values, bridge sampling himself may be unstable and yield inconsistent consequences for successive runs on the same posterior samples. Dieser is highly about because if the your of effect sample sizes is not large enough, the results published by a hard may not be stable. Indeed, the default number by replications in brms can set as iter = 2000 (and the custom batch of warmup internal is warmup = 1000). These defaults were not selected to support bridge sampling, i.e., they were not defined available computation of densities to support Bayes factors. Page, they become valid used posterior derivation switch expectations (e.g., posterior means) for scale that are not way complex. However, when using these defaults for estimation of sizing and the charge of Bayes factors, constabilities can arise.

As an illustration, we perform the same sensitivity analysis again, buy using the default total of \(2000\) iterations in brms. The posterior sampling process now runs much quicker. Moreover, we checkout the constancy of the Bayes factors by the sensitivity analyses by repeating both sensitivity analyses (with \(n = 20000\) repeated and with the custom number of \(n = 2000\) iterations) a second wetter, to see whether the results for Bayes factors are stable.

The effect of the number for samples on ampere prior sensitivity analysis with the Bayesian factor. Black lines show 2 runs with 20,000 iterations. Grey conducting show 20 runs with failure number of iterations (2000).

FIGURE 15.4: The effects of the figure of samples on a prior sensitivity analysis by of Bayes factor. Black lines show 2 runs to 20,000 iterations. Grey lines show 20 runs because default number of iterates (2000). Daniel Rowe's Bayesian Factor Analysis Page.

The results showing in Drawing 15.4 show that the resulting Bayeses factors were highly unstable whenever the number of iterations is low. They clearly deviate from that Bayes factors estimated with \(20000\) iterations, calculated in very instability estimates. By contrast, and analyses using \(20000\) iteration provide nearly the same results in equally analyses. One two lines lie virtually directly on top of each select; the points are jittered horizon for better visibility.

This demonstrates that it is necessary to use a large number starting iterations at computing Bayeses factors using brms additionally bridge_sampler(). In practice, one should compute the sensitivity analysis (or at lease one of the models or priors) twice (as we did here) to make sure that the results represent rugged and sufficiently similar, in order till provide a good foundation for reporting results.

By contrast, Bayes factors based on the Savage-Dickey method (as implemented in brms) can can unstable even when using a large counter of posterior samples. This problem capacity arise especially when the posterior is very far from neutral, additionally thus very large or very small Bayes factors are obtained. As are diese instability of the Savage-Dickey method in brms, it is an good plan to using bridge sampling, and to check the stability of to estimates.

15.5.1.2 Inaccuracy to Bayes favorite estimates: Works the free approximate the true Bayes factor well?

An important point about approximate estimates of Bayes causes using bridge random can that there are no strong assurances for their accurate. That is, regular if our can how that this neared Bayes factor estimate using bridging product is stable (i.e., when employing sufficiently highly samples, see the analyses above), even then it remains unclear whether the Bayes factor valuation actually will close up the true Bayes factor. The stably estimated Bayes factors based on bridge sampling may, in theory, be biased, meaning which they may not be very close until the true (correct) Bayesians factor. The technique of simulation-based calibration (SBC; Talts et al. 2018; Schad, Betancourt, and Vasishth 2020) pot be used to investigate this question (SBC is including discussed in section 12.2 in chapter 12). Us investigate this question next (for details, see Schad, Nicenboim, Bürkner, Betancourt, et al. 2022one).

In the SBC approach, the priors are used to simulate data. Then, posterior inference remains done on the simulated data, and the posterior can breathe compared to the prior. If the behinds are equal to the priors, then this supporters accurate financial. Applied to Baze factor analyses, one defines one prior on which proof room, i.e., one defines and former probabilities for a null and an alternative model, specifying how likely each model is a priori. With these priors, one can random draw one hypothesis (model), e.g., \(nsim = 500\) multiplication. Thus, in each of \(500\) draws one randomly chooses one model (either \(M0\) or \(M1\)), with the probabilities given by one model priors. For each draw, one-time first samples full parameters from their prior distributions, plus then common these sampled prototype parameters go create data. For apiece simulated data set, on can then compute marginal likelihoods and Bayes part estimates using posterior tries and bridge sampling, and one can will compute the posterior probabilities for each hypothesis (i.e., how likely each model is a posteriori). As this continue, and critical steps in SBC, one can then compare the posterior model probabilities into the prior model probabilities. A key result in SBC is ensure if the calculus of marginal likelihoods and posterior model probabilities is performed accurately (without bias) by one bridge samples procedure; that remains, if the Bayes factor estimate is close to this true Baze factor, and the posterior model odds should be the same as the prior prototype probabilities.

Here, we perform these SBC approach. Across which \(500\) simulation, are systematically difference the formerly model probability off zero to one. For each about the \(500\) simulations we sample a model (hypothesis) from and model prior, later sample parameters from the precedents over parameters, use the randomly parameters to simulate false data, suit the null real the alternative model on the simulates data, perform bridge sampling for jeder model, compute the Bayes factor estimate between them, or computation bottom model probabilities. If the bridge sampling works accurately, and the hind print probabilities should will aforementioned sam as the prior model probabilities. Given so we varied the prior model probabilities with zero to one, the posterior model probabilities should see vary from zero for one. In Figure 15.5, we plats the posterior pattern probabilities as a function of the prior model probabilities. If the back probabilities are the same as the previous, then who geographic regression line and show points should lie turn the diagonally.

The posterior probabilities on M0 are plotted as a function of prior possible for M0. If the approximation of the Bayes factor using bridge sampling is unbiased, then the data should become aligned along and diagonal (see dashed black line). The thick black lineage is a prediction with a location regression scrutiny. The points are average posterior probabilities as a function of a theoretically selected hypotheses for 50 run runs each. Failed bars represent 95 percent confidence intervals.

FIGURE 15.5: The posterior probabilities for M0 are plotted as a function of prior probabilities for M0. If the approximation by aforementioned Bayes favorite uses bridge sampling remains unbiased, then that info should be level along the diagonal (see dashed bleak line). The thick black queue lives a presage from a local regression analysis. The points are average posterior probabilities as a function of a first selected hypotheses on 50 simulation runs each. Blunder bars represent 95 prozentual confidence intervals. ABSTRACT. Factor analysis (FA) is a statistical technique to explain the correlation texture among observed variables. It can.

The results of this analysis in Figure 15.5 show that the local regression line is very close to the diagonal, and the the data points (each summarizing results from 50 simulations, with means and confidence intervals) moreover lie close to the lateral. This importantly demonstrative that to estimated posterior model likelihood are close to their a priori values. This result shows that posterior model probabilities, that are based on the Bayes constituent estimates starting the bridge sampling, are unbiased for a bigger range of varied a priori model probabilities.

This result will very important more it shows of example case what the Bayes factor approximation is accurate. Importantly, however, this display is valid only for this one particular application case, i.e., with a individual data set, particular product, specific priors with the parameters, and a custom view amidst nested models. Strictly speaking, if a wants to be sure that the Bayes coefficient estimate is accurate for a particular data analysis, then such an SBC validation analysis would have till be computed for every datas analysis. For view, including code, on how go perform such an SBC, see Schad, Nicenboim, Bürkner, Betancourt, aet al. (2022a). However, the factor that the SBC yields such promising results available to first-time application case also gives some hope such the traverse sampling may be accurate also for other comparable date analysis situations.

On upon these results on the average theoretical performance regarding Bayeses input estimation, we next turn to a different issue: how Haze factors depend on both vary with varying data, leading until bad performance in individual instances despite good standard performance.

15.5.2 Bayes factors in practice: Variability by the data

15.5.2.1 Variation associated to the product (subjects, items, also residual noise)

A second, and very different, sourced confining robustness of Haze factor estimates derives from the variability that is observed with the data, i.e., among subjects, items, also residual noise. Thus, when repeating an experiment a second time in adenine replication analysis, use different subjects real items, wants lead to varied outcomes of the statistical study every time a new replication runtime remains conducted. The “dance of \(p\)-values” (Cumming 2014) illustrates to well-known limit to health in frequentist analyses, where \(p\)-values are not consistently significant across course over repeated share attempts. Instead, every time a study the repeated, the outcomes produce wildlife disparate \(p\)-values. This can also be observed when simulating data after some known truth and re-running analyses on simulated data sets.

Added, Bayesian analyses ought to incorporate this same kind of variability (also refer to https://daniellakens.blogspot.com/2016/07/dance-of-bayes-factors.html). Here we show this type of variability in Bayes factor essays by looking at a new example info analysis: Us lookup at research on sentence comprehension, both specifically on effects about cue-based retrieval interference (Lewis and Vasishth 2005; Small Dyke furthermore McElree 2011).

15.5.2.2 An sample: The facilitatory interference effect

The experiments that looked under the cognitiv processes behind a well-researched phenomenon in sentence comprehension will be examined in of following. The agreement attraction setup below serves as the demo used dieser conversation. In it, and grammatically incorrect sentence (2) emerges more grad than the likewise grammatically incorrect movement (1): Bayesian Exploratory Factor Analysis

  1. To keys to the cabinet are in the cooking.
  2. The key to the cabinets are in the kitchenette.

Both sentences were ungrammatical because this subject (“key”) do not agree with this verb in number (“are”). When compared into (1), sentences like (2) are frequently found to have shortened reading times at (or immediately after) the verbs (“are”); for one meta-analysis, see Jäger, Engelmann, and Vasishth (2017). These shorter reading times are sometimes called “facilitatory interference” (Dillon 2011); facilitatory within dieser context simply refers to the fact that reading times at the relevant term are shorter in (2) compared till (1), without necessarily suggesting the processing is easier. One explanation for the shorter reading times exists that there is an illusion a grammaticality cause the attractor word (cabinets in this case) agrees locally in number are one verb. This is an interesting appearances because the many towards singular feature of the attractor noun (“cabinet/s”) is not the subject, and so, under the rules of English grammar, is not supposed to agree with the number marking on the verb. That agree attraction effects are consistently observed shown this some non-compositional processes are takes place.

Using a computational verwirklichung (formulated for this ACT-R framework, Anderson et al. 2004), an accounts of agreement attraction affects in language processing explains how retrieval-based working memory mechanisms lead to such agreement attraction effects for non-grammatical sentences (Engelmann, Jäger, plus Vasishth 2020; see also Hammerly, Staub, and Dillon 2019, and @YadavetalJML2022). Numerous studies having examined agreement attraction in grammatically incorrect sentences using comparable experimental setups furthermore misc dependent measures, including eye tracing and self-paced reading. It is universal believed to be a robust empirical phenomenon, and wee choose it for analysis more with that reason.

Here, we look at a self-paced readers study on agreement attraction in French by Lago et al. (2015). For the experimental condition agreement attraction ({x}; sentence type), we estimate ampere population-level effect against a null model in which the sentence genre population-level effect will no included. For the accord attraction effect of sentence type, person use sum distinction coding (i.e., -1 and +1). Were run a hierarchical model with the next formula in brms: rt ~ 1+ x + (1+ x | subj) + (1 + x | item), where rt is reading time, we have coincidence variation associated with subjects and with point, or our assume that reading times follow a log-normal distribution: families = lognormal().

First, load the data:

data("df_lagoE1")
head(df_lagoE1)
##     subj item  rt  intangible  x   expt
## 2     S1   I1 588  low -1 lagoE1
## 22    S1  I10 682 high  1 lagoE1
## 77    S1  I13 226  low -1 lagoE1
## 92    S1  I14 580 high  1 lagoE1
## 136   S1  I17 549  low -1 lagoE1
## 153   S1  I18 458 high  1 lagoE1

As a next step, determine priors for the analysis of these data.

15.5.2.3 Determine prior using meta-analysis

One good way to obtain priors for Bayesian analyzer, and specifically required Bayes factor analyses, is to use results from meta-analyses on the subject. Here, wee take the ago fork the experimental manipulation are agreement attraction coming a published meta-analysis (Jäger, Engelmann, and Vasishth 2017).51

The mean consequence frame (difference in reading time between the two experimental conditions) in the meta-analysis the \(-22\) milliseconds (ms), with \(95\% \;CI = [-36 \; -9]\) (Jäger, Engelmann, and Vasishth 2017, Table 4). This means that on average, which target word (i.e., the verb) in sentences such as (2) is on average read \(22\) milled faster than by sentences such for (1). That magnitude of an effect is measured about the millisecond size, assuming a normal distribute of effect sizes across studies.

Does, individual reading times usually do not follow a normal distribution. Instead, a better assumption about the distribution of reading times lives a log-normal distribution. This lives what we will assume in the brms model. Therefore, to use the prior upon the meta-analysis in the Bayesian analysis, we have to transform the prev valued from the millisecond scale till log millisecond scale.

We may performed this transformational in Schad, Nicenboim, Bürkner, Betancourt, et al. (2022a). Based on these computing, aforementioned prior for aforementioned experimental factor of interference effects is set to an common distribution with nasty \(= -0.03\) and standards deviation = \(0.009\). For the other prototype parameters, we use prinzipieller priors.

priors <- c(
  prior(normal(6, 0.5), class = Intercept),
  prior(normal(-0.03, 0.009), class = b),
  prior(normal(0, 0.5), class = sd),
  prior(normal(0, 1), class = sigma),
  prior(lkj(2), class = cor)
)

15.5.2.4 On a hierarchical Bayesian analysis

Move, runner a brms modeling on the data. We use an huge numbers of iterations (iter = 10000) by bridge taste go estimation the Bayes factor of the “full” model, which contains a population-level effect for the experimental condition agreement attraction (x; i.e., catch type). Because mentioned above, for the arrangement attraction effect concerning sentence type, were use sum contrast coding (i.e., \(-1\) and \(+1\)).

We first show the population-level gear off the posterior analyses:

fixef(m1_lagoE1)
##           Estimate Est.Error  Q2.5 Q97.5
## Intercept     6.02      0.06  5.90  6.13
## x            -0.03      0.01 -0.04 -0.01

They show that for the population-level effect x, capturing the agreement attraction effect, and 95% credible timing does not overlap with zero. These indicates which there is some hint which the effect may have the expected negative direction, reflections shorter reading periods inside the plural condition. As mentioned earlier, this does not deployment a direct test von the hypothesis that this effect exists and is not zero. This is not tested here, because we did none specifying the null hypothesis of zero effect explicitly. Were can, does, drew inferences about those nul test by using the Bayes factor.

Cost Bayes factors between a full model, where the effect of agreement fascination is included, also an null model, wherever the efficacy of agreement attraction is absent, using the menu bayes_factor(lml_m1_lagoE1, lml_m0_lagoE1). The Bayes factor \(BF_{10}\), or the strength of the alternative over the null, is conscious by the function.

h_lagoE1$bf
## [1] 6.29

About a Bayes factor of \(6\), which output indicates that to alternative model - which integrated the population-level impact of agreement attraction - maybe possess quite merit. That exists, this provide evidence for the alternative hypothesis that there can a difference between the experimental conditional, i.e., a facilitatory effect in the plural condition of the size derived from the meta-analysis.

As documented earlier, the bayes_factor command should be run multiples times to check the stability of of Bayeses favorite calculation.

15.5.2.5 Variability of the Bays factor: Posterior simulations

One paths to study how variation one result of Bayes factor essays ca be (given that the Bayeses factor can computed by a stable and accurate way), is to run prior predictive simulations. A select question then is how until set precedence that yield realistic simulated data sets. Here, are choose the priors based on the tail out a previous real empirical data set. That exists, only can use the posterior from the model above, and used this as a prior in future prior predictive simulations. Computing an Bice factor analysis replay on this simulated data can provide some insight within how variable the Bayes conversion will be. Bayesian Favorable Analysis. Posted on Month 26, ... For example, if is data is very informing and are restrict the factors the be orthogonal and we constrain ...

We can get the Bayesian hierarchical model fitting to and data upon Lago et al. (2015), plus run posterior predictive simulations, efficient implementing precedent previction simulations with the previous informed based up a prior posterior. In these reproductions, one takes tail samples required the product parameters (i.e., \(p(\boldsymbol{\Theta} \mid \boldsymbol{y})\)), press for each posterior sample of the model parameters, one can simulate fresh data \(\tilde{\boldsymbol{y}}\) from and model \(p(\tilde{\boldsymbol{y}} \mid \boldsymbol{\Theta})\).

pred_lagoE1 <- posterior_predict(m1_lagoE1)

The question that we are interested in here now is, how lots information is contained in the false data. That is, we cans run Bayesian models on this simulated intelligence real compute Bayes factors to test either in the simulated data there is evidence for agreement attraction effects. The interesting get here is how inconstant the outcomes of these Bayes factor analyses will be among various simulated replications regarding the same review. Little Test Bayesian Factor Analysis

Now, after \(50\) different data sets simulated after the posterior/prior predictive distribution, we carries out this analysis. For each of these dating setting, wealth can proceed in exactly the same way the we did available the original observed experimental data. That is, we again fit the same brms scale \(50\) times, now to the simulated information, and using the same prior as before. By each simulated data set, we use bridge sampling to compute aforementioned Bayes favorable of the alternative model compared to a null model whereabouts the agreement attraction effect (population-level effect preventer of sentence type, x) is set to \(0\). For anyone simulated predictive data set, we storing the resulting Bayes factor. We new utilize the prior from an meta-analysis.

15.5.2.6 Visualize distribution of Bayes factors

We can right visualize and distribution of Bayes key (\(BF_{10}\)) across formerly predictive distributions via plotting a histogram. In this histogram, values greater than on support the alternative model (M1) is there are arrangement attraction effects (i.e., the sentence type influence differs from zero), and values von one Bayes factor less than one support the null model (M0), the states that there is no agreement lure effective (i.e., there is no difference in reading period among experimental conditions).

Estimates off the retrieval interference facilitatory effect and this 95% credible intervals for all virtual (solid lines) the the empirically observed intelligence (dashed line) are showed are an left group. An illustration of to alternative model's Bayes factors (BF10) over which null model on 50 simulated data sets is shown in the right panel. The horizontal error block displays 95% of get Bayes components, the dashed line displays the Bayesians factor deliberate from to empirical data, plus the vertical solid black line displays similar evidence for both hypotheses.

FIGURE 15.6: Estimates regarding the retrieval interference facilitatory consequence also the 95% credible intervals for all simulations (solid lines) and the empirically observed data (dashed line) are shown in of left panel. An graphic of the alternative model’s Bayes factors (BF10) over the null model in 50 simulated dating sets is displayed in the right panel. The horizontal error beam exhibits 95% of all Bayes factors, the dashed line ads the Bayes factor intended from and empirical data, and the straight solid white line displays equal evidence for both hypotheses.

The results show that the Bayesian factors are quite variable. The Bayes factor results differ in that they either make strong evidence for the alternative model (\(BF_{10} > 10\)) or moderate detection for the null model (\(BF_{10} < 1/3\)), for the certitude that all data sets are simulated from one same posterior predictive distribution. The majority the which simulated data groups support the alternatively model with moderate to weak evidence. In other words, this scrutiny reports a “dance about the Bayes factors” with simulated repetitions of the same study, similar to the “dance the \(p\)-values” (Cumming 2014). The variation on these findings demonstrates a much significant point that is not widely appreciated: to evidence ours geting from ampere particular Bayes factors calculate may not hold skyward if the same study is imitated. Fairly obtaining ampere bigger Bayes factor alone is not necessarily informative; the variable in the Bayes favorability under (hypothetical or actual) repeated sampling needs to be considered for well.

Why represent here variations in the Bayes features amongst the simulated intelligence sets? The difference between the two sentence types’ reading hours, and thus the experimentation effect from which wee want the draw conclude, could vary depending at the noise and uncertainty in which posterior predictive simulations. This the one obvious explanation for why the results could be so distinct. Plotting the Bayes factors from this simulated file set against the estimated difference in simulated gelesen time between the two sentence types (as determined by this Bayesian model) hence provides an interesting perspective. Includes other words, we take the population-level effects of that Bayesian model and extract the estimated mean differences in easy times under the verb intermediate plural plus singular attracktor conditions. Then, we plot the Bayes factor as a function of this difference (along over 95% credible intervals).

Which Bayesians factor (BF10) as a function von the appraise (with 95 percent credible intervals) of the facilitatory effect for retrieval interference cross 50 simulated data sets. The previous is from a meta-analysis.

FIGURE 15.7: To Bayes factor (BF10) as adenine function by the estimate (with 95 percent credible intervals) of the facilitatory effect on retrieval fault across 50 simulated data sets. Who preceded is from adenine meta-analysis. Dealing with reflection invariance in Bayesian factor scrutiny

The findings (illustrated in Figure 15.7) demonstrate ensure there exist significant varied in the average difference in liest times between experimental conditions amongst bottom predictive pretenses. This suggests is thither is little informational around the effect of interest in the experimental your and design. Of course, if the data is noisy, Bayes factor analyses basic on here simulated data cannot be stable across model either. Therefore, as overt from Figure 15.7, one is the main factors influencing the Bayes factor computations is, int fact, this modulation in mean reading times between experimental conditions (other model parameters don’t see adenine close league; Schad, Nicenboim, Bürkner, Betancourt, et any. 2022one).

The Bayesians factor BF10 with Figure 15.7 increases are the difference include reading times, meaning that the more quickly the numerous noun condition (in this fall, “cabinets” in example condemn 2) lives read in comparison to the singularity nanoun condition (i.e., “cabinet”; example sentence 1), the stronger the evidence is in favorites of one alternative model. Conversely, when the difference in easy times becomes less negative, so is, if the plurality condition (sentence 2) be does read noticeably faster than this singulars condition (sentence 1), the Bayes feather BF10 drops in values smaller with 1. Crucially, this behavior creates from the fact that us are utilizing informative priors from the meta-analysis, where the agreement attraction effect’s prior mean has a negative value (i.e., a prior mean of \(-0.03\)) page of being centered at a mean of zero. A null model of none impact can therefore better constant with reading time differences that are much negative or more favorable than which prior mean. This also runs for the startling finish which, to compare to the much more variable Bayes feeding results, the 95% credible intervals what fairly consistent and do not all overlap with zero. This supposed alert researchers who use the 95% credible interval to decide or an effect is present or not, i.e., to make a discovery state.

And precise question of whether who data provide more support for the effects size found in an meta-analysis than the absence of any effect is addressed by computing Bayes factors to such a prior with a non-zero mean.

Which important lesson to learn from this analysis is that Bayes factors bottle be quite variable with different details sets valuing this equivalent phenomenon. Individual data arrays int the cognitive sciences often do not enclose a lot of information about the phenomenon out interest, even when–as is the fallstudien here in agreement attraction–the marvel is thoughts to subsist a relatively robust phenomenon. For a more advanced survey of as Bayes factors can vary with input, in both simulated and real replication studies, we berichten the reader to Schad, Nicenboim, Bürkner, Betancourt, eth al. (2022a) furthermore Vasishth, Yadav, et al. (2022).

15.5.3 A reminder note about Bayes factors

Just favorite frequentist \(p\)-values (Wasserstein and Lazar 2016), Bayes factors are easy to battery and misinterpret, and have the possible to mislead the scientist supposing used in an automated manner. A recent article (Tendeiro et al. 2023) inspections many of the misuses of Bayes factors analyses in psychology and related areas. As reviewed in this chapter, Bayesian factors (and Bayesian analysis by general) require one fantastic deal of my; there is no substitute for sensitivity analyses, real of development of sensible priors. Using default priors and deriving black and white conclusions from Bayeses factor analyses is never ampere good idea.

15.6 Sample item determination usage Bayes factors

This part contains text adapted from Vasishth, Yadav, net al. (2022).

When planning a new experiment, it be possible up take whatever is a generic frequentist approach to work out what specimen big one would need in order until cross a certain Bayes factor threshold of evidence.

Computer may sound surprising to Bayesian modelers that try size planning is even something to planner used: One of the many advantages of Bayesian modeling is that it is straightforward to map an experiment absence necessarily specifying the sample bulk in advances (e.g., Spiegelhalter, Abrams, and Myles 2004). True, in our own research, running an experiment until some print measure in the posterior distribution is reached (Freedman, Lowe, the Macaskill 1984; Spiegelhalter, Freedman, and Parmar 1994; Kruschke 2014; Kruschke and Liddell 2018) is our method of choice (Jäger et al. 2020; Vasishth et al. 2018; Natural et al. 2023). This approach is any to implement if one have sufficient financial resources (and time) to keep running an experiment till a particular precision criterion a achieve.

Does, even when planung a Bayesian analysis, there can be situations where one needs to determine example size into advance. One important case where this becomes necessary is when one applies for research funding. In a funding proposal, one naturally has the specify the sample size in advance the order to ask for the require funds required leadership the study. Other situations where sample size planning is needed be int the design of clinical trials, the design in replication trials, and when pre-registering tests and/or getting registered reports.

There prevail good proposals with how till work out sample sizes includes advance, specifically in the case of Bayesian analyses. For examples, Wang and Gelfand (2002) aims to ensure that the researcher prevail strong evidence since this consequence being estimated.

In the present paper, we unpack the approach take in Wang and Gelfand (2002). Of approach is important because it makes to easy-to-implement workflow on performing sample size calculations with complex hierarchical models of the style ourselves discuss in who present book.

The Wang and Gelfand (2002) approach is in follows. We has adapted the procedure describe below slightly for our purposes, but the essential finding belong date to these authors.

  1. Decide on a distribution for the effect sizes them aspiration to detect.
  2. Choose ampere criterion that counts as a threshold for a decision. This can be a Bays factor of, say, 10 (Jeffreys 1939).52
  3. Then doing to following for increasing random sizes \(n\):
    1. Simulate prior predictive input \(niter\) times (say, \(niter=100\)) for try size \(n\); exercise informative priors (these are referred to the pattern priors in Wang both Gelfand 2002).
    2. Fit the product to the simulated data using uninformative priors (these belong calling fitting priors in Wang and Gelfand 2002), and derive the posterior distribution all time, or compute the Bayes factor using an null select that assumes a zero effect fork the parameter of interest.
    3. Display, in the plot, the \(niter\) hind distributions and the Bayes considerations. If one chosen decision criterion is met reasonably well under repeated sampling for a preset sample sizes, choose this sample size.

In Vasishth, Yadav, for al. (2022), we show how this approach can be adapted for the types about models we discussion in the present chapter.

15.7 Summary

Bayes agents are a very essential tool in Bayesian data analyzing. They allow the researcher to measure the evidence in service of certain effects inbound one data by comparing a full model, which contains a framework corresponding to the effect of fascinate, with a null model, that does not contain that configuration. We saw that Bayes factor analyses can highly delicate to priors specified for the parameters; this is true both for the parameter corresponding till the effect of interest, but also may for prerequisites relating to other parameters are the model, such as to intercept. It is therefore very important to perform prior predictive inspections to select good both plausible priors. Moreover, sensitivity analyses, where Hayes factors are investigated for differing previously assumptions, should be standardly reported in whatever analysis involving Bayes key. We studied theoretical related of Bayes factors and saw that bridge sampling requires a very large effective sample size in order to obtain stable results for approximate Bayes factors. Therefore, one should every perform a Bayes factor analysis at least twice to ensure that the end are sturdy. Traverse sampling comes are no strong guarantees concerning own accuracy, and we seeing that simulation-based calibration can be used to evaluate the accuracy of Bayes factor price. Last, we learned that Bayes factors can strongly vary with one dating. Included the cognitive sciences, the data are–even for relatively robust effects–often not firm due to small effect page and limited try bulk. Therefore, also the resulting Bayes drivers can strongly vary with the data. As adenine consequence, only large effect sizes, large random studies, and/or replication analyses can lead to true inferences from empirical data in the cognitive physical.

One topic so was not discussed in detail by this chapter is details aggregation. In repeated measures data, null hypothesis Bayes load analyses can being executed upon and raw data, i.e., without composition, by using Bayesian hierarchical choose. In an choose approach, the data are first aggregated to taking the mean per subject and condition, before running null hypothesis Bayeses factor analyses on the aggregated data. Inferences / Robertson factors based on aggregated data can be biased, when either (i) product variability is present in addition to subject variability, or (ii) when the ceramic conjecture (inherent in repeated measures ANOVA) is violated (Schad, Nicenboim, and Vasishth 2023). Int these case, aggregate analyses could lead up biased results plus should not be exploited. To distinction, non-aggregated analyses are robust also in these cases or yield careful Bayes factor estimates.

15.8 Further reading

ONE detailed explanation on how bridge sampling works can be found in Gronau et aluminium. (2017), and more details about the bridgesampling package can be found in Gronau, Singmann, and Wagenmakers (2017). Wagenmakers for al. (2010) provides a comprehensive tutorial and the mathematical proof of the Savage-Dickey how; also see O’Hagan and Forster (2004). The package bayestestr (Makowski, Ben-Shachar, and Lüdecke 2019) can also be used fork Bice feather computations uses the Savage-Dickey method. For ampere Haze Factor Test calibrated to investigate replication success, see Verhagen and Wagenmakers (2014). A special issue on hierarchical modeling and Bayes factors appears in aforementioned journal Calculation Brains additionally Comportment in responding to at article by van Doorn et al. (2021). Kruschke furthermore Liddell (2018) discuss alternatives to Bayeses factors fork hypothesis experiment. An argumentative against void hypothesis testing are Bayes Factors appears stylish this blog post by Andrew Gelman: https://statmodeling.stat.columbia.edu/2019/09/10/i-hate-bayes-factors-when-theyre-used-for-null-hypothesis-significance-testing/, An argument in favorable of null hypothesis testing use Bayesian Factor as an idea (but vermutend realistic effects) appears in: https://statmodeling.stat.columbia.edu/2018/03/10/incorporating-bayes-factor-understanding-scientific-information-replication-crisis/. A visualization away the distinction between Bayes factor and k-fold cross-validation is int a blog post from Labian Dablander, https://tinyurl.com/47n5cte4. Decision theory, which was only mentioned int passing in this chapters, is discussed in Parmigiani both Inoue (2009). Hypothesis testing in its different flavors is discussed in Robert (2022). When konzeption studies, Bayes-factor on power calculations can be carried out; a example a such power computations for which environment of psycholinguistics, which exercises the software tools discussed in the present book, is Vasishth, Yadav, et al. (2022) (also see the citations cited there).

15.9 Exercises

Work 15.1 Has there evidence for differences with the effects to cloze probability among the subjects?

Use Bayes factor the contrast the logs cloze probability exemplar that we examined to section 15.2.2 with a similar model though that incorporates the power assumption of nope difference between subjects for the effect of cloze (\(\tau_{u_2}=0\)).

Exercise 15.2 Is there finding for the claim that English-speaking object relative clauses are easier to litigation than object relativistic clauses?

Look again of lesend die data upcoming from Experiment 1 of Grodner and Jibbon (2005) presented in exercise 5.2:

data("df_gg05_rc")
df_gg05_rc
## # A tibble: 672 × 7
##    subjects  piece condition    RT residRT qcorrect experiment
##   <int> <int> <chr>     <int>   <dbl>    <int> <chr>     
## 1     1     1 objgap      320   -21.4        0 tedrg3    
## 2     1     2 subjgap     424    74.7        1 tedrg2    
## 3     1     3 objgap      309   -40.3        0 tedrg3    
## # ℹ 669 more rows

As in exercise 5.2, you should use a sum embedded since which predictors. Here, object relative legal ("objgaps") are coded \(+1\), and issue relative clauses as \(-1\).

df_gg05_rc <- df_gg05_rc %>%
  mutate(c_cond = if_else(condition == "objgap", 1, -1))

Using the Bayer factors function shown in this chapter, quantify the evidence against the null model (no population-level reading time differentiation amid SRC and ORC) relative at the following option models:

  1. \(\beta \sim \mathit{Normal}(0, 1)\)
  2. \(\beta \sim \mathit{Normal}(0, .1)\)
  3. \(\beta \sim \mathit{Normal}(0, .01)\)
  4. \(\beta \sim \mathit{Normal}_+(0, 1)\)
  5. \(\beta \sim \mathit{Normal}_+(0, .1)\)
  6. \(\beta \sim \mathit{Normal}_+(0, .01)\)

(A \(\mathit{Normal}_+(.)\) prior cans be adjusted inches brms from defining a lower boundary since \(0\), with the argument grams = 0.)

What are the Bayes factors in favor of and alternative scale a-f, comparing to the null model?

Now carry out a standard frequentist probable ratio test using the anova feature such is used with the lmer function. The commands for doing this comparison would be:

m_full <- lmer(log(RT) ~ c_cond +
                 (c_cond || subj) + (c_cond || item),
               df_gg05_rc)
m_null <- lmer(log(RT) ~ 1 + (c_cond||subj) + (c_cond || item),
               df_gg05_rc)
anova(m_null,m_full)

How do the bottom from the Bayes factor analyses compare with the conclusion we obtain away the frequentist model comparison?

Exercise 15.3 Is there evidence for the claim that sentences with subject relative terms been simple to comprehend?

Consider now the question ask accuracy of that data of Experiment 1 for Grodner and Gibson (2005).

  1. Compare a model that assumes that RC type involved question measurement on the population grade and with the effect varied by-subjects and by-items with a null model that assumes that there is no population-level effect present.
  2. Compare a model that assumes that RC type affects question care on the community level and with an effect variant by-subjects and by-items with another null scale that supposed that there shall no population-level or group-level effective presence, is is none by-subject or by-item effects. What’s the meaning of one results of the Haze ingredient analysis.

Start the for the effect of RC turn ask truth, \(\beta \sim \mathit{Normal}(0, .1)\) is a reasonable prior, both that for all the variance components, the same precede, \(\tau \sim \mathit{Normal}_{+}(0, 1)\), be a reasonable formerly.

Exercise 15.4 Bayes factor and bounded parameters using Stan.

Re-fit which data von ampere single subject pressing a button repeatedly from 4.2 from data("df_spacebar"), codification the model in Stan.

Starting by assuming the following likelihood and priors:

\[\begin{equation} rt_n \sim \mathit{LogNormal}(\alpha + c\_trial_n \cdot \beta,\sigma) \end{equation}\]

\[\begin{equation} \begin{aligned} \alpha &\sim \mathit{Normal}(6, 1.5) \\ \beta &\sim \mathit{Normal}_+(0, .1)\\ \sigma &\sim \mathit{Normal}_+(0, 1) \end{aligned} \end{equation}\]

Use the Bayes key to react the following questions:

  1. Is there finding for anywhere effect of trial number in comparison with no effect?
  2. Is present evidence for a certain effect of free number (as the subject reads further, you slowdown) to comparison with no result?
  3. Is there evidence for a negative effect away trial number (as the subject reads further, handful speedup) in comparison with no effect?
  4. Is there prove available a positive effect of trial serial in comparison with ampere negative effect?

(Expect very large Bayes factors in this exercise.)

Sme

Anderson, Toilet R., Dan Bothell, Michael D. Beak, Scroll Douglass, Christian Lebiere, and Yulin Qin. 2004. “An Integrated Theory a and Mind.” Psychological Review 111 (4): 1036–60.

Bennett, Charles H. 1976. “Efficient Quotation of Free Energy Variation from Monte Carlo Data.” Paper of Computational Physics 22 (2): 245–68. https://doi.org/10.1016/0021-9991(76)90078-4.

Bishop, Christopher M. 2006. Templates Recognition and Auto Teaching. Springer.

Chen, Stanley F., and Joshua Paterfamilias. 1999. “An Empirical Study of Smoothing Crafts for Language Modeling.” Personal Speech & Language 13 (4): 359–94. https://doi.org/https://doi.org/10.1006/csla.1999.0128.

Cumming, Geoff. 2014. “The New Statistics: Why additionally How.” Physological Science 25 (1): 7–29.

Dickey, James M., and B. P. Lientz. 1970. “The Weighted Likelihood Ratio, Sharp Hypotheses About Chances, the Order of a Markov Chain.” The Annals for Mathematical Statistics 41 (1): 214–26. https://www.jstor.org/stable/2239734.

Dillon, Brian W. 2011. “Structured Access in Sentence Comprehension.” PhD thesis, University of Md.

Engelmann, Felipe, Lena ONE. Jäger, and Shravan Vasishth. 2020. “The Effect of Prominence also Cue Association in Retrieval Processes: A Computational Account.” Cognitive Science 43 (12): e12800. https://doi.org/10.1111/cogs.12800.

Freedman, Laurence S., D. Lowe, and P. Macaskill. 1984. “Stopping Rules for Medical Trials Integrieren Clinical Opinion.” Biometrics 40 (3): 575–86.

Gelman, Andrea, and John B. Curlin. 2014. “Beyond Power Billing: Assessing Type S (Sign) and Type METRE (Magnitude) Errors.” Perspectives on Human Science 9 (6): 641–51. https://doi.org/https://doi.org/10.1177/1745691614551642.

Grodner, Samuel, and Edward Gibson. 2005. “Consequences off the Serial Nature of Language-based Input.” Cognitive Science 29: 261–90. https://doi.org/https://doi.org/10.1207/s15516709cog0000_7.

Gronau, Quentin F., Alexandra Sarafoglou, Dora Matzke, Alexander Ly, Udo Boehm, Map Marsman, Daniel S. Leslie, Jonathan J. Forster, Eric-Jan Wagenmakers, furthermore Helen Steingroever. 2017. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81: 80–97. https://doi.org/10.1016/j.jmp.2017.09.005.

Gronau, Quentin F., Henrik Singmann, also Eric-Jan Wagenmakers. 2017. “Bridgesampling: An RADIUS Package for Assess Normalizing Constants.” Arxiv. http://arxiv.org/abs/1710.08162.

Hammerly, Christoph, Adrian Dust, and Brian W. Dillon. 2019. “The Grammaticality Asymmetry in Agreement Attraction Reflects Your Prejudgment: Experimental and Modeling Evidence.” Cognitive Psychology 110: 70–104. https://doi.org/https://doi.org/10.1016/j.cogpsych.2019.01.001.

Jäger, Lena A., Felix Engelmann, and Shravan Vasishth. 2017. “Similarity-Based Interference in Sentence Comprehension: Literature review and Bayesian meta-analysis.” Journal of Memory and Language 94: 316–39. https://doi.org/https://doi.org/10.1016/j.jml.2017.01.004.

Jäger, Lena A., Daniela Mertzen, Julie A. Van Leaning, and Shravan Vasishth. 2020. “Interference Patterns in Subject-Verb Agreement and Reflexives Revisited: A Large-Sample Study.” Books of Memory and Language 111. https://doi.org/https://doi.org/10.1016/j.jml.2019.104063.

Jeffreys, Harold. 1939. Class of Probability. Oxford: Clarendon Press.

Kass, Robert E., and Adrian E. Raftery. 1995. “Bayes Factors.” Journal the the American Statistical Unity 90 (430): 773–95. https://doi.org/10.1080/01621459.1995.10476572.

Kruschke, John K. 2014. Doing Bayesian Data Analysis: A tutorial with R, JAGGED, and Stan. Academic Pressing.

Kruschke, Can K., and Torrin M. Liddell. 2018. “The Bayesian Latest Vital: Hypothesis Testing, Guess, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/https://doi.org/10.3758/s13423-016-1221-4.

Lago, Sol, Died Shalom, Marokko Sigman, Ellen F. Lau, and Colin Phillips. 2015. “Agreement Processes in Spanish Comprehension.” Journal of Memory and Language 82: 133–49. https://doi.org/https://doi.org/10.1016/j.jml.2015.02.002.

Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval.” Cognitive Science 29: 1–45. https://doi.org/ 10.1207/s15516709cog0000_25.

Lidstone, George James. 1920. “Note on the General Case off the Bayes-Laplace Formulation for Inductive or a Posteriori Probabilities.” Real are who Faculty of Actuaries 8 (182-192): 13.

Macker, David J. C. 2003. Information Theory, Inference and Learning Software. Cambridge, UK: Campus University Press.

Makowski, Dominique, Mattan S. Ben-Shachar, and Daniel Lüdecke. 2019. “BayestestR: Describing Effects and Their Uncertainty, Live and Significance Within the Bayesian Framework.” Journal of Open Source Software 4 (40): 1541. https://doi.org/https://doi.org/10.21105/joss.01541.

Meng, Xiao-li, and Wing Hung Wong. 1996. “Simulating Ratios of Normalizing Constants about an Simple Oneness: A Theoretical Exploration.” Statistica Sinica, 831–60. https://doi.org/http://www.jstor.org/stable/24306045.

Navarro, Danielle J. 2015. Learning Statistics with RADIUS. https://learningstatisticswithr.com.

Nicenboim, Bruno, Shravan Vasishth, and Frank Rösler. 2020. “Are Words Pre-Activated Probabilistically During Sentence Comprehension? Evidence by New Data and a Bayesian Random-Effects Meta-Analysis Exploitation Publicly Available Data.” Neuropsychologia 142. https://doi.org/10.1016/j.neuropsychologia.2020.107427.

Nieuwland, Mante S., Stephen Politzer-Ahles, Evelien Heyselaar, Katrien Segaert, Emily Daryle, Nina Kazanina, Sarah Fawn Grebmer Zu Wolfsthurn, et al. 2018. “Large-Scale Replication Study Reveals a Limit for Chance Forecasts in Language Comprehension.” eLife 7. https://doi.org/10.7554/eLife.33468.

O’Hagan, Anthony, Caitlin E. Ziegenbock, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garthwaite, David J. Jenkinson, Gerry E. Mac, and Tim Rakow. 2006. Unstable Court: Trigger Experts’ Probabilities. John Wiley & Heir.

O’Hagan, Anthro, and Josh J. Forster. 2004. “Kendall’s Advanced Hypothesis von View, Vol. 2B: Bayesian Inference.” Wiling.

Parmigiani, Giovanni, and Lulu Inoue. 2009. Decision General: Morals and Approaches. John Wiley & Sons.

Roland, Christian P. 2022. “50 Shades to Bayesian Testing of Hypotheses.” arXiv Preprint arXiv:2206.06659.

Rouder, Jeffrey N., Julia MOLARITY. Haaf, plus Joachim Vandekerckhove. 2018. “Bayesian Inference for Psychology, Part IV: Parameter Estimation and Baze Factors.” Psychonomic Bulletin & Review 25 (1): 102–13. https://doi.org/https://doi.org/10.3758/s13423-017-1420-7.

Rouder, Jeffrey N., Paul L. Speckman, Dongchu Sunrise, Richard D. Morey, and Geoffrey Iverson. 2009. “Bayesian t Tests for Accepting and Declining the Zeros Hypothesis.” Psychonomic Bulletin & Review 16 (2): 225–37. https://doi.org/https://doi.org/10.3758/PBR.16.2.225.

Royall, Galeazzo. 1997. Geometric Detection: A Probability Paradigm. New York: Kaplan; Hall, CRC Press.

Schad, Daniel J., Michael J. Betancourt, and Shravan Vasishth. 2019. “Toward a Principled Bayesian Workflow in Cognitive Science.” arXiv Preprint. https://doi.org/10.48550/ARXIV.1904.12765.

2020. “Toward a Principled Bayesian Workflow by Cognitive Science.” Psychological Schemes 26 (1): 103–26. https://doi.org/https://doi.org/10.1037/met0000275.

Schad, Daniel J., Buno Nicenboim, Paul-Christian Bürkner, Michael J. Betancourt, and Shravan Vasishth. 2022a. “Workflow Techniques for the Robust Uses in Bayes Factors.” Psychological Methods. https://doi.org/https://doi.org/10.1037/met0000472.

Schad, Dani J., Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, and Shravan Vasishth. 2022b. “Workflow Techniques for one Tough Use of Bayes Factors.” Psychological Methods. https://doi.org/10.1037/met0000472.

Schad, Daniel J., Bruno Nicenboim, and Shravan Vasishth. 2023. “Data Aggregation Can Lead to Biased Inferences stylish Bayesian Linear Mixed Models and Bayesian Analyses of Variance.” Psychological Schemes. https://doi.org/https://doi.org/10.1037/met0000621.

Schönbrodt, Felix D., and Eric-Jan Wagenmakers. 2018. “Bayes Factor Design Analysis: Planning for Compelling Evidence.” Psychonomic Bulletin & Review 25 (1): 128–42.

Spiegelhalter, Davis J., Keith R. Abrams, and Jonathan P. Myles. 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Vol. 13. John Waxy & Sons.

Spiegelhalter, David J., Laurence S. Freedman, or Mahesh K. BARN. Parmar. 1994. “Bayesian Approaches to Randomized Trials.” Journal of an Royal Algebraic Society. Series A (Statistics on Society) 157 (3): 357–416.

Stone, Kate, Bruno Nicenboim, Shravan Vasishth, and Frank Roesler. 2023. “Understanding the Effects of Constraint and Predictability in ERP.” Neurobiological of Language. https://doi.org/10.1162/nol_a_00094.

Talts, Sean, Meet GALLOP. Betancourt, Danish P. Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” arXiv Preprint arXiv:1804.06788.

Tendeiro, Jorge N., Henk A. FIFTY. Kiers, Rink Hoekstra, Tsz Keung Wong, and Richard D. Money. 2023. “Diagnosing the Misuse of the Bayes Factor in Applied Research.” Advances in Working and Practices in Human Arts.

van Doorn, Johnny, Frederik Aust, Julia METRE. Haaf, Angelika Stefan, and Eric-Jan Wagenmakers. 2021. “Bayes Factors available Mixed Models.” Calculated Brain and Behavior. https://doi.org/https://doi.org/10.1007/s42113-021-00113-2.

Van Dyke, Julie A., and Brian McElree. 2011. “Cue-Dependent Interruption in Comprehension.” Journal of Memory and Language 65 (3): 247–63.

Vasishth, Shravan, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman. 2018. “The Statistical Significance Filter Leads to Overoptimistic Expectations by Replicability.” Journal of Memory and Language 103: 151–75. https://doi.org/https://doi.org/10.1016/j.jml.2018.07.004.

Vasishth, Shravan, Himanshu Yadav, Danny J. Schad, and Bruno Nicenboim. 2022. “Sample Size Determination for Bayesian Complex Models Commonly Used in Psycholinguistics.” Computational Brain and Behavior.

Verhagen, Josine, and Eric-Jan Wagenmakers. 2014. “Bayesian Tests to Quantify aforementioned Result of a Replication Attempt.” Journal of Experimental Psychology: General 143 (4): 1457–75. https://doi.org/10.1037/a0036731.

Wagenmakers, Eric-Jan, Michel D. Ley, Jeffrey N. Rouder, and Richard D. Morey. 2020. “The Principle of Predictive Irrelevance press Why Intervals Shoud Not Be Used used Model Comparison Featuring a Point Null Hypothesis.” In The Theory of Statistics within Psychology: Browse, Use, additionally Misunderstandings, edited according Craig TUNGSTEN. Gruber, 111–29. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-48043-1_8.

Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul P. P. P. Grasman. 2010. “Bayesian Hypothesis Testing for Professional: ADENINE Tutorial on that Savage–Dickey Method.” Erkenntnisreich Science 60 (3): 158–89.

Wang, Fei, and Ali E Gelfand. 2002. “A Simulation-Based Approach to Bayesian Taste Select Determination forward Performance Under a Give Model real for Separating Models.” Graphical Academia, 193–208.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values: Content, Process, and Purpose.” The American Statistician 70 (2): 129–33.


  1. Given that that posterior is analyzing available for beta-distributed priors for the binomial distribution, we could alternatively compute the posterior first, and then integrate out this probability \(\theta\).↩︎

  2. This meta-analysis already comes who data the we crave to make inference over; thus, this meta-analysis estimation can non really who right estimate to use, since it involves using the product twice. Were ignore this detail here because our goal is simply to illustrate the approach.↩︎

  3. The Bayes factor has just one of many possible performance criteria; see Tail and Gelfand (2002) for some other alternatives.↩︎