Mixed Models and Smoothing

The LMMsolver package

The aim of the LMMsolver package is to provide an efficient and flexible system to estimate variance components using restricted maximum likelihood or REML (Patterson and Thompson 1971), for models where the mixed model equations are sparse. An important feature of the package is smoothing with P-splines (Eilers and Marx 1996). The sparse mixed model P-splines formulation (Boer 2023) is used, which makes the computations fast.

A Linear Mixed Model (LMM) has the form

y = Xβ + Zu + e,  u ∼ N(0, G),  e ∼ N(0, R) , where y is a vector of observations, β is a vector with the fixed effects, u is a vector with the random effects, and e a vector of random residuals. X and Z are design matrices.

The LMMsolver package can fit models where the matrices G−1 and R−1 are a linear combination of precision matrices QG, i and QR, i: G−1 = ∑iψiQG, i ,  R−1 = ∑iϕiQR, i , where the precision parameters ψi and ϕi are estimated using REML. For most standard mixed models 1/ψi are the variance components and 1/ϕi the residual variances. We use a formulation in terms of precision parameters to allow for non-standard mixed models using tensor product splines introduced in Rodríguez-Álvarez et al. (2015).

If the matrices G−1 and R−1 are sparse, the mixed model equations can be solved using efficient sparse matrix algebra implemented in the spam package (Furrer and Sain 2010). To calculate the derivatives of the log-likelihood in an efficient way, the automatic differentiation of the Cholesky matrix (Smith 1995; Boer 2023) was implemented in C++ using the Rcpp package (Eddelbuettel and Balamuta 2018).

Introduction

The purpose of this section is to give users an easy introduction, starting from simple linear regression. Based on simulations we will explain the main functions, the input and the output. First we load the LMMsolver and ggplot2 packages:

library(LMMsolver)
library(ggplot2)

Linear Regression

We will start with a simple example where the true function is linear in variable x:

  f1 <- function(x) { 0.6 + 0.7*x}

Using this function we simulate data and add normal distributed noise:

set.seed(2016)
n <- 25
x <- seq(0, 1, length = n)
sigma2e <- 0.04
y <- f1(x) + rnorm(n, sd = sqrt(sigma2e))
dat1 <- data.frame(x = x, y = y)

We can fit the data using the LMMsolve function:

obj1 <- LMMsolve(fixed = y ~ x, data = dat1)

We can make predictions using the predict() function:

newdat <- data.frame(x = seq(0, 1, length = 300))
pred1 <- predict(obj1, newdata = newdat, se.fit = TRUE)
# adding the true values for comparison
pred1$y_true <- f1(pred1$x)

Note that for this linear model we could have used the standard lm() function, which will give the same result.

The following plot gives the simulated data with the predictions, and pointwise standard-error bands. The true value is plotted as dashed red line.

ggplot(data = dat1, aes(x = x, y = y)) +
  geom_point(col = "black", size = 1.5) +
  geom_line(data = pred1, aes(y=y_true), color = "red", 
            linewidth = 1, linetype = "dashed") +
  geom_line(data = pred1, aes(y = ypred), color = "blue", linewidth = 1) +
  geom_ribbon(data = pred1, aes(x=x,ymin = ypred-2*se, ymax = ypred+2*se),
              alpha = 0.2, inherit.aes = FALSE) + 
  theme_bw()

Fitting a non-linear function

In this section we will use the following non-linear function for the simulations:

f2 <- function(x) { 0.3 + 0.4*x + 0.2*sin(20*x) }

The simulated data is generated by the following code

set.seed(12)
n <- 150
x <- seq(0, 1, length = n)
sigma2e <- 0.04
y <- f2(x) + rnorm(n, sd = sqrt(sigma2e))
dat2 <- data.frame(x, y)

We can use the spline argument to fit the non-linear trend:

obj2 <- LMMsolve(fixed = y ~ 1, 
                 spline = ~spl1D(x, nseg = 50), 
                 data = dat2)

where spl1D(x, nseg = 50) defines a mixed model P-splines with 50 segments.

The model fit can be summarized in terms of effective dimensions:

summary(obj2)
#> Table with effective dimensions and penalties: 
#> 
#>         Term Effective Model Nominal Ratio Penalty
#>  (Intercept)      1.00     1       1  1.00     0.0
#>       lin(x)      1.00     1       1  1.00     0.0
#>         s(x)     11.28    53      51  0.22     0.0
#>     residual    136.72   150     148  0.92    30.3
#> 
#>  Total Effective Dimension: 150

The intercept and the slope lin(x) define the linear (or fixed) part of the model, the non-linear (or random) part is defined by s(x), with effective dimension 11.28.

Making predictions on the interval [0, 1] and plotting can be done in the same way as for the linear regression example:

newdat <- data.frame(x = seq(0, 1, length = 300))
pred2 <- predict(obj2, newdata = newdat, se.fit = TRUE)
pred2$y_true <- f2(pred2$x)

ggplot(data = dat2, aes(x = x, y = y)) +
  geom_point(col = "black", size = 1.5) +
  geom_line(data = pred2, aes(y = y_true), color = "red", 
            linewidth = 1, linetype ="dashed") +
  geom_line(data = pred2, aes(y = ypred), color = "blue", linewidth = 1) +
  geom_ribbon(data= pred2, aes(x=x, ymin = ypred-2*se, ymax = ypred+2*se),
              alpha=0.2, inherit.aes = FALSE) +
  theme_bw() 

Non-Gaussian data

The LMMsolver package can also be used for non-gaussian data, using the family argument, with default family = gaussian. As an example we use count data using the Poisson distribution, defined by $$ Pr(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \;, $$ with parameter λ > 0 and k is the number of occurrences. More general, the value of the parameter lambda can depend on another variable time x, for example time. Here we will assume that x is defined on the interval [0, 1] and defined by:

λ(x) = 4 + 3x + 4sin (7x) Using this function we simulate the following data

set.seed(1234)
n <- 150
x <- seq(0, 1, length=n)
fun_lambda <- function(x) { 4 + 3*x + 4*sin(7*x) }
x <- seq(0, 1, length = n)
y <- rpois(n = n, lambda = fun_lambda(x)) 
dat3 <- data.frame(x = x, y = y)

Now we fit the data with the argument family = poisson():

obj3 <- LMMsolve(fixed = y ~ 1, 
                spline = ~spl1D(x, nseg = 50), 
                family = poisson(), 
                data = dat3)
summary(obj3)
#> Table with effective dimensions and penalties: 
#> 
#>         Term Effective Model Nominal Ratio Penalty
#>  (Intercept)      1.00     1       1  1.00       0
#>       lin(x)      1.00     1       1  1.00       0
#>         s(x)      6.54    53      51  0.13       0
#>     residual    141.46   150     148  0.96       1
#> 
#>  Total Effective Dimension: 150

Making predictions and plotting the data is similar to the Gaussian data we showed before:

newdat <- data.frame(x = seq(0, 1, length = 300))
pred3 <- predict(obj3, newdata = newdat, se.fit = TRUE)
pred3$y_true <- fun_lambda(pred3$x)

ggplot(data = dat3, aes(x = x, y = y)) +
  geom_point(col = "black", size = 1.5) +
  geom_line(data = pred3, aes(y = y_true), color = "red", 
            linewidth = 1, linetype ="dashed") +
  geom_line(data = pred3, aes(y = ypred), color = "blue", linewidth = 1) +
  geom_ribbon(data= pred3, aes(x=x, ymin = ypred-2*se, ymax = ypred+2*se),
              alpha=0.2, inherit.aes = FALSE) +
  theme_bw() 

Smoothing combining two experiments

In this section we will give a bit more complicated example, to show some further options of LMMsolver. Suppose there are two experiments, A and B, with the same true unobserved non-linear function f2(x) as defined before.

The simulated data is given by the following code:

set.seed(1234)
nA <-  50
nB <- 100
mu_A <-  0.10
mu_B <- -0.10
sigma2e_A <- 0.04
sigma2e_B <- 0.10

x1 <- runif(n = nA)
x2 <- runif(n = nB)
y1 <- f2(x1) + rnorm(nA, sd = sqrt(sigma2e_A)) + mu_A
y2 <- f2(x2) + rnorm(nB, sd = sqrt(sigma2e_B)) + mu_B
Experiment <- as.factor(c(rep("A", nA), rep("B", nB)))
dat4 <- data.frame(x = c(x1, x2), y = c(y1,y2), Experiment = Experiment)

Before analyzing the data in further detail a boxplot gives some insight:

ggplot(dat4, aes(x = Experiment, y = y, fill = Experiment)) +  
  geom_boxplot() + 
  geom_point(position = position_jitterdodge(), alpha = 0.3) 

Comparing the two experiments we can see that:

  1. There is a clear difference in mean/median between the two experiments. This can be corrected for by adding the argument random = ~Experiment.
  2. The variance in experiment A is smaller than in B. This implies that is important to allow for heterogeneous variances which can be modelled by defining residual = ~Experiment.

The model in LMMsolve() is given by:

obj4 <- LMMsolve(fixed= y ~ 1, 
                 spline = ~spl1D(x, nseg = 50),
                 random = ~Experiment,
                 residual = ~Experiment,
                 data = dat4)

The table of effective dimensions is given by:

summary(obj4)
#> Table with effective dimensions and penalties: 
#> 
#>            Term Effective Model Nominal Ratio Penalty
#>     (Intercept)      1.00     1       1  1.00    0.00
#>          lin(x)      1.00     1       1  1.00    0.00
#>      Experiment      0.93     2       1  0.93   77.98
#>            s(x)      7.89    53      51  0.15    0.00
#>  Experiment_A!R     43.65    50      50  0.87   32.15
#>  Experiment_B!R     95.52   100     100  0.96    9.01
#> 
#>  Total Effective Dimension: 150

And making the predictions:

newdat <- data.frame(x=seq(0, 1, length = 300))
pred4 <- predict(obj4, newdata = newdat, se.fit = TRUE)
pred4$y_true <- f2(pred4$x)
ggplot(data = dat4, aes(x = x, y = y, colour = Experiment)) +
  geom_point(size = 1.5) +
  geom_line(data = pred4, aes(y = y_true), color="red", 
            linewidth = 1, linetype = "dashed") +
  geom_line(data = pred4, aes(y = ypred), color = "blue", linewidth = 1) +
  geom_ribbon(data = pred4, aes(x = x,ymin = ypred-2*se, ymax = ypred+2*se),
              alpha = 0.2, inherit.aes = FALSE) + 
  theme_bw()

The estimated random effects for Experiment can be obtained using the coef() function:

coef(obj4)$Experiment
#> Experiment_A Experiment_B 
#>   0.07719336  -0.07719336

The sum of the effects is equal to zero, as expected for a standard random term.

Examples from Quantitative Genetics

In this section we will show some examples from quantitative genetics, to illustrate some further options of the package.

Oats field trial

As a first example we will use an oats field trial from the agridat package. There were 24 varieties in 3 replicates, each consisting of 6 incomplete blocks of 4 plots. The plots were laid out in a single row.

## Load data.
data(john.alpha, package = "agridat")
head(john.alpha)
#>   plot rep block gen  yield row col
#> 1    1  R1    B1 G11 4.1172   1   1
#> 2    2  R1    B1 G04 4.4461   2   1
#> 3    3  R1    B1 G05 5.8757   3   1
#> 4    4  R1    B1 G22 4.5784   4   1
#> 5    5  R1    B2 G21 4.6540   5   1
#> 6    6  R1    B2 G10 4.1736   6   1

We will use the Linear Variance (LV) model, which is closely connected to the P-splines model (Boer, Piepho, and Williams 2020). First we need to define the precision matrix for the LV model, see Appendix in Boer, Piepho, and Williams (2020) for details:

## Add plot as factor.
john.alpha$plotF <- as.factor(john.alpha$plot)
## Define the precision matrix, see eqn (A2) in Boer et al (2020).
N <- nrow(john.alpha)
cN <- c(1 / sqrt(N - 1), rep(0, N - 2), 1 / sqrt(N - 1))
D <- diff(diag(N), diff = 1)
Delta <- 0.5 * crossprod(D)
LVinv <- 0.5 * (2 * Delta + cN %*% t(cN))
## Add LVinv to list, with name corresponding to random term.
lGinv <- list(plotF = LVinv)

Given the precision matrix for the LV model we can define the model in LMMsolve using the random and ginverse arguments:

obj7 <- LMMsolve(fixed = yield ~ rep + gen,
                 random = ~plotF, 
                 ginverse = lGinv, 
                 data = john.alpha)

The absolute deviance (−2 * logL) and variances for the LV-model are

round(deviance(obj7, relative = FALSE), 2)
#> [1] 54.49
summary(obj7, which = "variances")
#> Table with variances: 
#> 
#>   VarComp Variance
#>     plotF     0.01
#>  residual     0.06

as reported in Boer, Piepho, and Williams (2020), Table 1.

Model biomass as function of time.

In this section we show an example of mixed model P-splines to fit biomass as function of time. As an example we use wheat data simulated with the crop growth model APSIM. This data set is included in the package. For details on this simulated data see Bustos-Korts et al. (2019).

data(APSIMdat)
head(APSIMdat)
#>            env geno das   biomass
#> 1 Emerald_1993 g001  20  65.57075
#> 2 Emerald_1993 g001  21  60.70499
#> 3 Emerald_1993 g001  22  74.06247
#> 4 Emerald_1993 g001  23  63.73951
#> 5 Emerald_1993 g001  24 101.88005
#> 6 Emerald_1993 g001  25  96.84971

The first column is the environment, Emerald in 1993, the second column the simulated genotype (g001), the third column is days after sowing (das), and the last column is the simulated biomass with medium measurement error.

The model can be fitted with

obj8 <- LMMsolve(fixed = biomass ~ 1,
                 spline = ~spl1D(x = das, nseg = 50), 
                 data = APSIMdat)

The effective dimensions are:

summary(obj8)
#> Table with effective dimensions and penalties: 
#> 
#>         Term Effective Model Nominal Ratio Penalty
#>  (Intercept)      1.00     1       1  1.00    0.00
#>     lin(das)      1.00     1       1  1.00    0.00
#>       s(das)      6.46    53      51  0.13    0.01
#>     residual    112.54   121     119  0.95    0.00
#> 
#>  Total Effective Dimension: 121

The fitted smooth trend can be obtained as explained before:

das_range <- range(APSIMdat$das)
newdat <- data.frame(das=seq(das_range[1], das_range[2], length = 300))
pred8 <- predict(obj8, newdata = newdat, se.fit = TRUE)
ggplot(data = APSIMdat, aes(x = das, y = biomass)) +
  geom_point(size = 1.0) +
  geom_line(data = pred8, aes(y = ypred), color = "blue", linewidth = 1) +
  geom_ribbon(data = pred8, aes(x = das,ymin = ypred-2*se, ymax = ypred+2*se),
              alpha = 0.2, inherit.aes = FALSE) + 
    labs(title = "APSIM biomass as function of time", 
       x = "days after sowing", y = "biomass (kg)") +
  theme_bw()

The growth rate (first derivative) as function of time can be obtained using deriv = 1 in function obtainSmoothTrend:

plotDatDt <- obtainSmoothTrend(obj8, grid = 1000, deriv = 1)

ggplot(data = plotDatDt, aes(x = das, y = ypred)) +
  geom_line(color = "red", linewidth = 1) +
  labs(title = "APSIM growth rate as function of time", 
       x = "days after sowing", y = "growth rate (kg/day)") +
  theme_bw()

QTL mapping with IBD probabilities.

In QTL-mapping for multiparental populations the Identity-By-Descent (IBD) probabilities are used as genetic predictors in the mixed model (Li et al. 2021). The following simulated example is for illustration. It consists of three parents (A, B, and C), and two crosses AxB, and AxC. AxB is a population of 100 Doubled Haploids (DH), AxC of 80 DHs. The probabilities, pA, pB, and pC, are for a position on the genome close to a simulated QTL. This simulated data is included in the package.

## Load data for multiparental population.
data(multipop)
head(multipop)
#>   cross     ind         pA         pB pC    pheno
#> 1   AxB AxB0001 0.17258816 0.82741184  0 9.890637
#> 2   AxB AxB0002 0.82170793 0.17829207  0 6.546568
#> 3   AxB AxB0003 0.95968439 0.04031561  0 7.899249
#> 4   AxB AxB0004 0.96564081 0.03435919  0 4.462866
#> 5   AxB AxB0005 0.04838734 0.95161266  0 5.207757
#> 6   AxB AxB0006 0.95968439 0.04031561  0 5.265580

The residual (genetic) variances for the two populations can be different. Therefore we need to allow for heterogeneous residual variances, which can be defined by using the residual argument in LMMsolve:

## Fit null model.
obj9 <- LMMsolve(fixed = pheno ~ cross, 
                 residual = ~cross, 
                 data = multipop)
dev0 <- deviance(obj9, relative = FALSE)

The QTL-probabilities are defined by the columns pA, pB, pC, and can be included in the random part of the mixed model by using the group argument:

## Fit alternative model - include QTL with probabilities defined in columns 3:5 
lGrp <- list(QTL = 3:5)
obj10 <- LMMsolve(fixed = pheno ~ cross, 
                 group = lGrp,
                 random = ~grp(QTL),
                 residual = ~cross,
                 data = multipop) 
dev1 <- deviance(obj10, relative = FALSE)

The approximate log10(p) value is given by

## Deviance difference between null and alternative model.
dev <- dev0 - dev1
## Calculate approximate p-value. 
minlog10p <- -log10(0.5 * pchisq(dev, 1, lower.tail = FALSE))
round(minlog10p, 2)
#> [1] 8.76

The estimated QTL effects of the parents A, B, and C are given by:

coef(obj10)$QTL
#>     QTL_pA     QTL_pB     QTL_pC 
#> -1.2676362  0.6829275  0.5847088

References

Boer, Martin P. 2023. “Tensor Product P-Splines Using a Sparse Mixed Model Formulation.” Statistical Modelling 23 (5-6): 465–79. https://doi.org/10.1177/1471082X231178591.
Boer, Martin P., Hans-Peter Piepho, and Emlyn R. Williams. 2020. Linear Variance, P-splines and Neighbour Differences for Spatial Adjustment in Field Trials: How are they Related? J. Agric. Biol. Environ. Stat. 25 (4): 676–98. https://doi.org/10.1007/S13253-020-00412-4.
Bustos-Korts, Daniela, Martin P. Boer, Marcos Malosetti, Scott Chapman, Karine Chenu, Bangyou Zheng, and Fred A. van Eeuwijk. 2019. Combining Crop Growth Modeling and Statistical Genetic Modeling to Evaluate Phenotyping Strategies.” Front. Plant Sci. 10 (November). https://doi.org/10.3389/fpls.2019.01491.
Cressie, Noel, Matthew Sainsbury-Dale, and Andrew Zammit-Mangion. 2022. “Basis-Function Models in Spatial Statistics.” Annual Review of Statistics and Its Application 9 (1): 373–400. https://doi.org/10.1146/annurev-statistics-040120-020733.
Eddelbuettel, Dirk, and James Joseph Balamuta. 2018. “Extending R with C++: A Brief Introduction to Rcpp.” The American Statistician 72 (1): 28–36. https://doi.org/10.1080/00031305.2017.1375990.
Eilers, PHC, and BD Marx. 1996. Flexible smoothing with B-splines and penalties.” Stat. Sci. https://www.jstor.org/stable/2246049.
Furrer, R, and SR Sain. 2010. spam: A sparse matrix R package with emphasis on MCMC methods for Gaussian Markov random fields.” J. Stat. Softw. https://core.ac.uk/download/pdf/6340272.pdf.
Li, Wenhao, Martin P. Boer, Chaozhi Zheng, Ronny V. L. Joosen, and Fred A. van Eeuwijk. 2021. An IBD-based mixed model approach for QTL mapping in multiparental populations.” Theor. Appl. Genet. 2021 1 (August): 1–18. https://doi.org/10.1007/S00122-021-03919-7.
Patterson, HD, and R Thompson. 1971. Recovery of inter-block information when block sizes are unequal.” Biometrika. https://doi.org/10.1093/biomet/58.3.545.
Rodríguez-Álvarez, María Xosé, Dae Jin Lee, Thomas Kneib, María Durbán, and Paul Eilers. 2015. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm.” Stat. Comput. 25 (5): 941–57. https://doi.org/10.1007/S11222-014-9464-2.
Smith, S. P. 1995. Differentiation of the Cholesky Algorithm.” J. Comput. Graph. Stat. 4 (2): 134. https://doi.org/10.2307/1390762.