7 Multiple linear regression in R-INLA

本章では、時空間相関を持つモデルを実行する方法について学ぶ前に、よりシンプルなモデル(重回帰分析)をINLAパッケージで実行する方法やそこから予測値や残差を抽出したり、モデル診断やモデル選択を行う方法を学ぶ。

7.1 Introduction

本章では、チンパンジーの道具使用を調べた Hopkins et al. (2015) のデータを用いる。リサーチクエスチョンは、アリ釣りの技術が性別、年齢、生育環境で変わるのかである。

データでは、アリ釣りの技術は1回成功するまでの時間(潜時: Latency)で測定されている。計243個体の平均潜時を標準化した値がデータとして用いられている。また、説明変数は年齢Age、性別Sex、生育環境(rear; 1: 母親に育てられた、2: 人間に育てられた、3: 野生由来)、事件が実施された研究施設(Colony; 1か2)である。データは以下の通り。

chimp <- read_delim("data/Chimps.txt") 

datatable(chimp,
          options = list(scrollX = 20),
          filter = "top")

7.2 Data exploration

まずデータの確認を行う。図7.1は全ての変数のdotplot(Zuur 2012)を示したものである。平均潜時が大きい個体が2頭いる点に注意が必要である。 Hopkins et al. (2015) ではこれらは取り除かれたが、本章では入れて分析を行う。年齢は均等にばらついているように見える。

chimp %>% 
  select(Sex, Age, Z_Latency, Colony, rear) %>% 
  mutate(N = 1:n()) %>% 
  pivot_longer(1:5) %>% 
  ggplot(aes(x = value, y = N))+
  geom_point()+
  facet_rep_wrap(~name, repeat.tick.labels = TRUE, scales = "free_x")+
  theme_bw()+
  theme(aspect.ratio = 1)

図7.1: Cleveland dotplots of all the variables.

このデータには疑似反復はないように思える。各個体のデータは1つずつしかないし、時間的・空間的な相関が生じる要素もない。遺伝的な関連を考慮する必要があるかもしれないがここではひとまず扱わない。モデルに遺伝的な相関を取り入れる方法については、 Ga and Fox などを参照。

##Model formulation
以下のモデルを考える。なお、rearは3水準あるので2つの回帰係数のパラメータが推定される点は注意。

\[ \begin{aligned} &Latency_i \sim N(\mu_i, \sigma^2)\\ &E(Latency_i) = \mu_i \; \rm{and} \; var(Latency_i) = \sigma^2 \\ &\mu_i = \beta_1 + \beta_2 \times Age_i + \beta_3 \times Sex_i + \beta_4 \times Colony_i + \beta_5(\rm{and \; \beta_6}) \times rear_i \end{aligned} \]

7.3 Linear regression result

7.3.1 Executing the model in R=INLA

まず、Sex、Colony、rearについては変数を因子型にする。

chimp <- chimp %>% 
  mutate(fSex = factor(Sex),
         fColony = factor(Colony),
         frear = factor(rear))

次に、INLAでモデリングを行う。

m8_1 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.predictor = list(compute = TRUE, 
                                      quantiles = c(0.025, 0.975)))

7.3.2 Output for betas

7.3.2.1 Numerical output for the betas

$\sigma$以外のパラメータの事後分布の情報は以下のとおりである。年齢の95%確信区間は0を含まないので、年齢は平均潜時に影響を与えているといえそう。そのほかの切片以外のパラメータは全て95%確信区間に0を含んでいる。

m8_1$summary.fixed

7.3.2.2 Graphical output for the betas

$\beta_2$の事後分布は以下のように図示できる。

beta2 <- m8_1$marginals.fixed$Age

beta2 %>% 
  data.frame() %>% 
  ggplot(aes(x = x, y = y))+
  geom_area(fill = "lightblue")+
  geom_line()+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = expression(beta[2]),
       y = expression(paste("P( ", beta[2] ," | Data)")))

7.3.3 Output for hyperparameters

7.3.3.1 Numerical output for hyper parameters

ハイパーパラメータ$\tau = 1/\sigma^2$の事後分布の要約統計量は以下のように出せる。

m8_1$summary.hyperpar

しかしこれらは$\tau$についてのものであって、私たちが知りたいのは$\sigma = 1/\sqrt\tau$の事後分布である。$\tau$の期待値は以下のように書ける(第6.3節参照)。

\[ E(\tau) = \int_{-\infty}^{\infty} \tau \times p(\tau) d\tau \tag{7.1} \]

一方で、$= h() = 1/ $とするとき、$$の期待値は以下のようになる。
\[ E(\sigma) = \int_{-\infty}^{\infty} h(\tau) \times p(\tau) d\tau = \int_{0}^{\infty} \frac{1}{\sqrt{\tau}} \times p(\tau) d\tau \tag{7.2} \]

これは、単純に$\tau$の事後分布の期待値を$1/\sqrt{\tau}$で変換しても、それは$\sigma$の期待値ではないことを示している。幸いなことに、INLAではこれを計算してくれる関数が用意されている。

tau <- m8_1$marginals.hyperpar$`Precision for the Gaussian observations`

sigma <- inla.emarginal(function(x) 1/sqrt(x), tau)

sigma

## [1] 0.9767155

確かにこの値は単純に$\tau$の事後分布の期待値を$1/\sqrt{E(\tau)}$で変換したものとは違う。

etau <- m8_1$summary.hyperpar[,"mean"]
1/sqrt(etau)

## [1] 0.9736711

他の要約統計量が知りたい場合は、inla.tmarginalを用いればよい。

pmtau <- m8_1$marginals.hyperpar$`Precision for the Gaussian observations`
pm.sigma <- inla.tmarginal(function(x) sqrt(1/x), pmtau)

inla.zmarginal(pm.sigma)

## Mean            0.9767 
## Stdev           0.0445711 
## Quantile  0.025 0.8939 
## Quantile  0.25  0.945629 
## Quantile  0.5   0.974946 
## Quantile  0.75  1.00581 
## Quantile  0.975 1.06894

7.3.3.2 Graphical output for the hyperparameters

$\tau$の事後分布は以下のように図示できる(図7.2のA)。ただし、これにはわずかに45ポイントのデータしか使用されていないので、少しカクカクしている。inla.smarginal関数を用いると、スプライン回帰によってよりスムーズにしてくれる(図7.2のB)。

pmtau %>% 
  data.frame() %>% 
  ggplot(aes(x = x, y = y))+
  geom_area(fill = "lightblue")+
  geom_line()+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = expression(tau),
       y = expression(paste("P( ", tau ," | Data)")))+
  labs(title = "A") -> p1

tau.smooth <- inla.smarginal(pmtau)

tau.smooth %>% 
  data.frame() %>% 
  ggplot(aes(x = x, y = y))+
  geom_area(fill = "lightblue")+
  geom_line()+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = expression(tau),
       y = expression(paste("P( ", tau ," | Data)")))+
  labs(title = "B") -> p2

p1 + p2

図7.2: Pusterior distribution of tau

7.3.4 Fitted model

推定されたパラメータの事後分布から、モデル式は以下のように書ける。

\[ \begin{aligned} &Latency_i \sim N(\mu_i, 0.97^2)\\ &E(Latency_i) = \mu_i \; \rm{and} \; var(Latency_i) = 0.97^2 \\ \\ &\rm{for \; chimpanzee \; of \; sex = 1, colony = 1, and \; rear = 1}\\ &\mu_i = -0.45 + 0.02 \times Age_i\\ \\ &\rm{for \; chimpanzee \; of \; sex = 2, colony = 2, and \; rear = 2}\\ &\mu_i = -0.45 + 0.02 \times Age_i -0.22 +0.09 + 0.05\\ & \;\;\; = -0.52 + 0.02 \times Age_i \end{aligned} \]

7.4 Model validation

前章で計算したように事後平均を用いた$\mu_i$の予測値を手動で計算することもできるが(6.8.1)、以下のように求めることもできる。fit8_1には予測値の95%確信区間も入っている。

fit8_1 <- m8_1$summary.fitted.values
resid <- chimp$Z_Latency - fit8_1$mean

予測値と残差の関係を示したのが図7.3である。明らかに残差が大きい点が2つある。

data.frame(fitted = fit8_1$mean,
           resid = resid) %>% 
  ggplot(aes(x = fitted, y = resid))+
  geom_point(shape = 1)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  geom_hline(yintercept = 0,
             linetype = "dashed")+
  labs(x = "Fitted values", y = "Residuals")

図7.3: Fitted vs residuals

予測値と実測値の関係を見ても、あまり当てはまりがよいように見えない。

data.frame(Latency = chimp$Z_Latency,
           fitted = fit8_1$mean) %>% 
  ggplot(aes(x = fitted, y = Latency))+
  geom_point(shape = 1)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "Fitted values", y = "Latency")

図7.4: Fitted vs residuals

予測値を手動で計算する方法は、リンク関数が恒等関数のときのみ正確な値になる。なぜなら、もしリンク関数で線形予測子を変換した後の期待値は線形予測子の期待値を変換するものと一致しないからである(式(7.1)と式(7.2)も参照)。

\[ E(h(x)) \neq h(E(x)) \]

7.5 Model selection

7.5.1 What should we do?

ベイズ統計でモデル選択を行うとき、話は通常の頻度論的な場合よりも複雑になる。以下では、ベイズモデリング(特にINLA)で用いることのできるモデル選択方法について議論する。

7.5.2 Usind the DIC

AIC(赤池情報量規準)はモデル選択の際に最も一般的な指標である。なお、$L$はパラメータが与えられたときのデータの尤度(e.g., $P(D|\beta_1, \beta_2, \sigma)$)、$k$はパラメータ数である。$log(L)$は対数尤度、$-2 \times log(L)$は逸脱度(deviance)といわれる。共変量をたくさん入れればモデルの当てはまりはよくなるが($Lが高くなる$)、パラメータ数が増える($p$が大きくなる)という関係が成り立っている。AICが低いほど「良い」モデルということになる。

\[ AIC = -2 \times log(L) + 2\times p \]

もし事前分布が無情報なのであればパラメータ数は知ることができるが、事前分布が情報を持っている場合、回帰係数がとりうる範囲は限定されるので、モデルの自由度やパラメータ数は変わってくる。すなわち、このような場合にはAICは適していない。

このようなときに使えるのがDICである。DICは$\theta$を全てのパラメータを含むベクトル、$f(y|\theta)$を尤度、$D(\bar{\theta}) = -2 \times p(y|\bar{\theta})$をパラメータの期待値⁷が与えられた時の逸脱度とするとき、以下のように定義される。

\[ DIC = D(\bar{\theta}) + 2 \times p_D \]

なお、$p_D$は有効パラメータ数と呼ばれるもので、事前分布が無情報に近づくとパラメータ数$p$と一致する。また、事後分布が無情報に近づくほど頻度論での最尤推定値がベイズ統計の事後分布の期待値と一致する。よって、事前分布が無情報であればAICとDICはほとんど一致する。

これを実際にRで確かめよう。

まず、AICは以下のようになる。

m8_2 <- lm(Z_Latency ~ Age + fSex + fColony + frear,
           data = chimp)

## 対数尤度
logLik(m8_2)

## 'log Lik.' -336.2983 (df=7)

## AIC
AIC(m8_2)

## [1] 686.5965

続いて、INLAでDICも求める。control.computeオプションで、dic = TRUEとすればよい。INLAはデフォルトでは無情報事前分布が用いられている。

m8_1 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

m8_1$dic$dic

## [1] 686.6328

AICとDICはほとんど一致する。

7.5.2.1 Effective number of parameters

有効パラメータ数の求め方には2つある。1つ目は以下の通り。なお、$\bar{D}$は逸脱度の平均を表す。

\[ p_D = \bar{D} - D(\bar{\theta}) \]

このとき、DICは以下のように書き直せる。
\[ \begin{aligned} DIC &= D(\bar{\theta}) + 2 \times p_D \\ &= D(\bar{\theta}) + 2 \times (\bar{D} - D(\bar{\theta}))\\ &= \bar{D} + \bar{D} -D(\bar{\theta})\\ &= \bar{D} + p_D \end{aligned} \]

7.5.2.2 DIC related output

DICは実際のINLAの出力から計算することができる。

pD = m8_1$dic$mean.deviance - m8_1$dic$deviance.mean
DIC <- m8_1$dic$mean.deviance + pD

DIC

## [1] 686.6328

7.5.2.3 Model selection using DIC

それでは、実際にDICを比較してみる。先ほどのモデルから年齢以外の説明変数を1つずつ除いた以下のモデルを比較する。

m8_1b <- inla(Z_Latency ~  fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

m8_1c <- inla(Z_Latency ~ Age + fSex +  frear,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

m8_1d <- inla(Z_Latency ~ Age + fSex + fColony,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

DICの結果、frearがないモデルが最もDICが低いことが分かった。

c(m8_1$dic$dic, m8_1b$dic$dic, m8_1c$dic$dic, m8_1d$dic$dic)

## [1] 686.6328 692.3986 685.0378 682.7590

ここからは、さらにm8_1dから1つずつ説明変数を除いたモデルを作成し、DICが減少しなくなるまで同様の比較を続ける。

m8_1e <- m8_1 <- inla(Z_Latency ~ Age  + fColony,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

m8_1f <- m8_1 <- inla(Z_Latency ~ Age  + fSex,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

m8_1dより、そこからfColonyを除いたモデルのDICの方が低い。

c(m8_1d$dic$dic, m8_1e$dic$dic, m8_1f$dic$dic)

## [1] 682.7590 683.7640 681.4553

さらにm8_1fからfSexを除いたモデルと比較する。

m8_1g <- m8_1 <- inla(Z_Latency ~ Age,
             family = "gaussian",
             data = chimp,
             control.compute = list(dic = TRUE))

最終的に、m8_1fが最もDICが低いことが分かった。

c(m8_1f$dic$dic, m8_1g$dic$dic)

## [1] 681.4553 682.0625

7.5.3 WAIC

DICのほかには、WAIC(widely applicable information criterion)を使うこともできる。WAICはDICをより発展させたものととらえられる。WAICの解説については、 McElreath (2020) も参照。

INLAでは以下のようにして計算できる。

m8_1 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.compute = list(waic = TRUE))

m8_1$waic$waic

## [1] 696.295

7.5.4 Out of sample prediction

最後に、DICやWAICのような情報量規準ではなく、交差検証(cross validation)と呼ばれる方法を用いることもできる。この方法では、例えばデータをらなダムに2つ($D_{fit}$と$D_{pred}$)に分ける。その後$D_{fit}$に対してあるモデルを当てはめ、その結果をもとに$D_{fit}$の予測を行う。以上を何度も繰り返し、それらを総合してそのモデルの精度を評価する。これを複数のモデルに対して行い、そのモデルが最も良いかを調べるのが交差検証である。

INLAでも交差検証を行う。INLAでは、$D_{fit}$を1つだけの観測値を除いたデータとし、これを全ての観測値に対して行う交差検証を行う。これは”leave one out” 交差検証と呼ばれる。INLAは交差検証によってCPO(conditional predictive ordinate)やPIT(probability integral transform)と呼ばれる値を算出する。

CPOは、除かれた1つの観測値が、その他の観測値が与えられたときに得られる確率として定義される。$CPO_i$が高いほど他のデータを用いて推定されたモデルに従うことを示す。
\[ CPO_i = Pr(除かれた観測値_i|そのほかの観測値) \]

PITはCPOの代替的な指標である。説明はこちらや Blangiardo and Cameletti (2015) に詳しい。

INLAでは$CPO_i$を以下のように算出できる。

m8_3 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.compute = list(cpo = TRUE))

以下のm8_3$cpoには3つのリストが含まれる。m8_3$cpo$cpoは$CPO_i$が、m8_3$cpo$pitは$PIT_i$がデータ数だけ含まれる。3つめのm3_8$cpo$failureはもし1であればその観測値のCPOやPITが信頼できないことを表す。

m8_3$cpo %>% 
  data.frame() %>% 
  datatable()

CPOのdotplotを図示すると以下のようになる。それも低い値をとっており、モデルの当てはまりがよくないことが分かる。

m8_3$cpo %>% 
  data.frame() %>% 
  mutate(n = 1:n()) %>% 
  ggplot(aes(x = cpo, y = n))+
  geom_point()+
  coord_cartesian(xlim = c(0,1))+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "CPO", y = "Observation number")

モデル選択では、様々なモデルに対して同様にCPOを算出し、どのモデルがよりよく当てはまっているかを比べる。しかし実際にこれを基に判断するのは難しい。他の方法としては、モデルごとに$CPO_i$を1つの値にまとめ、それをモデル間で比較するというものである。例えば、$log(CPO_i)$の合計をそのモデルの当てはまりの良さとして使用することができる。

sum(log(m8_3$cpo$cpo))

## [1] -347.8294

7.5.5 Posterior predictive check

モデルの当てはまりをチェックする方法としては、モデルから事後予測分布を算出するというものである。事後予測分布は、モデルから推定された事後分布をもとに新たにデータを生成したときに、新たに得られるデータの予測分布のことである。INLAでは以下のようにして事後予測分布を計算できる。

7.5.5.1 Zuur (2017)の方法

どうやら、以下は平均$\mu_i$の事後周辺分布を算出しているに過ぎないよう。事後予測分布ではない点に注意。よって、事後予測p値も間違っている。

m8_4 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp,
             control.predictor = list(compute = TRUE),
             control.compute = list(return.marginals.predictor=TRUE))

例えば、1頭目のチンパンジーの平均潜時の事後予測分布は以下のようになる。実測値を黒い点で示した。

m8_4$marginals.fitted.values[[1]] %>% 
  data.frame() %>% 
  ggplot(aes(x = x, y = y))+
  geom_area(fill = "lightblue")+
  geom_line()+
  geom_point(data = chimp %>% .[1,],
             aes(x = Z_Latency, y = 0),
             size = 4.5)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "Latency scores of chimp 1",
       y = "Density")

全観測値について95%予測区間(事後予測分布の95%区間)と観測値の関係を示したのが以下の図である。実測値が95%予測区間に入ってないものがかなり多くあることが分かる。このことは、このモデルから実際のデータが得られたとは言いにくいということを示している。

postpre8_4 <- m8_4$marginals.fitted.values

postpre_all <- data.frame()

for(i in 1:nrow(chimp)){
summary_i <- inla.qmarginal(c(0.025, 0.5, 0.975), postpre8_4[[i]]) %>% 
  data.frame() %>% 
  rename(value = 1) %>% 
  mutate(col = c("q2.5","q50", "q97.5")) %>% 
  pivot_wider(names_from = col, values_from = value) %>% 
  mutate(id = i)

postpre_all <- bind_rows(postpre_all, summary_i)
}

postpre_all %>% 
  mutate(id2 = c(rep("1~61",61),rep("62~122",61), rep("123~183",61), rep("184~243",60))) %>% 
  ggplot(aes(x = id))+
  geom_errorbar(aes(ymin = q2.5, ymax = q97.5))+
  geom_point(data = chimp %>% mutate(id = 1:n(),
                                     id2 = c(rep("1~61",61),rep("62~122",61), rep("123~183",61), rep("184~243",60))),
             aes(x = id, y = Z_Latency),
             size = 1)+
  theme(aspect.ratio = 0.5)+
  facet_rep_wrap(~id2, repeat.tick.labels = TRUE,
                 scales = "free")

事後予測分布をもとにモデルの当てはまりを評価するために、事後予測p値という値が考えられているようだ。これは、事後予測分布が実測値よりも小さい値をとる確率$Pr(Latency_i^* \le Latency_i|D)$で定義される。なお、$Latency_i^*$は事後予測分布の値である。事後予測p値が0や1に近い値が多いならば、このモデルの当てはまりは悪いということになる。

では、実際に算出してみる。図示すると、やはり0や1に近い値が多くてモデルの当てはまりが悪いことが分かる。

pval <- rep(NA, nrow(chimp))

for(i in 1:nrow(chimp)){
  pval[i] <- inla.pmarginal(q = chimp$Z_Latency[[i]],
                            marginal = postpre8_4[[i]])
}

data.frame(pval = pval) %>% 
  ggplot()+
  geom_histogram(aes(x = pval),
                 bins = 20,
                 alpha = 0,
                 color = "black")+
  theme_bw()+
  theme(aspect.ratio = 1)+
  scale_x_continuous(breaks = seq(0,1,0.1))

7.5.5.2 正しい方法

以下、事後予測分布をシミュレーションによってサンプリングする方法を解説する。参考にしたのはこちらとこちら。

INLAパッケージでは、事後同時分布からランダムに値を抽出できるinla.posterior.sample関数が存在する。これを使用するためには、モデルを実行する際にcontrol.computeオプションでconfig = TRUEとする必要がある。

m8_4b <- inla(Z_Latency ~ Age + fSex + fColony + frear,
              family = "gaussian",
              data = chimp,
              control.compute = list(config = TRUE))

以下のようにして、事後同時分布からサンプルを抽出できる。

sim_jointpost <- inla.posterior.sample(n = 10000, m8_4b)

例えば、1つ目のサンプルは以下のようになる。ここには、事後同時分布からサンプルされた各パラメータの値と、そのときの各個体の期待値(予測値、$\mu_i$)の値が格納されている。

sim_jointpost[[1]]

## $hyperpar
## Precision for the Gaussian observations 
##                                 0.73403 
## 
## $latent
##                       [,1]
## Predictor:1   -0.558618428
## Predictor:2    0.219764665
## Predictor:3    0.310161225
## Predictor:4    0.100380575
## Predictor:5   -0.199812090
## Predictor:6   -0.473460892
## Predictor:7   -0.417944955
## Predictor:8   -0.367013972
## Predictor:9    0.070738977
## Predictor:10  -0.027355729
## Predictor:11   0.249406264
## Predictor:12   0.267582457
## Predictor:13  -0.098604195
## Predictor:14   0.034371134
## Predictor:15   0.007842726
## Predictor:16  -0.665065349
## Predictor:17  -0.260567052
## Predictor:18  -0.048645113
## Predictor:19  -0.093365170
## Predictor:20  -0.665065349
## Predictor:21  -0.579907813
## Predictor:22  -0.217988284
## Predictor:23  -0.622486581
## Predictor:24  -0.157233322
## Predictor:25  -0.072075786
## Predictor:26  -0.112513265
## Predictor:27   0.006870825
## Predictor:28  -0.388303356
## Predictor:29  -0.239277668
## Predictor:30  -0.473460892
## Predictor:31  -0.027355729
## Predictor:32  -0.112513265
## Predictor:33  -0.072075786
## Predictor:34   0.185538111
## Predictor:35   0.164248727
## Predictor:36  -0.473460892
## Predictor:37  -0.388303356
## Predictor:38  -0.155092033
## Predictor:39   0.352739993
## Predictor:40  -0.162472347
## Predictor:41  -0.034736043
## Predictor:42   0.411369121
## Predictor:43   0.036512423
## Predictor:44  -0.263680242
## Predictor:45   0.560394809
## Predictor:46   0.079091191
## Predictor:47  -0.601197197
## Predictor:48   0.093000262
## Predictor:49  -0.141182963
## Predictor:50   0.225003689
## Predictor:51  -0.643775965
## Predictor:52  -0.027355729
## Predictor:53   0.291985032
## Predictor:54   0.206827495
## Predictor:55  -0.345724588
## Predictor:56  -0.006066345
## Predictor:57  -0.537329044
## Predictor:58  -0.226340499
## Predictor:59  -0.622486581
## Predictor:60   0.347500969
## Predictor:61  -0.601197197
## Predictor:62   0.374029377
## Predictor:63  -0.141182963
## Predictor:64  -0.098604195
## Predictor:65  -0.558618428
## Predictor:66   0.121669959
## Predictor:67  -0.516039660
## Predictor:68  -0.014418559
## Predictor:69  -0.391416547
## Predictor:70   0.185538111
## Predictor:71  -0.162472347
## Predictor:72   0.203714305
## Predictor:73  -0.388303356
## Predictor:74   0.119528670
## Predictor:75  -0.048645113
## Predictor:76   0.164248727
## Predictor:77   0.483589488
## Predictor:78   0.441010720
## Predictor:79  -0.155092033
## Predictor:80  -0.409592740
## Predictor:81  -0.133802649
## Predictor:82  -0.077314811
## Predictor:83   0.164248727
## Predictor:84  -0.452171508
## Predictor:85   0.121669959
## Predictor:86  -0.050786402
## Predictor:87  -0.430882124
## Predictor:88  -0.141182963
## Predictor:89  -0.048645113
## Predictor:90  -0.119893579
## Predictor:91  -0.566970643
## Predictor:92   0.347500969
## Predictor:93  -0.430882124
## Predictor:94  -0.324435204
## Predictor:95  -0.388303356
## Predictor:96  -0.622486581
## Predictor:97   0.390079737
## Predictor:98  -0.409592740
## Predictor:99  -0.409592740
## Predictor:100 -0.609549411
## Predictor:101  0.228116880
## Predictor:102  0.036512423
## Predictor:103 -0.473460892
## Predictor:104  0.441010720
## Predictor:105 -0.324435204
## Predictor:106  0.225003689
## Predictor:107  0.390079737
## Predictor:108 -0.135943938
## Predictor:109 -0.034736043
## Predictor:110  0.057801807
## Predictor:111 -0.162472347
## Predictor:112  0.539105425
## Predictor:113  0.560394809
## Predictor:114  0.100380575
## Predictor:115  0.057801807
## Predictor:116 -0.119893579
## Predictor:117 -0.162472347
## Predictor:118  0.185538111
## Predictor:119 -0.034736043
## Predictor:120  0.036512423
## Predictor:121  0.164248727
## Predictor:122  0.007842726
## Predictor:123 -0.077314811
## Predictor:124 -0.069934497
## Predictor:125 -0.013446658
## Predictor:126 -0.452171508
## Predictor:127  0.632615176
## Predictor:128  0.100380575
## Predictor:129 -0.247629883
## Predictor:130  0.142959343
## Predictor:131  0.006870825
## Predictor:132  0.475237273
## Predictor:133  0.432658505
## Predictor:134 -0.750222885
## Predictor:135 -0.537329044
## Predictor:136 -0.048645113
## Predictor:137 -0.345724588
## Predictor:138 -0.027355729
## Predictor:139 -0.750222885
## Predictor:140 -0.409592740
## Predictor:141 -0.396655571
## Predictor:142 -0.056025427
## Predictor:143 -0.643775965
## Predictor:144 -0.494750276
## Predictor:145 -0.077314811
## Predictor:146  0.164248727
## Predictor:147  0.249406264
## Predictor:148 -0.239277668
## Predictor:149 -0.098604195
## Predictor:150 -0.303145820
## Predictor:151 -0.750222885
## Predictor:152  0.347500969
## Predictor:153 -0.750222885
## Predictor:154 -0.598688794
## Predictor:155 -0.556110026
## Predictor:156 -0.282461224
## Predictor:157  0.294493434
## Predictor:158 -0.236769266
## Predictor:159  0.333959012
## Predictor:160  0.312669628
## Predictor:161 -0.641267562
## Predictor:162 -0.109032962
## Predictor:163 -0.194190498
## Predictor:164  0.060310210
## Predictor:165  0.103860879
## Predictor:166 -0.385794954
## Predictor:167 -0.133435536
## Predictor:168 -0.282461224
## Predictor:169  0.419116548
## Predictor:170  0.251914666
## Predictor:171 -0.428373722
## Predictor:172  0.180666200
## Predictor:173 -0.598688794
## Predictor:174 -0.194190498
## Predictor:175 -0.279348034
## Predictor:176 -0.470952490
## Predictor:177 -0.449663106
## Predictor:178  0.610721004
## Predictor:179 -0.449663106
## Predictor:180  0.270090860
## Predictor:181  0.355248396
## Predictor:182 -0.282461224
## Predictor:183 -0.449663106
## Predictor:184  0.333959012
## Predictor:185 -0.159963944
## Predictor:186 -0.282461224
## Predictor:187 -0.303750608
## Predictor:188 -0.577399410
## Predictor:189  0.228483993
## Predictor:190 -0.303750608
## Predictor:191  0.376537780
## Predictor:192  0.504274084
## Predictor:193 -0.236769266
## Predictor:194  0.207194608
## Predictor:195  0.036879536
## Predictor:196 -0.598688794
## Predictor:197  0.397827164
## Predictor:198 -0.534820642
## Predictor:199  0.039020826
## Predictor:200  0.568142236
## Predictor:201 -0.598688794
## Predictor:202 -0.346329376
## Predictor:203  0.419116548
## Predictor:204 -0.197303688
## Predictor:205 -0.218593072
## Predictor:206 -0.598688794
## Predictor:207  0.419116548
## Predictor:208  0.036879536
## Predictor:209 -0.096095792
## Predictor:210 -0.074806408
## Predictor:211 -0.577399410
## Predictor:212  0.270090860
## Predictor:213  0.355248396
## Predictor:214  0.273204050
## Predictor:215  0.074219280
## Predictor:216  0.249773377
## Predictor:217  0.180666200
## Predictor:218 -0.321926802
## Predictor:219 -0.069567384
## Predictor:220 -0.513531258
## Predictor:221  0.270090860
## Predictor:222 -0.010938256
## Predictor:223 -0.131294247
## Predictor:224 -0.385794954
## Predictor:225 -0.282461224
## Predictor:226 -0.385794954
## Predictor:227  0.549966042
## Predictor:228 -0.449663106
## Predictor:229  0.015590152
## Predictor:230 -0.641267562
## Predictor:231 -0.026988616
## Predictor:232  0.397827164
## Predictor:233 -0.154724920
## Predictor:234  0.166757130
## Predictor:235  0.039020826
## Predictor:236 -0.385794954
## Predictor:237 -0.556110026
## Predictor:238 -0.032227640
## Predictor:239 -0.556110026
## Predictor:240  0.124178362
## Predictor:241 -0.513531258
## Predictor:242  0.079458304
## Predictor:243  0.397827164
## (Intercept):1 -0.346696489
## Age:1          0.021289384
## fSex2:1       -0.595130852
## fColony2:1     0.108955323
## frear2:1       0.231070034
## frear3:1      -0.029641599
## 
## $logdens
## $logdens$hyperpar
## [1] -4.42645
## 
## $logdens$latent
## [1] 3.895123
## 
## $logdens$joint
## [1] -0.5313266

例えば、1つ目のサンプルにおける243番目の個体の予測値とその時使用されたパラメータの値は以下のようになる。

tail(sim_jointpost[[1]]$latent, n = 7)

##                      [,1]
## Predictor:243  0.39782716
## (Intercept):1 -0.34669649
## Age:1          0.02128938
## fSex2:1       -0.59513085
## fColony2:1     0.10895532
## frear2:1       0.23107003
## frear3:1      -0.02964160

全個体について、予測値(期待値)とハイパーパラメータ$\tau$を10000サンプル分まとめる。

post_samples <- matrix(ncol = nrow(chimp),
                       nrow = 10000)
tau <- rep(NA, 10000)

for(i in 1:10000){
  for(j in 1:nrow(chimp)){
  post_samples[i,j] <- sim_jointpost[[i]]$latent[j,1]
  tau[i] <- sim_jointpost[[i]]$hyperpar[1]
  }
}

例えば、個体21について得られた予測値のサンプルの分布を描くと、図7.5Aのようになる。これは、m8_4$marginal.fitted.valuesで取得した分布(図7.5B)とほぼ一致する。このことからも、m8_4$marginal.fitted.valuesで取得されたのはやはり期待値$\mu_i$の事後分布であり、事後予測分布ではなかったことが分かる。

data.frame(x21 = post_samples[,21]) %>% 
  ggplot(aes(x = x21))+
  geom_density()+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "Linear predictor of chimp 1",
       y = "Density")+
  coord_cartesian(xlim = c(-0.75, 0.15))+
  labs(title = "Simulation based") -> p1

m8_4$marginals.fitted.values[[1]] %>% 
  data.frame() %>% 
  ggplot(aes(x = x, y = y))+
  geom_line()+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "Linear predictor of chimp 1",
       y = "Density")+
  coord_cartesian(xlim = c(-0.75, 0.15))+
  labs(title = "Using `marginal.fitted.values`") -> p2

p1 + p2

図7.5: Posterior distribution of the linear predictor of sample 21. A: siulation based, B: using marginal.fitted.values

事後予測分布を得るためには、ここからさらに$\sigma$($\tau$)も入れたうえで正規分布からサンプリングを行う必要がある。以下のようにして、平均$\mu_i$、標準偏差$\sigma$の正規分布からサンプリングを行う。

y.sim <- matrix(ncol = nrow(chimp),
                       nrow = 10000)

for(j in 1:nrow(chimp)){
  y.sim[,j] <- rnorm(10000, mean = post_samples[,j], sd = 1/sqrt(tau))
}

得られた値は以下の通り(最初の10サンプルのみ表示)。

y.sim %>% 
  data.frame() %>% 
  head(10) %>% 
  datatable(options = list(scrollX = 243))

例えば、21番目の個体の事後予測分布は以下のようになる。当然だが、やはり期待値$\mu_i$の事後分布(図7.5)よりも広い範囲の値をとる。

data.frame(x = y.sim[,21]) %>% 
  ggplot(aes(x = x))+
  geom_histogram(aes(y = ..density..),
                 color = "black", alpha = 0)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(x = "Posterior predictive samples of chimp 1",
       y = "Density")

全個体の95%予測区間(事後予測分布の95%区間)を算出し、それと実測値の関係を見てみよう。図示すると(図7.6)、概ね実測値が95%予測区間の中に納まっているが、その範囲に入らないデータが6つあることが分かる。

int_pre <- data.frame(id = 1:nrow(chimp),
                      pi.lower = NA,
                      pi.upper = NA)

for(j in 1:nrow(chimp)){
  int_pre[j,2] <- quantile(y.sim[,j], probs = 0.025)
  int_pre[j,3] <- quantile(y.sim[,j], probs = 0.975)
}

int_pre %>% 
  mutate(id2 = c(rep("1~61",61),rep("62~122",61), rep("123~183",61), rep("184~243",60))) %>% 
  ggplot(aes(x = id))+
  geom_errorbar(aes(ymin = pi.lower, ymax = pi.upper))+
  geom_point(data = chimp %>% mutate(id = 1:n(),
                                     id2 = c(rep("1~61",61),rep("62~122",61), rep("123~183",61), rep("184~243",60))),
             aes(x = id, y = Z_Latency),
             size = 1)+
  theme(aspect.ratio = 0.5)+
  facet_rep_wrap(~id2, repeat.tick.labels = TRUE,
                 scales = "free")

図7.6: 95% predictive interval and observed values.

最後に事後予測p値を算出してヒストグラムを書く。わずかながら1にかなり近い値をとるデータがあり、当てはまりがあまり良いわけではないことが分かった。

pval <- NA

for(j in 1:nrow(chimp)){
pval[j] <- sum(y.sim[,j] <= chimp$Z_Latency[j])/nrow(y.sim) 
}

data.frame(pval = pval) %>% 
  ggplot()+
  geom_histogram(aes(x = pval),
                 bins = 40,
                 alpha = 0,
                 color = "black")+
  theme_bw()+
  theme(aspect.ratio = 1)+
  scale_x_continuous(breaks = seq(0,1,0.1))+
  scale_y_continuous(breaks = seq(0,16,1))+
  coord_cartesian(xlim = c(0,1))

7.6 Visualizing the model

最後に、モデルの予測値とその95%確信区間をデータ上に可視化する。

lm関数を用いた重回帰分析ではこれがかなり簡単に行える。例えば、fColony = 1、frear = 1のときの予測値とその95%信頼区間は以下のように描ける。

fit8_2 <- ggpredict(m8_2,
                    terms = c("Age[4:50,by = 0.1]","fSex"),
                    condition = c(fColony = "1", frear = "1"))

fit8_2 %>% 
  data.frame() %>% 
  rename(Age = x, fSex = group) %>% 
  mutate(fSex = str_c("fSex = ",fSex)) %>% 
  ggplot(aes(x = Age, y = predicted))+
  geom_line()+
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
              alpha = 0.4)+
  geom_point(data = chimp %>% mutate(fSex = str_c("fSex = ",fSex)),
             aes(y = Z_Latency),
             shape = 1)+
  facet_rep_wrap(~fSex)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(y = "Latency")

INLAで同様のグラフを95%確信区間で書く方法は2つある。そのうち1つをここで紹介する(もう一つは第8章で紹介する)。　　

この方法は少しトリッキーだがうまくいく。まず、予測値を求めたい範囲のデータを格納し、かつZ_Latency = NAのデータフレームを作成する。

newdata <- crossing(Z_Latency = NA,
                    Age = seq(4,50,length = 100),
                    fSex = c("1","2"),
                    fColony = "1",
                    frear = "1")

これをもとのデータにくっつけてモデルを実行すると、パラメータの推定自体に影響はないが、先ほどnewdataで指定した範囲についても予測値を算出することができる。

chimp2 <- bind_rows(chimp, newdata)

m8_5 <- inla(Z_Latency ~ Age + fSex + fColony + frear,
             family = "gaussian",
             data = chimp2,
             control.predictor = list(compute = TRUE),
             control.compute = list(return.marginals.predictor=TRUE))

確信区間等を算出し、244番目以降のデータについて抽出すればnewdataで指定したデータについての予測値と95%確信区間が得られる。図示した結果は頻度論の結果とほぼ変わらない。

fit8_5 <- m8_5$summary.fitted.values[244:443,] %>% 
  bind_cols(newdata)

fit8_5 %>% 
  mutate(fSex = str_c("fSex = ",fSex)) %>% 
  ggplot(aes(x = Age, y = mean))+
  geom_line()+
  geom_ribbon(aes(ymin = `0.025quant`, ymax = `0.975quant`),
              alpha = 0.4)+
  geom_point(data = chimp %>% mutate(fSex = str_c("fSex = ",fSex)),
             aes(y = Z_Latency),
             shape = 1)+
  facet_rep_wrap(~fSex)+
  theme_bw()+
  theme(aspect.ratio = 1)+
  labs(y = "Latency")

References

Blangiardo M, Cameletti M (2015) Spatial and spatio-temporal bayesian models with R - INLA. John Wiley & Sons

Hopkins WD, Reamer L, Mareno MC, Schapiro SJ (2015) Genetic basis in motor skill and hand preference for tool use in chimpanzees (pan troglodytes). Proc Biol Sci 282:20141223

McElreath R (2020) Statistical rethinking : A bayesian course with examples in R and STAN, 2nd Edition. Chapman; Hall/CRC

Zuur AF (2012) A beginner’s guide to generalized additive models with R. Highland Statistics, Newburgh, Scotland

INLAでは、回帰係数$\beta$については事後分布の期待値(平均)、ハイパーパラメータについては事後分布の最頻値(モード)がDICの計算に使用される。↩︎