R數據分析:掃盲貼,什么是多重插補

好多同學跑來問 , 用spss的時候使用多重插補的數據集,怎么選怎么用?是不是簡單的選一個做分析?今天寫寫這個問題 。
什么時候用多重插補【R數據分析:掃盲貼,什么是多重插補】首先回顧下三種缺失機制或者叫缺失類型:

R數據分析:掃盲貼,什么是多重插補

文章插圖
上面的內容之前寫過,這兒就不給大家翻譯了 , 完全隨機缺失,缺失量較小的情況下你直接扔掉或者任你怎么插補都可以,影響不大的 。隨機缺失可以用多重插補很好地處理;非隨機缺失,任何方法都沒得救的,主分析做完之后自覺做敏感性分析才是正道;這個我好像在之前的文章中給大家解釋過原因 。
When it is plausible that data are missing at random, but not completely at random, analyses based on complete cases may be biased. Such biases can be overcome using methods such as multiple imputation that allow individuals with incomplete data to be included in analysesly, it is not possible to distinguish between missing at random and missing not at random using observed data. Therefore, biases caused by data that are missing not at random can be addressed only by sensitivity analyses examining the effect of different assumptions about the missing data mechanism
多重插補的思想寫多重插補之前我們先回憶簡單插補 , 叫做single imputation,就是缺失值只插一個,無論是用均值,用中位數,用眾數等等 , 反正只挑一個,只形成一個完整數據集,叫做簡單插補 。
這里面就有一個問題:就插了一個值,你怎么就敢說這個值對?是不是偏倚的可能性其實挺高的?
多重插補就不一樣了,進行多重插補的時候我會對一個缺失值會插補很多個可能的值,我們會得到很多個完整的數據集(mutliple),比如每個缺失的地方我們插補5個值 , 就會得到5個數據集 。這5個數據集的原來的缺失的數據都被算法插補好了,但是插補的值不盡相同,多重插補的思想精髓在于:對這插補出來的每一個數據集都做一遍我們的目標分析 , 然后將效應匯總從而得到誤差最小的合并效應 。
現在給出多重插補的定義(來自BMJ):
Multiple imputation is a general approach to the problem of missing data that is available in several commonly used statistical packages. It aims to allow for the uncertainty about the missing data by creating several different plausible imputed data sets and appropriately combining results obtained from each of them.
具體的思路就是,首先插補多個數據集,就是每個缺失的地方會插補多次,每一次插補的值都是基于現有數據分布的缺失值的預測值;第一步做完之后我們不是有很多個完整數據集了嘛,然后我們將我們感興趣的分析在每一個數據集中都做一次 , 得到多個結果;第三步就是將這些結果匯總 。
以上就是思路流程 。
In the first step, the dataset with missing values (i.e. the incomplete dataset) is copied several times. Then in the next step, the missing values are replaced with imputed values in each copy of the dataset. In each copy, slightly different values are imputed due to random variation. This results in mulitple imputed datasets. In the third step, the imputed datasets are each analyzed and the study results are then pooled into the final study result.

R數據分析:掃盲貼,什么是多重插補

文章插圖
所以說如果你用多重插補處理缺失數據 , 分析的時候卻只用某一個數據集來做分析肯定都是不正確的 , 所以以后千萬別問,到底選哪個這樣的問題了,選哪個都不對 。
介紹完思想我們再看實操 。
實例操練在spss中的多重插補實操,大家請閱讀下面的鏈接,寫的很細哈:
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html#:~:text=After%20multiple%20imputation%2C%20the%20multiple%20imputed%20datasets%20are,that%20separates%20the%20original%20from%20the%20imputed%20datasets.
今天我們寫如何在R中進行多重插補
我現在有數據如下:
R數據分析:掃盲貼,什么是多重插補

文章插圖
很簡單的數據,可以看到數據中有很多缺失值的,我想要做的目標分析是一個以hyp為因變量的邏輯回歸,如果我不插補數據直接做 , 可以寫出如下代碼:
model <- glm(hyp ~ bmi, family = binomial(link = 'logit'), data)model_or <- exp(cbind(OR = coef(model), confint(model)))運行后得到想要的OR和置信區間如下:

推薦閱讀