2021JUN20.md 7.87 KB
 Ying-Qiu Zheng committed Jul 01, 2021 1 2 3 ## Fusion of High-quality and low-quality classification models ### Graphical models ![diagram1](/figs/2021JUL01/diagram-20210630-3.png)  Ying-Qiu Zheng committed Jul 02, 2021 4 ##### An alternative formulation  Ying-Qiu Zheng committed Jul 02, 2021 5 ![diagram2](/figs/2021JUL01/diagram-20210702-01.png)  Ying-Qiu Zheng committed Jul 01, 2021 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 ### Panel A - (the most basic) model formulation (with classical ARD priors) The model for high quality data classification follows a regression form with ARD priors. The low-quality model is trained by marginalising over the posterior distribution of the high quality coefficients $\mathbf{w}^{H}$ to give (the distribution of) a set of low-quality coefficients (with ARD priors likewise). #### On the high quality data. - Suppose $\mathbf{X}^{H}$ is the $v\times d$ feature matrix (e.g. connectivity profiles of $v$ voxels). $\mathbf{t}$ is the $v\times 1$ labels (0-1 variables). $\mathbf{w}$ is the $d\times 1$ coefficients, and $\mathbf{y}=\sigma(\mathbf{X}^{H}\mathbf{w})$ determines the probability for each class. - Here we adopt the Relevance Vector Machine (RVM) with ARD prior to find $\mathbf{w}$. Suppose $\mathbf{w}$ has a prior distribution $\mathcal{N}(\mathbf{0}, \text{diag}(\alpha_{i}^{-1}))$. We hope $\alpha_{i}$ is driven to Inf, if the associated feature is useless for prediction. - The posterior distribution $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$ can be found by maximising math P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \propto P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w}, \alpha_1,...\alpha_d)P(\mathbf{w}|\alpha_1,...\alpha_d)  using Newton-Raphason algorithm. Suppose $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\sim\mathcal{N}(\mathbf{w}^{*}, \mathbf{H}^{-1})$. By marginalising over the posterior of $\mathbf{w}$ (using Laplace Approximation), we can find $\alpha_1,...\alpha_d$ by maximising type-II likelihood (evidence) math P(\mathbf{t}|\mathbf{X}^{H}, \alpha_1,...\alpha_d) = \int P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w})P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \text{d}\mathbf{w}  #### On the low quality data. - Suppose $\mathbf{X}^{L}$ is the connectivity profiles of the voxels in low quality image. Here we seed to predict $\mathbf{t}$ using $\mathbf{X}^{L}$, aided by high quality training. We assume both $\mathbf{X}^{L}$ and $\mathbf{X}^{H}$ share the same set of $\mathbf{t}$ and $\mathbf{y}$. - Different from the high quality data, we assume $y=\sigma(X^{L}\mathbf{w} + X^{L}\Delta\mathbf{w})$, where the posterior distribution of $\mathbf{w}$ is derived from training on the high quality data, i.e., $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$, and the additional weights $\Delta\mathbf{w}$ has a prior distribution $\mathcal{N}(0, \text{diag}(\beta_{i}^{-1}))$. These additional weights are introduced to adapt the model towards low quality features. - Similarly, we find the posterior of $\Delta\mathbf{w}$ by maximising the posterior math P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{H}^{L}, \mathbf{w}, \beta_1,...\beta_d) \propto P(\mathbf{t}|\mathbf{X}^{L},\Delta\mathbf{w}, \mathbf{w})P(\Delta\mathbf{w} | \beta_1,...\beta_d)  Thus the posterior of $\Delta\mathbf{w}$ depends on $\mathbf{w}$. - To estimate $\beta_1, ...,\beta_d$, we estimate the type-II likelihood by marginalising $\Delta\mathbf{w}$ and $\mathbf{w}$ math P(\mathbf{t} | \mathbf{X}^{L},\beta_1,...\beta_d,\alpha_1,...\alpha_d)=\int P(\mathbf{t} | \mathbf{X}^{L}, \Delta\mathbf{w},\mathbf{w})P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{X}^{L}, \mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}  which is intractable... We thus generate samples of $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$ to calculate the integral, and use Laplace approximation for the posterior of $\Delta\mathbf{w}$. #### Prediction on the low quality data. - With $\alpha_1,...\alpha_d,\beta_1,...\beta_d$ estimated, we can calculate the true posterior of $\Delta\mathbf{w}$ and $\mathbf{w}$. - For a new voxel, the probability math P(t=1|\mathbf{x}^{L}...)=\int\sigma(\mathbf{x}^{L}\mathbf{w}+\mathbf{x}^{L}\Delta\mathbf{w})P(\Delta\mathbf{w}|\mathcal{D},\mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathcal{D}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}  where $\mathcal{D}$ is previous training data. ---- ## Simulation results  Ying-Qiu Zheng committed Jul 01, 2021 43 We generated $\mathbf{X}^{H}$ of varying sizes, 1000x500, 1000x1000, 1000x1500. They are generated by  Ying-Qiu Zheng committed Jul 01, 2021 44 45 46 47 48 49 50 julia n = 1000 # number of samples d = 1500 # number of features # generate feature matrices - high quality Xtrain, Xtest = (randn(n, d) for _ ∈ 1:2) # low quality -- some columns are noisier  Ying-Qiu Zheng committed Jul 01, 2021 51 noise_col = rand(1:d, Int(d * 0.2)) # 20% of the columns of XL are noiser  Ying-Qiu Zheng committed Jul 01, 2021 52 53 54 55 56 57 XLtrain = copy(Xtrain) XLtrain[:, noise_col] .+= randn(n, Int(d * 0.2)) XLtest = copy(Xtest) XLtest[:, noise_col] .+= randn(n, Int(d * 0.2)) [x .= exp.(x) for x ∈ [Xtrain, Xtest, XLtrain, XLtest]]  Ying-Qiu Zheng committed Jul 01, 2021 58 # generate high quality coefficients - 60% of the coefficients w are zeros.  Ying-Qiu Zheng committed Jul 01, 2021 59 60 61 62 63 64 65 w=randn(d); w[rand(1:d, Int(d*0.6))] .= 0. ytrain= logistic.(Xtrain*w); ytest = logistic.(Xtest * w) ttrain = [x > 0.5 ? 1 : 0 for x in ytrain] ttest = [x > 0.5 ? 1: 0 for x in ytest] # low quality -- some columns are zero  Ying-Qiu Zheng committed Jul 01, 2021 66 zero_col = rand(1:d, Int(d * 0.05)) # 5% of the columns in XL are zero  Ying-Qiu Zheng committed Jul 01, 2021 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 [x .= exp.(x) for x ∈ [Xtrain, Xtest]] XLtrain = copy(Xtrain) XLtrain[:, zero_col] .= 0. XLtest = copy(Xtest) XLtest[:, zero_col] .= 0. # low quality -- more outliers outlier_row = rand(1:d, Int(d * 0.01)) # 1% outliers XLtrain = copy(Xtrain) XLtrain[outlier_row, :] .+= randn(Int(d * 0.01), d) XLtest = copy(Xtest) XLtest[outlier_row, :] .+= randn(Int(d * 0.01), d) [x .= exp.(x) for x ∈ [Xtrain, Xtest, XLtrain, XLtest]]  And we compared three methods:  Ying-Qiu Zheng committed Jul 02, 2021 83 84 - Blue: using posterior of w (trained on the high) as priors for the low-quality data. - Red: trained on low quality data only  Ying-Qiu Zheng committed Jul 01, 2021 85 - Orange: Lasso Logistic regression.  Ying-Qiu Zheng committed Jul 02, 2021 86 - Pale blue: trained on low quality data (marginalising over the posterior of w from high)  Ying-Qiu Zheng committed Jul 01, 2021 87 88  accuracy  Ying-Qiu Zheng committed Jul 01, 2021 89 ![accuracy](/figs/2021JUL01/acc.svg)  Ying-Qiu Zheng committed Jul 01, 2021 90 91  dice  Ying-Qiu Zheng committed Jul 01, 2021 92 93 ![dice](/figs/2021JUL01/dice.svg)  Ying-Qiu Zheng committed Jul 01, 2021 94 When $d > n$, Lasso appears superior to the others.  Ying-Qiu Zheng committed Jul 01, 2021 95 96 97 98 99 100 101 102 103 104 105 106  ### Panel B - with structured ARD priors (in progress). #### On the high quality data Instead of using ARD priors $\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\alpha_1,...\alpha_d))$, we assume the hyperparamters have a underlying structure, e.g., $\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\exp(\mathbf{u}))$, where $\mathbf{u}$ is a Gaussian Process $\mathbf{u}\sim\mathcal{N}(\mathbf{0}, \mathbf{C}_{\Theta})$ such that neighbouring features (i.e., adjoining/co-activating voxels) share similar sparsity. [(Ref)](https://proceedings.neurips.cc/paper/2014/file/f9a40a4780f5e1306c46f1c8daecee3b-Paper.pdf) #### On the low quality data The low-quality coefficients have similar structured ARD priors (exp of a Gaussian Process) that may not share the same hyperparameters with the high-quality coefficients' priors. We seek to solve the hyperparameters for the low-quality classification model, marginalising over the posteriors of the high-quality model. ### Panel C - with structured spike-and-slab priors (in progress). #### On the high quality data Similarly, instead of using ARD priors, we assume the coefficients have spike-and-slab priors with latent variables $\gamma_i^{H}, i=1,2,...d$, where $\gamma_i\sim\text{Bernoulli}(\sigma(\theta))$. The hyperparameter $\theta$ can be a Gaussian Process. [(ref)](https://ohbm.sparklespace.net/srh-2591/) #### On the low quality data The low quality coefficients have similar spike-and-slab priors to enforce sparsity.  Ying-Qiu Zheng committed Jul 28, 2021 107 108 109  ## Assume low-quality data is rotated (noiser) version of high quality data  Ying-Qiu Zheng committed Jul 28, 2021 110 111 112 math P(\mathbf{U}|\mathbf{t}, \mathbf{H}^{L}, \mathbf{w}, \beta_1,...\beta_d) \propto P(\mathbf{t}|\mathbf{X}^{L},\mathbf{U}, \mathbf{w})P(\mathbf{U} | \mathbf{w}, \beta_1,...\beta_d)