### Merge branch 'master' of git.fmrib.ox.ac.uk:yqzheng1/archive

parents 015d13c1 08f951d5
 ## Fusion of High-quality and low-quality classification models ### Graphical models ![diagram1](/figs/2021JUL01/diagram-20210630-2.png) ### Panel A - (the most basic) model formulation (with classical ARD priors) The model for high quality data classification follows a regression form with ARD priors. The low-quality model is trained by marginalising over the posterior distribution of the high quality coefficients $\mathbf{w}^{H}$ to give (the distribution of) a set of low-quality coefficients (with ARD priors likewise). #### On the high quality data. - Suppose $\mathbf{X}^{H}$ is the $v\times d$ feature matrix (e.g. connectivity profiles of $v$ voxels). $\mathbf{t}$ is the $v\times 1$ labels (0-1 variables). $\mathbf{w}$ is the $d\times 1$ coefficients, and $\mathbf{y}=\sigma(\mathbf{X}^{H}\mathbf{w})$ determines the probability for each class. - Here we adopt the Relevance Vector Machine (RVM) with ARD prior to find $\mathbf{w}$. Suppose $\mathbf{w}$ has a prior distribution $\mathcal{N}(\mathbf{0}, \text{diag}(\alpha_{i}^{-1}))$. We hope $\alpha_{i}$ is driven to Inf, if the associated feature is useless for prediction. - The posterior distribution $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$ can be found by maximising math P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \propto P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w}, \alpha_1,...\alpha_d)P(\mathbf{w}|\alpha_1,...\alpha_d)  using Newton-Raphason algorithm. Suppose $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\sim\mathcal{N}(\mathbf{w}^{*}, \mathbf{H}^{-1})$. By marginalising over the posterior of $\mathbf{w}$ (using Laplace Approximation), we can find $\alpha_1,...\alpha_d$ by maximising type-II likelihood (evidence) math P(\mathbf{t}|\mathbf{X}^{H}, \alpha_1,...\alpha_d) = \int P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w})P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \text{d}\mathbf{w}  #### On the low quality data. - Suppose $\mathbf{X}^{L}$ is the connectivity profiles of the voxels in low quality image. Here we seed to predict $\mathbf{t}$ using $\mathbf{X}^{L}$, aided by high quality training. We assume both $\mathbf{X}^{L}$ and $\mathbf{X}^{H}$ share the same set of $\mathbf{t}$ and $\mathbf{y}$. - Different from the high quality data, we assume $y=\sigma(X^{L}\mathbf{w} + X^{L}\Delta\mathbf{w})$, where the posterior distribution of $\mathbf{w}$ is derived from training on the high quality data, i.e., $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$, and the additional weights $\Delta\mathbf{w}$ has a prior distribution $\mathcal{N}(0, \text{diag}(\beta_{i}^{-1}))$. These additional weights are introduced to adapt the model towards low quality features. - Similarly, we find the posterior of $\Delta\mathbf{w}$ by maximising the posterior math P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{H}^{L}, \mathbf{w}, \beta_1,...\beta_d) \propto P(\mathbf{t}|\mathbf{X}^{L},\Delta\mathbf{w}, \mathbf{w})P(\Delta\mathbf{w} | \beta_1,...\beta_d)  Thus the posterior of $\Delta\mathbf{w}$ depends on $\mathbf{w}$. - To estimate $\beta_1, ...,\beta_d$, we estimate the type-II likelihood by marginalising $\Delta\mathbf{w}$ and $\mathbf{w}$ math P(\mathbf{t} | \mathbf{X}^{L},\beta_1,...\beta_d,\alpha_1,...\alpha_d)=\int P(\mathbf{t} | \mathbf{X}^{L}, \Delta\mathbf{w},\mathbf{w})P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{X}^{L}, \mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}  which is intractable... We thus generate samples of $P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)$ to calculate the integral, and use Laplace approximation for the posterior of $\Delta\mathbf{w}$. #### Prediction on the low quality data. - With $\alpha_1,...\alpha_d,\beta_1,...\beta_d$ estimated, we can calculate the true posterior of $\Delta\mathbf{w}$ and $\mathbf{w}$. - For a new voxel, the probability math P(t=1|\mathbf{x}^{L}...)=\int\sigma(\mathbf{x}^{L}\mathbf{w}+\mathbf{x}^{L}\Delta\mathbf{w})P(\Delta\mathbf{w}|\mathcal{D},\mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathcal{D}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}  where $\mathcal{D}$ is previous training data. ---- ## Simulation results We generated $\mathbf{X}^{H}$ of varying sizes, 1000x100, 1000x500, 1000x1000, 1000x1500. They are generated by julia n = 1000 # number of samples d = 1500 # number of features # generate feature matrices - high quality Xtrain, Xtest = (randn(n, d) for _ ∈ 1:2) # low quality -- some columns are noisier noise_col = rand(1:d, Int(d * 0.2)) # 20% of the columns are noiser XLtrain = copy(Xtrain) XLtrain[:, noise_col] .+= randn(n, Int(d * 0.2)) XLtest = copy(Xtest) XLtest[:, noise_col] .+= randn(n, Int(d * 0.2)) [x .= exp.(x) for x ∈ [Xtrain, Xtest, XLtrain, XLtest]] # generate high quality coefficients - 60% of the coefficients are zeros. w=randn(d); w[rand(1:d, Int(d*0.6))] .= 0. ytrain= logistic.(Xtrain*w); ytest = logistic.(Xtest * w) ttrain = [x > 0.5 ? 1 : 0 for x in ytrain] ttest = [x > 0.5 ? 1: 0 for x in ytest] # low quality -- some columns are zero zero_col = rand(1:d, Int(d * 0.05)) # 5% of the columns are zero [x .= exp.(x) for x ∈ [Xtrain, Xtest]] XLtrain = copy(Xtrain) XLtrain[:, zero_col] .= 0. XLtest = copy(Xtest) XLtest[:, zero_col] .= 0. # low quality -- more outliers outlier_row = rand(1:d, Int(d * 0.01)) # 1% outliers XLtrain = copy(Xtrain) XLtrain[outlier_row, :] .+= randn(Int(d * 0.01), d) XLtest = copy(Xtest) XLtest[outlier_row, :] .+= randn(Int(d * 0.01), d) [x .= exp.(x) for x ∈ [Xtrain, Xtest, XLtrain, XLtest]]  And we compared three methods: - Blue: High + Low training - Red: trained on low quality data - Orange: Lasso Logistic regression. ![accuracy](/figs/accuracy.svg) ![dice](/figs/dice.svg) When $d >> n$, Lasso appears superior to the others. ### Panel B - with structured ARD priors (in progress). #### On the high quality data Instead of using ARD priors $\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\alpha_1,...\alpha_d))$, we assume the hyperparamters have a underlying structure, e.g., $\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\exp(\mathbf{u}))$, where $\mathbf{u}$ is a Gaussian Process $\mathbf{u}\sim\mathcal{N}(\mathbf{0}, \mathbf{C}_{\Theta})$ such that neighbouring features (i.e., adjoining/co-activating voxels) share similar sparsity. [(Ref)](https://proceedings.neurips.cc/paper/2014/file/f9a40a4780f5e1306c46f1c8daecee3b-Paper.pdf) #### On the low quality data The low-quality coefficients have similar structured ARD priors (exp of a Gaussian Process) that may not share the same hyperparameters with the high-quality coefficients' priors. We seek to solve the hyperparameters for the low-quality classification model, marginalising over the posteriors of the high-quality model. ### Panel C - with structured spike-and-slab priors (in progress). #### On the high quality data Similarly, instead of using ARD priors, we assume the coefficients have spike-and-slab priors with latent variables $\gamma_i^{H}, i=1,2,...d$, where $\gamma_i\sim\text{Bernoulli}(\sigma(\theta))$. The hyperparameter $\theta$ can be a Gaussian Process. [(ref)](https://ohbm.sparklespace.net/srh-2591/) #### On the low quality data The low quality coefficients have similar spike-and-slab priors to enforce sparsity.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!