2021JUN20.md 7.87 KB
Newer Older
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
1
2
3
## Fusion of High-quality and low-quality classification models
### Graphical models
![diagram1](/figs/2021JUL01/diagram-20210630-3.png)
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
4
##### An alternative formulation
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
5
![diagram2](/figs/2021JUL01/diagram-20210702-01.png)
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
### Panel A - (the most basic) model formulation (with classical ARD priors)
The model for high quality data classification follows a regression form with ARD priors. The low-quality model is trained by marginalising over the posterior distribution of the high quality coefficients $`\mathbf{w}^{H}`$ to give (the distribution of) a set of low-quality coefficients (with ARD priors likewise). 
#### On the high quality data.
- Suppose $`\mathbf{X}^{H}`$ is the $`v\times d`$ feature matrix (e.g. connectivity profiles of $`v`$ voxels). $`\mathbf{t}`$ is the $`v\times 1`$ labels (0-1 variables). $`\mathbf{w}`$ is the $`d\times 1`$ coefficients, and $`\mathbf{y}=\sigma(\mathbf{X}^{H}\mathbf{w})`$ determines the probability for each class.
- Here we adopt the Relevance Vector Machine (RVM) with ARD prior to find $`\mathbf{w}`$. Suppose $`\mathbf{w}`$ has a prior distribution $`\mathcal{N}(\mathbf{0}, \text{diag}(\alpha_{i}^{-1}))`$. We hope $`\alpha_{i}`$ is driven to Inf, if the associated feature is useless for prediction. 
- The posterior distribution $`P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)`$ can be found by maximising 
```math
P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \propto P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w}, \alpha_1,...\alpha_d)P(\mathbf{w}|\alpha_1,...\alpha_d)
```
using Newton-Raphason algorithm. Suppose $`P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\sim\mathcal{N}(\mathbf{w}^{*}, \mathbf{H}^{-1})`$. By marginalising over the posterior of $`\mathbf{w}`$ (using Laplace Approximation), we can find $`\alpha_1,...\alpha_d`$ by maximising type-II likelihood (evidence)
```math
P(\mathbf{t}|\mathbf{X}^{H}, \alpha_1,...\alpha_d) = \int P(\mathbf{t}|\mathbf{X}^{H}, \mathbf{w})P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d) \text{d}\mathbf{w}
```
#### On the low quality data.
- Suppose $`\mathbf{X}^{L}`$ is the connectivity profiles of the voxels in low quality image. Here we seed to predict $`\mathbf{t}`$ using $`\mathbf{X}^{L}`$, aided by high quality training. We assume both $`\mathbf{X}^{L}`$ and $`\mathbf{X}^{H}`$ share the same set of $`\mathbf{t}`$ and $`\mathbf{y}`$. 
- Different from the high quality data, we assume $`y=\sigma(X^{L}\mathbf{w} + X^{L}\Delta\mathbf{w})`$, where the posterior distribution of $`\mathbf{w}`$ is derived from training on the high quality data, i.e., $`P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)`$, and the additional weights $`\Delta\mathbf{w}`$ has a prior distribution $`\mathcal{N}(0, \text{diag}(\beta_{i}^{-1}))`$. These additional weights are introduced to adapt the model towards low quality features.
- Similarly, we find the posterior of $`\Delta\mathbf{w}`$ by maximising the posterior
```math
P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{H}^{L}, \mathbf{w}, \beta_1,...\beta_d) \propto P(\mathbf{t}|\mathbf{X}^{L},\Delta\mathbf{w}, \mathbf{w})P(\Delta\mathbf{w} | \beta_1,...\beta_d)
```
Thus the posterior of $`\Delta\mathbf{w}`$ depends on $`\mathbf{w}`$. 
- To estimate $`\beta_1, ...,\beta_d`$, we estimate the type-II likelihood by marginalising $`\Delta\mathbf{w}`$ and $`\mathbf{w}`$
```math
P(\mathbf{t} | \mathbf{X}^{L},\beta_1,...\beta_d,\alpha_1,...\alpha_d)=\int P(\mathbf{t} | \mathbf{X}^{L}, \Delta\mathbf{w},\mathbf{w})P(\Delta\mathbf{w}|\mathbf{t}, \mathbf{X}^{L}, \mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}
```
which is intractable... We thus generate samples of $`P(\mathbf{w}|\mathbf{X}^{H}, \mathbf{t}, \alpha_1,...\alpha_d)`$ to calculate the integral, and use Laplace approximation for the posterior of $`\Delta\mathbf{w}`$.

#### Prediction on the low quality data.
- With $`\alpha_1,...\alpha_d,\beta_1,...\beta_d`$ estimated, we can calculate the true posterior of $`\Delta\mathbf{w}`$ and $`\mathbf{w}`$. 
- For a new voxel, the probability
```math
P(t=1|\mathbf{x}^{L}...)=\int\sigma(\mathbf{x}^{L}\mathbf{w}+\mathbf{x}^{L}\Delta\mathbf{w})P(\Delta\mathbf{w}|\mathcal{D},\mathbf{w}, \beta_1,...\beta_d)P(\mathbf{w}|\mathcal{D}, \alpha_1,...\alpha_d)\text{d}\Delta\mathbf{w}\text{d}\mathbf{w}
```
where $`\mathcal{D}`$ is previous training data.

----
## Simulation results
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
43
We generated $`\mathbf{X}^{H}`$ of varying sizes, 1000x500, 1000x1000, 1000x1500. They are generated by
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
44
45
46
47
48
49
50
```julia
n = 1000 # number of samples
d = 1500 # number of features

# generate feature matrices - high quality
Xtrain, Xtest = (randn(n, d) for _  1:2)
# low quality -- some columns are noisier
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
51
noise_col = rand(1:d, Int(d * 0.2)) # 20% of the columns of XL are noiser
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
52
53
54
55
56
57
XLtrain = copy(Xtrain)
XLtrain[:, noise_col] .+= randn(n, Int(d * 0.2))
XLtest = copy(Xtest)
XLtest[:, noise_col] .+= randn(n, Int(d * 0.2))
[x .= exp.(x) for x  [Xtrain, Xtest, XLtrain, XLtest]]

Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
58
# generate high quality coefficients - 60% of the coefficients w are zeros.
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
59
60
61
62
63
64
65
w=randn(d); w[rand(1:d, Int(d*0.6))] .= 0. 

ytrain= logistic.(Xtrain*w); ytest = logistic.(Xtest * w)
ttrain = [x > 0.5 ? 1 : 0 for x in ytrain]
ttest = [x > 0.5 ? 1: 0 for x in ytest]

# low quality -- some columns are zero
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
66
zero_col = rand(1:d, Int(d * 0.05)) # 5% of the columns in XL are zero
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
[x .= exp.(x) for x  [Xtrain, Xtest]]
XLtrain = copy(Xtrain)
XLtrain[:, zero_col] .= 0.
XLtest = copy(Xtest)
XLtest[:, zero_col] .= 0.

# low quality -- more outliers
outlier_row = rand(1:d, Int(d * 0.01)) # 1% outliers 
XLtrain = copy(Xtrain)
XLtrain[outlier_row, :] .+= randn(Int(d * 0.01), d)
XLtest = copy(Xtest)
XLtest[outlier_row, :] .+= randn(Int(d * 0.01), d)
[x .= exp.(x) for x  [Xtrain, Xtest, XLtrain, XLtest]]
```

And we compared three methods:
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
83
84
- Blue: using posterior of w (trained on the high) as priors for the low-quality data. 
- Red: trained on low quality data only
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
85
- Orange: Lasso Logistic regression.
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
86
- Pale blue: trained on low quality data (marginalising over the posterior of w from high)
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
87
88

accuracy
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
89
![accuracy](/figs/2021JUL01/acc.svg)
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
90
91

dice
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
92
93
![dice](/figs/2021JUL01/dice.svg)

Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
94
When $`d > n`$, Lasso appears superior to the others.
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
95
96
97
98
99
100
101
102
103
104
105
106

### Panel B - with structured ARD priors (in progress).
#### On the high quality data
Instead of using ARD priors $`\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\alpha_1,...\alpha_d))`$, we assume the hyperparamters have a underlying structure, e.g., $`\mathbf{w}\sim\mathcal{N}(0, \text{diag}(\exp(\mathbf{u}))`$, where $`\mathbf{u}`$ is a Gaussian Process $`\mathbf{u}\sim\mathcal{N}(\mathbf{0}, \mathbf{C}_{\Theta})`$ such that neighbouring features (i.e., adjoining/co-activating voxels) share similar sparsity. [(Ref)](https://proceedings.neurips.cc/paper/2014/file/f9a40a4780f5e1306c46f1c8daecee3b-Paper.pdf)
#### On the low quality data
The low-quality coefficients have similar structured ARD priors (exp of a Gaussian Process) that may not share the same hyperparameters with the high-quality coefficients' priors. We seek to solve the hyperparameters for the low-quality classification model, marginalising over the posteriors of the high-quality model.

### Panel C - with structured spike-and-slab priors (in progress).
#### On the high quality data
Similarly, instead of using ARD priors, we assume the coefficients have spike-and-slab priors with latent variables $`\gamma_i^{H}, i=1,2,...d`$, where $`\gamma_i\sim\text{Bernoulli}(\sigma(\theta))`$. The hyperparameter $`\theta`$ can be a Gaussian Process. [(ref)](https://ohbm.sparklespace.net/srh-2591/)
#### On the low quality data
The low quality coefficients have similar spike-and-slab priors to enforce sparsity.
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
107
108
109


## Assume low-quality data is rotated (noiser) version of high quality data
Ying-Qiu Zheng's avatar
Ying-Qiu Zheng committed
110
111
112
```math
P(\mathbf{U}|\mathbf{t}, \mathbf{H}^{L}, \mathbf{w}, \beta_1,...\beta_d) \propto P(\mathbf{t}|\mathbf{X}^{L},\mathbf{U}, \mathbf{w})P(\mathbf{U} | \mathbf{w}, \beta_1,...\beta_d)
```