Learning Privately from Multiparty Data notes

Hamm J, Cao Y, Belkin M. Learning privately from multiparty data[C]//International Conference on Machine Learning. 2016: 555-563.

Abstract

在多方场景下，怎样在不获取其他party’s private data的情况下，通过组合本地训练的分类器，训练一个准确的符合DP的分类模型？本文提出了

transfer the ‘knowledge’ of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global $\epsilon$-differentially private classifier.

本文指出大部分的voting都太敏感，因此提出了新的risk weighted by class probabilities estimated from the ensemble。相对于非隐私的方案，误差控制在$O(\epsilon^{-2} M^{-2})$之内，$M$是parties的数量。

Introduction

[1] 提出了通过平均local分类器的参数来得到global分类器，通过DP机制来防止平均参数时造成的隐私泄露。平均参数是一个简单并且实用的步骤，可以通过Yao提出的SMC来实现。但是，对于非数值型的参数，比如决策树或者聚合不同类型的分类器时，直接平均参数并不适用。

本文提出了通过两个步骤来组合local分类器为一个global分类器的方法：第一步，本地训练的分类器被一个信任的实体收集，但是直接使用DP处理过的参数并不实用，

Instead, we use the classifier ensemble to generates (pseudo)labels for auxiliary unlabeled data, thus transferring the knowledge of the ensemble to the auxiliary data.

第二步，用标记过的辅助数据找到一个empirical risk minimizer，然后通过output pertubation[2] 的方法release一个DP的分类器。

当用ensemble of classifiers为辅助数据生成label时，最简单的方法就是majority voting。本文量化的表明了这种训练方式对于local分类器的vote是highly sensitive的（是说对于差分隐私敏感度高吗？）。这样一来会造成很大的性能损失。所以本文提出新的risk insensitive to individual votes，每个采样根据ensemble的confidence来分配权重。

One of our main results is in Theorem 4: we can achieve $\epsilon$-differential privacy with a generalization error of $O(\epsilon^{-2} M^{−2})$ and $O(N^{−1})$ terms, relative to the expected risk of a non-private solution, where M is the number of parties and N is the number of samples in auxiliary data. This result is especially useful in a scenario where there are a large number of parties with weak local classifiers such as a group of connected smart devices with limited computing capability.

本文的三个优点：

flexible: 可以组合不同类型的local分类器
误差收敛快
provides $\epsilon$-differential privacy to all samples of a party and not just a single sample.

Preliminary

直接用[2]中的对参数加噪声的方式，不同点是：

One important difference of our setting to previous work is that we consider $\epsilon$-differential privacy of all samples of a party, which is much stronger than $\epsilon$-differential privacy of a single sample.

然后提了一些假设，类似于[2]中的：

Transferring knowledge of ensemble

Local classifiers

假设local分类器是M个黑盒(可以是不同类型的分类器)，每个分类器通过本地的训练集训练，数据是独立同分布的。切分数据通过sample without replacement [3]。

Privacy issues of direct release

第一步中，local分类器被一个可信任的实体收集。最简单的方法就是在DP处理后直接释放所有M个分类器参数。但是这种操作需要一个恒定的敏感度(sensitivity)，不如直接释放统计数据的平均值，这样敏感度为$O(M^{-1})$。此外有效的DP机制只能用在特定类型的分类器上[4]。

另一种方法是，将ensemble当成一种服务来对test data进行预测。如果用majority voting来为测试样本做预测，Report Noisy Max[5] 可以被用来sanitize the votes for a single query。但是当查询次数变多时，noise是和查询次数线性成比例的，这在现实中很不实用。

Leveraging auxiliary data

为了解决上述问题，提出transfer the knowledge of the ensemble to a global classifier using auxiliary unlabeled data。具体地，利用ensemble来为辅助数据生成标签，然后再用来训练global分类器。辅助数据的数量不影响privacy (就因为没有标签就可以不影响privacy吗？)，反而数据量越大，分类器越接近于原始的ensemble，bound是$O(N^{-1})$。和用ensemble预测对比，the sanitized global classifier可以被用来任意多的查询次数，而不影响privacy。

Finding a global private classifier

First attempt: ERM with majority voting

利用M个local分类器的majority voting来生成辅助数据。就是超过半数的分类器对当前数据的预测标签作为该数据的标签。

如图算法1表示了这种方法。公式6就是majority voting。公式8是ERM的损失函数

\[R_{S}^{\lambda} (w) = \frac{1}{N} \sum_{(x,v \in S)} l(h(x;w),v) + \frac{\lambda}{2} \Vert w \Vert ^2 \tag{8}\]

(后边还列了一下这个式子有和没有正则项的期望，这不是纯凑吗？还没看到用得着的地方啊。。)

Theorem 1. The perturbed output $w_p = w_s + \eta$ from Algorithm 1 with $p(\eta) \propto e^{-\frac{\lambda \epsilon}{2} \Vert \eta \Vert}$ is $\epsilon$-differentially private.

Proof. Suppose $D = (S^{(1)}, \dots, S^{(M)})$ 是M个训练集，$D’ = (S’^{(1)}, \dots, S^{(M)})$是相邻的数据集，通过$D$和$D’$计算出相应的local分类器$H = (h_1, \dots, h_M)$和$H’ = (h’1, \dots, h_M)$。通过投票生成相应的辅助数据$S = {x_i, v(x_i)}$和$S’ = {x_i, v’(x_i)}$，同样的features但是可能不同的labels。$R_S^{\lambda} (w)$和$R{S’}^{\lambda} (w)$分别是基于$S$和$S’$的regularized empirical risks。$w_s$和$w_{s’}$是对应risk的minimizer。根据[2]的推论7和8可以得出

\[\Vert w_s - w_{s'} \Vert \leq \frac{1}{\lambda} \max _w \Vert \nabla g(w) \Vert,\]

其中，$g(w)$ is the risk difference $R_S^{\lambda} (w) - R_{S’}^{\lambda} (w)$，本文中

\[\Vert \nabla g(w) \Vert \leq \frac{1}{N} \sum_{i=1}^{N} \Vert v(x_i) x_i l'(v(x_i) w^T x_i) - v'(x_i) x_i l'(v'(x_i) w^T x_i) \Vert\] \[\leq \frac{1}{N} \sum_{i=1}^{N} \Vert x_i \Vert \times \vert l'(w^Tx_i) + l'(-w^Tx_i) \vert . \tag{*}\]

注意到假设了$\Vert x \Vert \leq 1 \text{ and } \vert l’(\cdot) \vert \leq 1$。最坏的情况是每个$v(x_i)$和$v’(x_i)$都不相同，因此the RHS of $*$ is bounded by 2. 因此，$w_s$的$L_2$ sensitivity是

\[\max_{S,S'} \Vert w_s - w_{s'} \Vert \leq \frac{2}{\lambda}\]

因此得证。

Performance issues of majority voting

分析majority-voted ERM的误差。主要被两部分bound，添加的noise和risk的期望与实际的差距。majority-voted ERM的敏感度是$\frac{2}{\lambda}$，而标准的ERM是$\frac{2}{N \lambda}$，对应的误差bound是

\[R(w_p) \leq R(w_0) + O(\epsilon^{-2}) + O(N^{-1})\]

with high probability。但是如果$\epsilon$很小的时候，constant gap $O(\epsilon^{-2})$会很大，不能保证successful learning。引起这个的原因是multiparty voting的最坏结果：假设投票的$M-1$个打平(M是奇数)，那么敏感度的最坏情况就是，剩下的那个local分类器是相反的，这样辅助数据所有的标签都是相反的，最终得到的global分类器是完全不同的。

Better yet: weighted ERM with soft labels

设$\alpha(x)$是$M$个分类器中把sample $x$分成正例的比例，引入weighted loss：

\[l^{\alpha} (\cdot) = \alpha(x) l(w^T x) + (1-\alpha(x)) l(-w^Tx)\]

这样一来，如果$M$个votes全是正的或者负的，则此时的loss和original loss是一样的；而当正负预测正好相等时，loss也不会像之前的那种对单个vote那么敏感。特别地，

a single vote can change $l^{\alpha}(\cdot)$ only by a factor of $\frac{1}{M}$.

Lemma 2. For any $w$, the expection of the weighted loss is asymptotically the expection of the unweighted loss:
\[\lim_{M \rightarrow \infty} E_x[l^{\alpha}(w)] = E_{x,v}[l(w^Txv)]\]

Lemma 2表明weighted loss的期望和标准loss的期望是渐进相等的，when the target v is a probabilistic concept from $P(h(x) = v)$ of the random hypothesis, as opposed to a deterministic concept $v(x)$ from majority voting.

然后仍然用output perturbation来学习满足DP的分类器。如算法2所示。

Privacy and performance

Theorem 3. The perturbed output $w_p = w_s + \eta$ from Algorithm 2 with $p(\eta) \propto e^{-\frac{M \lambda \epsilon}{2} \Vert \eta \Vert}$ is $\epsilon$-differentially private.

这样，只需要$1/M$的噪声量就可以达到同样的DP。

Parameter averaging through SMC[1]. Private exchange of gradient information to minimize empirical risks incrementally[6,7]. 本文motivated by[1]，但是用了和[1]完全不同的方法来aggregate local classifiers.

We use an ensemble approach and average the classifier decisions[8] instead of parameters, which makes our approach applicable to arbitrary and mixed classifier types.

注记一下：所谓ensemble就是直接放一起呗：${h_1,h_2,\cdots,h_M}$，所以可以有不同的类型，主要用来对unlabled data进行voting呗。以半监督的方式[4,9]。

Experiments

用了三组数据：活动识别、网络入侵检测、恶意URL检测。对比了算法1，2和[1]中的parameter averaging，以及单个local分类器，和non-private的global分类器。算法1效果挺一般，算法2比[1]中的方法强。

Reference

[1] Pathak, Manas, Rane, Shantanu, and Raj, Bhiksha. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems, pp. 1876–1884, 2010.

[2] Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Anand D. Differentially private empirical risk minimization. The Journal of Machine Learning Research, 12:1069–1109, 2011.

[3] Politis, D. N., Romano, J. P., and Wolf, M. Subsampling. Springer Verlag, 1999.

[4] Ji, Zhanglong, Jiang, Xiaoqian, Wang, Shuang, Xiong, Li, and Ohno-Machado, Lucila. Differentially private distributed logistic regression using private and public data. BMC medical genomics, 7(Suppl 1):S14, 2014.

[5] Dwork, Cynthia and Roth, Aaron. The algorithmic foundations of differential privacy. Theoretical Computer Science, 9(3-4): 211–407, 2013.

[6] Rajkumar, Arun and Agarwal, Shivani. A differentially private stochastic gradient descent algorithm for multiparty classification. In International Conference on Artificial Intelligence and Statistics, pp. 933–941, 2012.

[7] Hamm, Jihun, Champion, Adam, Chen, Guoxing, Belkin, Mikhail, and Xuan, Dong. Crowd-ML: A privacy-preserving learning framework for a crowd of smart devices. In Proceedings of the 35th IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, 2015.

[8] Breiman, Leo. Bagging predictors. Machine learning, 24(2):123–140, 1996a.

[9] Jagannathan, Geetha, Monteleoni, Claire, and Pillaipakkamnatt, Krishnan. A semi-supervised learning approach to differential privacy. In Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, pp. 841–848. IEEE, 2013.