, , , , , ,

[SOLVED] Csci 4260/6260 assignment 3: anonymization and differential privacy

$25

File Name: Csci_4260_6260_assignment_3__anonymization_and_differential_privacy.zip
File Size: 631.14 KB

5/5 - (1 vote)

1 Sensitivity Calculation
Problem 1. (30 pts) Consider a dataset D = {x1, x2, . . . , xn} consisting of n elements, each of
which is drawn from the universe U = {1, 2, . . . , 100}. Compute the sensitivity of the following query
functions and provide an example neighboring dataset pair that yields the highest sensitivity, i.e.,
give an example of (D, D′
).
1
(1) q(D) = |{x : x > p , x ∈ D}|, i.e., the number of elements greater than p. Compute the
sensitivity of q under two different definitions of neighboring datasets we learned in class (i.e.,
one under the proper subset definition and the other under the replacement definition). You
need to describe your logic or reasoning in detail.
(2) q(D) = Pn
i=1 xi
Compute the sensitivity of q under two different definitions of neighboring datasets.
(3) q(D) = median({x1, x2, . . . , xn})
For this problem, use the replacement definition and assume n is an odd number.
2 Variance of Noisy Answer
Problem 2. (10 pts) Let’s quickly check your understanding on the Laplace random variable. Let
q : X
n → R be a query function whose sensitivity is 1 (for example, the count query). We know
that the Laplace mechanism M generates its private answer as shown below satisfies ϵ-differential
privacy.
M(D) = q(D) + Y
noise
,
where Y ∼ Lap
0, λ =
1
ϵ

.
(1) Show that M(D) is an unbiased estimate of q(D).
(2) Suppose that we split the total privacy budget ϵ into two chunks, [ ϵ
2
,
ϵ
2
], and obtain
R1 = M(D) = q(D) + Lap 
0, λ =
2
ϵ

R2 = M(D) = q(D) + Lap 
0, λ =
2
ϵ

What is the variance of R1 + R2, i.e., Var(R1 + R2)?
(3) Now consider splitting the privacy budget as [ ϵ
3
,

3
] and computing
R1 = M(D) = q(D) + Lap 
0, λ =
3
ϵ

R2 = M(D) = q(D) + Lap 
0, λ =
3


Compute the variance of R1 and R2 and state which answer is more useful (i.e., better utility).
3 Private Synthetic data generation via Histograms
Consider the following dataset consisting of 1,000 observations randomly generated using the
scikit-learn package. See the attached CSV file containing the dataset. It contains two attributes x1 and x2 and a label y. As you can see from Figure 1, (x1, x2)-coordinate values are
restricted to the range [0, 1) 1 The label yi
, for i = 1, . . . , 1000, is an integer, i.e., yi ∈ {0, 1}. 0
and 1 correspond to the blue and red dots in the figure, respectively. We will generate a synthetic
dataset that resembles the one in Figure 1 under differential privacy. Here are steps we will be
taking to generate the synthetic dataset.
1The notation [a, b) means the interval defined by a ≤ x < b. The left square bracket means that the lower bound
a is inclusive while the right parenthesis means that the upper bound is not inclusive.
2
0.0 0.2 0.4 0.6 0.8 1.0
x1
0.0
0.2
0.4
0.6
0.8
1.0
x
2
Figure 1: A dataset randomly generated using the scikit-learn package
1 Building a 2D histogram
(1) Construct bins by splitting each dimension into k intervals.
(2) Count the number of observations that fall into each bin.
Definition 1 (Histogram) A histogram H = (b1, . . . , bm) constructed from a dataset D is a
collection of bins bi, each of which is a tuple bi = (r
i
1×r
i
2
, counti) =
[s
i
1
, fi
1
) × [s
i
2
, fi
2
), counti

,
where
• rect = (r
i
1 × r
i
2
) denotes the rectangular region associated with bi, and
• counti is the number of observations in rect, i.e.,

{(x1, x2) ∈ D | s
i
1 ≤ x1 < fi
1
, si
2 ≤ x2 < fi
2}

.
2
For example, if we split each dimension into two intervals, [0, 0.5) and [0.5, 1.0), we obtain
2 × 2 = 4 bins in total. Each bin is composed of 2D range and count. The range is a
Cartesian product of two intervals, one from dimension x1 and the other from x2. The count
is an integer that represents the number of observations falling into the rectangle formed by
the associated range. An example of 2D histogram is shown in Table 1
3
. Alternatively, this
Bin bi
.range bi
.count
b1 [0, 0.5) × [0, 0.5) 123
b2 [0, 0.5) × [0.5, 1.0) 377
b3 [0.5, 1.0) × [0, 0.5) 346
b4 [0.5, 1.0) × [0.5, 1.0) 154
Table 1: An example 2D histogram obtained by splitting each dimension into 2 equi-width intervals.
can be viewed as laying a regularly spaced grid over a rectangle [0, 1) × [0, 1) and counting
number of observations contained in each cell.
2For a set X, |X| denotes the number of elemnts in X.
3Note that the count values in Table 1 are not calculated from the dataset you will be using in this homework. In
other words, even if you get different count values, it doesn’t mean that your implementation is incorrect.
3
Problem 3. Implement the function that construct a histogram.
1 def construct_histogram(X, n_split):
2 “””
3 Parameters:
4 ————————–
5 X: 2D array of size (n, 2), containing observations
6 n_split: integer, the number of intervals to have on each dimension
7
8 Return:
9 ————————-
10 h: 1D array containing bin counts; the bins are enumerated from left to
11 right and top to bottom. For our example in Table 1, it should return
12 (377, 154, 123, 346) = (b2, b4, b1, b3).
13 “””
14 pass
15
Consider releasing the histogram you constructed.
Problem 4. Using the histogram you constructed, propose a way to generate a dataset that
can satisfy k-anonymity. You need to elaborate your idea. For which value of k, does your
dataset satisfy the k-anonymity? Draw a table showing how the dataset to be released look
like.
2 Releasing the histogram using the Laplace mechanism
Given a query function q : X
n → R, the Laplace mechanism M first computes the true answer
q(D) and draw a random noise Y from a Laplace distribution Lap (0, λ) with mean 0 and scale
factor λ. We saw in class that setting λ =
∆q
ϵ
allows to achieve ϵ-differential privacy:
Mq(D) = q(D) + Y , Y ∼ Lap (0, λ) .
We also learned in class how to release histograms (a sequence of count values) under differential privacy. As a warm-up, let’s compute the sensitivity of releasing histograms.
Problem 5. Recall that we talked about two different ways of defining neighboring datasets:
proper subset definition VS replacement definition. Let H = (h1, h2, . . . , hn), where hi
is an
integer corresponding to the number of observations in the i
th bin of H, i.e., bi
.count. For
each definition, derive the sensitivity of function releasing a histogram of length n. Your
answer must clearly show each step taken to compute the sensitivity.
Now we are ready to write a function that release histogram under ϵ-differential privacy.
Problem 6. Write a function that releases a histogram using the Laplace mechanism. You
can use numpy.random.laplace to generate the Laplace noise.
1 def dp_histogram(histo, epsilon):
2 “””
3 Parameters:
4 ————————
5 histo: 1D array containing count values of a histogram
6 epsilon: privacy budget, a real number greater than 0.
4
7
8 Returns:
9 ————————
10 noisy_histo: 1D array containing count values, each entry is perturbed
11 with Laplace noise
12 “””
13 pass
3 Post-processing the released histogram.
A drawback of using the Laplace mechanism for releasing histograms is that the release count
values can be negative and are real numbers (rather than integers). What we learned in
class is that the privacy guarantee of differential privacy is invariant under post-processing.
This means that we can freely transform the differentially privately released output and the
result is still ϵ-differentially private as long as the algorithm you used to transform the output
doesn’t make use of the noise injected by the Laplace mechanism.
After post-processing the histogram, we will normalize the histogram to convert the count
values into probabilities. This is simply done by dividing each post-processed count values
by their total sum.
Problem 7. Write a simple function that turns negative count values into zeros.
1 def postprocess(noisy_histo):
2 “””
3 Parameters:
4 —————————-
5 noisy_histo: 1D array containing noisy count values
6
7 Returns:
8 —————————-
9 probs: 1D array that contains the probabilities for each bin.
10 “””
11 pass
4 Generating a synthetic dataset.
Let’s recall the information available to us. We have a noisy histogram
He = (b1, b2, . . . , bn) = {bi = ([s
i
1
, fi
1
) × [s
i
2
, fi
2
), counti
, probi
)}
n
i=1
containing count and prob for each rectangle formed by two intervals. This tells us that we
can create a dataset by generating counti observations within the range [s
i
1
, fi
1
)×[s
i
2
, fi
2
). For
simplicity, we assume observations are uniformly distributed over [s
i
1
, fi
1
) × [s
i
2
, fi
2
). You can
use numpy.random.uniform to randomly generate uniformly distributed values.
Problem 8. Write a function that generate a synthetic dataset of size 1,000.
1 def synthesize_dataset(noisy_histo):
2 “””
3 Parameters:
4 ————————-
5 noisy_histo: tuple containing bin ranges and bin counts
5
6
7 Returns:
8 ————————-
9 syn_dataset: 2D array containing
10 “””
5 Experimenting with the synthetic datasets
Problem 9. Generate synthetic datasets for three different values of ϵ = 0.1, 1.0, 10.0 and
for three different values of n split=2, 4, 8. Visualize the datasets you generated using the
following code. You should provide 3 × 3 = 9 visualizations.
1 import matplotlib.pyplot as plt
2
3 def visualize_2d(X, filename, y=None):
4 “””
5 Visualize the dataset given in X
6
7 Parameters:
8 —————
9 X: 2D array of shape (n, 2)
10 filename: a string, the name of file to store the figure
11 “””
12 fig, ax = plt.subplots()
13
14 ax.scatter(X[:, 0], X[:, 1], cmap=plt.get_cmap(‘jet’))
15 ax.set_xlabel(r’$x_1$’, fontsize=15)
16 ax.set_ylabel(r’$x_2$’, fontsize=15)
17 plt.tight_layout(pad=0.1)
18 plt.savefig(filename)
19
Problem 10. Discuss the trade-offs played by the following hyper-parameters.
• privacy budget ϵ
• number of splits n split

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Csci 4260/6260 assignment 3: anonymization and differential privacy[SOLVED] Csci 4260/6260 assignment 3: anonymization and differential privacy
$25