blank

Reassessing EMNLP 2024’s Best Paper: Does Divergence-Based Calibration for Membership Inference Attacks Hold Up?

2024-11-26T00:00:00+00:00

Introduction

At EMNLP 2024, the Best Paper Award was given to “Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method”. The paper addresses Membership Inference Attacks (MIAs), a key issue in machine learning related to privacy. The authors propose a new calibration method and introduce PatentMIA, a benchmark utilizing temporally shifted patent data to validate their approach. The method recalibrates model probabilities using a divergence metric between the outputs of a target model and a token-frequency map (basically a histogram) derived from auxiliary data, claiming improved detection of member and non-member samples.

However, upon closer examination, we identified significant shortcomings in both the experimental design and evaluation methodology. The proposed dataset introduces a temporal shift between the distribution of member and non-member data, which can lead to overestimation of the performance of an MIA that may end up distinguishing samples based on the temporal range, and not actual membership.

In this post, we critically analyze this shift, and the broader implications of MIA evaluations for models in the wild.

What is Membership Inference?

Membership Inference Attacks (MIAs) are a useful tool in assessing memorization of training data by a model trained on it. Given a model $D$ samples from some underlying distribution $\mathcal{D}$ and a model $M$ trained on $D$, membership inference asks the following question:

Was some given record $x$ part of the training dataset $D$, or just the overall distribution $\mathcal{D}$?

The underlying distribution $\mathcal{D}$ is assumed to be large enough to the point where the above test can be reframed as inferring whether $x \in D$ (via access to $M$) or not. In practice, the adversary/auditor starts with some non-member data (data that they know was not part of the training data $D$, but belongs to the same underlying distribution $\mathcal{D}$) and on the basis of some scoring function, generates a distribution of scores for these non-members. A sweep over these values can then yield “thresholds” corresponding to certain false-positive rates (FPRs), which can then be used to evaluate the true-positive rate (TPR) of the approach under consideration.

It is important to note here that these non-members should be from the same underlying distribution. To better understand why this is important, think of a model trained for the binary classification task of distinguishing images of squirrels and groundhogs Maybe you want to give nuts to squirrels and vegetables to groundhogs . For this example, let’s say this particular groundhog image was part of the training data, but the other two weren’t.

A model will have higher loss on images of llamas, and understandably so since these are images the model did not see at all during training. Using their images would give a clear member/non-member distinction, but would also probably classify any squirrel/groundhog image as a member, even if it wasn’t. As an experimental setup, this is easily enforced when working with standard machine learning models and datasets such as CIFAR-10 and ImageNet, where well-established train/test splits from the same underlying distribution exist.

What’s Special about LLMs?

Because these models are trained on a large scale of data (and in many cases, exact training data is unknown), it is hard to collect data to use as “non-members” which has not been used in the model training and is from the same underlying distribution. Early works on membership inference for LLMs resorted to using data generated after a model’s training cutoff , since such data could not have been seen by a model. However, such design choices can introduce implicit distribution shifts and give a false sense of membership leakage.

Method Overview

The proposed method tries to fix a known issue with MIAs: models often fail to properly separate member and non-member samples. To address this, the authors use an auxiliary data-source to compute token-level frequencies, which are then used to recalibrate token-wise model logits. This normalization aims to adjust token-level model probabilities based on their natural frequency or rarity, aligning with membership inference practices such as reference model calibration.

They also introduce PatentMIA, a benchmark that uses temporally shifted patents as data. The idea is to test whether the model can identify if a patent document was part of its training data or not. While this approach sounds interesting, our experiments suggest that the reported results are influenced by limitations in the benchmark design.

Experimental Evaluation

We ran two key experiments to test the paper’s claims: one for true positives and another for false positives.

True Positive Rate Experiment

This experiment checks if the method can correctly distinguish member data from non-member data when both are drawn from the same distribution. We used train and validation splits from The Pile dataset, which ensures there are no temporal or distributional differences between the two sets. Below we report results for the Wikipedia split.

Model	AUC	TPR@5%FPR
Pythia-6.9B	0.542	0.071
GPT-Neo-125M	0.492	0.054
GPT-NeoX-20B	0.600	0.103

Result:
The method performs only slightly better than the LOSS attack, and remains comparable to most standalone membership inference attacks. For reference, AUC with the baseline LOSS and zlib attacks for Pythia-6.9B are 0.526 and 0.536 respectively, while it is 0.618 when using a reference-model (Table 12 in ). Similarly, using LOSS and zlib yield AUCs of 0.563 and 0.572 respectively.

Reported improvements in the paper (Table 2 showing AUCs of 0.7 and higher) are thus likely due to exploiting differences in the data distribution, rather than actual improvements in detecting membership.

False Positive Rate Experiment

Next, we check how often the method falsely identifies data as “member” when it has in fact not be used in the model’s training. To do this, we use the WikiMIA dataset but replaced the training data with unrelated validation data from the Wikipedia split of The Pile. This means that we can say with certainty that the Pythia and GPT-neox models did not train on either split. We follow the experimental setup of in Section 3 of for this analysis.

Result:
Below we report results for the Wikipedia split. Note that in this setting, a score closer to 0.5 is better since both splits are non-members.

Model	AUC for DC-PDD	AUC for LOSS
Pythia-6.9B	0.667	0.636
GPT-Neo-125M	0.689	0.671
GPT-Neox-20b	0.637	0.656

The method flags a high number of false positives. It frequently identifies non-member data as part of the training set, suggesting that the attack was was reliant on temporal or distribution artifacts rather than truly detecting membership.

The Problem with Temporally Shifted Benchmarks

The introduction of PatentMIA highlights a broader problem with MIA research: benchmarks that rely on temporal shifts . These benchmarks often make it easy for attack models to exploit simple artifacts, like whether a document contains terms that didn’t exist during training (e.g., “COVID-19” or “Tesla Model Y”). This creates an illusion of success but doesn’t address the real challenge of membership inference.

Why These Benchmarks Are Misleading

The issues with temporally shifted benchmarks are not new. Several prior works have already established the dangers of using such benchmarks:

Spurious Patterns: Temporal shifts introduce artifacts that are easily exploitable by attack models. As noted by Duan et al. , temporal markers (e.g., “COVID-19” or recent events) allow models to cheat by detecting new concepts rather than true membership.
Misleading Evaluations: Maini et al. show how temporal shifts can inflate the perceived success of MIAs, even when no meaningful membership inference occurs.
Blind Baselines Work Better: Das et al. demonstrate that blind baselines often outperform sophisticated MIAs on temporally shifted datasets, highlighting how these benchmarks fail to test real inference ability.

Despite these well-established issues, the EMNLP Best Paper continues to rely on temporally shifted data like PatentMIA for its evaluations. This undermines the robustness of its claims and contributes little to advancing membership inference research.

Machine Learning Awards: A Problem of Incentives

This situation raises important questions about the role of awards in machine learning research.

Do Awards Encourage Rushed Work? Highlighting work with known flaws, like relying on misleading benchmarks, can discourage researchers from investing time in more rigorous evaluations.
Harming the Field: Awards that celebrate flawed work set a bad precedent and can mislead the community into thinking these methods are the gold standard.
Losing Credibility: Over time, the reputation of awards themselves suffers, as researchers may start viewing them as less meaningful.

This is a growing problem in machine learning research, where not only acceptance but even awards are constantly under scrutiny for their soundness, let alone their contribution. If awards are to truly highlight excellence, they must emphasize thoroughness, reproducibility, and robustness over surface-level novelty.

Conclusion

The EMNLP 2024 Best Paper sought to address a pressing challenge in membership inference but falls short under careful scrutiny. The proposed method fails both in distinguishing members and non-members under rigorous conditions and in avoiding false positives when the data is untrained. Furthermore, its reliance on PatentMIA exemplifies a larger issue with using temporally shifted benchmarks to claim progress.

For the field to advance meaningfully, greater emphasis must be placed on rigorous evaluation practices. Awards should reflect this by rewarding work with robust and thorough evaluations, rather than methods that (knowingly or otherwise) exploit well-known flaws in evaluation practices. Only then can we ensure that the field moves forward in a meaningful way.

Acknowledgements

We would like to thank Zack Lipton and Zico Kolter for their helpful feedback on the draft and for referring us to Nicholas’s example of good criticism.

My submission to the ETI Challenge

2024-11-12T00:00:00+00:00

This post describes an approach developed for the Erasing the Invisible challenge at NeurIPS 2024. My method combined “rinsing” with adversarial techniques, designed for both the black-box and beige-box competition tracks. Although my solution didn’t secure a top spot, I saw potential in the methodology and wanted to document it to possibly aid future research and development in this area.

Adversarial Rinsing

The central idea behind my approach is blending “rinsing” with adversarial perturbations. “Rinsing” here means passing an image through a diffusion model multiple times, intending to erode watermarks present in the input. For adversarial examples, I used the SMI$^2$FGSM attack because of its success with transfer-based attacks. The objective of these adversarial perturbations is to disrupt the latent space representation of the image, aiming to dislodge any potential latent-space watermarks.

Generating Adversarial Perturbations

I used a joint loss that maximizes the separation of the perturbed image’s embedding from the original in two ways:

Embedding Space Distance: A loss that combines norm-distance and cosine-distance for better embedding separation.

class MSEandCosine(ch.nn.Module):
  ...
    def forward(self, output, target):
        mse_loss = self.mse(output, target)
        # Flatten to compute cosine similarity
        csn_loss = 1 - self.csn(output.view(output.size(0), -1), target.view(target.size(0), -1))

        # Combined Loss
        loss = (1 - self.alpha) * mse_loss + self.alpha * csn_loss
        return loss

Image Quality Loss: This component uses differentiable metrics like PSNR, SSIM, LPIPS, aesthetics, and artifacts scores. By optimizing this loss, my aim was to preserve the original image quality while removing the watermark.

class NormalizedImageQuality(ch.nn.Module):
    ...

    def forward(self, output, target):
        """
            Output here is generated image, target is original image
        """
        outputs_aesthetics, outputs_artifacts = self._compute_aesthetics_and_artifacts_scores(output)
        final_score = -4.5e-2 * outputs_aesthetics + 1.44e-1 * outputs_artifacts
        return final_score

        lpips_score = self._lpips(output, target)
        psnr_score = self._psnr(output, target)
        ssim_score = self._ssim(output, target)
        outputs_aesthetics, outputs_artifacts = self._compute_aesthetics_and_artifacts_scores(output)

        # Differentiable NMI is too slow, so ignoring it for now
        # nmi_score = differentiable_nmi(output, target)

        if self.target_aesthetics is None:
            self.target_aesthetics, self.target_artifacts = self._compute_aesthetics_and_artifacts_scores(target)

        delta_aesthetics = outputs_aesthetics - self.target_aesthetics
        delta_artifacts = outputs_artifacts - self.target_artifacts

        # Differentiable metrics!
        weighted_scores = [
            (psnr_score, -2.22e-3), # PSNR
            (ssim_score, -1.13e-1), # SSIM
            (lpips_score, 3.41e-1), # LPIPS
            (delta_aesthetics, 4.5e-2), # Delta-Aesthetics
            (delta_artifacts, -1.44e-1), # Delta-Artifacts
        ]

        # Aggregate weighted scores
        final_score = sum([score * weight for score, weight in weighted_scores])

        # Want to be close to zero
        return ch.abs(final_score)

The optimization then proceeds with a series of augmentations (described below), with an $L_\infty$ norm constraint on the added perturbation. During experimentation, I also tried minimizing detection by watermark-detection models like WAVES but found it degraded performance.I didn't play around too much with the hyper-parameters in my algorithm. Maybe some hyper-parameter tuning could help?

...
"stable_sig": (ClassificationWrapper(stable_sig), "classify"),
"tree_ring": (ClassificationWrapper(tree_ring), "classify"),
"stegastamp": (ClassificationWrapper(stegastamp), "classify")

Augmentations for Robustness

To enhance the effectiveness of the attack, a diverse set of differentiable augmentations are integrated into SMI$^2$FGSM. These augmentations are chosen to closely match the kind of augmentations usually used in watermark-insertion algorithms: Random crop, Gaussian blur, Gaussian noise, JPEG compression, Noise in the FFT domain, Rotation, Motion Blur, Random brightness, Random contrast, Random hue, Horizontal flips. I additionally used Mixup using a set of clean images. To avoid the attack overfitting to a specific augmentation, I randomly sampled from the set of possible augmentations at each iteration.

  transformation_functions = [
        random_crop,
        gaussian_blur,
        ...
        mixup
    ]
    # Randomly pick one of the transformation functions
    random_transform = transformation_functions[np.random.randint(0, len(transformation_functions))]

I also sample the hyper-parameters for each of these augmentations from a wide range of values to avoid potential overfitting.

def motion_blur(x):
  angle = np.random.randint(5, 175)
  direction = np.random.choice([-1, 1])
  return kornia.filters.motion_blur(x, kernel_size=15, direction=direction, angle=angle, border_type='constant')

Generative Models

Empirical observations during implementation revealed that the waves and openai/consistency-decoder generative models yielded the best results. Flipping their order or adding another diffusion/generative models only made the final image worse, since multiple rinsing runs were presumably degrading image quality.

def get_class_scaled_logits(model, features, labels):
    outputs = model(features).detach().cpu().numpy()
    num_classes = np.arange(outputs.shape[1])
    values = []
    for i, output in enumerate(outputs):
        label = labels[i].item()
        wanted = output[label]
        not_wanted = output[np.delete(num_classes, label)]
        values.append(wanted - np.max(not_wanted))
    return np.array(values)

What didn’t work?

A ton of things! I experimented with a lot of compression algorithms, adding noise to images, and various combinations of all of these methods. I also tried adding adversarial perturbations generated using some Imagenet classifier (as a proxy for perturbations that could shift the latent space in favor of avoiding watermark detection). None of them worked, with most of them retaining most of their watermarks. To be honest this did surprise me a bit- stepping into this field I did not realize that these image watermarks could be so robust. I also tried my adversarial-rinse approach, but without “rinsing” - using watermark-detection models as my target models, with varying number of iterations. While that does work to some extent, its performance is nowhere close to that when rinsing is introduced. While the converse is also true, rinsing by itself proved to be much more useful than only adversarial perturbations.

Takeaways

This was definitely a very fun and interesting challenge! I got to learn more about the cat-and-mouse game of watermark insertion and removal, and play around with diffusion models. While the competition itself encourages a joint score of image degradation and low detection rates, I can see a more practical adversary caring way more about the former - after all, one can always try multiple times to bypass filtering (if, say, uploading to OSM platforms) while minimizing image degradation.

My solution is available here: https://github.com/iamgroot42/adversarial-rinsing

My submission to the TDC Trojan Detection Challenge

2023-11-08T00:00:00+00:00

Here I describe my approach to the TDC Trojan Detection challenge, co-located with NeurIPS 2023. The challenge involved identifying triggers in a given model where trojans had been inserted during the training process. Our task was not only to identify triggers that would lead to specific trojan behavior but also to pinpoint the exact triggers used during the trojan insertion. My final approach got me a rank of 7(/16) on the large-model subtrack, and 9(/26) on the base-model subtrack.

Beam-Search

Starting with the observation that the input triggers were of variable length, I considered a beam-search-like approach1. Beginning with some X tokens, I tried out multiple possible next tokens and retained the ones that maximized perplexity for the given trojan output. I repeated this process iteratively (from left to right), retaining only the top-K candidates (in terms of score, across all lengths) while optimizing. Here’s what it looked like:

def beam_search_helper(seq_so_far: List[int],
                       target_seq: List[int],
                       n_pick: int,
                       top_k: int):
    random_picked = np.random.randint(0, len(all_tokens), n_pick)
    ppls = []
    for i in random_picked:
        seq_new = seq_so_far + [i]
        # Make sure this sequence has same length as target
        ppls.append(calculate_perplexity(seq_new, target_seq))
    
    # Pick top K candidates, and their scores
    wanted = np.argsort(ppls)[:top_k]
    scores = np.array(ppls)[wanted]
    
    # Return said sequences
    return [seq_so_far + [random_picked[i]] for i in wanted], scores


def beam_search(target_seq: List[int]):
    candidates, scores = [[]], [np.inf]
    # Everything is between 5 and 40 tokens long
    max_length = 5 #40
    min_length = 5
    n_pick= 10 # 50
    top_k = 5 # 10
    candidates_at_any_point = 15
    
    for i in tqdm(range(max_length)):
        # Run for each candidate
        c_new, s_new = [], []
        for cand in candidates:
            # Use large set for start
            if i == 0:
                s, c = beam_search_helper(cand, target_seq, 200, top_k)
            else:
                s, c = beam_search_helper(cand, target_seq, n_pick, top_k)
            c_new.extend(s)
            s_new.extend(c)

        # Add to pool
        candidates += c_new
        scores += s_new

        # Keep only top candidates_at_an_point candidates
        best_indices = np.argsort(scores)[:candidates_at_any_point]
        candidates = [x for i, x in enumerate(candidates) if i in best_indices]
        scores = [x for i, x in enumerate(scores) if i in best_indices]
    
    s_kept, c_kept = np.array(scores), candidates
    
    # Return top 20 candidates
    keep = np.argsort(s_kept)[:20]
    
    texts = [tokenizer.decode(x) for i, x in enumerate(c_kept) if i in keep]
    scores = s_kept[keep]
    
    return texts, scores

However, this approach failed really badly: I got a score of <2 on the base sub-track.

However, looking at the problem made me realize that it was not too different from the red-teaming sub-track: finding malicious inputs that lead an aligned model to generate harmful content. If anything, this case is easier, since the model is not only not aligned, but has malicious behavior via Trojaned triggers. With this in mind, I decided to give GCG a shot.

GCG

With some very straight forward modifications to the official codebase (my fork is available here) and the setup for the attack itself (instead of having things starting with “Sure”, we have exact target sequences), I was able to get GCG to work. I saw a significant jump to a score of around 20, but it was still far from satisfactory. However, in the process of generating these triggers, I observed a very interesting trend: while optimizing a trigger for a given target sequence, the intermediate model outputs corresponded to other trojans I was not optimizing for!

target = "Humans are the only species in nature to display homosexuality"
...
outputs = [
  ...
  "![The *in vivo* effect of the",
  ...
  "mkfs.btrfs /dev/mapper/cryptroot",
  ...
]

Even in cases where the generation would not succeed (even after 100+ iterations), I noticed other trojans pop up as intermediate outputs. One way to utilize this information could be to run the generation for all known trojans, collect input-output pairs, and map then back to their original trojans. However, that wouldn’t be very efficient, and would require a lot of compute.

Starting with Known Trojans

Using the observation above, I modified the code to start with triggers from known (trigger, trojan) pairs, instead of the default "! ! ! ! ..." string. Making this change further increased score to around 30. Keep in mind that the competition required participants to submit 20 triggers per trojan. I was thus naively running the experiment 20 times per trojan, and picking whatever input trigger I had at the end of the optimization. However, this approach was not very efficient:

Many instances used as few as 10-15 iterations, while some took as many as 50. To deal with this, I set an upper limit of 50 iterations per (trigger, trojan) pair, and broke out of the code as soon as a successful trigger was found.
There was a lot of randomness in the generation: starting with the same trigger for a (trigger, trojan) pair could lead to a successful trigger generation in <10 iterations, or fail to find one even after 100+ iterations. To account for this randomness, I decided to simply run 50 iterations (instead of 20) for each trojan, hoping that I would get a decent number of successful triggers.

# Also use information from generated trojans
generated_trojans = load_targets(SETTINGS[setting]["generated_trojans"])

# Collect information on already-known triggers
known_triggers = None
if x in generated_trojans:
    trojan_strings = [j[0] for j in generated_trojans[x]]
    known_triggers = list(set(trojan_strings))

# Add all trojans NOT for this target
all_known_triggers_use = all_known_triggers[:]
for k, v in generated_trojans.items():
    if k != x:
        all_known_triggers_use.extend([j[0] for j in v])
all_known_triggers_use = list(set(all_known_triggers_use))

These steps further boosted my performance to around 41, but I was still far from the top.

I should mention that I also maintained a list of "failed" initial triggers for each trojan, to rule out of the possibility of repeating optimization with bad triggers. While this step has the potential of a lot of false positives (triggers that did not work for a trojan in one go, but could work in another), I chose to play safe and just discard them.

Multiple Iterations

Looking at the compute limit for the challenge made me realize- I could run the experiment multiple times, and potentially benefit from the growing pool of successful trigger-trojan pairs instead of being limited to the 20 pairs given as part of the challenge. I this modified my scripts to keep running the experiment iteratively, using a growing pool of pairs for each trojan, and making sweeps first to make sure I had 20 unique triggers for each trojan. This simple yet effective modification worked pretty well, boosting my score to 56.

Negative Feedback

Knowing that GCG-based search has a tendency to produce unrelated trojans in the process (probably an artifact of how trojans were inserted in the training, probably pushing them close in some latent space), I decided to add an additional negative loss term to discourage generation of triggers that produced other triggers.

# Get loss from other trojans
other_trojans_losses = ch.zeros_like(losses)

counter_others = 0
for suffix_manager_other, tokenized_trojan_other in zip(suffix_manager_others, tokenized_trojans_other):
    tokenized_trojan_other_use = tokenized_trojan_other[:, -ids.shape[1]:]
    ids_this = ch.cat([
      ids[:, :suffix_manager._target_slice.start].cpu(),
      tokenized_trojan_other_use.repeat(ids.shape[0], 1)], 1).cuda()
    other_trojans_losses += target_loss(logits, ids_this, suffix_manager_other._target_slice)
    counter_others += 1

# Normalize negative loss term
other_trojans_losses /= counter_others
other_trojans_losses *= -1

# Adding weighted negative loss term
losses += negative_loss_factor * other_trojans_losses

Although this step did not increase score by a lot (jumped to 57), I think it was a good idea to add this term, since it would help in cases where the model would get stuck in a local minima, and would keep generating the same trigger over and over again.

Increasing Recall

At this point, my ASR was close to 100, i.e. for all submitted triggers, the model generated the desired trojans. However, the evaluation metric also included recall, wanting us to extract the exact triggers used during trojan insertion. To do this, I modified the code to keep running for multiple iterations, even after successful triggers were found. Additionally, I also kept track of corresponding perplexity scores corresponding to triggers and at the time of generating predictions for the submission, ranked triggers according to their scores and picked the top 20 ones. Thus, at any given point, I had >20 (trigger, trojan) pairs to pick from, per trojan. This modification was based on an insight provided by my advisor, and the the observation that the provided (trigger, trojan) pairs (as part of competition data) had extremely low perplexity scores.

if len(triggers) > 0:
    if x not in accurate_trojans:
        accurate_trojans[x] = []
    new_trigger_pairs = [(j, get_likelihood(model, tokenizer, j, x)) for j in triggers]
    accurate_trojans[x].extend(new_trigger_pairs)

This modification did not increase my total score by a lot, but did increase recall from around 13 to 16 (but with slightly decreased ASR).

Takeaways

This was definitely a very fun and interesting challenge! I got to learn more about jail-breaking (via GCG) and trojan detection. While the ASR here seems more relevant at a first glance, I can see the value of having high recall as well. For instance, high recall techniques could potentially be used to find what triggers have been Trojaned into a model (perhaps via poisoning on the Internet), and then pin-point exact sources (to potentially block them out for future training runs, or take action).

My solution is available here: https://github.com/iamgroot42/tdc_23

My submission to the MICO Challenge

2023-01-31T00:00:00+00:00

Here I describe my approach to the MICO challenge, co-located with SaTML 2023. Specifically, I walk through my solution for the CIFAR track, which got me the second position on the final leaderboard. I used the same approach for the Purchase100 track as well, but finished fourth. A description of all the winning approaches can be found here.

Let’s start by downloading relevant data from the MICO competition, which can be found here (for CIFAR).

import os
import urllib

from torchvision.datasets.utils import download_and_extract_archive
from sklearn.metrics import roc_curve, roc_auc_score

from mico_competition.scoring import tpr_at_fpr, score, generate_roc, generate_table
from sklearn.metrics import roc_curve, roc_auc_score

import numpy as np
import torch
import csv
import copy

from torch.autograd import Variable
from sklearn import metrics
from tqdm.notebook import tqdm
from torch.distributions import normal
from torch.utils.data import DataLoader, Dataset
from mico_competition import ChallengeDataset, load_cifar10, load_model
from torch.distributions import Categorical
import torch.nn.utils.prune as prune

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import torch as ch
import torch.nn as nn

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import tree
from scipy.stats import norm

import autosklearn.classification
import autosklearn.metrics

Features

Target-model features

Let’s start by collecting features that, for a given target model, only utilize information from that model and any additional data. Later in the post, we will cover features based on utilization of other ‘reference’ model and the problem setup.

The first feature I included is based on the approach described in , using class-scaled logits instead of direct probabilities.

def get_class_scaled_logits(model, features, labels):
    outputs = model(features).detach().cpu().numpy()
    num_classes = np.arange(outputs.shape[1])
    values = []
    for i, output in enumerate(outputs):
        label = labels[i].item()
        wanted = output[label]
        not_wanted = output[np.delete(num_classes, label)]
        values.append(wanted - np.max(not_wanted))
    return np.array(values)

Next, I use the MERLIN approach to sample neighbors and note variation in model loss. Modification uses log-scale while noting loss differences.

@torch.no_grad()
def relative_log_merlin(model, features, labels):
    epsilon = 0.5
    small_value = 1e-10
    n_neighbors = 50
    criterion = torch.nn.CrossEntropyLoss(reduction='none')
    noise = normal.Normal(0, epsilon)
    diffs = []
    base_preds = model(features)
    base_losses = criterion(base_preds, labels).cpu().numpy()
    base_preds = base_preds.cpu().numpy()
    for i, feature in enumerate(features):
        neighbors = []
        distances = []
        for _ in range(n_neighbors):
            sampled_noise = noise.sample(feature.shape).to(feature.device)
            neighbors.append(feature + sampled_noise)
            distances.append(sampled_noise.mean().cpu().item())
        neighbors = torch.stack(neighbors, 0)
        loss_neighbors = criterion(model(neighbors), labels[i].view(1).repeat(n_neighbors))
        loss_change = ch.norm((loss_neighbors - base_losses[i])).item()
        # Use relative drop instead of absolute
        loss_change /= (small_value + base_losses[i].item())
        diffs.append(np.log(loss_change + small_value))
    diffs = np.array(diffs)
    # Clip at zero (lower side)
    diffs[diffs < 0] = 0
    return diffs

Next, I perform gradient-descent on the given data (using the training loss) to make modifications to the input. Makes note of change in model loss after gradient ascent, as well as the change in the input itself. My intuition here was that members would have less scope for loss reduction (and consequently smaller changes to the datum itself).

def ascent_recovery(model, features, labels, adv: bool = False):
    criterion = torch.nn.CrossEntropyLoss(reduction='none')
    n_times = 10
    step_size = 0.01 if adv else 0.1 # For normal, use higher
    final_losses, final_dist = [], []
    for i, (feature, label) in enumerate(zip(features, labels)):
        model.zero_grad()
        feature_var = Variable(feature.clone().detach(), requires_grad=True)
        for j in range(n_times):
            feature_var = Variable(feature_var.clone().detach(), requires_grad=True)
            loss = criterion(model(ch.unsqueeze(feature_var, 0)), torch.unsqueeze(label, 0))
            loss.backward(ch.ones_like(loss), retain_graph=True)
            with ch.no_grad():
                if adv:
                    feature_var.data += step_size * feature_var.data
                else:
                    feature_var.data -= step_size * feature_var.data
                loss_new = criterion(model(ch.unsqueeze(feature_var, 0)), ch.unsqueeze(label, 0))
        # Get reduction in loss
        final_losses.append(loss.item() - loss_new.item())
        # Get change in data (norm)
        final_dist.append(ch.norm(feature_var.data - feature.data).detach().cpu().numpy())
    final_losses = np.stack((final_losses, final_dist), 1)
    return final_losses.reshape(-1, 2)

Next idea is to “train” model on given data to see how much loss changes. If was with DP, not seen many times (and with clipped gradient), so expected loss decrease would be much more than that for some point that has already been seen multiple times. While at it, I also take note of gradient norms

I tried using the learning rates corresponding to different training mechanisms, turns out that using the same, higher LR, works out best in practice.

def extended_epoch(model, features, labels, use_dp: bool = False):
    lr = 0.05 # if use_dp else 0.0005
    criterion = nn.CrossEntropyLoss(reduction='none')
    features_collected = []
    # Note losses currently
    base_preds = model(features).detach()
    base_losses = criterion(base_preds, labels).cpu().numpy()
    for i, (feature, label) in enumerate(zip(features, labels)):
        # Make copy of model
        model_ = copy.deepcopy(model)
        model_.train()
        model_.cuda()
        optimizer = ch.optim.SGD(model_.parameters(), lr=lr, momentum=0)
        optimizer.zero_grad()
        loss = criterion(model_(ch.unsqueeze(feature, 0)), ch.unsqueeze(label, 0))
        loss.backward()
        optimizer.step()
        # Keep track of gradient norms
        gradient_norms = [ch.linalg.norm(x.grad.detach().cpu()).item() for x in model_.parameters()]
        # Keep track of updated loss
        loss_new = criterion(model_(ch.unsqueeze(feature, 0)), ch.unsqueeze(label, 0)).detach().cpu().numpy()
        loss_difference = (base_losses[i] - loss_new).item()
        gradient_norms += [loss_difference]
        features_collected.append(gradient_norms)
    features_collected = np.array(features_collected)
    features_collected = features_collected.reshape(features_collected.shape[0], -1)
    # Do not care about biases or loss diff
    features_collected = features_collected[:, [0, 2, 4]]
    # features_collected = np.log(features_collected + 1e-10)
    return features_collected.reshape(features_collected.shape[0], -1)

The next feature is inspired by dataset inference. It takes fixed-size steps in random directions (but same random direction) and keep track of how many steps it takes to flip classification. The main idea here is that members/non-members would have different proximity to decision boundaries and thus would have different statistics for this result.

def blind_walk(model, features, labels):
    # Track the number of steps taken until decision flips 
    # Walk no more than 100 steps, and try 10 different random directions
    num_directions = 10
    num_max_steps = 100
    point_of_failure = np.ones((num_directions, features.shape[0])) * np.inf
    std = 0.1
    for j in range(num_directions):
        noise = ch.randn_like(features) * std
        for i in range(1, num_max_steps + 1):
            new_labels = ch.argmax(model(features + noise * i).detach(), 1)
            flipped = np.nonzero((new_labels != labels).cpu().numpy())[0]
            point_of_failure[j][flipped] = np.minimum(point_of_failure[j][flipped], i)
    point_of_failure = np.clip(point_of_failure, 0, num_max_steps)
    point_of_failure = np.mean(point_of_failure, 0)
    return point_of_failure.reshape(-1, 1)

Now, we collect all of the features described above. Note that I also include an ‘adv’ variant for gradient-ascent that instead looks to maximize loss instead of minimizing it.

def custom_feature_collection(model, features, labels, use_dp: bool = False):
    features_collected = []
    features_collected.append(ascent_recovery(model, features, labels))
    features_collected.append(ascent_recovery(model, features, labels, adv = True))
    features_collected.append(extended_epoch(model, features, labels, use_dp = use_dp))
    features_collected.append(relative_log_merlin(model, features, labels).reshape(-1, 1))
    features_collected.append(get_class_scaled_logits(model, features, labels).reshape(-1, 1))
    features_collected.append(blind_walk(model, features, labels).reshape(-1, 1))
    combined_feratures = np.concatenate(features_collected, 1)
    return combined_feratures

Reference-model features

The problem structure and distribution of data (see here) suggests that when looking at a given datapoint, it is more likely to have been a member of another randomly selected model, than not being a member. My intention was to somehow utilize this information in the attack.

To compare these trends with reference models, I took inspiration from the MATT attack, which works by computing gradient alignment between the given model and another reference model. Although the original attack assumes linear models (and cannot work for convolutional layers/deep models), I figured gradient similarity would still be a useful metric to compare models. For any given model, I sample 25 random reference models (out of all 100 train models) and compute gradient similarity between the given model and each reference model.

Below, we compute gradient similarity between the given model and some reference model.

def matt_modified_scores(model, features, labels, model_reference):
    criterion = nn.CrossEntropyLoss(reduction='none')
    cos = nn.CosineSimilarity(dim=0)
    features_collected = []

    # Make copy of model
    model_ = copy.deepcopy(model)
    model_.cuda()

    for i, (feature, label) in enumerate(zip(features, labels)):
        # Compute gradients with both models
        model_.zero_grad()
        model_reference.zero_grad()
        loss = criterion(model_(ch.unsqueeze(feature, 0)), ch.unsqueeze(label, 0))
        loss_ref = criterion(model_reference(ch.unsqueeze(feature, 0)), ch.unsqueeze(label, 0))
        loss.backward()
        loss_ref.backward()
        
        # Compute product
        inner_features = []
        for p1, p2 in zip(model_.parameters(), model_reference.parameters()):
            term = ch.dot(p1.grad.detach().flatten(), p2.grad.detach().flatten()).item()
            inner_features.append(term)
        features_collected.append(inner_features)
    features_collected = np.array(features_collected)
    # Focus only on weight-related parameters
    features_collected = features_collected[:, [0, 2, 4, 6]]
    return features_collected

To aggregate gradient alignment information across reference models (since the reference models are randomly selected, so cannot compare across models), I compute the range, mean of absolute values, min, and max of the gradient alignment scores.

def extract_features_for_reference_models(reference_features):
    num_layers_collected = reference_features.shape[2]
    features = []
    for i in range(num_layers_collected):
        features.append((
            np.max(reference_features[:, :, i], 0) - np.min(reference_features[:, :,  i], 0),
            np.sum(np.abs(reference_features[:, :,  i]), 0),
            np.min(reference_features[:, :,  i], 0),
            np.max(reference_features[:, :,  i], 0)
        ))
    return np.concatenate(features, 0).T

Let’s start collecting all train models.

def collect_models():
    """
        Collect all models from the 'train' set
    """
    CHALLENGE = "cifar10"
    LEN_TRAINING = 50000
    LEN_CHALLENGE = 100
    scenarios = os.listdir(CHALLENGE)

    dataset = load_cifar10(dataset_dir="/u/as9rw/work/MICO/data")

    collected_models = {x:[] for x in scenarios}
    phase = "train"
    for scenario in tqdm(scenarios, desc="scenario"):
        root = os.path.join(CHALLENGE, scenario, phase)
        for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
            path = os.path.join(root, model_folder)
            challenge_dataset = ChallengeDataset.from_path(path, dataset=dataset, len_training=LEN_TRAINING)
            challenge_points = challenge_dataset.get_challenges()
            
            model = load_model('cifar10', path)
            collected_models[scenario].append(model)

        collected_models[scenario] = np.array(collected_models[scenario], dtype=object)
            
    return collected_models

train_models = collect_models()

Feature vectors are thus a combination of two kinds of features. The first set (originating from matt_modified_scores) uses reference models and reference statistics between a given model and these reference models. The second set of features, originating from custom_feature_collection, does not use any additional reference models.

For Purchase100: without the reference-model-based features, the attack had a maximum TPR@0.1FPR of $\approx0.13$, and immediately jumped to $0.15$ with the inclusion of those reference-model-based features. Additional experimentation with the meta-classifier itself bumped performance further up to $>0.16$. My experience with these MI attacks has been that the choice of meta-classifier also matters. Most of the MI-related papers I looked at use LR-based models when dealing with features. It thus might be worthwhile to spend some time on feature engineering (using the same set of raw features) and meta-classifier optimization.

CHALLENGE = "cifar10"
scenarios = os.listdir(CHALLENGE)
phase = "train"
dataset = load_cifar10(dataset_dir="/u/as9rw/work/MICO/data")
LEN_TRAINING = 50000
LEN_CHALLENGE = 100

X_for_meta, Y_for_meta = {}, {}
num_use_others = 25 # 50 worked best, but too slow

# Check performance of approach on (1, n-1) models from train
for scenario in tqdm(scenarios, desc="scenario"):
    use_dp = not scenario.endswith('_inf')
    preds_all = []
    scores_all = []
    root = os.path.join(CHALLENGE, scenario, phase)
    all_except = np.arange(100)

    for i, model_folder in tqdm(enumerate(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1]))), desc="model", total=100):
        path = os.path.join(root, model_folder)
        challenge_dataset = ChallengeDataset.from_path(path, dataset=dataset, len_training=LEN_TRAINING)
        challenge_points = challenge_dataset.get_challenges()
        
        challenge_dataloader = torch.utils.data.DataLoader(challenge_points, batch_size=2*LEN_CHALLENGE)
        features, labels = next(iter(challenge_dataloader))
        features, labels = features.cuda(), labels.cuda()
            
        model = load_model('cifar10', path)
        model.cuda()
        features, labels = features.cuda(), labels.cuda()
        # Look at all models except this one
        other_models = train_models[scenario][np.delete(all_except, i)]
        # Pick random models
        other_models = np.random.choice(other_models, num_use_others, replace=False)
        other_models = [x.cuda() for x in other_models]

        features_collected = np.array([matt_modified_scores(model, features, labels, other_model) for other_model in other_models])
        scores = extract_features_for_reference_models(features_collected)
        other_features = custom_feature_collection(model, features, labels, use_dp = use_dp)
        scores = np.concatenate((scores, other_features), 1)

        mem_labels = challenge_dataset.get_solutions()

        # Store
        preds_all.append(mem_labels)
        scores_all.append(scores)
    
    preds_all = np.concatenate(preds_all)
    scores_all = np.concatenate(scores_all)
    
    X_for_meta[scenario] = scores_all
    Y_for_meta[scenario] = preds_all

Use auto-sklearn (an automl package does does pipeline, optimizer, and hyper-param optimization for you) worked out best, second to using random-forest classifiers.

In all scenarios, training classifiers separately for the different scenarios (no, low, high DP) worked better than training a single classifier, even when the scenario is explicitly provided as an input feature (one-hot encoding).

# Train different meta-classifiers per scenario
CHALLENGE = "cifar10"
scenarios = os.listdir(CHALLENGE)
meta_clfs = {x: autosklearn.classification.AutoSklearnClassifier(memory_limit=64 * 1024, time_left_for_this_task=180, metric=autosklearn.metrics.roc_auc) for ii, x in enumerate(X_for_meta.keys())}

avg = 0
use_all = False
for sc in scenarios:
    train_split_og, test_split_og = train_test_split(np.arange(100), test_size=20)
    train_split = np.concatenate([np.arange(200) + 200 * i for i in train_split_og])
    test_split = np.concatenate([np.arange(200) + 200 * i for i in test_split_og])
    if use_all:
        X_train = X_for_meta[sc]
        X_test  = X_for_meta[sc]
        y_train = Y_for_meta[sc]
        y_test  = Y_for_meta[sc]
    else:
        X_train = X_for_meta[sc][train_split]
        X_test  = X_for_meta[sc][test_split]
        y_train = Y_for_meta[sc][train_split]
        y_test  = Y_for_meta[sc][test_split]

    meta_clfs[sc].fit(X_train, y_train)
    preds = meta_clfs[sc].predict_proba(X_test)[:, 1]
    preds_train = meta_clfs[sc].predict_proba(X_train)[:, 1]
    
    print(f"{sc} AUC (train): {roc_auc_score(y_train, preds_train)}")
    scores = score(y_test, preds)
    scores.pop('fpr', None)
    scores.pop('tpr', None)
    display(pd.DataFrame([scores]))
    avg += scores['TPR_FPR_1000']

print("Average score", avg / 3)

For submissions, I set use_all to True (since the autoML classifier already does train-val splits internally). The train-test split above is just for visualization purposes (and to see how well the classifier performs on val data).

Feature Inspection

A closer look (via permutation importance) at the features shows some interesting trends.

from sklearn.inspection import plot_partial_dependence, permutation_importance

labels = []
for i in range(4):
    labels.append(f"range_{i+1}")
    labels.append(f"abssum_{i+1}")
    labels.append(f"min_{i+1}")
    labels.append(f"max_{i+1}")
labels.append("ascent_loss")
labels.append("ascent_diff")
labels.append("adv_ascent_loss")
labels.append("adv_ascent_diff")
labels.append("ext_epo_1")
labels.append("ext_epo_2")
labels.append("ext_epo_3")
labels.append("merlin")
labels.append("lira")
labels.append("blindwalk")

scenario = "cifar10_inf"
r = permutation_importance(meta_clfs[scenario], X_for_meta[scenario], Y_for_meta[scenario], n_repeats=10, random_state=0)
sort_idx = r.importances_mean.argsort()[::-1]

plt.boxplot(
    r.importances[sort_idx].T, labels=[labels[i] for i in sort_idx]
)
    
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

for i in sort_idx[::-1]:
    print(
        f"[{i}] {labels[i]:10s}: {r.importances_mean[i]:.3f} +/- "
        f"{r.importances_std[i]:.3f}"
    )

The abs-sum and range of gradient alignment values across reference models seem to be most useful. Features like LiRA and gradient norms in the extended-epoch simulation also seem to be important for the case of no DP.

...
[20] ext_epo_1 : 0.010 +/- 0.001
[0] range_1   : 0.013 +/- 0.001
[15] max_4     : 0.016 +/- 0.003
[13] abssum_4  : 0.026 +/- 0.003
[5] abssum_2  : 0.028 +/- 0.003
[24] lira      : 0.045 +/- 0.002
[9] abssum_3  : 0.047 +/- 0.003
[21] ext_epo_2 : 0.054 +/- 0.003
[22] ext_epo_3 : 0.064 +/- 0.003

However, the same trend does not hold when looking at models with DP.

...
[22] ext_epo_3 : 0.018 +/- 0.002
[1] abssum_1  : 0.020 +/- 0.003
[12] range_4   : 0.020 +/- 0.003
[0] range_1   : 0.020 +/- 0.002
[13] abssum_4  : 0.023 +/- 0.003
[5] abssum_2  : 0.024 +/- 0.003
[9] abssum_3  : 0.027 +/- 0.002
[24] lira      : 0.053 +/- 0.003

In this case, most of the classifier’s performance seems to originate from LiRA, followed by features extracted from reference models. However, features like the one extracted from the extended-epoch simulation seem to be less somewhat important - in fact, more than they are in the case of no DP.

Interestingly, for the case of low-$\epsilon$ DP, merlin-based features turn out to be the most important.

...
[15] max_4     : 0.012 +/- 0.002
[1] abssum_1  : 0.014 +/- 0.003
[14] min_4     : 0.014 +/- 0.002
[24] lira      : 0.016 +/- 0.002
[8] range_3   : 0.017 +/- 0.003
[20] ext_epo_1 : 0.027 +/- 0.004
[23] merlin    : 0.028 +/- 0.002

The utility of each of these features, though, is significantly lower than scenarios with higher-$\epsilon$ DP, or no DP at all. This is understandable, since the DP is very much intended to reduce membership inference risk. However, it is interesting that the relative utility of features is not same across the scenarios, further showing why having individual meta-classifiers for the scenarios is useful.

Next, we collect features for all models across scenarios and phases. Helps to re-use features when experiment with different meta-classifiers. No need to re-run for ‘train’ models, since we already have features for those.

CHALLENGE = "cifar10"
LEN_TRAINING = 50000
LEN_CHALLENGE = 100

scenarios = os.listdir(CHALLENGE)
phases = ['dev', 'final']
stored_features = {}

dataset = load_cifar10(dataset_dir="/u/as9rw/work/MICO/data")

for i, scenario in tqdm(enumerate(scenarios), desc="scenario", total=3):
    use_dp = not scenario.endswith('_inf')
    stored_features[scenario] = {}
    for phase in tqdm(phases, desc="phase"):
        stored_features[scenario][phase] = []
        root = os.path.join(CHALLENGE, scenario, phase)
        all_except = np.arange(100)
        for j, model_folder in tqdm(enumerate(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1]))), desc="model", total=len(os.listdir(root))):
            path = os.path.join(root, model_folder)
            challenge_dataset = ChallengeDataset.from_path(path, dataset=dataset, len_training=LEN_TRAINING)
            challenge_points = challenge_dataset.get_challenges()
            
            model = load_model('cifar10', path)
            challenge_dataloader = torch.utils.data.DataLoader(challenge_points, batch_size=2*LEN_CHALLENGE)
            features, labels = next(iter(challenge_dataloader))

            model.cuda()
            features, labels = features.cuda(), labels.cuda()
            # Look at all models except this one
            other_models = train_models[scenario][np.delete(all_except, j)]
            other_models = np.random.choice(other_models, num_use_others, replace=False)
            features_collected = np.array([matt_modified_scores(model, features, labels, other_model.cuda()) for other_model in other_models])
            scores = extract_features_for_reference_models(features_collected)
            other_features = custom_feature_collection(model, features, labels, use_dp=use_dp)
            processed_features = np.concatenate((scores, other_features), 1)

            stored_features[scenario][phase].append(processed_features)

Time to generate predictions!

CHALLENGE = "cifar10"
LEN_TRAINING = 50000
LEN_CHALLENGE = 100

scenarios = os.listdir(CHALLENGE)
phases = ['dev', 'final']

dataset = load_cifar10(dataset_dir="/u/as9rw/work/MICO/data")

for i, scenario in tqdm(enumerate(scenarios), desc="scenario", total=3):
    for phase in tqdm(phases, desc="phase"):
        root = os.path.join(CHALLENGE, scenario, phase)
        j = 0
        for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
            path = os.path.join(root, model_folder)
            features_use = stored_features[scenario][phase][j]
            # Using scenario-wise meta-classifier
            predictions = meta_clfs[scenario].predict_proba(features_use)[:, 1]
            j += 1
            assert np.all((0 <= predictions) & (predictions <= 1))

            with open(os.path.join(path, "prediction.csv"), "w") as f:
                 csv.writer(f).writerow(predictions)

Packaging the submission: now we can store the predictions into a zip file, which you can submit to CodaLab.

import zipfile

phases = ['dev', 'final']
experiment_name = "final_submission"

with zipfile.ZipFile(f"submissions_cifar/{experiment_name}.zip", 'w') as zipf:
    for scenario in tqdm(scenarios, desc="scenario"): 
        for phase in tqdm(phases, desc="phase"):
            root = os.path.join(CHALLENGE, scenario, phase)
            for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
                path = os.path.join(root, model_folder)
                file = os.path.join(path, "prediction.csv")
                if os.path.exists(file):
                    zipf.write(file)
                else:
                    raise FileNotFoundError(f"`prediction.csv` not found in {path}. You need to provide predictions for all challenges")

Takeaways

This was my first stint with actively developing and launching a membership inference attack, and I definitely learned a lot! I can now see that membership inference is hard, even when access to a highly-overlapping dataset (and models trained on similar data) are provided. I also learned that while using observation-based features directly (such as LiRA, scaled to the $[0, 1]$ scale) is commonpractice in MI literature, switching to better classifiers than linear regression (better yet- automating the entire model pipeline) can give non-trivial boosts in performance. I hope this post was helpful to you, and I’d love to hear your thoughts and feedback!

My solutions (for both CIFAR and Purchase100) are available here: https://github.com/iamgroot42/MICO

Dissecting Distribution Inference

2022-12-15T00:00:00+00:00

Distribution inference attacks aims to infer statistical properties of data used to train machine learning models. These attacks are sometimes surprisingly potent, as we demonstrated in previous work.

KL Divergence Attack

Most attacks against distribution inference involve training a meta-classifier, either using model parameters in white-box settings , or using model predictions in black-box scenarios . While other black-box were proposed in our prior work, they are not as accurate as meta-classifier-based methods, and require training shadow models nonetheless .

We propose a new attack: the KL Divergence Attack. Using some sample of data, the adversary computes predictions on local models from both distributions as well as the victim’s model. Then, it uses the prediction probabilities to compute KL divergence between the victim’s models and the local models to make its predictions. Our attack outperforms even the current state-of-the-art white-box attacks.

We observe several interesting trends across our experiments. One striking example is that with varying task-property correlation. While intuition suggests increasing inference leakage with increasing correlation between the classifier’s task and the property being inferred, we observe no such trend:

Distinguishing accuracy for different task-property pairs for Celeb-A dataset for varying correlation. Task-property correlations are: $\approx 0$ (Mouth Slightly Open-Wavy Hair), $\approx 0.14$ (Smiling-Female), $\approx 0.28$ (Female-Young), and $\approx 0.42$ (Mouth Slightly Open-High Cheekbones).

Impact of adversary’s knowledge

We evaluate inference risk while relaxing a variety of implicit assumptions of the adversary;s knowledge in black-box setups. Concretely, we evaluate label-only API access settings, different victim-adversary feature extractors, and different victim-adversary model architectures.

Victim Model	Adversary Model
Victim Model	RF	LR	MLP$_2$	MLP$_3$
Random Forest (RF)	12.0	1.7	5.4	4.9
Linear Regression (LR)	13.5	25.9	3.7	5.4
Two-layer perceptron (MLP$_2$)	0.9	0.3	4.2	4.3
Three-layer perceptron (MLP$_3$)	0.2	0.3	4.0	3.8

Consider inference leakage for the Census19 dataset (table above with mean $n_{leaked}$ values) as an example. Inference risk is significantly higher when the adversary uses models with learning capacity similar to the victim, like both using one of (MLP$_2$, MLP$_3$) or (RF, MLP). Interestingly though, we also observe a sharp increase in inference risk when the victim uses models with low capacity, like LR and RF instead of multi-layer perceptrons.

Defenses

Finally, we evaluate the effectiveness of some empirical defenses, most of which add noise to the training process.

For instance while inference leakage reduces when the victim utilizes DP, most of the drop in effectiveness comes from a mismatch in the victim’s and adversary’s training environments:

Distinguishing accuracy for different for Census19 (Sex). Attack accuracy drops with stronger DP guarantees i.e. decreasing privacy budget $\epsilon$.

Compared to an adversary that does not use DP, there is a clear increase in inference risk (mean $n_{leaked}$ increases to 2.9 for $\epsilon=1.0$, and 4.8 for $\epsilon=0.12$ compared to 4.2 without any DP noise).

Our exploration of potential defenses also reveals a strong connection between model generalization and inference risk (as apparent below, for the case of Celeb-A), suggesting that the defenses that do seem to work are attributable to poor model performance, and not something special about the defense itself (like adversarial training or label noise).

Mean distinguishing accuracy on Celeb-A (Sex), for varying number of training epochs for victim models. Shaded regions correspond to error bars. Distribution inference risk increases as the model trains, and then starts to decrease as the model starts to overfit.

Summary

The general approach to achieve security and privacy for machine-learning models is to add noise, but our evaluations suggest this approach is not a principled or effective defense against distribution inference. The main reductions in inference accuracy that result from these defenses seem to be due to the way they disrupt the model from learning the distribution well.

Paper: Anshuman Suri, Yifu Lu, Yanjin Chen, David Evans. Dissecting Distribution Inference. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 8-10 February 2023.

Code: https://github.com/iamgroot42/dissecting_distribution_inference

Running scripts on Rivanna at UVA

2022-02-03T00:00:00+00:00

Overview

Our department has a nice collection of 16 servers, each with 4 GPUs. While this may sound like a lot, the servers are shared across the department and thus fill out pretty fast. We do have a SLURM cluster too, but it’s not as fast (or big enough to run GPU-based jobs via SLURM) as the servers. I recently looked into Rivanna after my advisor suggested it. I did write some sbatch scripts during my internship at Oracle Research Labs, so that definitely came in handy while trying to write wrapper scripts for the Rivanna cluster.

Settings things up

The structure for these environments here is pretty similar to what we have for the CS servers. You can load up specific modules using module load. Since sbatch files get passed onto in a new bash environment, it’s always a good idea to have all your module load and other related commands (like conda activate) in your .bashrc file so that you don’t have to worry about adding all of them to every sbatch file.

For reference, here’s what I added to my .bashrc:

module load singularity
module load cudatoolkit/11.0.3-py3.8
module load cuda/11.4.2
module load cudnn/8.2.4.15
module load anaconda/2020.11-py3.8

#Identify whether using Rivanna version (load data accordingly)
export ISRIVANNA=1

# Conda-init related stuff that is auto-populated
conda activate phd # phd is the name of my environment

The only downside here is that storage is not shared with the other CS servers: you must either commit to using only Rivanna or only CS servers, or make sure you regularly sync your generated code (which is straightforward, thanks to Git) and data (not so trivial).

The Rivanna cluster has a cronjob of sorts that deletes files that aren't accessed for more than 90 days.

There are two storage directories: /home and /scratch. The former has a 50GB quota limit with weekly snapshots for backup, while the /scratch directory has a quota of 10TB/350,000 files (whichever is more). As mentioned on the Rivanna Storage page:

Slurm jobs run against /home will be slower than those run against /scratch

Thus, it is advisable to have all your scripts and data in the /scratch directory, even your Anaconda environment. You can specify a location for your Conda environment with the --prefix flag while running conda create.

Writing SBATCH scripts

Okay, enough talk! Let’s start with a basic wrapper script- let’s call it test.sbatch. If you are already familiar with these options and/or just want to use the defaults and get started, feel free to use the template here.

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH -A uvasrg
#SBATCH --mem=32G
#SBATCH -p gpu
#SBATCH --gres=gpu:rtx2080:1
#SBATCH --cpus-per-task=10
#SBATCH --time=3-00:00:00
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

# Your command goes here
echo "Parameters were $PARAM1 and $PARAM2"

Wait, wait, wait! Aren’t the #SBATCH commands going to get ignored (starts with #, so it’s got to be a comment, duh?). Well, not really- #SBATCH commands are treated differently.

`#SBATCH --ntasks=1`

This argument specifies the number of tasks you want to run with the given script. In most cases, this will be 1 unless you want parallel execution and will utilize it explicitly via your script.

For instance, you could specify --ntasks=2 to run two tasks in parallel:

 echo "Hello there" & 
 echo "General Kenobi" &
 wait

This can be particularly useful when you have some form of caching, utilize the same file across different scripts, or just like the idea of having all your experiments run on the same physical machine.

`#SBATCH -A uvasrg`

This option specifies which “allocation” you want to use. As a user at UVA (or SLURM clusters in general), you may be part of one or more allocation groups, which all have their compute budgets. This option ensures that your scripts run on the allocation group you specify. If you happen to be in UVASRG, this is the option for you!

If you’re curious about the compute budget used up/left in your allocation, you can run:

allocations

Furthermore, if you’re curious about other members in the same allocation group, you can run:

allocations -a

`#SBATCH --mem=32G`

This is pretty straightforward and lets you specify the total memory (in GBs) you want to allocate to your job. In most cases, 32 or 64 GBs is enough.

Rivanna has an upper limit of 32 GBs of memory per core, so you'll need to be careful with this.

`#SBATCH -p gpu`

This option here specifies which partition you want your scripts to run on. For most users (at least with machine learning), this will be ‘gpu’. You can have a look at all the available partitions here.

`#SBATCH --gres=gpu:rtx2080:1`

This option here (the --gres in general) allows you to specify configurations of the nodes you want to run your scripts on. In this case, the rtx2080 is the name of the GPU card, and 1 is the number of cards you want to use. In terms of speed, you might want to prefer v100 over 2080.

One thing to note here: there may be times when machines with one type of GPU card are free while others are in use. It might be better to specify the other GPU in such cases instead of waiting for the busy GPU machines to clear up. But how exactly would you know which machines are free and which ones are not?

sinfo -o "%20N %10R %10e %25f %25G %t %C" -t IDLE,MIX

This provides information on the status of all machines, their GPU cards (and how many they have), free memory, and free/busy CPU cores. Pretty useful, huh!

`#SBATCH --cpus-per-task=10`

This option lets you specify the number of CPUs you require per task in your script. From what I know, Rivanna has an upper limit of 10 per job for the GPU servers, but I’m not too sure about it.

`#SBATCH --time=3-00:00:00`

This option here lets you specify an upper limit for your job. SLURM will terminate any jobs after this given time limit (set to a default value much lower than 3 days for Rivanna, from what I remember). We have an upper limit of 3 days, so this pretty much tells SLURM to run the job for as long as possible. If you want to run your job for longer, make sure you cache results so that the re-run can pick up from where the previous job left off.

`#SBATCH --output=logs/%x-%j.out`

These two formats (this and #SBATCH --error=logs/%x-%j.err) specify the filenames for error and log files. The %x is the script’s name, and %j is the job ID.

P.S. You might wanna make sure you have the directory `logs` where you submit your job.

Submitting a job

There you go! Now that you know what each flag here means, let’s get to submitting the job:

 sbatch --export=ALL,PARAM1='no',PARAM2=42 --job-name=try test.sbatch

Note how we have two additional parameters, PARAM1 and PARAM2. This is a way for you to pass parameters to your job, which can then be used in your script. Make sure you leave the ALL in there since this passes on your environment’s predefined variables to the job.

Another sidenote (that I found out the weird way) - make sure all your flags and options are specified before the filename of your script. If they aren't, `sbatch` will simply ignore them.

Job Arrays

Sometimes you might want to run the same commands inside but with different parameters- maybe you want to run a grid search or generate results for multiple datasets. Instead of creating new sbatch scripts in this case, you can utilize job arrays. This is a way to run multiple jobs with the same configuration but with different parameters.

For instance, you could run a job array with the following command:

sbatch --job-name=try test.sbatch --array=0-6

Note the additional --array=0-6 option. This specifies that you want to run 7 jobs, starting from 0 until 6 (all-inclusive). You could then modify the sbatch script to use different parameters based on the job index:

...
RATIOS=(0.2 0.3 0.4 0.5 0.6 0.7 0.8)
echo "Running for ratio ${RATIOS[$SLURM_ARRAY_TASK_ID]}"
...

Tips and Tricks

Start-Time Estimate

Submitted a job, but it’s stuck in (Priority) or (Resources)? If you see it’s way too much, you can always default back to some other servers (or change the configuration of your job so that it gets accepted). In cases like these, getting a rough estimate for the start time for your job can be helpful.

 squeue -u  --start

If you skip the --start above, you can view all your jobs and relevant information: how long they’ve been running for, their status, etc.

Opening interactive sessions

Sometimes you may want to test your code out for smaller cases and/or debug, and going to and fro with log files may not be very convenient. You can open an interactive session with your job using the following command:

 ijob -A uvasrg -p gpu --time=0-00:30:00 --gres=gpu:rtx2080:1 --mem=8G 

This command here would open an interactive session (capped at 30 minutes, so that you don’t accidentally leave it on and get charged for it) with your job, run on a machine with the RTX2080 GPU card and an allocation of 8GB memory.

Adjusting resources based on job efficiency

It may so happen that you over-estimate the resources needed for your job, which can lead to higher allocation consumption as well as your scripts running at a later time (because of busy resources). A good starting point is to look at runs of your completed scripts:

you@machine:~$ sacct --starttime=2022-02-04 --endtime=2022-02-11 --state COMPLETED  -u 
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
32259190       job1        gpu              16  COMPLETED      0:0 
32259191       job2        gpu              16  COMPLETED      0:0 

This command here, for instance, will give you a list of all completed jobs that started and finished between 4th and 11th February, 2022. You can then pick any jobID that you like and look at its CPU and memory usage efficiency:

you@machine:~$ seff 32259190
Job ID: 32259190
Cluster: shen
User/Group: /users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 1-20:54:32
CPU Efficiency: 21.72% of 8-14:47:28 core-walltime
Job Wall-clock time: 12:55:28
Memory Utilized: 5.72 GB
Memory Efficiency: 17.86% of 32.00 GB

Running GPU-based scripts

It’s good practice to specify the GPU card you want to use for your job. Most users might be familiar with CUDA_VISIBLE_DEVICES and setting this value before their experiments, or via code (in PyTorch, for instance). However, the machines that SLURM runs scripts on have shared GPUs that are visible to each job for some reason. As a result, even if you request one GPU and your job is run on a machine with 8 GPUs, using export CUDA_VISIBLE_DEVICES=0 will, instead of doing nothng (since you’d expecte the visible compute environment to have only one GPU), the job will run on the 1st GPU, even if it’s not the one that’s free.

I found this out the hard way (not mentioned on the website documentation, but the support staff were nice enough to tell me about it)- I submitted 8-10 jobs and they were all really slow, since they all got assigned the same machine and the CUDA_VISIBLE_DEVICES made them run on the same GPU.

tl;dr Do not set CUDA_VISIBLE_DEVICES (via envionment variable or within code) for your jobs- SLURM manages that for you.

On the Risks of Distribution Inference

2021-06-27T00:00:00+00:00

Inference attacks seek to infer sensitive information about the training process of a revealed machine-learned model, most often about the training data.

Standard inference attacks (which we call “dataset inference attacks”) aim to learn something about a particular record that may have been in that training data. For example, in a membership inference attack , the adversary aims to infer whether or not a particular record was included in the training data.

Differential Privacy provides a theoretical notion of privacy that maps well to membership inference attacks. However, it provides privacy at the dataset level. Thus, it doesn’t capture attacks that violate privacy at the distribution level. This is where property inference comes in. Property inference, a different kind of inference risk, involves an adversary that aims to infer some statistical property of the training distribution.

We illustrate the kind of risks introduced by property inference via a fictional example. Skynet, an (imaginary) organization that handles private data, releases a machine learning model $M$ trained on their network flow graphs to predict faulty nodes in a network of servers. However, an adversary ($\mathcal{A}$) that wishes to launch a bot-net into this cluster of servers sees an opportunity in this model. They seek to infer whether the effective diameter ($90^{th}$ percentile of all pair-wise shortest paths) of the network is below 6 ($\mathcal{D}_0$) or not ($\mathcal{D}_1$).

We picked this property as an example based on useful properties cited in the traffic classification literature (e.g. ). Learning this property might be useful for the adversary in crafting a bot-net that would not be detected by Skynet’s bot-detection software. The main point of the illustration is to convey that an adversary can infer properties of the underlying data distribution that a model producer would not expect and that might be valuable to the adversary.

Formalizing Property Inference

To formalize property inference attacks, we adapt the cryptographic game for membership inference proposed by Yeom et al. :

In this game, the victim samples a dataset S from the distribution $\mathcal{D}$ and trains a model $M$ on it. It then samples some data-point $z$ from either $S$ or $\mathcal{D}$, based on $b \xleftarrow{R}${0,1}. The adversary then tries to infer $b$ using algorithm $H$, given access to ($z$, $\mathcal{D}$, $M$). This cryptographic game captures the intuitive notion of membership inference. It focuses on a particular dataset and sample: inferring whether a given data point was part of training data.

In contrast, property inference focuses on properties of the underlying distribution ($\mathcal{D}$), not the dataset ($S$) itself. To capture property inference, we propose a similar cryptographic game. Instead of differentiating between the sources of a specific data point ($S$ or $\mathcal{D}$), we propose distinguishing between two distributions, $\mathcal{D}_0$ and $\mathcal{D}_1$.

A model trainer $\mathcal{B}$ samples a dataset $D$ from either of the distributions $\mathcal{D}_0$, $\mathcal{D}_1$. These distributions can be obtained from the publicly know distribution $\mathcal{D}$ by applying functions $\mathcal{G}_0$, $\mathcal{G}_1$ respectively, that transform distributions (and represent the “property” an adversary might care about). So, we formalize distribution inference with this question:

Given a model trained on this dataset $D$ drawn from either distribution $\mathcal{D}_0$ or $\mathcal{D}_1$, can an adversary infer from which of $\mathcal{D}_0$, $\mathcal{D}_1$ the dataset was sampled?

Frameworks like Differential Privacy do not apply here: the adversary here cares about statistical properties of the distribution the model was trained on, not details about a particular sampled dataset.

Evaluating Risk of Property Inference

Most often in the literature, the adversary considers the ratio of members in a dataset satisfying a particular Boolean function $f$ as the “property.” It then aims to distinguish between models trained on datasets with different proportions.

However, these experiments often test with arbitrary ratios, making it hard to understand the relative risk of different properties. Some examples are Chase et al. which considers 0.1 v/s 0.25, and Zhang et al. which considers 0.33 v/s 0.67.

To better understand how well an intuitive notion of divergence in properties aligns with observed inference risk, we execute property inference attacks with increasing diverging properties. We fix one property (ratio=0.5) and vary the other ($\alpha$). We perform these experiments for three datasets: focusing on the ratio of females for the US Census and RSNA BoneAge datasets, and the average node-degree for the OGBN arXiv dataset.

The state-of-the-art method for property inference attacks involves meta-classifiers, usually using Permutation Invariant Networks (Ganju et al., ). After training hundreds or thousands of models locally, the adversary trains a meta-classifier on model parameters.

Illustration of the functioning of a Permutation Invariant Network. The process of model-parameter extraction involves constructing permutation-invariant representations of neurons per layer $h_i$ using learnable parameters ($\phi_i$). These representations are then joined together for all layers with another learnable transform $\rho$, yielding the meta-classifier’s predictions.

We use two simple attacks (using only model outputs) as baselines:

Loss Test: predict the property based on its performance on data from the same distribution it was trained, compared to the other distribution being analyzed.
Threshold Test: extends the loss test by calibrating performance trends on a small set of models and arriving at a threshold based on model performance.

Experimental Results

Our results demonstrate how a meta-classifier can differentiate between models with ratios as similar as 0.5 and 0.6:

Figure 1: Differentiating between models trained on datasets trained with 50% females v/s females. Orange crosses are for the Loss Test; green with error bars are the Threshold Test; the blue box-plots are the meta-classifiers.

The meta-classifier attacks provide the best predictions, but the loss-test and threshold-test can serve as valuable baselines — even these simple attacks provide accuracies significantly better than random-guessing.

Inferring Graph Properties

Our proposed definitions allow the property to hold over the whole dataset, not just aggregate statistics like mean ratio. Thus, we focus on node-classification for a graph: differentiating between versions of the graph with varying mean node-degrees as the property. We fix one property (mean node-degree 13) and vary the other ($\alpha$). Inferring the mean node-degree is a novel property inference task since the property here holds over the entirety of training data- no such property has been explored in the literature yet.

Figure 2 (Left): Differentiating between models trained on datasets trained with mean node-degree 13 v/s on the OGBN arXiv dataset. Figure 3 (Right): Predicting the mean node-degree of training data graphs directly with a meta-classifier on the OGBN arXiv dataset.

Our results demonstrate how a meta-classifier can also be trained to directly infer the mean-node degree of graphs (Figure 2). Encouraged by the success of meta-classifiers for this task, we also tried a meta-classifier variant to predict the mean-node degree of training graphs (Figure 3). The resulting meta-classifier even generalizes well, accurately predicting mean node-degrees for distributions ($\alpha$={12.5, 13.5}) that it hasn’t seen.

Summary

Our work on distribution inference formalizes and shows how property inference attacks can indeed infer distribution-level properties. Our ongoing work is focused on quantifying and studying this ‘privacy leakage’ of properties and its implications.

Anshuman Suri, David Evans. Formalizing Distribution Inference Risks - Workshop on Theory and Practice of Differential Privacy, ICML 2021.

Code: https://github.com/iamgroot42/distribution_inference

Reassessing adversarial training with fixed data augmentation

2021-06-24T00:00:00+00:00

Overview

A couple months ago, a post on Reddit highlighted a bug in PyTorch + NumPy that affects how data augmentation works (see image above). Knowing nearly all of my projects use this combination, I read through the linked blog by Tanel Pärnamaa to see what it was all about. I was a bit shocked that it took our community this long to notice a bug this severe! Nearly all data-loaders use more than one worker. Unfortunately, not many people (clearly, since it took us all so long to notice this bug) sit down to debug data augmentation at this level within their ML pipeline.

Reading through this bug, I remembered how (proper) data-augmentation had been proposed as a means to reduce robust overfitting by authors at DeepMind . This paper got me thinking: “Could fixing this augmentation bug and rerunning adversarial training lead to gains in robustness?”. Curious to see the impact of fixing this data augmentation bug, I decided to run some experiments of my own. You can head over to the repository and run them yourself if you want.

Experiments

I chose the CIFAR-10 dataset: small enough to iterate experiments fast and challenging enough to observe performance gains.

Standard Training

Interestingly, standard training with the fixed data-augmentation pipeline hurt performance a bit, compared to using faulty augmentation:

Model	Standard Accuracy (%)	Robust Accuracy (ε = 8/255) (%)
Standard	89.140	0.000
Standard (augmentation)	94.720	0.000
Standard (fixed augmentation)	94.620	0.000

Adversarial Training

Not thinking much about the 0.1% performance drop (probably statistical noise, right?), I ran adversarial training with $L_\infty$ robustness ($\epsilon=\frac{8}{255}$):

Model	Standard Accuracy (%)	Robust Accuracy (ε = 8/255) (%)	Robust Accuracy (ε = 16/255) (%)
Robust	79.520	44.370	15.680
Robust (augmentation)	86.320	51.400	17.480
Robust (fixed augmentation)	86.730	51.880	17.570

As visible here, there’s an absolute 0.4% performance gain for $\epsilon=\frac{8}{255}$, and 0.09% performance gain for $\epsilon=\frac{4}{255}$, when using the fixed augmentation pipeline. Although the 0.09% here is not very significant, the 0.4% improvement seems non-trivial. This improvement is especially significant compared to the kind of performance differences reported on benchmarks for this dataset. Additionally, accuracy on clean data sees an improvement as well: absolute 0.41% change.

Not wanting to make any claims based on experiments on just the $L_\infty$ norm, I reran the same set of experiments for the $L_2$ norm ($\epsilon=1$).

Model	Standard Accuracy (%)	Robust Accuracy (%), ε = 0.5	Robust Accuracy (%), ε = 1
Robust	78.190	61.740	42.830
Robust (augmentation)	80.560	67.200	51.140
Robust (fixed augmentation)	81.070	67.620	51.220

Performance gains appear in this case as well. Accuracy on clean data bumps up by 0.51%, while robustness on $\epsilon=0.5$ and $\epsilon=1.0$ improves by 0.42% and 0.08%, respectively. The fact that this case sees a consistent, albeit small, improvement in both clean and perturbed-data performance hints at how simply fixing this augmentation may provide a nice bump in existing training methods. It is very much possible that these gains are just coincidental fluctuations in the randomness of model training. Regardless, fixing data-loaders is something that should be done anyway. The goal of these experiments was to try and quantify the impact of improper augmentation. It would be great if someone with sufficient resources could run these experiments on a larger scale to rule out statistical noise.

Takeaway

Fixing data augmentation can have a non-trivial (and positive) impact when training for robustness. Anyone training robust models (especially with adversarial training, since that is what I tested on) should fix their data-loaders.