type
status
date
slug
summary
tags
category
icon
password
 

LoneNeuron: A Highly-Effective Feature-Domain Neural Trojan Using Invisible and Polymorphic Watermarks

Special things

  1. The survey for people who download the model online.
  1. The questionnaire to annotate whether a pair of images are visually identical.
  1. The watermark is based on steganography.
  1. They did a lot experiments for their method. It is much enough work for one paper.

Background

A significant amount of DNN users obtained pre-trained models from the Internet or other third parties, but they are often unwilling or unable to detect and mitigate the security risks of such models. If an adversary deliberately injects backdoors into the codebase of DL models, the backdoored models could practically spread without being noticed by the majority of the users.

Definition

White-box attack: The adversary has full access to the trained model and the original training data. Grey-box attack: The adversary has access to the model only.

Lonenuron Attack

Trojan neuron Insertion

  1. Insert the Trojan neural, which is essentially a ReLU layer into the victim DNNs.
  1. , that is embedded by the watermark, is as the input. We can get feature maps which are poisoned behind the convolution layer 1.
  1. Extract values of output of convolution layer 1 from watermark location as the input of the Trojan neuron.
  1. Change the value to binary value.
  1. Compare if these values equal to the pattern which is set. If it is the same, activate the Trojan neuron.
  1. Add the output of the Trojan neuron with the output of convolution layer 1.

Trigger Generation and Embedding

  • Feature-domain Trigger and How to Embed
Each of the 𝑟 kernels in the first convolution layer captures certain visual features.
Sliding each kernel across the image yields a feature map of size . Hence, the size of the entire feature space is .
To select 𝑁 features to carry the trigger k, we first select feature maps and then identify features from the same locations of each selected feature map, so that . For example, when and , all the features are selected from the same feature map.
features from the same feature map should correspond to nonoverlapping pixel-domain regions.
In practice, they set and , when .
 
The figure shows that they set , and .
Figure : Example of embedding an N-bit activation pattern k={ki} in the feature domain, and then reconstructing polymorphous pixel-domain water marks for the same image.
Figure : Example of embedding an N-bit activation pattern k={ki} in the feature domain, and then reconstructing polymorphous pixel-domain water marks for the same image.
  • Pixel-domain Watermark (reverse engineers)
where the maliciously edited features representation is used to reconstruct the watermarked image in the pixel-domain . That is, to identify a pixel-domain watermark , such that satisfies:
Besides, two additional constraints apply:
1) only integer solutions are accepted,
2) the distance between and should be small.

Training the Trojan Neuron

To generate , a white-box attacker could add watermarks to the raw training data, while a grey-box attacker could use completely random images.
Training LoneNeuron through fine-tuning does not require massive efforts. They freeze the pre-trained victim DNN except for the Trojan neuron and train it with (labelled as ). The Trojan output is trained to a similar scale as other outputs from the first convolution layer, so the poisoned model demonstrates seemingly benign behaviours internally. Formally, this is a multi-objective optimization problem:
where limits the L-1 norm of the Trojan output, and denote the relative importance of .

Experiments and Attack Evaluation

Training and Attack Effectiveness

In the white-box attack, insert watermarks into a small number of random images from the original training datasets.
In the grey-box attack, employ completely random images, insert watermarks, and use them to train the LoneNeuron.
 
Add one neuron to the mode.
 
Single-label Attacks: 1,000 randomly selected images with 1,000 activation patterns for each image, and 100 polymorphous watermarks for each activation pattern.
notion image
They do not observe any performance differences between white-box and grey-box attacks.
 
Multi-Label Attacks: In this experiment, we create 𝑀 ∈ [2, 𝑀𝐷 ] activation patterns for 𝑀 random target labels, where 𝑀𝐷 is the total number of labels in each dataset. We insert 𝑀 Trojan neurons into the victim DNN and train them individually. Embed one watermark in each testing image. The attack success rates remain at 100% for all the DNN models. Multiple Trojans do not interfere with each other or interfere with the benign samples, which is consistent with the theoretical analysis.
LoneNeuron against Vision Transformers.
In all the victim models on all datasets, they achieve 100% attack success rates with 0% main task degradation and the same level of watermark stealthiness as in the CNN attacks.
LoneNeuron against Speech Recognition.
LoneNeuron achieves a 0% deduction of the main task performance, and 73.9% and 75.4% word-level attack success rates on the AN4 and Librispeech datasets.
 

Watermark Polymorphism & Stealthiness

(1) watermark polymorphism;
generating polymorphous pixel-domain watermarks is very efficient.
notion image
(2) numerical similarity analysis;
notion image
MSE, SAM, and LPIPS are all close to 0, while SSIMs are close to 1. The results show that the watermarked images are very similar to the original images in different image similarity/quality measurements, including ones that are shown to be highly consistent with human perceptual similarity judgments
(3) randomness analysis;
The polymorphic watermarks introduced completely random perturbations to the victim images. The lack of any statistical pattern in the pixel domain makes it difficult, if not impossible, to statistically identify the watermarks
(4) user study; and
We sent the questionnaire to 300+ undergraduate students to annotate whether a pair of images are visually identical. 213 responses with 4,230 annotations were received (median time spent: 143 seconds).
(5) stealthiness against DNN-based detectors
They generate 10,000 LoneNeuron’s polymorphous watermarked images from the same activation pattern and 10,000 benign images and attempt to classify them with all four classifiers. None of the training processes could converge and the training accuracy remains at 50%. The results indicate that even complex DNNs could not capture any meaningful information to identify the polymorphous watermarks.

LoneNeuron against SOTA Defenses

No defending methods are useful for this backdoor attack.

Attack Robustness

Fine-tuning: the earlier layers of a DNN are frozen while only the last layer(s) are retrained
  • the earlier layers of a DNN are frozen while only the last layer(s) are retrained
Retraining: all the weights in the network are updated
  • the earlier layers of a DNN are frozen while only the last layer(s) are retrained
Pruning:
  • LoneNeuron will remain 100% effective even when 80% of the neurons in each layer are pruned
JPEG Compression:
They designed an effective and reasonably efficient approach to embed watermarks into attack images that survive JPEG compression

Comparing with other neuron attack

notion image

Conclusion

  1. Present LoneNeuron
  1. Evaluate LoneNeuron with 8 DNN models and two vision transformers on five popular benchmarking dataset.
  1. Escape all SOTA neural Trojan detectors and robust against fine-tuning
  1. Show that LoneNeuron could be employed to attack DNNs in other application domains, such as voice recognition.

What we can do?

  1. Design the defending methods for this kind of backdoor attack. Even though it is pruning. And for the pruning backdoor attack, can we do defending methods?
  1. If few users detect the model, we can set the copyright of the model. If we can compare the elements of the copyright between two models, we can assure which one is the benign model and which one is the backdoor model.
  1. 给数据集做标记。即使数据集和模型都是下载的,只要模型最后一层显示出了标记数据集的标志,那说明该模型单纯使用该数据集进行训练。(不知道说的是否明白)
     
    服务器的使用读博士的流水账生活
    Loading...
    Chunyu Hu(Hugh)
    Chunyu Hu(Hugh)
    one normal PhD student in USA 一个普通的在美国的博士生
    公告

    2024年4月20日

    chunyu的partner拿到了硕士的offer,我们终于又要在一起了。

    时刻记住

    放下助人情节
    尊重他人命运