This evaluation suite compares the GLUE results with Adversarial GLUE (AdvGLUE), a multi-task benchmark that evaluates modern large-scale language models robustness with respect to various types of adversarial attacks.
This suite requires installations of the following fork IntelAI/evaluate.
After installation, there are two steps: (1) loading the Adversarial GLUE suite; and (2) calculating the metric.
sst2, mnli, qnli, rte, and qqp.More information about the different subsets of the GLUE dataset can be found on the GLUE dataset page.
from evaluate import EvaluationSuite
suite = EvaluationSuite.load('intel/adversarial_glue')
mc_results,  = suite.run("gpt2")The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
accuracy: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see accuracy for more information).
The original GLUE paper reported average scores ranging from 58% to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
For more recent model performance, see the dataset leaderboard.
For full example see HF Evaluate Adversarial Attacks.ipynb
This metric works only with datasets that have the same format as the GLUE dataset.
While the GLUE dataset is meant to represent “General Language Understanding”, the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.
 @inproceedings{wang2021adversarial,
  title={Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models},
  author={Wang, Boxin and Xu, Chejian and Wang, Shuohang and Gan, Zhe and Cheng, Yu and Gao, Jianfeng and Awadallah, Ahmed Hassan and Li, Bo},
  booktitle={Advances in Neural Information Processing Systems},
  year={2021}
}