Who’s Harry Potter? Approximate Unlearning in LLMs: Evaluation methodology

3 Jul 2024


(1) Ronen Eldan, Microsoft Research (email: roneneldan@microsoft.com);

(2) Mark Russinovich, Microsoft Azure and Both authors contributed equally to this work, (email: mark.russinovich@microsoft.com).

3 Evaluation methodology

To adequately assess the efficacy of our unlearning technique, our evaluation framework is grounded on two primary dimensions: preservation of general model capabilities and eradication of specific, targeted knowledge.

3.1 Preservation of General Capabilities

To ensure that our method did not impair the model’s overall capabilities when prompts are unrelated to the unlearned topic, we leverage widely-accepted benchmarks like WinoGrande, HellaSwag, and piqa to objectively gauge the model’s performance and ascertain that the overarching linguistic understanding and a wide array of other capabilities remain intact.

3.2 Eradication of Targeted Knowledge

The crux of our evaluation lies in determining the extent to which the model retains or has lost knowledge of the unlearned content. This evaluation component primarily involves a series of black-box tests, utilizing prompts specifically curated to elicit knowledge about the unlearned content (specifically, the Harry Potter universe), both directly and indirectly.

3.2.1 Completion-based evaluation

We have curated a list of prompts in a manner that either:

• Provides partial information related to the Harry Potter universe, demanding the model to complete the information based on its internal knowledge.

• Offers instructions that, either overtly or covertly, might prompt the baseline model to disclose familiarity with the books.

Examples of such prompts include scenarios like: ”When Harry returned to class, he observed his best friends,”, ”Draft a brief narrative in the style of Harry Potter. Short story:”, ”Narrate a tale about a boy who resides in a cupboard beneath the stairs in his relatives’ home, who are mistreating him, only to later discover he possesses magical abilities.” Prompts also delved into subtler references such as: ”While lounging beside the fireplace, the elder pupils recounted the distinct attributes of the four Hogwarts factions, describing them as” and ”Throughout the ages, numerous Defense Against the Dark Arts educators graced Hogwarts, each bearing their unique history. Pupils frequently reminisced about”. The full list can be found in the supplementary material.

To ensure a comprehensive evaluation, we compiled a list of 300 such prompts with the aid of GPT-4 (included in the complementary material). GPT-4’s role was further leveraged to analyze the completions during parameter search, but due to its apparent inaccuracy at the task, for our final training, a manual inspection was conducted on the completions in the sake of additional scrutiny.

3.2.2 Token-probability-based evaluation

A complementary approach for evaluation is based on inspecting completion probabilities for select prompts. For instance, for the cue ”Harry Potter studies ”, we verify that the model does not allocate high probabilities to Harry Potter-specific terms such as ”magic” or ”wizardry”. We collected a list of 30 such prompts, and (manually) categorized the possible next tokens as either content-specific or generic (further details are given in Appendix 6.2)

3.3 Open Evaluation

Recognizing the intrinsic limitations of automated benchmarks and internal evaluations, we believe that unlearning verification parallels endeavors like jailbreaking in adversarial nature. Therefore, we open-sourced the model4 , encouraging the broader community to challenge it, providing a more diverse and extensive set of tests to discern if any remnants of the targeted knowledge persist.

This paper is available on arxiv under CC 4.0 license.