Appendix · Academic Integrity

Declaration on the Use of AI

This page is the team's acknowledgement of generative AI tools used in Project LLMao, filed in the spirit of Section 4.3 of the NUS Policy for Use of AI in Teaching and Learning (7 Aug 2024).

The project itself is a study of small language models. Generative AI is both an object of study (the 14 fine-tuned models) and a tool we used during data preparation, evaluation, and engineering. We separate the two below.

AI tools used, and how

AI Tool

Purpose

How the output was used

Step 3.5 Flash (stepfun/step-3.5-flash:free, via OpenRouter)

Primary teacher for the training corpus. Generated the non-sarcastic → sarcastic rewrite pairs from NHDSD source headlines, and then in a second pass produced the five missing strategy variants per source (sarcasm, irony, satire, understatement, overstatement, rhetorical_question).

The two Step 3.5 Flash passes together produce the 89,688-record training pool (14,948 sources × 6 strategies). Used as-is for training — no downstream LLM filter was applied to these pairs.

Nemotron Nano 30B (nvidia/nemotron-3-nano-30b-a3b:free, via OpenRouter)

Independent binary sarcasm classifier used to cross-validate source-headline labels in NHDSD where the original NHDSD label and Step 3.5 Flash disagreed.

Ran only on the disagreement subset as a tiebreaker to estimate NHDSD mislabel rate. This is a QA pass on source headline labels upstream of pair generation — it does not filter or re-annotate the 89,688 training pairs themselves.

Google Gemini 2.5 Flash (via API)

Acted as one of the seven evaluation signals — the LLM-as-judge score reported in the dashboard — rating whether each model output preserves meaning while removing sarcasm.

Used as an automated metric alongside six non-LLM metrics. Human evaluation (140 samples × 3 models × 2 annotators, κ > 0.8) is the primary ground truth, not the Gemini score. See /eval.

Anthropic Claude (Claude Code, Opus / Sonnet)

Pair-programming assistant for implementing the webapp (Next.js frontend, FastAPI backend), refactoring training scripts, and drafting documentation.

All generated code was read, edited, run, and debugged by team members before being committed. Claude did not make architectural decisions autonomously — every recipe, metric, and experiment was specified by the team.

GitHub Copilot

Inline autocomplete during routine coding (loops, boilerplate, type signatures).

Suggestions were accepted or rejected line-by-line by the author.

ChatGPT + Codex

Generating training and other code. Asking ML engineering related tasks

Output reviewed manually before use. AI did not suggest what experiment to run, human set research directions and directive on what to do and the expected end output, e.g. I want a customisable SLURM training script.

What AI was not used for

Formulating the research question, hypotheses, or experimental design.
Selecting the 14 models, four training recipes, or the seven-metric evaluation pipeline.
Running training jobs or generating model outputs on the held-out test set.
Manually labelling the 140-sample gold human-evaluation set (done by two team members independently).
Drawing conclusions from results or deciding which findings to report.

Responsibility

Team 14 is solely responsible for the content of this report, the webapp, the code, the experimental results, and any errors therein. Every AI-assisted output — whether a generated training label, a code suggestion, or a proofreading pass — was reviewed by a team member before being integrated into the final submission.

We have not used AI tools to generate this declaration's substantive content about what the team did or did not do; those statements are authored by the team. Phrasing and formatting passes were AI-assisted and then edited by hand.