Questions & Answers
This page collects the questions and comments raised during the CS4248 interim review by our instructor and project mentor, along with the team's responses.
Additional questions from the poster roadshow and final presentation will be added here as they come in.
Methodology & framing
Responses to interim feedback on how the task is set up and how we justify the two-stage pipeline.
Why use larger LLMs for preprocessing but smaller models (T5, BART, LLaMA) as the actual task target? Isn't that backwards?
The goal of LLMao is to ship a small, cheapinference-time model — that is literally the "Lightweight Language Models" in the project name. The two stages have very different cost profiles:
- Preprocessing is a one-time offline cost. We call the teacher models once to build the training corpus (89,688 strategy-annotated pairs from 28,619 NHDSD headlines). After that, the teachers are never called again.
- The target model is used forever at inference. T5-Joint (220M) and LLaMA 3.2 1B run on commodity hardware — a phone, a laptop, a free-tier LMStudio instance — without an API key.
Concretely the preprocessing pipeline uses two OpenRouter models on the free tier. Step 3.5 Flash (stepfun/step-3.5-flash:free) is the primary teacher — it generates the rewrite pairs and the six strategy variants per source that together make up the 89,688-record training pool. Nemotron Nano 30B (nvidia/nemotron-3-nano-30b-a3b:free) is used separately as an independent binary sarcasm classifier, re-checking NHDSD source-headline labels where the original NHDSD label and Step 3.5 Flash disagreed.
This is the standard strong-teacher → small-student distillation pattern that Stanford Alpaca, Vicuna, and Orca popularised. The teachers act as a label factory for a supervised training set that would otherwise need thousands of human-hours to produce; the small student then carries the task at inference time.
T5 is a simpler model and may not respond well to structured prompts — are you sure it's the right baseline?
A fair concern going in, but the results flipped it: T5-Joint is the best model in our lineup, beating BART-Base, BART-CE, all BART-RL variants, and both LLaMA LoRA variants on human evaluation (43.6% strict success vs the runner-up at 32.1%).
Two factors explain this. First, joint-task training (predict strategy + rewrite simultaneously from a single prompt) gave T5 an auxiliary signal that regularised the decoder. Second, the encoder-decoder architecture turned out to be a better fit for headline-length span rewriting than either the decoder-only LLaMA or the RL-tuned BART.
The structured-prompt concern was real for BART-CE, which uses a longer context-enhanced prompt format and ended up at 28.6% strict success — worse than the simpler BART-Base at 30.7%.
Evaluation
Clarifying what counts as a successful rewrite.
How do you define a better rewrite? What does “better” actually mean?
A rewrite is "better" if it satisfies both conditions simultaneously:
- Sarcasm removed — the output is no longer read as sarcastic by a human annotator.
- Meaning preserved — the underlying claim or event is the same as the input, not paraphrased into a different statement.
Both criteria are judged by two independent human annotators on a 140-sample gold set, with Cohen's κ ∈ [0.839, 0.884] across the three evaluated models. The headline metric we report is strict success rate — the fraction of samples where both annotators agreed the rewrite flipped AND preserved meaning.
We also report seven automatic metrics (three classifier flip rates, semantic similarity, perplexity, BLEU vs input, edit distance, and an LLM-as-judge score), but we treat the human numbers as ground truth and the automatic metrics as diagnostic signals — see the classifier-vs-human disagreement story below.
CS4248 / AY2025/26 S2 / Team 14 / NUS