LLMaoSarcasm Transfer
Methodology

Model Training

Every model in the dashboard traces back to one of four training recipes. This page is the exact hyperparameters, data splits, and loss formulations we used — read it alongside the data pipeline and the evaluation methodology.

Note: the T5 family was trained on a separate stratified split with its own pipeline (Camille's CS4248-project-AY2526S2 repo). Same epochs and LR as the BART pipeline but different batch shape, max length, and best-metric criterion.

Four training recipes

01a

BART SFT (Yang Zhi)

BART-Base and BART-CE trained with cross-entropy on the sar-to-non splits. 5 epochs, early stopping on val BLEU, HuggingFace Seq2SeqTrainer.

2 models
01b

T5 SFT (Camille)

T5-Joint, T5-Control, and the 6 ablations — 4 epochs, effective batch 16, max length 1248, fp16, eval_loss as best metric, SLURM orchestration.

9 models
02

Reinforcement Learning (REINFORCE + KL)

Takes an SFT BART checkpoint as policy, uses a sarcasm classifier + ROUGE-L as reward, KL penalty against the frozen reference.

2 models
03

LoRA Instruction Tuning

LLaMA 3.2 1B fine-tuned via low-rank adapters on 7 projection layers. Loss masked to the assistant response only.

2 models
Section 01a

BART Supervised Fine-Tuning

Yang Zhi's BART pipeline — cross-entropy training with the HuggingFace Seq2SeqTrainer, early stopping on validation BLEU (patience 2), and a cosine warmup schedule. BART doesn't need a task prefix: it's pretrained with its own denoising objective, so the input is the raw sarcastic headline.

BART-Base

facebook/bart-base (140M)

Plain supervised fine-tuning on the 10,868 headline→rewrite pairs from the main sar-to-non split. Input is the raw sarcastic headline with no prefix — BART is pretrained with its own denoising objective and doesn't expect a task token.

Train size10,868 pairs
Val size1,356 pairs
Epochs5 (early stop, patience 2)
Batch size16
Learning rate3e-4
Max length128 tokens
Warmup steps500
Weight decay0.01
Best metricBLEU on val
Precisionbf16 (if CUDA)
scripts/train.py

BART-CE

Context-enhanced training data

facebook/bart-base (140M)

Same BART architecture and hyperparameters as BART-Base, but trained on the context-enhanced data split: each training pair is conditioned on the scraped article body in addition to the headline. Smaller split (8,258 pairs) because not every headline has a scrape-able article body.

Train size8,258 pairs (with body)
Val size1,029 pairs
Data splitsar_to_non_context_enhanced
Epochs5 (early stop, patience 2)
Batch size16
Learning rate3e-4
Max length128 tokens
Best metricBLEU on val
scripts/train.py
Section 01b

T5 Supervised Fine-Tuning

Camille's T5 pipeline runs in a separate repo with its own stratified split, SLURM orchestration, and a longer max-sequence length to fit the combined input+target with the strategy token. Same HuggingFace Seq2SeqTrainer as the BART side, but different hyperparameters — the table below is the shared recipe; each model card then notes where it differs.

Shared T5 recipe (joint · control · ablations)

Data prepprepare_t5_datasets.py (80/10/10 stratified)
TrainerSeq2SeqTrainer, predict_with_generate=True
Epochs4 (no early stopping)
Per-device batch8
Grad accum2 (effective 16)
Learning rate3e-4
SchedulerCosine, warmup_ratio 0.06
Weight decay0.01
Max source/target1248 tokens
Best metriceval_loss
Precisionfp16
Seed42
Compute1× NV GPU, 32G mem, SLURM gpu-long, 5h limit
Orchestrationslurm_finetune_t5.sh → .py

T5-Joint

Best model by human eval

google-t5/t5-base (220M)

T5-base trained to do two things at once on every example: classify the sarcasm strategy AND rewrite the headline. The input is prefixed with 'rewrite to non-sarcastic and predict strategy: ' and the target is 'strategy: <type> rewrite: <headline>'. Forcing the model to emit the strategy token before the rewrite makes it decompose the task (identify what's sarcastic first, then remove it) and is why T5-Joint wins human eval on meaning preservation — 16.4% meaning change vs T5-Control's 25%.

Input prefix"rewrite to non-sarcastic and predict strategy: "
Target format"strategy: {strategy} rewrite: {rewrite}"
Data splitdata/joint_and_ablate_prepared/joint (80/10/10 stratified)
Epochs4
Per-device batch8
Grad accum2 (effective batch 16)
Learning rate3e-4
SchedulerCosine, 6% warmup
Weight decay0.01
Max source/target len1248 tokens
Best metriceval_loss (predict_with_generate=True)
Precisionfp16
Compute1× NV GPU, 32G mem, SLURM gpu-long
camille-readbean/scripts/finetune_T5.py

T5-Control

google-t5/t5-base (220M)

Same T5 recipe and split as T5-Joint but the strategy token is stripped from both input and output — the model only sees 'rewrite to non-sarcastic: {sarcastic}' and outputs the plain rewrite. This isolates the contribution of the strategy-prefix trick: any difference on meaning preservation between T5-Joint and T5-Control is attributable to the joint objective, not data or backbone.

Input prefix"rewrite to non-sarcastic: "
Target formatPlain rewrite (no strategy token)
Data splitdata/joint_and_ablate_prepared/control (same as joint)
Epochs4
Per-device batch8
Grad accum2 (effective batch 16)
Learning rate3e-4
SchedulerCosine, 6% warmup
Max source/target len1248 tokens
Precisionfp16
camille-readbean/scripts/slurm_finetune_t5_control.sh

T5-Joint (small)

google-t5/t5-small (60M)

Earlier T5-small variant of the joint model using the default model flag in Camille's slurm script. Same training recipe as T5-Joint, smaller backbone — kept in the evaluation to show that the joint model's edge comes from the strategy prefix, not just capacity. Still listed in the dashboard under the original name 'joint'.

Input prefix"rewrite to non-sarcastic and predict strategy: "
Target format"strategy: {strategy} rewrite: {rewrite}"
Backbonet5-small (60M params)
Everything elseIdentical to T5-Joint
camille-readbean/scripts/finetune_T5.py
Section 02

Reinforcement Learning

REINFORCE with a KL penalty against the frozen SFT reference — the recipe ViSP (arxiv 2507.09482) used for sarcasm generation, inverted here for sarcasm removal. Pure style reward saturates instantly because the SFT outputs already score ~1.0 on the classifier, so we blend in a ROUGE-L content-preservation term to keep the policy from collapsing to “delete everything”.

Loss formulation

Reward
r = α · (1 − Psarcastic(output)) + (1 − α) · ROUGE-L(output, ref), α = 0.5
REINFORCE loss
Lrl = −(r − baseline) · Σ log πθ(y | x)
Total loss
L = Lrl + β · KL(πθ ‖ πref), β = 0.2
Baseline
Exponential moving average of batch reward (decay 0.9)

BART-RL

BART-Base SFT checkpoint

Takes the BART-Base SFT checkpoint as both policy and frozen reference. Generates outputs via sampling, scores them with a sarcasm classifier, and updates the policy with REINFORCE + a KL penalty to stop it drifting. The reward is a weighted sum of style (classifier) and content preservation (ROUGE-L against the reference).

Policy initBART-Base SFT checkpoint
ReferenceSame checkpoint, frozen
Reward (style)1 − P(sarcastic)
Reward (content)ROUGE-L vs reference
Reward blendα·style + (1−α)·content, α = 0.5
KL coeff0.2
Learning rate1e-5 (low — policy drift risk)
Epochs3
Gradient clip1.0
BaselineEMA (0.9 decay)
scripts/train_rl.py

BART-CE+RL

BART-CE SFT checkpoint

Same RL recipe as BART-RL, but starts from the context-enhanced SFT checkpoint instead of the plain one. Intended to stack the benefits of article-aware supervision with the reward-driven polish.

Policy initBART-CE SFT checkpoint
Reward formulaSame as BART-RL
KL coeff0.2
Learning rate1e-5
Epochs3
scripts/train_rl.py

Known failure mode: reward hacking

Human eval shows BART-RL has a 40.7% meaning-change rate — more than double T5-Joint's 16.4%. The model learned that deleting sarcastic tokens reduces P(sarcastic) while still preserving enough ROUGE-L overlap to satisfy the content reward. Deletion optimizes the composite reward without actually rewriting. We document this in detail on the eval page.

Section 03

LoRA Instruction Tuning

LLaMA 3.2 1B is a decoder-only chat model, so the recipe looks nothing like the seq2seq pipeline. We use PEFT LoRA adapters on every attention + MLP projection, keeping the base weights frozen, and mask the loss to the assistant response. The system prompt and user message are encoded with the Llama 3 chat template so the model stays aligned with its instruction-tuned prior.

System prompt

You are a writing assistant. Rewrite sarcastic news headlines as neutral, factual equivalents that preserve the core meaning without irony or mockery. Respond with only the rewritten headline, no explanation.

LLaMA 3.2 1B

meta-llama/Llama-3.2-1B-Instruct

Instruction-tuned LLaMA fine-tuned via LoRA adapters — only ~6M of the 1.24B parameters are trained. Uses the chat template with a system prompt, loss is masked to the assistant response only (the prompt tokens are set to −100). The SFT checkpoint is then merged back into the base model and exported to GGUF so LMStudio can serve it on consumer hardware.

LoRA rankr = 16
LoRA alpha32
LoRA dropout0.05
Target modulesq, k, v, o, gate, up, down
Trainable params~6M of 1.24B (0.5%)
Learning rate2e-4
Batch × grad accum8 × 2 = 16 effective
Epochs3
Max length256 tokens
SchedulerCosine, 5% warmup
Precisionbf16 + gradient checkpointing
scripts/train_llama.py

LLaMA 3.2 1B (context)

Scraped article body in the prompt

meta-llama/Llama-3.2-1B-Instruct

Same LoRA recipe as the base LLaMA variant, but the user message includes the scraped article body alongside the headline. This tests whether the model can use the article to ground its rewrite in the real event. Bumping max_length to 1024 to fit the article forces a smaller batch size.

LoRA configSame as base (r=16, α=32)
Learning rate2e-4
Batch × grad accum4 × 4 = 16 effective
Epochs3
Max length1024 tokens (to fit article body)
Article cachedata/processed/intermediate/article_scrape_cache.jsonl
Prompt formatHeadline + 'Article context:' + body
scripts/train_llama_context.py
Section 04

Ablation Study

Six retrainings of the T5 control recipe (not joint — the ablations use the plain rewrite to non-sarcastic: prefix and emit the plain rewrite), each with one sarcasm subtype dropped from the training data. To keep effective dataset size constant across the six variants, every ablation pool is stratified- sampled down to the minimum across all drops. The finding on the dashboard: the six ablations cluster within 0.005 of each other on similarity — the model learns generic sarcasm patterns that transfer across subtypes, so no single one is load-bearing.

Ablation-specific recipe

Baset5-small (default in slurm script)
Input prefix"rewrite to non-sarcastic: "
Target formatPlain rewrite (no strategy token)
Train poolStratified downsample to min-across-drops
Val poolStratified downsample, same rule
Test setFull held-out split (shared across all six)
Everything elseSame as T5 shared recipe above
Held out
sarcasm
ablation_without_sarcasm
Held out
irony
ablation_without_irony
Held out
satire
ablation_without_satire
Held out
overstatement
ablation_without_overstatement
Held out
understatement
ablation_without_understatement
Held out
rhetorical question
ablation_without_rhetorical_question