Methodology

Model Training

Every model in the dashboard traces back to one of four training recipes. This page is the exact hyperparameters, data splits, and loss formulations we used — read it alongside the data pipeline and the evaluation methodology.

Note: the T5 family was trained on a separate stratified split with its own pipeline (Camille's CS4248-project-AY2526S2 repo). Same epochs and LR as the BART pipeline but different batch shape, max length, and best-metric criterion.

Four training recipes

01a

BART SFT (Yang Zhi)

BART-Base and BART-CE trained with cross-entropy on the sar-to-non splits. 5 epochs, early stopping on val BLEU, HuggingFace Seq2SeqTrainer.

2 models

01b

T5 SFT (Camille)

T5-Joint, T5-Control, and the 6 ablations — 4 epochs, effective batch 16, max length 1248, fp16, eval_loss as best metric, SLURM orchestration.

9 models

Reinforcement Learning (REINFORCE + KL)

Takes an SFT BART checkpoint as policy, uses a sarcasm classifier + ROUGE-L as reward, KL penalty against the frozen reference.

2 models

LoRA Instruction Tuning

LLaMA 3.2 1B fine-tuned via low-rank adapters on 7 projection layers. Loss masked to the assistant response only.

2 models

Section 01a

BART Supervised Fine-Tuning

Yang Zhi's BART pipeline — cross-entropy training with the HuggingFace Seq2SeqTrainer, early stopping on validation BLEU (patience 2), and a cosine warmup schedule. BART doesn't need a task prefix: it's pretrained with its own denoising objective, so the input is the raw sarcastic headline.

BART-Base

facebook/bart-base (140M)

Plain supervised fine-tuning on the 10,868 headline→rewrite pairs from the main sar-to-non split. Input is the raw sarcastic headline with no prefix — BART is pretrained with its own denoising objective and doesn't expect a task token.

Train size10,868 pairs

Val size1,356 pairs

Epochs5 (early stop, patience 2)

Batch size16

Learning rate3e-4

Max length128 tokens

Warmup steps500

Weight decay0.01

Best metricBLEU on val

Precisionbf16 (if CUDA)

scripts/train.py ↗

BART-CE

Context-enhanced training data

facebook/bart-base (140M)

Same BART architecture and hyperparameters as BART-Base, but trained on the context-enhanced data split: each training pair is conditioned on the scraped article body in addition to the headline. Smaller split (8,258 pairs) because not every headline has a scrape-able article body.

Train size8,258 pairs (with body)

Val size1,029 pairs

Data splitsar_to_non_context_enhanced

Epochs5 (early stop, patience 2)

Batch size16

Learning rate3e-4

Max length128 tokens

Best metricBLEU on val

scripts/train.py ↗

Section 01b

T5 Supervised Fine-Tuning

Camille's T5 pipeline runs in a separate repo with its own stratified split, SLURM orchestration, and a longer max-sequence length to fit the combined input+target with the strategy token. Same HuggingFace Seq2SeqTrainer as the BART side, but different hyperparameters — the table below is the shared recipe; each model card then notes where it differs.

Shared T5 recipe (joint · control · ablations)

Data prepprepare_t5_datasets.py (80/10/10 stratified)

TrainerSeq2SeqTrainer, predict_with_generate=True

Epochs4 (no early stopping)

Per-device batch8

Grad accum2 (effective 16)

Learning rate3e-4

SchedulerCosine, warmup_ratio 0.06

Weight decay0.01

Max source/target1248 tokens

Best metriceval_loss

Precisionfp16

Seed42

Compute1× NV GPU, 32G mem, SLURM gpu-long, 5h limit

Orchestrationslurm_finetune_t5.sh → .py

T5-Joint

Best model by human eval

google-t5/t5-base (220M)

T5-base trained to do two things at once on every example: classify the sarcasm strategy AND rewrite the headline. The input is prefixed with 'rewrite to non-sarcastic and predict strategy: ' and the target is 'strategy: <type> rewrite: <headline>'. Forcing the model to emit the strategy token before the rewrite makes it decompose the task (identify what's sarcastic first, then remove it) and is why T5-Joint wins human eval on meaning preservation — 16.4% meaning change vs T5-Control's 25%.

Input prefix"rewrite to non-sarcastic and predict strategy: "

Target format"strategy: {strategy} rewrite: {rewrite}"

Data splitdata/joint_and_ablate_prepared/joint (80/10/10 stratified)

Epochs4

Per-device batch8

Grad accum2 (effective batch 16)

Learning rate3e-4

SchedulerCosine, 6% warmup

Weight decay0.01

Max source/target len1248 tokens

Best metriceval_loss (predict_with_generate=True)

Precisionfp16

Compute1× NV GPU, 32G mem, SLURM gpu-long

camille-readbean/scripts/finetune_T5.py ↗

T5-Control

google-t5/t5-base (220M)

Same T5 recipe and split as T5-Joint but the strategy token is stripped from both input and output — the model only sees 'rewrite to non-sarcastic: {sarcastic}' and outputs the plain rewrite. This isolates the contribution of the strategy-prefix trick: any difference on meaning preservation between T5-Joint and T5-Control is attributable to the joint objective, not data or backbone.

Input prefix"rewrite to non-sarcastic: "

Target formatPlain rewrite (no strategy token)

Data splitdata/joint_and_ablate_prepared/control (same as joint)

Epochs4

Per-device batch8

Grad accum2 (effective batch 16)

Learning rate3e-4

SchedulerCosine, 6% warmup

Max source/target len1248 tokens

Precisionfp16

camille-readbean/scripts/slurm_finetune_t5_control.sh ↗

T5-Joint (small)

google-t5/t5-small (60M)

Earlier T5-small variant of the joint model using the default model flag in Camille's slurm script. Same training recipe as T5-Joint, smaller backbone — kept in the evaluation to show that the joint model's edge comes from the strategy prefix, not just capacity. Still listed in the dashboard under the original name 'joint'.

Input prefix"rewrite to non-sarcastic and predict strategy: "

Target format"strategy: {strategy} rewrite: {rewrite}"

Backbonet5-small (60M params)

Everything elseIdentical to T5-Joint

camille-readbean/scripts/finetune_T5.py ↗

Section 02

Reinforcement Learning

REINFORCE with a KL penalty against the frozen SFT reference — the recipe ViSP (arxiv 2507.09482) used for sarcasm generation, inverted here for sarcasm removal. Pure style reward saturates instantly because the SFT outputs already score ~1.0 on the classifier, so we blend in a ROUGE-L content-preservation term to keep the policy from collapsing to “delete everything”.

Loss formulation

Reward: r = α · (1 − P_sarcastic(output)) + (1 − α) · ROUGE-L(output, ref), α = 0.5
REINFORCE loss: L_rl = −(r − baseline) · Σ log π_θ(y | x)
Total loss: L = L_rl + β · KL(π_θ ‖ π_ref), β = 0.2
Baseline: Exponential moving average of batch reward (decay 0.9)

BART-RL

BART-Base SFT checkpoint

Takes the BART-Base SFT checkpoint as both policy and frozen reference. Generates outputs via sampling, scores them with a sarcasm classifier, and updates the policy with REINFORCE + a KL penalty to stop it drifting. The reward is a weighted sum of style (classifier) and content preservation (ROUGE-L against the reference).

Policy initBART-Base SFT checkpoint

ReferenceSame checkpoint, frozen

Reward (style)1 − P(sarcastic)

Reward (content)ROUGE-L vs reference

Reward blendα·style + (1−α)·content, α = 0.5

KL coeff0.2

Learning rate1e-5 (low — policy drift risk)

Epochs3

Gradient clip1.0

BaselineEMA (0.9 decay)

scripts/train_rl.py ↗

BART-CE+RL

BART-CE SFT checkpoint

Same RL recipe as BART-RL, but starts from the context-enhanced SFT checkpoint instead of the plain one. Intended to stack the benefits of article-aware supervision with the reward-driven polish.

Policy initBART-CE SFT checkpoint

Reward formulaSame as BART-RL

KL coeff0.2

Learning rate1e-5

Epochs3

scripts/train_rl.py ↗

Known failure mode: reward hacking

Human eval shows BART-RL has a 40.7% meaning-change rate — more than double T5-Joint's 16.4%. The model learned that deleting sarcastic tokens reduces P(sarcastic) while still preserving enough ROUGE-L overlap to satisfy the content reward. Deletion optimizes the composite reward without actually rewriting. We document this in detail on the eval page.

Section 03

LoRA Instruction Tuning

LLaMA 3.2 1B is a decoder-only chat model, so the recipe looks nothing like the seq2seq pipeline. We use PEFT LoRA adapters on every attention + MLP projection, keeping the base weights frozen, and mask the loss to the assistant response. The system prompt and user message are encoded with the Llama 3 chat template so the model stays aligned with its instruction-tuned prior.

System prompt

“You are a writing assistant. Rewrite sarcastic news headlines as neutral, factual equivalents that preserve the core meaning without irony or mockery. Respond with only the rewritten headline, no explanation.”

LLaMA 3.2 1B

meta-llama/Llama-3.2-1B-Instruct

Instruction-tuned LLaMA fine-tuned via LoRA adapters — only ~6M of the 1.24B parameters are trained. Uses the chat template with a system prompt, loss is masked to the assistant response only (the prompt tokens are set to −100). The SFT checkpoint is then merged back into the base model and exported to GGUF so LMStudio can serve it on consumer hardware.

LoRA rankr = 16

LoRA alpha32

LoRA dropout0.05

Target modulesq, k, v, o, gate, up, down

Trainable params~6M of 1.24B (0.5%)

Learning rate2e-4

Batch × grad accum8 × 2 = 16 effective

Epochs3

Max length256 tokens

SchedulerCosine, 5% warmup

Precisionbf16 + gradient checkpointing

scripts/train_llama.py ↗

LLaMA 3.2 1B (context)

Scraped article body in the prompt

meta-llama/Llama-3.2-1B-Instruct

Same LoRA recipe as the base LLaMA variant, but the user message includes the scraped article body alongside the headline. This tests whether the model can use the article to ground its rewrite in the real event. Bumping max_length to 1024 to fit the article forces a smaller batch size.

LoRA configSame as base (r=16, α=32)

Learning rate2e-4

Batch × grad accum4 × 4 = 16 effective

Epochs3

Max length1024 tokens (to fit article body)

Article cachedata/processed/intermediate/article_scrape_cache.jsonl

Prompt formatHeadline + 'Article context:' + body

scripts/train_llama_context.py ↗

Section 04

Ablation Study

Six retrainings of the T5 control recipe (not joint — the ablations use the plain rewrite to non-sarcastic: prefix and emit the plain rewrite), each with one sarcasm subtype dropped from the training data. To keep effective dataset size constant across the six variants, every ablation pool is stratified- sampled down to the minimum across all drops. The finding on the dashboard: the six ablations cluster within 0.005 of each other on similarity — the model learns generic sarcasm patterns that transfer across subtypes, so no single one is load-bearing.

Ablation-specific recipe

Baset5-small (default in slurm script)

Input prefix"rewrite to non-sarcastic: "

Target formatPlain rewrite (no strategy token)

Train poolStratified downsample to min-across-drops

Val poolStratified downsample, same rule

Test setFull held-out split (shared across all six)

Everything elseSame as T5 shared recipe above

Held out

sarcasm

ablation_without_sarcasm

Held out

irony

ablation_without_irony

Held out

satire

ablation_without_satire

Held out

overstatement

ablation_without_overstatement

Held out

understatement

ablation_without_understatement

Held out

rhetorical question

ablation_without_rhetorical_question