Skip to content

Latest commit

Β 

History

History
632 lines (534 loc) Β· 91 KB

research-log.md

File metadata and controls

632 lines (534 loc) Β· 91 KB

Legend:

  • πŸ“œ: papers
  • πŸ“°: blog posts, project pages
  • πŸ“–: books
  • 🌐: broad/general resources
  • πŸ§ͺ: code, experiments
  • πŸ“Ί: videos

09/30/2024

  • πŸ“œLooped Transformers for Length Generalization
    • LLMs have struggled with length generalization, i.e. solving an algorithmic task on inputs of arbitrary length, even for simple addition tasks. Looped transformers allow passing the model's intermediate state (& original input) to the next iteration. Previous work showed tasks that can be written as RASP-L (Restricted Access Sequence Processing, learnable) programs can be learned by transformers, but many useful tasks don't have a RASP-L form (RASP-L does not have loops). The authors propose n-RASP-L, an extension allowing sequential application of a program for a given number of steps. They train without intermediate-step supervision, which is possible since different inputs may effectively provide supervision over subproblems, on simple tasks like parity-checking, string copying, binary arithmetic, and calculating a unique set of inputs. When applied to the GPT-2 architecture, the looped transformer successfully generalizes and performs much better than vanilla transformers or those with pause tokens. The authors note the computational costs when the number of looped steps is large and that training requires having the ground-truth number of steps in the data (though this constraint is less burdensome than full CoT training)

09/25/2024

  • πŸ“œLLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of Openai's o1 On PlanBench
    • Refers to o1 as a "Large Reasoning Model", since its training and capabilities are noticeably different from other LLMs, and evaluates it on PlanBench. Previous models have struggled with Blocks world tasks on PlanBench, where solutions require 2-16 steps. Previous SOTA performance was 50-60% on Blocks world and below 5% on Mystery Blocks world, and one-shot prompting was not a strict improvement over zero-shot. o1-preview achieves 97.8% on Blocks world and 41.6%/52.8% (zero-/one-shot) on Mystery Blocks world, and is able to solve some problems with up to 28 steps, whereas other models struggle with solutions of only 5 steps. However, o1 still struggles to correctly identify when problems are unsolvable. o1 is also much slower and costs 10-100x more than other models (unpredictably, since the number of reasoning tokens cannot be explicitly controlled)

09/20/2024

  • πŸ“œTraining Language Models to Self-Correct via Reinforcement Learning
    • SFT on self-correction traces with rejection sampling (like STaR) amplifies the model's bias to not make any corrections/fails to improve its understanding of when to make modifications. The authors note a robust approach should 1) train on self-generated traces to avoid distribution shift during evaluation and 2) prevent a collapse to making only minor edits. SCoRe replaces conventional SFT with two stages: first, train for a model initialization that optimizes correction performance/avoids collapse by minimizing divergence from the base model, followed by online multi-turn RL with a substantial bonus for improving from the first to second response/penalty for a worse second response. The RL still uses a policy gradient and KL-divergence against a fixed model. Gemini 1.5 Flash improved on MATH (from 52.6% to 64.4%) and MBPP-R (from 47.3% to 60.6%), beating previous methods and making significantly more true and fewer false corrections. Performance can be improved further via test-time compute by generating several samples in parallel and self-correcting on each before performing majority voting

09/19/2024

  • πŸ“œNVLM: Open Frontier-Class Multimodal LLMs
    • Introduces NVIDIA Vision Language Model, using Qwen2-72B-Instruct as the LLM backbone and InternViT-6B-448px-V1-5 as the ViT. Images are broken into 1-6 448px tiles, along with a scaled-down thumbnail for global context, encoded separately, and downsampled via pixel shuffle. They test three architectures: decoder-only (a two-layer MLP to project image tokens into the LLM embedding space), cross-attention between image and text tokens, and a hybrid (thumbnail tokens are processed alongside text tokens via self-attention, and high-res tiles are processed via cross-attention). Multimodal SFT usually degrades text-only performance, but this can be avoided with including a high-quality text-only dataset during SFT. Training was composed of two stages: during pretraining the LLM backbone & vision encoder are frozen and only the projector/cross-attention layers are trained, then during finetuning the vision encoder is frozen and the LLM and alignment modules are jointly trained. High-quality data is necessary to achieve strong multimodal performance. The decoder-only model generally performs the best and is less complex, but the longer sequence length requires more training and inference compute. The cross-attention architecture still has strong performance and is faster
  • πŸ“œData curation via joint example selection further accelerates multimodal learning
    • Introduces joint example selection (JEST), using contrastive learning to select the most learnable sub-batches from super-batches. Batches are scored by comparing the difference in loss between the learner and a pretrained reference model. Each sub-batch is split into low and high resolution, which also saves compute with minimal impact on performance. Scoring incurs a 10% overhead in FLOPs, but allows the learner to achieve SOTA performance with 13x fewer examples and 10x fewer FLOPs

09/18/2024

  • πŸ“œA Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
    • Evaluates quantization of Vicuna, Gemma 1, and Llama 2 & 3 family of models on more recent benchmarks (MATH, MuSR, IFEval, GPQA, MMLU-PRO) using GPTQ, AWQ, SmoothQuant, and FP8. The authors confirm quantized models generally outperform full-precision smaller models (with exceptions for hallucination & instruction following). They find weight-only quants (GPTQ, AWQ) preserve accuracy better, especially for very large models (Llama 3.1 405B). Degradation from quantization does not significantly differ on harder evals

09/17/2024

  • πŸ“œALOHA Unleashed: A Simple Recipe for Robot Dexterity
    • Dextrous manipulation is hard to model, and past attempts at imitation learning for robots has been limited to non-dextrous tasks. The authors gather 26,000 teleoperated demonstrations across 5 tasks + 2,000 demonstrations on 3 simulated tasks and a transformer architecture which takes ResNet feature maps from multiple camera views and proprioception state. An L1 loss is insufficient to achieve high performance, but a diffusion policy with action chunking does better (40-95% success across tasks). Performance slightly generalizes to states not seen during training, but there are still edge cases where the policy fails to recover
  • πŸ“œDemoStart: Demonstration-led auto-curriculum applied to sim-to-real with multi-fingered robots
    • DemoStart is an auto-curriculum RL method bootstrapped from a few (2-60) demonstrations in simulation with the goal of zero-shot sim-to-real transfer. Only task parameters which have a non-zero success and failure rate are used to train (called Zero-Variance Filtering, or ZVF), and training is biased toward states which occur earlier in demonstrations, to avoid focusing on what has already been learned. DemoStart is implemented with a distributed actor-learner setup, where the policy is updated via MPO (maximum a posteriori policy optimisation). After a teacher policy is learned, it is distilled into a student policy with visual observations using behavior cloning. Random force perturbations, physical constants, and camera poses/lighting/colors are used as DR. There is still a significant drop in success when transferring to real environments on harder tasks, but is a significant improvement over the compared SAC-X method
  • πŸ“œH-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
    • Provides a new estimate of human performance on ARC, using participants from Mechanical Turk. Average human performance on the training and test sets were 76.2% and 64.2%, after 3 attempts, compared to two-shot performance on Claude 3.5 Sonnet (19.3%) and GPT-4o with few-shot prompting (38.5%). LLMs make fewer systematic errors e.g. relating to grid dimension, and LLMs and humans share similar edit distances to correct solutions, but overall LLMs make different errors than humans, and humans are able to self-correct at a higher rate

09/16/2024

  • πŸ“œPhysics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
    • Explores how to get an LLM (modified GPT-2) to self-correct using synthetic data consisting of elementary-school level math problems. Models often internally "know" they've made a mistake (determined via probing), so one technique is to regenerate a sentence if a mistake is detected, which is slightly more effective than vanilla beam search, at the cost of inference complexity & compute. Another approach is to introduce retries into the training data. When holding total tokens constant, introducing self-corrected mistakes (up to 50% of generated sentences) results in a significant performance boost and does not interfere with the model's ability to generate correct results. The results hold even when some of the corrections are fake/incorrect. Attempts to teach this ability with a LoRa finetune fail
  • πŸ“œScaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
    • Investigates scaling test-time compute among several strategies (best-of-N, beam search, lookahead search) & optimal tradeoffs between train- and test-time compute on MATH (using PaLM-2). The optimal strategy depends on the difficulty of the problem (as predicted by the model). Beam search is most effective on harder problems and at lower compute budgets, and best-of-N is most effective on easier problems and higher budgets. For a given level of compute, lookahead search underperformed. On easier problems within reach of a model's capabilities, test-time compute can be more effective than a larger model, but harder problems still require more pretraining. They also investigate finetuning revision models for self-correction & compare parallel sampling vs. sequential revision, finding the ideal ratio depends on compute budget & question difficulty (purely sequential was better for easier questions, with a mix for harder questions)
  • πŸ“œSynthetic Continued Pretraining
    • Introduces EntiGraph, an algorithm using an LLM (GPT-4) that transforms a small corpus into a larger corpus for continued pre-training (CPT). The LLM extracts a list of entities from the given document, generates a description for each entity, and analyzes relations among entities. Applying EntiGraph to the QuALITY dataset transforms 1.3M tokens into 600M synthetic tokens, which were then used for CPT on Llama 3 8B Base over 2 epochs. Closed book accuracy on questions related to the dataset increased from 39.49% to 56.42%, exceeding GPT-4's accuracy. Open book accuracy (using RAG) increased from 60.35% to 62.73%, showing EntiGraph provided 80% of the absolute performance improvement of RAG and was complementary to RAG. The authors note this approach relies on a powerful augmentation model and does not enable bootstrapping

09/13/2024

  • πŸ“œSapiens: Foundation for Human Vision Models
    • Introduces Sapiens, a family of vision transformers for human-centric vision tasks (2D pose estimation, body-part segmentation, depth estimation, surface normal prediction). Uses a proprietary dataset of 1B images at 1024px resolution. Models are pretrained using a masked auto-encoder (MAE), which excludes a random subset of patches. Models range in size from 300M to 2B, with the larger models achieving substantial improvement over previous SOTA
  • πŸ“œMini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
    • Mini-Omni is a speech-to-speech model (not A-T-T-A as became popular first). A Whisper encoder transforms the audio input into tokens, and the model generates audio and text tokens simultaneously, allowing the output speech to be conditioned on text. Training was done in 3 stages: 1) modality alignment, where the core model is frozen and the ASR + TTS adapters learn to understand & generate speech; 2) adaption training, where the adapters are frozen and the core model learns to generate text from audio input; 3) multi-modal finetuning, where the entire model is unfrozen. The core model is Qwen2-0.5B, so while its understanding of speech is powerful, the core model is weak
  • πŸ“œLanguage Agents Achieve Superhuman Synthesis Of Scientific Knowledge
    • Introduces PaperQA2, a RAG agent designed for literature reviews, summarization, and contradiction-detection. They decompose RAG into tools: paper search (transform request into keyword search), gather evidence (top-k rank + rerank & contextual summarization (RCS)), generate answer, citation traversal (use the citation graph as hierarchical indexing). Performance on tasks was above average human expert level (using GPT-4-Turbo) and costs $1-3 per query
  • πŸ“œCan LLMs Generate Novel Research Ideas?
    • Compared LLM-proposed (Claude 3.5 Sonnet) research ideas to 100+ NLP researchers. The LLM and humans were provided with the same idea template. The RAG setup retrieves and ranks existing papers from Semantic Scholar, generates many ideas, deduplicates (leaving 5%), and ranks them. The LLM ideas were rated as more novel and exciting, but slightly less feasible, after being transformed stylistically to be indistinguishable from human ideas. The authors note that despite producing many ideas, most of the ideas are similar, indicating raw sampling isn't a very effective form of test-time compute, and that LLMs are still sub-human at evaluating ideas

08/22/2024

  • πŸ“œScaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
    • Introduces Crossformer, a single policy trained on 900k trajectories over 20 embodiments, varying in camera views, proprioceptive inputs, joint configurations, action outputs, and control frequencies. Tasks can be specified with images or language, and action-space specific heads handle emitting appropriate outputs. Input sequences consist of alternating chunks of image observations, proprioceptive observations, and readout tokens. Action chunking helps with temporal consistency & avoiding compounding errors. The policy avoids negative transfer (though also doesn't exhibit significant positive transfer), matches or outperforms SOTA policies tailored to each setting

08/21/2024

  • πŸ“œScaling Law With Learning Rate Annealing
    • Examines the loss curve of LLMs as a function of a forward area (a sum of the step-wise LR) and an annealing area (accounts for momentum & the more rapid decrease in loss as the LR decays). These two stages trade off with one another, and for WSD (warmup-stable-decay), the ideal annealing ratio is ~10% of total steps (decreasing w/ total steps). This framing aligns with several observed phenomena, e.g. optimal cosine annealing uses a cycle length equal to total number of steps & decays LR to zero, why constant LR can outperform cosine for a small number of steps, why higher re-warmup LR in continued pre-training spikes loss initially but results in a lower final loss, & why warmup steps matter less in continued pre-training. An advantage of this framing over Chinchilla scaling laws is because it predicts loss at any given step count, thousands of data points can be collected in a single training run, allowing for fitting a model with <1% the computational cost

08/20/2024

  • πŸ“œARCLE: The Abstraction And Reasoning Corpus Learning Environment For Reinforcement Learning
    • Introduces an RL environment (in Gymnasium) for the ARC benchmark. Actions are split into pixel-level selection and operation groups. Agents are trained via PPO. By default, the reward is exceptionally sparse, so auxiliary losses are added. Training using all losses and random initial grids yielded a success rate of >95%, 75% of the time. A separate experiment on policies shows that not assuming conditional independence between selection & operation is necessary for effective learning. The multi-task few-shot nature of ARC makes it a good fit for advanced RL approaches (meta-RL, generative models, & model-based RL)
  • πŸ“°On the speed of ViTs and CNNs
    • Pushes back against criticism that ViTs aren't practical at higher resolution for real-time processing. Makes the case that ViTs are fast enough for real-time image processing (>100 images/sec) at 1024x1024 resolution, and that for most tasks, we only need roughly 224px for most photos, 448px for reading text in digital images, 896px for reading a desktop screen/page of a document (which happen to be the resolutions used by PaliGemma)

08/19/2024

  • πŸ“°AI Fundamentals: Energy-Based Models
    • EBMs are a kind of generative models where the goal is to explicitly learn a probability distribution underlying training data, which allows the model to generate samples from the true distribution. Directly computing the MLE is expensive, so practical techniques include: contrastive divergence (CD) used to approximate the MLE; score matching, where the score function is the negative gradient of the log-prob wrt. x, and we minimize the square of the expected difference between the model's score and the data score; noise contrastive estimation (NCE), where the model is trained to distinguish between samples from the data distribution and a noise distribution. EBMs can be difficult to train (often diverge, sensitive to hyperparameters) and haven't been scaled as large as other models
  • πŸ“°New LLM Pre-training and Post-training Paradigms
    • Covers Qwen 2, Apple Intelligence Foundation Language Models (AFM), Gemma 2, Llama 3.1. Dataset filtering (quality over quantity), increased vocab size, synthetic data (incl. for context lengthening during pre-training), fancier RMs, & knowledge distillation have all become more popular (although Llama 3.1 notably did not use distillation). DPO/combinations of RLHF algorithms are now more popular than just PPO

08/14/2024

  • πŸ“œMutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
    • Introduces rStar for small language models (SLMs), in an attempt to improve reasoning without a stronger teacher model. (Bigger models can achieve better improvement to reasoning on their own than smaller models.) rStar uses MCTS where the action space includes: propose a one-step thought, propose remaining thoughts (standard CoT), propose the next sub-question + answser (RAP), re-answer the sub-question, rephrase the sub-question. Rewards are determined by how likely the action is to lead to the right answer. Upper Confidence Bounds applied to Trees (UCT) is used to select each node. Since it's difficult to define a single metric that reliably selects the best trajectory, all trajectories are collected, and a second discriminator SLM is used to perform mutual reasoning consistency. MCTS alone provides a boost to performance, beating previous methods, and combined with the discriminator, rStar does even better. Weak discriminators seem to work fine, almost as well as GPT-4 (at least on GSM8K)

08/13/2024

  • πŸ“°Introducing SWE-bench Verified
    • Used human annotators to verify a subset (500 samples) of SWE-bench, filtering out poorly specified issues & those with unfair tests. SOTA is now ~35%
  • πŸ“œThe AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
    • Claims to be a pipeline for e2e paper generation at ~$15/paper. The pipeline is broken into idea generation, experiment iteration, & paper write-up. Generation involves multiple rounds of CoT & self-reflection, conditioned on an existing archive of research, & a semantic filter to exclude ideas too similar to existing literature. Experiments use Aider to plan & execute ideas, try fixing errors, & take notes on results. To draft the paper, Aider fills out a LaTeX template & refines using self-reflection. The system then polls the Semantic Scholar API to fill in citations & generate the Related Work section. A final round of self-reflection on the entire paper is used to streamline the arguments. Finally, the paper is fed through a LaTeX compiler & errors are passed back to Aider for correction. The pipeline is able to generate & produce empirical results for some ideas, but sometimes fails, or doesn't explain all reasoning, still hallucinates some details, produces a limited bibliography, & messes up visualizations (since it doesn't use vision capabilities). Claude 3.5 Sonnet produced the highest quality papers
    • Separately, they use an agent (GPT-4o) to review & score papers, which uses self-reflection, few-shot examples, & response ensembling, which achieves roughly human-level classification (accuracy, F1, AUC), although with a significantly higher false positive rate, at a cost of $0.25-50/review.
    • The system occasionally made attempts to modify its own code, resulting in excess resource usage. The authors recommend strict sandboxing of the system
  • πŸ“œAgent Q: Advanced Reasoning and Learning for Autonomous AI Agents
    • Introduces Agent Q: a base model + RFT on successful trajectories + DPO over all trajectories. During training, MCTS is used to explore more options & collect rewards. Preferences for DPO are determined via a mixture of the MCTS reward + an estimate from a frozen critic LLM's ranking over potential actions, since the former reward from the outcome provides limited supervision. On the simulated WebShop benchmark, Agent Q marginally outperforms DPO using outcome supervision alone and doesn't quite meet human performance. By combining the model with MCTS at test time, human-level performance is achieved. When applied to booking a table on OpenTable, Agent Q + MCTS reaches a 95.4% success rate. The relative performance gap here is higher, possibly due to fine-grained feedback becoming more important as the number of required steps grows. The authors note this approach would not be safe in environments where interactions are risky or can't be undone

08/12/2024

  • πŸ“œFrom LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
    • Surveys papers from H2 2023 - H1 2024. LLMs for SE are limited by context length, hallucinations, and lack of access to real-time info & tools. RAG is still faster and cheaper than using a long context. Multi-agent systems are becoming more common to work around limitations. Outputs from agents are more open-ended/harder to measure, so benchmarks/evals tend to be all over the place. The most concrete success has been from code gen (copilot) & test gen/bug fixes, especially using pass@k. Surprising how often GPT-3.5 (or equivalent) is still used
  • πŸ“œA Survey of Mamba
    • The theoretical advantage of Mamba having linear scaling with length during inference has motivated a lot of research. Thus far most Mamba models have been small relative to frontier transformers, and struggle with recall early in the context. There's plenty of ongoing work, but the ecosystem & hardware optimization are far behind work on transformers
  • πŸ“œTree Attention: Topology-Aware Decoding for Long-Context Attention on GPU Clusters
    • Introduces a method to parallelize attention across GPUs using tree reduction, providing an asymptotic speedup over Ring Attention. They consider self-attention as a gradient of an energy function. Since computing the gradient of f(x) has the same time complexity as computing f(x), and since logsumexp is associative, it can be reduced in parallel, leading to a log(N) reduction in complexity. They achieved an 8x speedup and less memory usage over Ring Attention when using 128 GPUs on a sequence of 5.12M tokens

08/09/2024

  • πŸ“œReasons to Doubt the Impact of AI Risk Evaluations
    • Leading industry, government, & safety labs allocate significant resources to evals, in the hope that they improve understanding of risks & enable mitigating them. However, evals may fail to improve understanding (miss risks due to interactions with the real world, cost more than building scary demos of capabilities, fail to capture discoveries in deployment). They may also fail to mitigate risks after lines are crossed (voluntary commitments are not dependable, governments can be slow to react, evals don't improve safety culture). They may even backfire (becoming goals for dual-use capabilities, consuming resources that could be used for technical safety/governance progress, contributing to safety-washing, leaking scary demos). To improve the situation, stakeholders should be aware of hype, measure propensities as well as capabilities, ensure evals can be done pre-deployment (& white-box), & continue to make eval practices more rigorous. Labs should honor evals commitments, provide access to models, & share eval infrastructure. External evaluators should specialize & cooperate on standards Government should require lab cooperation & clarify protections for doing so. Researchers should advance a broad science of evaluation & develop better threat modeling.

08/08/2024

  • πŸ“œPOA: Pre-training Once for Models of All Sizes
    • POA builds on teacher-student distillation by adding an "elastic student" as another branch. The elastic student is a random subset of the student's parameters, chosen by randomly sampling from among a combination of widths and depths (biased toward smaller sub-networks). The elastic students acts as regularizers/an ensemble during training, and can be directly extracted from the pre-trained teacher. Both the teacher and extracted students achieve SOTA performance on k-NN classification for ImageNet-1K, object detection & segmentation on COCO, and semantic segmentation on ADE20K
  • πŸ“œGrokfast: Accelerated Grokking by Amplifying Slow Gradients
    • Hypothesizes that grokking is caused by fast-varying gradients initially leading to over-fitting, followed by slow-varying gradients eventually yielding generalization. By considering the changes in parameters in the frequency domain, applying a low-pass filter (moving average) to the gradients could accelerate grokking. If the fast-moving gradients are excluded entirely, training speed & stability worsens. Grokking happens faster the most when averaging is applied after an initial overfitting phase, and faster still when adding weight decay (up to 50x faster on a modular arithmetic task). Since keeping a large window of historical gradients would require a lot of memory, an exponential moving average is also tested, yielding similar results (22x faster grokking on an MNIST classifier, faster/better convergence on a graph CNN and a two-layer LSTM for IMDB sentiment analysis). When studying parameter trajectories, the Grokfast model deviates much more from initial states before overfitting, but then much less during the grokking transition (& with 100x less variance), implying it's more deterministic
  • πŸ“œHuman vs. Machine: Behavioral Differences between Expert Humans and Language Models in Wargame Simulations
    • Used few-shot ICL (GPT-4) to train separate critique, refine, and rank models to help decompose competitive programming problems. Non-experts with assistance reached unassisted expert level. Decomposing problems also enabled the model to self-supervise (repair programs generated by itself)
  • πŸ“œAchieving Human Level Competitive Robot Table Tennis
    • Uses a hierarchy with a high-level controller (HLC) to select among low-level skill policies (LLCs) (to avoid catastrophic forgetting & improve evaluation efficiency). Policies are trained using Blackbox Gradient Sensing (BGS), an evolutionary strategies (ES) algorithm, rather than gradient descent, since RL algorithms like PPO resulted in jerkier movements. Policy models are small (~10k params) dilated-gated CNNs. The HLC observes a timestep to estimate ball velocity and then chooses an LLC to return the ball. Policies were trained iteratively, starting with a seed dataset of 40 minutes of human-human play, then alternating between sim and zero-shot deployments against human players. Simulation was in MuJoCo with fluid dynamics. To reduce the sim-to-real gap, ~100 samples of real-world data were used to update LLC lookup. The robot played at an intermediate level, consistently beating beginners and losing to advanced players

08/07/2024

  • πŸ“°A Visual Guide to Quantization
    • bfloat16 introduced to maintain range of 32-bit float with slightly less precision in fewer bits. Zero-point (asymmetric) quantization & clipping allow for compressing the range of values to fewer bits. Static quants use a calibration dataset to find appropriate scale and zeropoint values. GPTQ uses per-layer quants, computed using the inverse-Hessian to determine which weights are most sensitive. GGUF splits each layer into super blocks and again into sub blocks, where the sub blocks use absmax (symmetric) quantization with a scale factor informed by the super block. BitLinear (1-bit) quantizes weights to 1 bit during training by centering the distribution around 0 and uses absmax to quantize the activations
  • πŸ“œThe Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    • BitNet replaces linear layer weights during training with absmean quants rounding to -1, 0, or 1, which allows matrix multiplications to become additions. This reduces memory consumption, latency, & energy, and increases throughput, with equal perplexity for larger models. For Llama 70B, improvements were 7x for memory consumption, 4x for latency, 41x for energy, and 9x for throughput.
  • πŸ“œExtreme Compression of Large Language Models via Additive Quantization
    • Introduces AQLM (Additive Quantization of Language Models), which splits weight rows of linear layers into groups of weights represented by a sum of vectors from learned codebooks and codes, optimized by iteratively performing a beam search over code values followed by gradient descent over codebooks. After quantizing linear layers, the remaining parameters are fine-tuned to approximate the original outputs. They achieve 2-bit PTQ (post-training quant) with minimal quality loss (much better than GPTQ and slightly better than QuIP#), achieving pareto-optimality below 3 bits
  • πŸ“œPV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression
    • Makes the case that straight-through estimation for fine-tuning compressed weights is sub-optimal and introduces PV-tuning, which iteratively alternates between optimizing continuous (P) and discrete (V) parameters. PV-tuning is designed to work with various PTQ methods and achieves pareto-optimality at 2.0 bits, allowing a 2-bit 13B model (Llama 2) to outperform the 16-bit 7B model

08/05/2024

  • πŸ“œMeasuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
    • Applied SAEs to Othello and chess models and introduced p-annealing, where the sparsity penalty starts at the L1 norm and gradually approaches the L0 norm (which is non-differentiable). p-annealing on a standard SAE was comparable to gated SAEs. However, none achieve reconstruction comparable to a linear probe, offering further evidence that SAEs aren't capturing all information in the model's representations
  • πŸ“œTamper-Resistant Safeguards for Open-Weight LLMs
    • Introduces Tampering Attack Resistance (TAR), adversarial training designed to minimize the ability to fine-tune away safeguards. They start with a model with a general safeguard like circuit breaking. TAR performs an inner loop sampling attacks against the safety metric. The gradient from the inner loop is then mixed in an outer loop with the gradient from a retain loss (from representation engineering) to preserve capabilities performance. Notably, the tamper-resistance loss is negative entropy rather than negative cross-entropy, since the model can learn to exploit the latter. After fine-tuning attacks, TAR maintained near-random performance on WMDP, significantly more robust than prior approaches. TAR also achieves a lower attack success rate (ASR) on HarmBench, although the success rate is still 64%. TAR does impose a cost on capabilities, comparable to other approaches

08/02/2024

  • πŸ“œThe Larger the Better? Improved LLM Code-Generation via Budget Reallocation
    • Evaluated Code Llama (7B to 70B) on pass@k for HumanEval, MBPP, APPS. For a fixed compute/wall-time budget, 7B and 13B can outperform even the 70B model. Note this appears dependent on the difficulty of the task, since for the competition-level APPS split, 7B was dominated by larger models. In situations where unit tests aren't available, a larger model can be used as a ranker for generations from smaller models, but if wall-time isn't a constraint, simply generating from the larger model yields much higher accuracy.
  • πŸ“œSafetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
    • Argues that if a safety benchmark correlates highly with general capabilities benchmarks, it's liable for safetywashing. Since benchmarks impact incentives for resource allocation, care should be taken to use/develop safety benchmarks which implicitly control for capabilities, which requires empirical measurement. Scores on MT-Bench and LMSYS Chatbot Arena are highly correlated with capabilities. ETHICS is highly correlated with capabilities, likely because it's measuring recognition of moral considerations, whereas MACHIAVELLI and Sycophancy are are not. Bias benchmarks are generally not correlated with capabilities. TruthfulQA is highly correlated with capabilities. GPQA and QuALITY are (unsurprisingly?) highly correlated as well. RMS calibration error is safe to use, but Brier scores are not, since they entangle accuracy and calibration. Older adversarial robustness benchmarks (ANLI, AdvGLUE, ImageNet-A) are highly correlated with capabilities, but newer ones (HarmBench, PGD) are not. WMDP is anti-correlated with capabilities

08/01/2024

  • πŸ“°Extrinsic Hallucinations in LLMs
    • A lot of hallucinations come from incorrect pre-training data. Benchmarks like FactualityPrompt & FActScore measure general factuality, TruthfulQA measures accuracy on adversarial examples of common human falsehoods, and SelfAware measures a model's ability to know whether it knows a question is unanswerable. FAVABench measures fine-grained kinds of hallucinations. Pretrained models tend to be better calibrated on their correctness (scaling with model size), but RLHF reduces calibration. RAG, RARR, FAVA, RR, and Self-RAG are all methods that use external information to augment/correct answers. Chain-of-verification (CoVe) and recitation-augmented-generation (RECITE) both use the model itself to reduce hallucinations. There are several approaches to favoring factuality/attribution during SFT/DPO
  • πŸ“œRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    • RAG augments the model with non-parametric memory: documents are embedded prior to queries (often in small chunks), selected at query-time by similarity (FAISS), and used to augment the prompt. The pre-trained retriever and generator are fine-tuned end-to-end. Describes RAG-token, where the generator produces a distribution for the next token for each (top K) document, and RAG-sequence, where a separate beam search is run over each (top K) document for the entire sequence
  • πŸ“œRetrieval-Augmented Generation for Large Language Models: A Survey
    • Naive RAG can have issues with selecting misaligned chunks and integrating retrieved information with the query. Advanced techniques include optimizing indexing structure, optimizing the query (rewriting, expanding, decomposing), post-retrieval processing (reranking chunks, compressing context), iterative/recursive retrieval, and specialized modules (direct searches across different data sources, intelligent routing, using task-specific retrievers). There are many metrics/benchmarks to evaluate different approaches, but no standard

07/31/2024

  • πŸ“°Circuits Updates - July 2024
    • The Next Five Hurdles: missing features (SAEs are likely only extracting a small fraction of total features), cross-layer superposition (though residual stream SAEs can maybe address features from previous layers), attention superposition, interference weights, zooming out (how do we go from understanding features/circuits to understanding the model as a whole? How much will automated interp help?)
    • What is a Linear Representation? What is a Multidimensional Feature?: there has been some ambiguity around the linear representation hypothesis. Are features one-dimensional representations, or linear in a mathematical sense (addition and scaling)? Olah thinks the latter is the better definition and talks about multidimensional feature manifolds, but also ends with a note that definitions should be fluid in research and imperfect theories can still be productive
    • The Dark Matter of Neural Networks?: models may have "memorization features" which are extremely numerous & sparse (hence "dark matter")
    • Attention Pivot Tables: notes on reproducing early work on interpreting single-layer transformers as implementing skip-trigrams, and how "fiddly" this was
    • Measuring feature sensitivity using dataset filtering: despite SAEs finding interpretable features that are highly specific (only fire for a specific concept), many of them appear to not be very sensitive (don't fire even when humans/Claude think the text highly relates to the feature). This may be because the feature is subtly related to a concept rather than representing the concept as a whole
  • πŸ“°Open Source Automated Interpretability for Sparse Autoencoder Features
    • Released a library to generate and score explanations of SAE features using LLMs. This has become drastically cheaper with the latest models (e.g. Llama-3 70B). This works some of the time, but explanations aren't precise enough to distinguish among similar concepts, and a significant fraction of explanations don't generate samples that activate the feature at all (consistent with above findings from Anthropic).
  • πŸ“°Exploring Gemma Scope
    • Interactive site explaining/demoing uses of extracting features with SAEs, including steering

07/30/2024

  • πŸ“œScaling Exponents Across Parameterizations and Optimizers
    • Gives theoretical and empirical backing (tens of thousands of models, up to 27B parameters) for per-layer learning rates and scaling epsilon, or removing it entirely, as in Adam-atan2. Alignment/correlation between parameter and data vectors can cause a significant (and non-monotonic) shift in activation norms across layers and over time, motivating the above recommendations. Following these guidelines can find hyperparameters on small versions of models that transfer well to a larger scale
  • πŸ“œGrokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
    • Provides empirical evidence of reasoning achieved only via grokking, and a mechanistic explanation for why composition of facts is harder than comparison of facts: the former requires multiple steps, so only facts seen during training get generalized, whereas the latter can be done in parallel, so the model more easily generalizes over OOD facts. They also perform an experiment on an extended version of the comparison task requiring anti-symmetry & transitivity. GPT-4-Turbo and Gemini-Pro-1.5 perform poorly even with RAG and CoT, while the much smaller grokked transformer achieves near-perfect accuracy
  • πŸ“œLearning to (Learn at Test Time): RNNs with Expressive Hidden States
    • Proposes a new layer with linear complexity where the hidden state is itself an ML model. In the forward pass, the layer performs self-supervised learning to reconstruct the input. Parameters from the outer loop (rest of the network, reconstruction views, initial inner weights, inner learning rate) are learned during training, and the layer's inner weights are learned during inference. To improve speed, the inner loop learns on a batch of tokens (e.g. 16), at a slight cost to perplexity. They tested with both linear and two-layer MLP hidden states and transformer and Mamba backbones, up to 1.3B. The TTT layers had lower perplexity than vanilla transformer and Mamba backbones, particularly on longer context lengths, although they didn't observe clean scaling laws.

07/29/2024

  • πŸ“œDiLoCo: Distributed Low-Communication Training of Language Models
    • Distributed Low-Communication (DiLoCo) splits training into an outer and inner loop. The inner loop is performed on distributed heterogeneous workers. Each worker has its own data shard, is passed model parameters from the outer loop, performs its own optimization over many steps (using AdamW), and then sends gradients to the outer optimizer. The outer optimizer (using Nesterov momentum) averages gradients and updates parameters sent to the workers. Scaling the number of inner steps results in slightly worse perplexity but much faster training, with a sweet spot of 500. Results were robust for non-i.i.d. data sharding, adjusting compute per shard, and simulating dropped gradient communications, providing evidence total compute is what matters most. Note tested models were only 400M params
  • πŸ“œDiPaCo: Distributed Path Composition
    • DIstributed PAths COmposition (DiPaCo) splits training and inferencing of a large model into sparsely activated modules (trained a la DiLoCo). DiPaCo is designed for a world where compute is cheap and communication is expensive (not the current world). During training, routing to a path is determined by the entire input sequence. At test/inference time, the sequence is split into chunks and each chunk is routed separately. A 150M path performs nearly as well as a dense 1.3B model in less wall-clock time, although it's difficult to compare FLOPs used in training given the architecture. Performance is improved by routing smaller chunks, although this would incur throughput issues in the real world (having to recompute the KV cache after each routing decision)
  • πŸ“œThe Future of Large Language Model Pre-training is Federated
    • Introduces Photon, a federated learning (FL) system similar to DiLoCo, but with an emphasis on heterogeneous compute and private data. A node abstracts over a single GPU, multi-GPU, or multi-machine client, which receive the model & send gradients back to an aggregator. They trained models up to 1.3B in size on heterogeneous clients, with the largest model performing as well as a centrally-trained one, and during later rounds federated training acts as a regularizer

07/26/2024

  • πŸ“œPlanning behavior in a recurrent neural network that plays Sokoban
    • Reproduced a prior setup using a Deep Repeating ConvLSTM (DRC) architecture to solve Sokoban puzzles. By repeating the initial observation and advancing the DRC hidden state at the start of an episode, the agent gets to "think" before taking an action. This improves the agent's ability to solve harder puzzles. The agent also naturally exhibits cycling/"pacing" behavior; however, the more thinking steps the agent is given at the start of the episode, the less it cycles on its own, indicating that pacing is a learned form of planning/mesa-optimization
  • πŸ“ΊEric Wallace: Memorization in language models
    • Repeating data makes it easier to memorize, but larger models are also better at memorizing after very few examples. There are ways to mitigate/undo memorization, but they're generally expensive and don't work against targeted attacks
  • πŸ“ΊMartin Wattenberg: Models within models - how do LLMs represent the world?
    • Covered case studies of Othello-GPT (player-opponent board state) & Stable Diffusion (depth, foreground-background), speculated about a user-system model in LLMs, and raised the question of what data about the internal model would be useful (& not) to expose to the user
  • πŸ“ΊNicholas Carlini: The security of LLMs
    • Adversarial robustness in image models remains a challenge after a decade of research, although attacks weren't relevant to bad actors. Adversarial robustness has become much more important for LLMs. Despite text not being differentiable, the embeddings are, so LLMs are also highly susceptible to gradient-based attacks. Just like image models, attacks transfer to LLMs with a different architecture and trained on different data. Data poisoning is becoming more important

07/25/2024

  • πŸ“°AI achieves silver-medal standard solving International Mathematical Olympiad problems
    • AlphaProof uses an LLM to translate math problems into formal language. The backend uses AlphaZero's algorithm (MCTS + self-play with synthetic data), which sounds very similar to the Lean-STaR paper from 07/22. Note for the headline performance, problems were manually translated to Lean, and LLM translation is still WIP: "the results showed great promise"
  • πŸ“œRule Based Rewards for Language Model Safety
    • Leverages LLMs' ability on specific classification tasks to evaluate whether a completion follows a behavior policy. A grader LLM estimates the probability a completion meets each proposition/combined class features defined in the behavior policy. The classification prompts fed into the grader are tuned from a small Gold dataset created by human researchers. The classification probabilities are fed into the Rule-Based Reward (RBR) model, a small linear model fitted against a synthetic dataset. The RBR score is combined with the helpful-only RM for the total reward used in PPO training. Including RBR in training led to fewer over-refusals on safe prompts while maintaining appropriate refusals and model performance
  • πŸ“œExploring Scaling Trends in LLM Robustness
    • Found AT effectiveness scaled with model size (on Pythia models from 14M to 12B), and importantly, larger models were more sample efficient. AT against one attack also transferred to other attacks

07/24/2024

  • πŸ“œDefending Against Unforeseen Failure Modes with Latent Adversarial Training
    • LAT perturbs latent state instead of inputs, as in AT. The optimal layer to perturb is found empirically. Models were fine-tuned using poisoned data to insert trojans, then fine-tuned with clean data & the given technique. LAT pareto dominates AT in image classification, text classification, and text generation (7B model) for data forgetting, although it can entrench trojans sometimes, just like AT
  • πŸ“œTargeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
    • Applied LAT to multiple layers of Llama2-7B-chat and Llama3-8B-instruct, as an additional step beyond refusal training (RT). The LAT models maintained performance on benign requests and reduced attack success rates better than robust refusal dynamic defense (R2D2), while requiring 36x fewer GPU hours to fine-tune. DPO + LAT was also able to remove backdoor triggers with minimal impact on general performance. Adding LAT to unlearning methods also improved success with minimal impact to performance, although relearning unwanted knowledge remains trivial

07/23/2024

  • πŸ“œThe Alignment Problem from a Deep Learning Perspective
    • First published in Aug 2022. Covers popular ideas: reward misspecification + situational awareness from RLHF can lead to reward hacking, which can exacerbate misalignment. As systems become more generally capable, deception and power-seeking become more likely and risky, especially as we cede control to autonomous agents
  • πŸ“°Thoughts on the impact of RLHF research
    • Christiano makes the case that RLHF was a relatively simple alignment technique that gave the field much-needed empirical data, and more complicated techniques will share technical ingredients, so the development was a net positive. He thinks RLHF had a small marginal impact on timelines, avoiding RLHF would have introduced a capability overhang, and effective empirical safety work requires working with systems that are closer to posing a risk
  • πŸ“ΊRLHF: How to Learn from Human Feedback with Reinforcement Learning
    • Good refresher:
      • RL will often over-exploit, so including KL-control in the loss prevents too much divergence from the base (or SFT) model (limits to infinite self-play)
      • human ratings are expensive, and RL is sample-hungry and unstable, so turning it into supervised RL with a reward model is much cheaper/efficient
      • offline RL (training the RM) leverages large amount of existing data & allows reusing existing supervised learning infrastructure
      • high-quality labels for the RM is a necessity
  • πŸ“œWARM: On the Benefits of Weight Averaged Reward Models
    • Trains a single RM created from the average of RMs trained using different hyperparameters & starting from different SFT checkpoints. Linear interpolation of weights relies on "linear mode connectivity" of models with shared pre-training. Weight averaging improves reliability on OOD tests and is more robust than ensembling (possibly due to reduced memorization)
  • πŸ“œSimple Synthetic Data Reduces Sycophancy In Large Language Models
    • By default, instruction tuning increases sycophancy, and larger models exhibit this trait more. Sycophancy can be modestly reduced by training on a synthetic dataset with examples disregarding user opinions, particularly those which the model knows are incorrect
  • πŸ“œCompositional Preference Models For Aligning LMs
    • Decomposes a single preference score into distinct features, each of which gets a score from an LLM. Feature scores are re-aggregated using a logistic regression. CPMs were more robust to overoptimization and more preferred by another LLM than reference PMs

07/22/2024

  • πŸ“œJumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
    • Trained SAEs on Gemma 2 9B, using JumpReLU and an L0 penalty (both requiring pseudo-derivatives to train) to decrease false positives of activations and encourage sparsity. JumpReLU had a similar number of very-high-frequency (>10%) features to TopK (more than Gated), but fewer high-frequency (>1%) than TopK and Gated. All three architectures exhibit similar manual (human) and automated interpretability
  • πŸ“œTruth is Universal: Robust Detection of Lies in LLMs
    • Previous research failed to find a single "truth" direction in LLM activation space that generalizes from affirmative statements to negations. They found a 2D subspace consisting of a general truth direction and a polarity-sensitive truth direction, which accounts for most of the model's sense of truth and can generalize to disjunctions and conjunctions. Activations projected onto these directions can be checked as a rudimentary lie detector, with strong accuracy for simple facts (though less robust for more complicated statements)
  • πŸ“œThe Platonic Representation Hypothesis
    • Mostly interesting from a philosophical perspective. Makes an argument for "convergent realism": just like human science, even though training data is biased/limited, models can capture "true" representations. Requiring multi-task performance (more general), increasing model capacity, and encouraging simplicity (either via explicit regularization or an implicit Occam's razor) are three hypotheses for why convergence would happen. Predictions for this view include scaling being sufficient (though not efficient), training data helping cross-modality performance (only up to the limit that the different modalities can share information), and reduced hallucination/bias.
  • πŸ“œLean-STaR: Learning to Interleave Thinking and Proving
    • Applies the idea behind STaR to theorem proving. Human-written proofs + retrospective thoughts from GPT-4 are used to start. The rationale is used to help predict subsequent tactics. Successful trajectories are added to the dataset (as in STaR) and used to finetune for the next iteration. This method outperformed SFT & expert iteration alone on multiple models. Unclear whether this would scale to bigger models, or if the initial thoughts from GPT-4 are enabling the improvement

07/18/2024

  • πŸ“œOpen-Ended Learning Leads to Generally Capable Agents
    • Defined a 3D "XLand" environment and used population based training (PBT) and generational training to improve agent fitness over time. Agents have a Goal Attention Module (GOAT) to structure, process, and attend to its goal. Agents achieved some amount of generalization on held-out tasks and finetuning for transfer
  • πŸ“°A (Long) Peek into Reinforcement Learning
    • Good overview of concepts, popular algorithms, and development over time. The reference to the "deadly triad" (bootstrapping, function approximation, off-policy training) leading to instability was helpful
  • πŸ“°Stitching SAEs of different sizes
    • Categorized features in larger vs smaller SAEs as "novel" vs "reconstruction" (sparsify) features. Mixing novel features from larger SAEs into smaller SAEs improved the performance of the smaller SAE. They also created "Frankenstein" SAEs by iteratively merging in novel features, achieving (slightly) better performance with smaller size
  • πŸ“°SAEs (usually) Transfer Between Base and Chat Models
    • Found that SAEs trained on base models perform well on chat models, and the gap can be closed further by fine-tuning the SAE. Seems to be further evidence that chat models' residual streams are very similar to base models
  • πŸ“°An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
    • No new information, but a great summary of the approach
  • πŸ“œProver-Verifier Games Improve Legibility Of LLM Outputs
    • OpenAI trained provers (helpful and sneaky) and verifiers to investigate whether we can train models to produce accurate outputs that are legible to humans. Joint training resulted in a model that was more accurate than initialization and still legibile. The sneaky prover's inserted errors became more subtle over time. A model trained only for correctness had the highest performance (and poor legibility), indicating a "legibility tax" tradeoff between accuracy and legibility

07/15/2024

  • πŸ“œCRADLE: Empowering Foundation Agents Towards General Computer Control
    • Fed screenshot/low-FPS video into GPT-4o. Scaffolded with self-reflection on inpupts, inference to select next task, learned & stored skills (code to interact with a mouse + keyboard), and episodic & procedural memory for improving performance over time. This framework was able to perform a variety of tasks in games/software with >50% success
  • πŸ“œSTaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning
    • Fine-tuned GPT-J-6B using STaR: starting with an initial few-shot prompt showing rationales, the model was trained to generate rationales to the input questions. Correct answer + rationale examples were added to the dataset. If the model wasn't able to come up with the right answer on its own, a hint was given in the form of the correct answer, and the model was able to generate a corresponding rationale. The process was iterated using the augmented dataset until performance plateaued. Performance after STaR was close to GPT-3 (30x larger), indicating models can "bootstrap" some amount of reasoning
  • πŸ“œQuiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
    • Generalizes the idea from STaR by having the model generate internal thoughts/rationales at each token position (in parallel), based off the preceding tokens. At the end of a thought, the post-rationale and base logits are mixed using a shallow MLP ("mixing head"). Rationales are optimized during training using REINFORCE, where the reward for a thought is based on how well it improves prediction of future tokens (the base future tokens are assumed to be ground-truth). Performance of Mistral 7B improved on GSM8K and CommonsenseQA, with "difficult" tokens benefiting more from internal reasoning. A future step could be dynamically determining when to generate/end thought, allowing a model to allocate variable compute during generation

07/12/2024

  • πŸ“–Deep Learning (Goodfellow, Bengio, Courville), Chapter 8
    • Chapter 8, Optimization: local minima in high-dimensional space are unlikely to be far from the global minimum, but saddle points are common, incentivizing optimization algorithms that can escape locally small gradients. Gradient clipping can prevent taking too large a step off a cliff. Momentum overcomes poor conditioning of the Hessian by using the gradient to update the momentum/velocity rather than the weights directly. Interesting to see the recommendation on treating weight initialization as a hyperparameter, as more recent texts have not
  • πŸ“–Neural Networks and Deep Learning
    • Brushed through quickly. The visual explanation for why neural nets can approximate any function was nice. Both the vanishing and exploding gradient problems in deep networks result from the gradient in early layers being the product of terms from many later layers, leading to instability

07/11/2024

  • πŸ“–Deep Learning (Goodfellow, Bengio, Courville), Chapters 6-7
    • Chapter 6, Feedforward Networks: Historical coverage was really helpful, e.g. dominance of ReLU resulted from avoiding saturation & two-sided activations, cross-entropy improved over MSE's saturation/slow learning
    • Chapter 7, Regularization: Good coverage, esp. L2 as MAP Bayesian inference with a Gaussian prior on weights (discourages high weights) & L1 as the same but with a Laplace distribution prior (encourages sparsity). Also interesting to think of dropout as approximating bagging

07/10/2024

07/09/2024

  • πŸ“œMe, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
    • New benchmark for measuring situational awareness, composed of self-knowledge (facts, causal influence, mechanistic introspection), inferences (training vs. deployment stages, self-recognition of text authorship), & actions (leveraging knowledge of identity, avoiding pattern imitation). No models are currently close to saturation, but scores were higher than I expected

07/08/2024

  • πŸ“œOn scalable oversight with weak LLMs judging strong LLMs
    • Compared debate to consultancy and direct question-answering for inference (not training). Debate outperforms consultancy, but QA with article access does significantly better than either. Obvious next steps are to train the debaters via self-play using the judge's decision as the reward signal
  • πŸ“œEureka: Human-Level Reward Design Via Coding Large Language Models
    • Given RL environment code, an LLM (GPT-4) generates candidate reward functions. Each are simulated, and the LLM is given detailed performance statistics as feedback to iteratively generate a new batch of reward functions. After several iterations, the final reward function often outperforms one defined by human experts
  • πŸ“œDrEureka: Language Model Guided Sim-To-Real Transfer
    • Extends Eureka by generating reward functions that are (1) robust to domain randomization (DR) to account for real-world physics and (2) produce safer behavior. Feasible ranges on DR parameters to guide the LLM are learned via parallelized simulations. They achieved better than human-designed performance and got a robot dog to walk on a yoga ball for several minutes, without intermediate real-world testing
  • πŸ“°An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
    • Great summary from Nanda on latest mechinterp research. Pleased to see most of the papers he mentions are linked in this research log

07/06/2024

  • πŸ“œUnderstanding Generalization through Visualizations
    • Despite intuitions from the past, models do tend to generalize well to test data, perhaps because the loss landscape in high dimensions is mostly occupied by flat basins, leading to implicit regularization

07/05/2024

  • πŸ“œUncovering Latent Human Wellbeing In Language Model Embeddings
    • Used PCA to reduce dimensionality from embeddings and extract features relevant to wellbeing, using labeled prompts from ETHICS utilitarianism dataset. Small models represented wellbeing to an extent, bigger models did better
  • πŸ“œWhen Representations Align: Universality in Representation Learning Dynamics
    • Makes a case that given smooth encoder/decoder maps, high expressivity (enough model parameters/complexity), and small initial weights, structured representations (as opposed to overfitting) minimizes loss and is a natural consequence of gradient descent. Analysis & empirical data are limited to simple datasets & ignore inductive biases of models

07/03/2024

  • 🌐Intro to ML Safety, Lectures 13-14
    • Lecture 13, Trojans: data poisoning of public datasets works even when a small fraction is poisoned. Open-weight models can also be manipulated. Anomaly detection, Neural Cleanse, & meta-networks can be used to detect trojans, but not 100% of the time
    • Lecture 14, Detecting Emergent Behavior: many examples of unanticipated capabilities emerging from more params/compute. Emergent, instrumentally convergent goals (e.g. self-preservation) are concerning for safety. Proxy gaming has emerged many times, and can sometimes be detected by comparing to a trusted policy
    • Lecture 15, Honest Models: assessing a model's "beliefs" is hard. Older models did poorly on TruthfulQA, but this lecture appears outdated, since modern RLHF models perform significantly better

07/02/2024

  • πŸ“–AI Safety Book (Hendrycks), Chapters 8.1-8.9
    • Covered uncertainty around timelines/takeoff speeds, economic growth, distribution of AI power, distribution of access. Discussed tradeoffs of open access (allowing bottom-up misuse) vs. tight control (allowing top-down misuse/lock-in). Compute is a natural target for governance since it's physical and quantifiable. Government and international cooperation will become increasingly important as risk increases.
  • 🌐Intro to ML Safety, Lectures 1-12
    • Lectures 1-9: mostly recap/covered in safety book
    • Lecture 10, Anomaly Detection: AUROC, AUPR, FPR95 can all be used to evaluate anomaly detection. (Negative) prediction confidence can be used for anomaly detection, but isn't robust to adversarial inputs. Outlier exposure can help detect unseen OOD examples/anomalies. Training on geometric transformations (rotation, translation) can also help
    • Lecture 11, Interpretable Uncertainty: modern NNs are miscalibrated (often overconfident), especially on OOD data. Temperature scaling (fixed, post-training) & ensembles can significantly reduce calibration error.
    • Lecture 12, Transparency: covered saliency maps & feature visualization. Good reminder that a lot of interpretability work (on transformers) is less than two years old!

07/01/2024

  • πŸ“–AI Safety Book (Hendrycks), Chapters 7.1-7.7
    • Covered game theory, cooperation/conflict, collective action problems, & evolutionary pressures. The most salient idea is maliciousness from AI systems is not even necessary for bad outcomes for humans; rationality and selection pressure is enough

06/28/2024

  • πŸ“œConfidence Regulation Neurons in Language Models
    • LLMs have "entropy neurons" which can modify the output distribution without directly impacting the logits, and "frequency neurons," which directly modify logits along the direction of the token frequency distribution. Reproduced from GPT-2 small up to Llama 7B (entropy neurons) and Pythia 1B (frequency neurons)
  • πŸ“œInterpreting Attention Layer Outputs with Sparse Autoencoders
    • Same as Attention Output SAEs Improve Circuit Analysis entry from 06/21
  • πŸ“œLLM Critics Help Catch LLM Bugs
    • OpenAI trained CriticGPT for scalable oversight. Started with GPT4, used RLHF pipeline: human-rated critiques of (question, answer) data, used to train a reward model, optimized a policy with PPO, with Force Sampling Beam Search (FSBS) to reduce rate of hallucinations/nitpicks. CriticGPT outperforms humans and ChatGPT

06/27/2024

  • πŸ§ͺImplemented PPO for GridWorld
    • When the agent can find a solution, it converged much more quickly and stably than DQN. Still surpisingly fails to solve some seeds with bad obstacle configurations/sparser reward

06/26/2024

  • πŸ§ͺImplemented DQN for GridWorld
    • Holy instability Batman. Had to use L1 loss to get reasonable level of success. Sparse reward likely makes this harder for larger grids

06/25/2024

06/21/2024

06/20/2024

06/19/2024

06/18/2024

06/17/2024

06/16/2024

06/13/2024

06/11/2024

06/10/2024

06/07/2024

06/06/2024

06/05/2024

06/04/2024

06/03/2024

06/02/2024

05/31/2024

05/30/2024

05/29/2024

05/28/2024

05/27/2024

05/24/2024

05/23/2024

05/22/2024

05/20/2024

05/17/2024

05/15/2024

05/14/2024

05/10/2024

05/09/2024

05/08/2024

05/06/2024

05/05/2024

05/03/2024

05/02/2024

04/29/2024

04/25/2024

04/24/2024

04/23/2024

04/22/2024

04/21/2024

04/20/2024

04/19/2024

04/17/2024

04/15/2024

04/14/2024

04/13/2024

04/11/2024

04/10/2024

04/09/2024

04/08/2024

04/02/2024

03/31/2024

03/28/2024