2025 Technical Reading

2025 Reading List

NLP

LLMs

Foundations of Large Language Models

Broad overview of the following topics: Model architectuers and training, activation functions, learning techniques, data handling and preocessing, optimization and efficiency, concepts in learning, applciations and methods, techniquies and strategies.

Scaling Laws for Precision

Training in lower precision reduces the model’s “effective parameter count,” allowing us to predict the additional loss incurred from training in low precision and post-train quantization. Authors find that lower precision training can be more compute-efficient, but it can also lead to worse performance. They also find that there is a trade-off between the amount of data a model is trained on and the precision at which it is trained. For example, a model that is trained on a lot of data may perform worse if it is quantized to a lower precision after training. In conclusion, this paper shows that the precision of a language model can have a significant impact on its performance. It is important to consider both the training and inference precision when choosing a precision for a language model.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

BitNet b1.58 is a new way to make AI language models use less memory and power by reducing each parameter to just three values: -1, 0, or 1. The model works as well as regular 16-bit models when it reaches 3 billion parameters, but uses about 3.5 times less memory and runs almost 3 times faster. At larger sizes (70 billion parameters), it runs even better - about 4 times faster and uses much less memory than standard models. The model saves a lot of energy too - using 71 times less power for its main calculations compared to regular models. Tests show it performs well on language tasks and can handle long training sessions (2 trillion tokens) with good results.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast,Memory Efficient, and Long Context Finetuning and Inference

ModernBERT is an improved version of the BERT model. It’s designed for tasks like retrieval and classification. The model was trained on 2 trillion tokens and can handle sequences up to 8192 tokens long. It uses modern techniques like rotary positional embeddings and Gated Linear Units. This makes it faster and more memory-efficient. ModernBERT achieves top results in various evaluations, including classification tasks and retrieval in different domains, such as code. It’s also optimized to run efficiently on common GPUs.

STAR: A Simple Training-free Approach for Recommendations using Large Language Models

STAR is a new framework using large language models for recommendations without fine-tuning, featuring a two-stage process: retrieval (using semantic similarity and collaborative information) and ranking. Testing on Amazon Review datasets showed STAR outperformed supervised models on Beauty and Toys categories, and performed nearly as well on Sports items. This suggests LLMs can be effective for recommendation systems without the cost and complexity of fine-tuning.

Agents (LLM-based)

AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data

LLMs are powerful but can struggle with factual consistency and require complex queries. Traditional KG tools are difficult to use and require technical expertise. AGENTiGraph bridges the gap between LLMs and KGs. It uses a multi-agent system where each agent has a specific role, such as interpreting user intent or translating queries into graph operations. This allows for more natural language interaction with KGs and improved accuracy in tasks like question answering.

Agents are not enough

Current AI agents are limited. They can’t handle complex tasks or adapt to different situations.

There are historical challenges with agents, such as limited capabilities and lack of trust.
To improve agents, the authors propose three things:
- A secure version for private tasks.
- A user representation to avoid constant user input.
- A program to manage interactions between user and agents.

The idea is to create an ecosystem with different components working together:

Agents: Focus on specific tasks and can work with each other.
Sims: Represent users with their preferences and privacy settings.
Assistants: Interact with users and manage Sims and Agents to complete tasks.

Agents (Chip Hyuen)

AI agents perceive and act on their environment, with their capabilities defined by available tools and the environment itself. Tools are essential for agents to perceive (read) and act (write), augmenting knowledge, extending capabilities (like math or code execution), and enabling real-world actions. Planning is crucial for complex tasks, requiring plan generation, validation (by heuristics or AI), and execution, ideally decoupled to prevent wasted resources. Foundation models can be used for planning, especially when provided with information about action outcomes, and function calling enables tool use within model APIs. Effective agents require careful consideration of planning granularity, reflection/error correction, tool selection, and robust evaluation to address potential failures in planning, tool usage, or efficiency.

A Survey on LLM-powered Agents for Recommender Systems

This survey reviews LLM-powered agents in recommender systems, categorizing them into three paradigms: recommender-oriented (enhancing core mechanisms), interaction-oriented (improving user dialogue), and simulation-oriented (modeling complex interactions). It analyzes agent architecture (profile, memory, planning, action) and discusses datasets, evaluation, challenges, and future research directions.

LLM-powered Agents for Recommender Systems: A Comprehensive Survey

This comprehensive survey examines the integration of LLM-powered agents in recommender systems, focusing on their architecture, capabilities, and applications. The paper discusses how these agents can enhance recommendation quality through better understanding of user preferences, improved interaction patterns, and more sophisticated reasoning capabilities. It also addresses key challenges in deployment, including computational efficiency, privacy concerns, and evaluation methodologies.

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Proposes MRKL (Modular Reasoning, Knowledge and Language), a neuro-symbolic architecture to overcome limitations of large language models (LMs). It integrates multiple neural models with discrete knowledge and reasoning modules, aiming to handle complex tasks involving knowledge, reasoning, and language processing more effectively than standalone LMs. The paper outlines the architecture, implementation challenges, and AI21 Labs’ implementation, Jurassic-X.

Design Patterns for Securing LLM Agents against Prompt Injections

As AI agents powered by Large Language Models (LLMs) become increasingly versatile and capable of addressing a broad spectrum of tasks, ensuring their security has become a critical challenge. This paper proposes a set of principled design patterns for building AI agents with provable resistance to prompt injection attacks, which exploit the agent’s reliance on natural language inputs. The authors systematically analyze these patterns, discuss their trade-offs in terms of utility and security, and illustrate their real-world applicability through case studies. This is particularly important when agents are granted tool access or handle sensitive information.

Machine Learning

Improving Pinterest Search Relevance Using Large Language Models

Pinterest improved their search relevance using a two-stage approach: (1) A cross-encoder LLM teacher model trained on human-annotated data to predict 5-scale relevance scores, and (2) A lightweight student model trained via knowledge distillation for production serving. The system leverages rich Pin text features including titles, descriptions, synthetic image captions, and user engagement data. The LLM-based approach improved search feed relevance by 2.18% (nDCG@20) and increased fulfillment rates by over 1.5% globally. Key innovations include using multilingual LLMs to generalize across languages, enriched text representations, and large-scale semi-supervised learning through distillation.

Reinforcement Learning: An Overview

This is a pretty hefty paper and can take several hours to read through once but is a good refresher and/or overview. It provides a comprehensive overview of reinforcement learning (RL) and sequential decision making from end of 2024 (December). Based off of a textbook from Kevin Murphy. The paper covers value-based RL, policy-gradient methods, and model-based approaches. It also briefly discusses the integration of RL with large language models (LLMs). This work updates and expands upon chapters 34 and 35 of Murphy’s earlier textbook.

HealthBench: A Benchmark for Evaluating Language Models in Healthcare

HealthBench is a new benchmark for evaluating language models in healthcare applications. It focuses on assessing models’ capabilities in medical knowledge, clinical reasoning, and patient communication. The benchmark includes tasks like medical question answering, clinical case analysis, and patient education. This is particularly important as healthcare applications require high accuracy and reliability, and the benchmark helps identify areas where models need improvement to be safely deployed in healthcare settings.