alphaXiv

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

970

15 Aug 2025

multi-modal-learningreasoningdeep-reinforcement-learning

Thyme: Think Beyond Images

Thyme, developed by a collaboration including Kuaishou, CASIA, and other Chinese academic institutions, equips Multimodal Large Language Models (MLLMs) with the ability to autonomously generate and execute code for diverse image processing and complex computations. This open-source framework enhances high-resolution visual perception and quantitative reasoning capabilities, often outperforming larger open-source baselines and reducing hallucination.

260

18 Aug 2025

multi-modal-learningreasoningtransformers

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study

An empirical study by SenseTime Research and S-Lab NTU benchmarks the spatial intelligence capabilities of state-of-the-art Large Multimodal Models, finding that while GPT-5 establishes a new performance benchmark, it still significantly lags human proficiency in complex spatial reasoning tasks such as mental reconstruction and deformation. The research also introduces a comprehensive taxonomy and standardized protocols for evaluating spatial intelligence in LMMs.

197

18 Aug 2025

reinforcement-learningtext-generationfine-tuning

Reinforcement Learning with Rubric Anchors

The Rubicon framework extends reinforcement learning for large language models to open-ended and subjective tasks by using rubric-based rewards, achieving a +5.2% absolute improvement on humanities benchmarks compared to its base model and outperforming a 671B model by +2.4%. This method facilitates fine-grained stylistic control and produces more human-like responses without compromising general reasoning abilities.

121

18 Aug 2025

image-generationgenerative-modelssequence-modeling

Next Visual Granularity Generation

Researchers from Nanyang Technological University and SenseTime Research introduce Next Visual Granularity Generation (NVG), a framework that decomposes image generation into iterative coarse-to-fine refinement based on varying unique token counts at a consistent resolution. The framework demonstrates enhanced control over visual structure and content, achieving a Fréchet Inception Distance of 2.06 on ImageNet 256x256 while enabling structure-guided image synthesis.

17 Aug 2025

deep-reinforcement-learningreinforcement-learningoptimization-methods

Cold-RL: Learning Cache Eviction with Offline Reinforcement Learning for NGINX

Web proxies such as NGINX commonly rely on least-recently-used (LRU) eviction, which is size agnostic and can thrash under periodic bursts and mixed object sizes. We introduce Cold-RL, a learned eviction policy for NGINX that replaces LRU's forced-expire path with a dueling Deep Q-Network served by an ONNX sidecar within a strict microsecond budget. On each eviction, Cold-RL samples the K least-recently-used objects, extracts six lightweight features (age, size, hit count, inter-arrival time, remaining TTL, and last origin RTT), and requests a bitmask of victims; a hard timeout of 500 microseconds triggers immediate fallback to native LRU. Policies are trained offline by replaying NGINX access logs through a cache simulator with a simple reward: a retained object earns one point if it is hit again before TTL expiry. We compare against LRU, LFU, size-based, adaptive LRU, and a hybrid baseline on two adversarial workloads. With a 25 MB cache, Cold-RL raises hit ratio from 0.1436 to 0.3538, a 146 percent improvement over the best classical baseline; at 100 MB, from 0.7530 to 0.8675, a 15 percent gain; and at 400 MB it matches classical methods (about 0.918). Inference adds less than 2 percent CPU overhead and keeps 95th percentile eviction latency within budget. To our knowledge, this is the first reinforcement learning eviction policy integrated into NGINX with strict SLOs.

588

14 Aug 2025

synthetic-datatext-generationtransformers

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

DatologyAI's BeyondWeb framework generates high-quality synthetic data, enabling Large Language Models to achieve state-of-the-art performance with up to 7.7x faster training speed compared to open web data. This approach allows smaller models to outperform larger ones trained on conventional datasets, addressing the data scarcity challenge in pretraining.

826

14 Aug 2025

reinforcement-learningagentstransformers

SSRL: Self-Search Reinforcement Learning

A reinforcement learning framework called SSRL trains large language models to perform internal 'self-search,' enabling them to leverage their parametric knowledge as an information source. This approach reduces training costs for LLM agents, enhances efficiency, and demonstrates robust generalization to real external search environments.

209

15 Aug 2025

vision-language-modelstransformersreasoning

Ovis2.5 Technical Report

Alibaba Group's Ovis2.5 introduces a Multimodal Large Language Model (MLLM) capable of native-resolution visual perception and advanced reasoning through a 'reflection' thinking mode. The model sets a new state-of-the-art among open-source MLLMs in its parameter class, achieving an average score of 78.3 on the OpenCompass leaderboard for its 9B version, while also releasing a high-performing 2B model for resource-constrained scenarios.

104

18 Aug 2025

transformersreasoningchain-of-thought

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

This paper introduces OptimalThinkingBench, a unified benchmark designed to evaluate how Large Language Models balance performance and computational efficiency, specifically addressing tendencies of overthinking on simple tasks and underthinking on complex ones. Its comprehensive evaluation of 33 models reveals that no current LLM achieves an optimal balance, highlighting a fundamental trade-off that existing models and mitigation strategies struggle to resolve.

105

18 Aug 2025

generative-modelsneural-renderingimage-generation

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

4DNeX introduces the first feed-forward framework that generates dynamic 3D scene representations from a single image, enabling high-quality 4D geometry inference and novel-view video synthesis within 15 minutes. This approach tackles data scarcity by curating the 4DNeX-10M dataset and adapts a pretrained video diffusion model using a unified 6D video representation.

Events

Popular Communities

Install Browser Extension

Blog|Feedback|We're hiring

Explore

Communities

Login

Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

Thyme: Think Beyond Images

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study

Reinforcement Learning with Rubric Anchors

Next Visual Granularity Generation

Cold-RL: Learning Cache Eviction with Offline Reinforcement Learning for NGINX

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

SSRL: Self-Search Reinforcement Learning

Ovis2.5 Technical Report

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

Events

V-JEPA 2

How well do LLMs reason over tabular data?

AI4Science

Popular Communities