Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com
AI/ML Reading List
Curated links with summaries. RSS feed ↗
- [D] icml, no rebuttal ack so far..ai-mlcommunity
- [R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)
VOID is a video inpainting model that removes objects while maintaining physical consistency by modeling counterfactual scene evolution (e.g., stopping domino chains, preventing collisions when objects are removed). The approach uses VLM-guided masks and paired synthetic training data to handle object interactions, outperforming existing methods (64.8% preference in human studies) and is released as open-source by Netflix—shifting practitioner capabilities in physics-aware video editing.
ai-mlcommunity - Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use
Arcee AI released Trinity Large Thinking, an open-weight 400B sparse MoE reasoning model (Apache 2.0) optimized for autonomous agents, multi-turn tool use, and long-horizon tasks with 262K token context. The release demonstrates frontier-class agentic performance (#2 on PinchBench) with training innovations (Muon optimizer, SMEBU load-balancing) and represents meaningful open-source democratization of reasoning-capable models previously dominated by proprietary systems.
ai-mlresearch - TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
TII releases Falcon Perception, a 600M-parameter unified early-fusion Transformer for open-vocabulary visual grounding and segmentation that processes image patches and text tokens in shared parameter space from layer 1. The architectural shift from modular vision-language systems demonstrates significant improvements over SAM 3 on complex semantic tasks (+21.9 points on spatial reasoning), with novel techniques like GGROPE positional embeddings and hybrid attention strategies that could influence how practitioners design multimodal perception systems.
ai-mlresearch - [D] Agentic AI: From Tantrums to Trustai-mlcommunity
- [P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.
PhAIL is an open benchmark measuring Vision Language Action (VLA) model performance on real warehouse robotics hardware using production metrics (UPH, MTBF), revealing the best current models achieve only 5% of human throughput with failures every 4 minutes—a critical reality check on autonomous manipulation claims with fully reproducible runs, open code, and submission pathways that shift how the field should evaluate robot AI.
ai-mlcommunity - [D] Make. Big. Batch. Size.ai-mlcommunity
- Anthropic says its leak-focused DMCA effort unintentionally hit legit GitHub forks
Anthropic's DMCA takedown targeting leaked Claude Code source on GitHub inadvertently removed 8,100 legitimate forks of its official public repository, affecting developers not involved in the leak. The incident illustrates collateral damage in IP enforcement efforts and raises questions about DMCA's application to open-source AI code in practice.
ai-mlresearch - OpenAI Buys Some Positive Newsai-mlresearch
- Gemma 4: Byte for byte, the most capable open modelsai-mlresearch
- Stanford CS 25 Transformers Course (OPEN TO ALL | Starts Tomorrow)
Stanford's CS 25 Transformers seminar is opening to public attendance and livestream, featuring leading researchers from OpenAI, Anthropic, Google, and NVIDIA discussing frontier work in LLMs, multimodal models, and applications across biology and robotics. This is a high-signal educational resource for practitioners tracking cutting-edge transformer research with direct access to field leaders.
ai-mlcommunity - The Magic of Machine Learning That Powers Enemy AI in Arc Raiders
Embark Studios' Arc Raiders uses reinforcement learning and robotics techniques (learned locomotion + behavior trees) to generate dynamic enemy AI that adapts and recovers unpredictably rather than following scripted patterns. The application signals growing adoption of RL in game development and represents a practical deployment of sim-to-real transfer techniques in interactive media, relevant as a case study in embodied AI and physics-informed learning systems.
ai-mlcommunitylong-signal:rdd - [D] Self-Promotion Threadai-mlcommunity
- [D] SIGIR 2026 review discussionai-mlcommunity
- [R] Best way to tackle this ICML vague response?ai-mlcommunity
- LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Researchers evaluated instruction-tuned LLMs on essay scoring across three datasets, finding moderate agreement with human raters on holistic scoring but systematic negative bias on lower-order traits like grammar. The work demonstrates that small validation sets can detect these biases, enabling practical bias-correction strategies without fine-tuning—directly relevant to practitioners deploying LLMs in high-stakes educational contexts.
ai-mlresearchvelocity:hn-medium - Benchmark for Assessing Olfactory Perception of Large Language Models
Researchers introduced the Olfactory Perception (OP) benchmark to evaluate LLMs' ability to reason about smell across 1,010 questions and multiple modalities, finding that models rely on lexical associations rather than structural reasoning and that language ensembles improve performance. This work extends capability assessment beyond visual/auditory domains and reveals how LLMs encode non-visual sensory knowledge, with implications for multimodal reasoning and knowledge representation.
ai-mlresearchvelocity:hn-medium - Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models
MARS-GPS proposes a multi-chain-of-thought voting framework with Python code execution and entropy-based ranking to improve geometric problem solving in LLMs, achieving 88.8% on Geometry3K (+11% SOTA improvement) with public code release. This advances a core capability gap—logical inference in multimodal reasoning—relevant to researchers building reasoning systems and those tracking LLM mathematical cognition progress.
ai-mlresearch - A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation
Researchers propose a safety-aware multi-agent LLM framework that decomposes behavioral health dialogue into specialized roles (empathy, action, supervision) with dynamic orchestration and continuous safety auditing, evaluated on DAIC-WOZ corpus. This bridges system design and safety concerns for practitioners building clinical-adjacent LLM applications and signals emerging best practices for role-based agent coordination in constrained domains.
ai-mlresearchvelocity:hn-medium - RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
RoboClaw proposes a VLM-driven agentic framework that unifies data collection, policy learning, and deployment for long-horizon robotic tasks through Entangled Action Pairs—coupled forward/recovery actions enabling autonomous, self-resetting loops. Real-world experiments show 25% success-rate improvement and 53.7% reduction in human intervention time, advancing the practical scalability of multi-policy robotic systems beyond brittle manual-reset pipelines.
ai-mlresearch - Science-T2I: Addressing Scientific Illusions in Image Synthesis
ScienceT2I introduces a 20k+ adversarial image pair dataset across 16 scientific domains to expose and measure how current text-to-image models fail at scientific reasoning despite visual fidelity, revealing a ~35 point gap between implicit and explicit scientific prompts. The authors propose SciScore (a CLIP-H reward model) and a two-stage alignment framework that achieves 50%+ relative improvement on FLUX.1[dev], demonstrating that scientific grounding in generative models is trainable and measurable—directly applicable to practitioners working on model alignment and evaluation.
ai-mlresearch - EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks
EHRStruct introduces a standardized benchmark with 11 representative tasks and 2,200 samples to evaluate LLMs on structured electronic health record data, evaluating 20 models and proposing EHRMaster, a code-augmented method for improved performance. This addresses a gap in systematic evaluation frameworks for LLMs on clinical structured data reasoning—a high-value domain where model reasoning and understanding quality directly impacts clinical decision-making applications.
ai-mlresearchvelocity:hn-medium - The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline
Researchers develop a comprehensive mathematical framework explaining that training methodology, loss function design, and data diversity determine AI weather forecast skill at least as much as architecture, challenging the field's focus on model architecture alone. The work combines approximation theory, dynamical systems, and information theory with empirical validation across diverse models, establishing practical bounds on extreme event prediction and providing prescriptive guidance for pipeline design—directly applicable to practitioners building operational forecasting systems.
ai-mlresearchvelocity:hn-medium - Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Researchers introduce PaperRecon, the first systematic evaluation framework for assessing quality and hallucination risks in AI-generated academic papers, tested on 51 papers via ClaudeCode and Codex agents. The work reveals a critical trade-off between presentation quality and factual accuracy in AI paper generation, directly relevant to researchers and practitioners concerned with AI system reliability and the integrity of research outputs.
ai-mlresearchvelocity:hn-medium - Transfer learning for nonparametric Bayesian networks
Researchers propose two transfer learning algorithms (PCS-TL and HC-TL) for estimating nonparametric Bayesian networks from limited data, with metrics to detect and mitigate negative transfer. The work demonstrates statistical improvements on synthetic and UCI datasets, addressing a real constraint in industrial deployment where data scarcity is endemic.
ai-mlresearch - Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting
DANCEMATCH introduces a quantized motion fingerprinting framework combining skeleton motion quantization with spatio-temporal transformers to enable efficient large-scale dance video retrieval, addressing the scalability limits of continuous embedding methods. The work includes a released benchmark dataset (DANCETYPESBENCHMARK) and demonstrates generalization across choreographic styles, making it relevant to practitioners in motion analysis, retrieval systems, and structured representation learning for video understanding.
ai-mlresearchvelocity:hn-medium - Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
GameSight proposes a two-stage knowledge-enhanced visual reasoning model for automatic soccer commentary generation, improving entity alignment by 18.5% over Gemini 2.5-pro by combining visual reasoning with external statistics and game state tracking. While technically solid and addressing real limitations in end-to-end commentary systems, this is an application-layer advance in sports analytics rather than a foundational ML breakthrough with cross-domain relevance.
ai-mlresearch - RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning
RefineRL introduces a self-refinement reinforcement learning method that enables smaller LLMs (4B) to solve competitive programming problems at the level of much larger models (32B–235B) through iterative validation and skeptical re-attempts. This represents a concrete advance in reasoning-task efficiency and scaling, with direct implications for practitioners optimizing model deployment and researchers studying LLM iteration capabilities.
ai-mlresearch - Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents
OpenTools is a community-driven framework that standardizes tool schemas, evaluation, and monitoring for LLM-integrated agents, decoupling tool-use accuracy from intrinsic tool correctness. This addresses a real reliability gap in agentic systems and demonstrates 6-22% performance gains, with practical implications for practitioners building tool-using AI applications.
ai-mlresearchvelocity:hn-medium - How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study
Researchers propose E-STEER, a framework for direct representation-level intervention of emotional signals in LLMs and agents, revealing non-monotonic emotion-behavior relationships aligned with psychology. This work bridges mechanistic interpretability, agent behavior control, and safety—moving beyond surface-level style manipulation to show emotions can enhance both capability and alignment.
ai-mlresearchvelocity:hn-medium - Polychromic Objectives for Reinforcement Learning
Researchers propose 'polychromic objectives' for reinforcement learning fine-tuning, a method that prevents policies from collapsing into repetitive outputs by explicitly enforcing diverse generation exploration during training. This matters because mode collapse limits both the breadth of learned behaviors and the effectiveness of test-time compute scaling in RL, making diversity-preserving objectives directly relevant to practitioners improving pretrained policies.
ai-mlresearch - AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation
AdaLoRA-QAT combines adaptive low-rank adaptation and quantization-aware training for efficient chest X-ray segmentation, achieving 16.6× parameter reduction and 2.24× compression while maintaining clinical accuracy. Demonstrates practical pathway for deploying foundation models in resource-constrained medical settings; open-source release increases accessibility for practitioners.
ai-mlresearchvelocity:hn-medium - VibeGuard: A Security Gate Framework for AI-Generated Code
VibeGuard is a pre-publish security framework targeting vulnerabilities introduced by AI-assisted code generation (artifact hygiene, packaging drift, source-map exposure, secrets, supply-chain risk), demonstrated on synthetic projects with 94.44% F1 score. Directly addresses a real failure mode (Claude Code CLI's March 2026 leak of 512K lines via misconfigured source map) that existing static analysis tools missed, establishing a new threat model for practitioners deploying AI code assistants in production.
ai-mlresearch - OrgAgent: Organize Your Multi-Agent System like a Company
OrgAgent introduces a hierarchical three-layer organizational structure (governance, execution, compliance) for multi-agent LLM systems, demonstrating 102% performance improvement and 74% token reduction over flat architectures on SQuAD 2.0. The work identifies organizational structure as a first-order factor in multi-agent reasoning efficiency and effectiveness, with direct implications for practitioners scaling complex reasoning systems.
ai-mlresearch