Autonomous Agents
Autonomous Agents-research papers. Updated daily. See as well the Resources-section.
Research papers
Chronological order.
6th of August 2024
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Reviews scaling up inference compute (test-time) in order to built self-improving agents. Quantifies the amount of improvement, when increasing inference.
- Test-time compute outperforms 14x larger models.
- Compute optiml scaling strategy can improve efficiency of test-time compute by factor of up to 4x.
5th of August 2024
ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems
- ReDel (Recursive Delegation): Recursive multi-agent framework, where LLM decides when to delegate/how to delegate (delegation graph).
- Includes custom tool-use, delegation schema, event-based logging and interactive replay (web UI).
- Icludes open-source Python package.
- ReDel delegation schemes include DelegateOne (wait parent-agent until child-agent completion) and DelegateWait (provide separate function for parent agent to retrieve child agent response).
- Event-driven logging includes built-in events ans custom events.
3rd of August 2024
The Drama Machine: Simulating Character Development with LLM Agents
- Drama Machine: Reviews Automated Identity-generation with LLMs. Uses multiple LLMs to simulate dynamic/complex AI characters in domain of drama scenes: interview/detective.
- Roles include Ego, SuperEgo, Autobiography, Director and Critic.
2nd of July 2024
Coalitions of Large Language Models Increase the Robustness of AI Agents
- Coalition of LLM models outperform single model and fine-tuned LLMs.
- Specific LLMs fit for particular tasks and cheaper interference.
1st of August 2024
- AgentGen: Generates diverse LLM agent environments and planning tasks. LLM fine-tuned with this data improves significantly planning capabilities.
- Uses inspirational corpus to generate environment context (actions/restrictions/etc). Generates tasks, which include "difficulty diversification: easy/medium/hard with bidirectional evolution (Bi-Evol) to smoothly acquire new planning skills.
31st of July 2024
Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries
- Tulip Agent and AutoTulipAgent: LLM-agent has priviledges to create, update, delete and edit tool library.
- Self-Recursively extendible tool library.
- AutoTulipAgent includes 5 generic tools: 2 to decompose tasks/search tools, includes apart capability to create/delete/update tools.
28th of July 2024
Solving Robotics Problems in Zero-Shot with Vision-Language Models
- Wonderful Team: uses off-shelf VLM model for high-level planning, low-level location extraction and action execution.
25th of July 2024
PersonaGym: Evaluating Persona Agents and LLMs
- Introduces PersnaGym-benchmark to evaluate persona LLM-agents.
- Sets an automatic PersonaScore-metric to evaluate five different capabilities.
- Finds SOTA level LLMs to offer highly varying level of capabilities as persona-agents.
- Increasing model size is not guarantee of better persona agent performance with varying level of persona agent performance detected.
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
- RISE (Recursive IntroSpEction): iteratively sel-improve LLM responses through fine-tuning with RL.
- RISE starts with turn 1, where only prompt is provided. In turn 2, the prompt, the original response and its feedback is provided to generate the turn 2 response. Majority voting is used to select the final response from multiple responses generated.
24th of July 2024
Reinforced Prompt Personalization for Recommendation with Large Language Models
- Reinforced Prompt Personalization (RPP): uses instance-based prompting with MARL.
- Instead of task-based (role-play/history/reasoning guidance/output format), Instance-based prompting personalises to these four-characteristics with MARL.
- AI-gadget Kit: multi-agent driven Swarm UI (SUI) tabletop gaming system, which consist of meta-motion, interactive behaviour, interactive relationship and application.
3D Question Answering for City Scene Understanding
- Sg-CityU: 3D multimodal QA, which uses scene graph to provide answers related to spatial relationships about city-scenes
23rd of July 2024
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
- RedAgent: Introduces concept of "Jaillbreaking strategy" (strategies used by attackers to construct jaillbreaking prompts) red teaming through multi-agent self-reflection from context feedback and skill memory.
- The approach can jaillbreak LLMs and LLM-based apps (even more vulnerable) using just few queries.
- The Red-Agent architecture includes skill memory and multiple roles (profile constructor/planner/attacker/evaluator) and short/long term memory.
AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game
- AmongAgents: multi-agent LLM-framework with memory, reflection and interaction in social deduction game with ambiguous and deceptive characters.
- Includes meeting/task-phases.
- Agents pose personality-component: generated with personality prompt from pre-defined set of personalities: behaviour/decision-making, which contribute to more dynamism/realism.
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
- OpenDevin: LLM-based multi-agent framework, where agents interact as human-like SW agents writing code, using command line and browsing web.
- The framework includes: interaction mechanism (event stream), environment(sandbox environment for code execution), interface(human-like), multi-agent delegation (co-operate) and evaluation framework.
- Event stream tracks history of action and observation.
PyBench: Evaluating LLM Agent on various real-world coding tasks
- Introduces PyBench-benchmark for real-world like coding tasks withh LLM-agents.
- Introduces high-performance PyLlama3 model for coding tasks.
Artificial Agency and Large Language Models
- Reviews theoretical models for agents, LLM agents and concept of artificial agency.
LawLuo: A Chinese Law Firm Co-run by LLM Agents
- LawLuo: includes LLM-based receptionist/lawyer/secrretary/boss-agents to realistic legal consultation company based on SOP (Standard Operating Principle).
22th of July 2024
[TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON](https://arxiv.org/abs/2407.15734
- TaskGen: LLM-agent framework to solve tasks by dividing task into sub-tasks, executed by its own agent/equipped function. Manages memory/information based on need-to-know. Uses in StrictJson-format.
- Includes meta-agent, inner-agent, function-calls, sub-tasks, shared memory (sub-task completed/list of past equiped function inputs or outputs/shared variables) and passing context/shared memory to inner agent/function.
- Utilises global context adds data to default LLM prompt (carrying shared variables throughout a task/to store the current state of a dynamic environmental variable/specific instructions).
Odyssey: Empowering Agents with Open-World Skills
- Odyssey: interactive (plan-actor-critic) LLM-agent (fine-tuned Llama 3) with real world skill library.
- Introduces long-term planning/dynamic-immediate planning/autonomous exploration benchmark.
- Planner decomposes long-term goals into sub-goals with ultimate goals/behavioural constraints/agent states/achievements.
- Actor executes skill code using query context/similarity match/skill selection.
- Critic uses execution feedback/self-validation/self-reflection.
19th of July 2024
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
- Explores feasibility of Autonomic Computing Vision (ACV) with multi-agent framework based on LLMs.
- LLM-based multi-agent framework achieves level 3 autonomy.
- The original ACV-framework identified 4 pillars: self-configuration, self-optimization, self-healing and self-protection.
12th of July 2024
PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents
- PersonaRAG: Includes compoments k-docs retrieval, user interaction analysis (user profile/contextual retrieval/live session/document ranking/feedback agents) and cognitive dynamic adaption(selective/collaborative use of agents).
Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments
- IGOR (Instruction following with GOal-conditioned RL): LLM translates instructions into high-level action plan with sub-goals and RL executes them.
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation'
- LLMs generate novel and diverse biomedical hypthesis through multi-agent interaction.
11th of July 2024
GTA: A Benchmark for General Tool Agents
- GTA-benchmark: evaluates general tool usage of LLM agents in real user queries with real deployed tools. for example web page screenshots.
- Evaluates perception, operation, logic and creativity tools.
- Defines "Real-World" as helping humans in real-life with being step/tool-implicit.
- GPT-4 solves 50% of these tasks.
- Includes illustration of executable tool chains.
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- Internet of Agents (IoA): LLM agents lack capability to interact in dynamic environments with other agents outside its hard-coded communication pipeline.
- Limitations include: ecosystem isolation, single-device simulation and rigid communication/coordination.
- IoA acts in Internet-like environment to achieve collective intelligence and new capabilities.
- Includes architectural design of the IoA-framework.
- LAAs (LLM-empowered Autonomous Agents): Introduces concept of LAAs, which include three elements: external tools, LLMs (knowledge modelling) and Agentic workflow (human-like symbolic reasoning).
- LAAs are characterised by natural language dialogue, decision making, planning, task decomposition and actionining.
GPT-4 is judged more human than humans in displaced and inverted Turing tests
- Introduces Inverted Turing text.
Beyond Instruction Following: Evaluating Rule Following of Large Language Models
- RuleBench-benchmark: evaluates LLMs capability to follow rules.
- Evaluation dimensions include: executing rules, triggering rules, following formal rules, applying rules and following counterfactual rules.
Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
- Argues, that LLMs cannot be linguistic agents in the actual form by lacking embodiment, participation and precariousness.
- Reviews integration of LLMs into Automated Production Systems.
10th of July 2024
WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment
- Discovers lower-bound of covering 0.5% of WikiHow instructions equals roughly usage of 300 APIs, which we can consider lower-bound limit for covering wide variety of WikiHow instructions in Embodied agent tasks.
- The framework iteratively produces action spaces for APIs to be used by a LLM based embodied agent.
- This two-step process works by iteratively generating through hallucination: semi-executable agent policies with python by LLM few-shot prompting from WikiHow instructions, parse partial/full python programs into pool of APIs
9th of July 2024
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models
- Hypothetical Minds: Introduces "Theory-of-Mind"-module. Includes as well perception, memory and hierarchical two-level planning.
Vision language models are blind
- Reviews 7 visual tasks, where SOTA-level VLMs perform shockingly bad.
5th of July 2024
On scalable oversight with weak LLMs judging strong LLMs
- Explores debate and consultancy to supervise AI.
- Finds debate outperforms consultancy in general. Better debater models modestly improve judge accuracy.
- Reviews toxicity/bias in LLM agent multi-step inputs/outputs, instead of individual LLM input-output.
- Reviews LLMs in strategic games. LLMs come with systematic bias: positional bias, payoff bias and behavioural bias. LLMs performance decreases, when the mentioned bias-dimensions are misaligned.
3rd of July 2024
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
- LivePortrait: generates realistic video from single portrait image with facial expressions and head poses from different angles.
- Offers better computational efficiency and controllability over diffusion models, by using implicit-keypoint-based framework.
- Generation speed is 12.8 ms with RTX 4090.
Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory
- Cactus: multi-turn dialogue dataset for mental health counseling, consisting of goal-oriented/structured Cognitive Behavioral Therapy interation.
- Trains Camel-LLM using the Cactus-dataset.
2nd of July 2024
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
- GRASP: Large scale spatial reasoning benchmark and dataset in structured grid environment requiring planning and commonsense reasoning.
[MMedAgent: Learning to Use Medical Tools with Multi-modal