***Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant.***
Authors: Gaole He, Gianluca Demartini, Ujwal Gadiraju
Web Information Systems Group, Faculty of EEMCS, Delft University of Technology

Corresponding author: Ujwal Gadiraju
Contact Information:
u.k.gadiraju@tudelft.nl

Delft University of Technology - Faculty of EEMCS, Department of Software Technology, Web Information Systems Group
Building 28, Van Mourik Broekmanweg 6
2628 XE Delft
The Netherlands

***General Introduction***
This dataset contains data collected during experiments at Delft University of Technology, as part of Gaole He's PhD Thesis project: It is being made public both to act as supplementary data for publications and the PhD thesis of Gaole He and in order for other researchers to use this data in their own work.

The data in this data set was collected in the Web Information Systems Group, Faculty of EEMCS, between May 2024 and Sep 2024.


***Purpose of the dataset***
The purpose of these experiments was to investigate (1)  How does human involvement in the high-level planning and real-time execution shape their trust in an AI system powered by LLM agents? and (2) How does human involvement in the high-level planning and real-time execution of tasks with an AI system powered by LLM agents affect the overall task performance?


***Description of the data in this dataset***
The dataset (data.zip) consists of user responses in our main study (anonymized_data.zip) in planning tasks from [UltraTool](https://github.com/JoeYing1019/UltraTool) dataset and the data analysis code (code.zip). We also release the interface (interface.zip).  All data released are anonymized.

The user responses are collected from Prolific Platform. util.py in code.zip shows how to load these csv files.

- TiA_postq.csv, it contains user_id and all questions in the trust in automation questionnaire. Each row represents all data collected from one participant.
- user_plan_quality.csv, it contains exper annotation of plan quality from the authors.
- userbehavior.csv, it includes all user expertise and user cognitive load.
- userfeedback.csv, it includes all user actions and messages sent in conversation.
- userinfo.csv, with header "user_id,task_order_str,user_group,created_time", `task_order_str` records the task ordering of each participant. `user_group` indicates user group in [AP-AE | AP-UE | UP-AE | UP-UE].
- usertask.csv, with header "user_id,task_id,answer_type,choice,created_time", it records user choice at one task. The answer type contains user choice, user confidence, risk perception etc (recording user choice associated in tasks). `choice` indicate user choice at specific place.


***Code Usage***

The naming of code (xx.ipynb) typycally indicates its usage. Experimental results are also shown in the notebooks.
code used in data analysis (data_analysis.zip):

- analysis_cognitive_load.ipynb: analysis of cognitive load across conditions (section 5.1)
- analysis_confidence.ipynb: analysis of user confidence across conditions (section 5.3.4)
- analysis_correlation.ipynb: Spearman rank-order correlation coefficient for covariates level on dependent variables. (Table 7) and Task-specific spearman rank-order correlation coefficient for plan quality and risk perception (Table 8)
- analysis_H1_task_specific.ipynb: task-specific analysis for hypothesis 1 in paper (Table 3).
- analysis_H1.ipynb: analysis for hypothesis 1 in paper.
- analysis_H2.ipynb: analysis for hypothesis 2 in paper.
- analysis_H3.ipynb: analysis for hypothesis 3 in paper.
- analysis_H4.ipynb: analysis for hypothesis 4 in paper.
- analysis_TiA.ipynb: analysis of trust in automation
- calc_measure_action_seq.py: calculate measures associated with action sequence.
- calc_measures.py: calculate the depedent variables with reliance and appropriate relaince.
- descriptive_statistics.ipynb: Sec 5.1, Distribution of covariates, Performance Overview
- cognitive_load_bar_plot.py: bar plot illustrating the cognitive load (Figure 4 in section 5.1)
- confidence_planning_execution.py: bar plot for confidence dynamics (Figure 5)
- estimation_plot.ipynb: code to draw estimation plot for user trust across conditions (not shown in paper)
- failure_analysis.ipynb, failure analysis (section 5.3.3)
- qualitative_analysis_execution.ipynb: The interaction analysis (click buttons / interaction with LLMs / give feedback) in the execution stage
- qualitative_analysis_planning.ipynb: The interaction analysis (edit/delete/add/split steps) in the planning stage
- util.py: prepare necessary variables for evaluation (e.g., correct answers, user responses)

***Interface***

The interface is implemented with Flask, please refer to [here](https://github.com/RichardHGL/CHI2025_Plan-then-Execute_LLMAgent/blob/main/interface/README.md) to understand how to deploy with Render.
