***To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems.***
Authors: Gaole He, Abri Bharos, Ujwal Gadiraju
Web Information Systems Group, Faculty of EEMCS, Delft University of Technology

Corresponding author: Ujwal Gadiraju
Contact Information:
u.k.gadiraju@tudelft.nl

Delft University of Technology - Faculty of EEMCS, Department of Software Technology, Web Information Systems Group
Building 28, Van Mourik Broekmanweg 6
2628 XE Delft
The Netherlands

***General Introduction***
This dataset contains data collected during experiments at Delft University of Technology, as part of Gaole He's PhD Thesis project: It is being made public both to act as supplementary data for publications and the PhD thesis of Gaole He and in order for other researchers to use this data in their own work.

The data in this data set was collected in the Web Information Systems Group, Faculty of EEMCS, between Dec 2021 and Oct 2022.


***Purpose of the dataset***
The purpose of these experiments was to investigate (1)  How can a debugging intervention help users to estimate the performance of an AI system, both at the instance and at the global level? and (2) How does a debugging intervention affect the reliance of users on an AI system?


***Description of the data in this dataset***
The dataset (data.zip) consists of seletected deceptive review detection tasks from [Deception Detection](https://github.com/vivlai/deception-machine-in-the-loop) dataset (data/review_data) and user responses in our main study (data/anonymous_data) and the data analysis code (data_analysis.zip). All data released are anonymized.

reviews_p1.json and reviews_p2.json stores the tasks we selected as two batches of similar difficulty levels.
All reviews are stored in json format, we only explain the key-value mapping.
- reviews_p1.json
  - `id`, task ID.
  - `tokens`, the tokens (i.e., words) in the hotel review.
  - `hwords`: (list of words) highlighted tokens, generated with BERT-LIME.
  - `attributes`: (list of float numbers) word contribution to the model prediction, generated with BERT-LIME.
  - `classification`: [d | t], model prediction. `t` indicates genuine and `d` indicates deceptive.

The user responses are collected from Prolific Platform. util.py in data_analysis.zip shows how to load these csv files.

- userinfo.csv, with header "user_id,task_order_str,reverse_flag_str,user_group,analogy_type,created_time", `task_order_str` records the task ordering of each participant. `reverse_flag_str` shows when we show the misleading AI advice. `user_group` indicates user group in [control | accuracy | analogy]. `analogy_type` indicates three analogy types [train | weather | vaccine] provided in our experiments.
- usertask.csv, with header "user_id,task_id,answer_type,choice,created_time", it records user choice at one task. The answer type contains [base | analogy | attention] to indicate [user initial decision | user decision after advice | attention check]. `choice` indicate user choice at specific place.
- TiA_PostQ.csv, it contains user_id and all questions in the trust in automation questionnaire. Each row represents all data collected from one participant. 
- user_input_ati.csv, it contains user_id and 9 questions in the ATI questionnaire. Each row represents all data collected from one participant. 
- user_attention_check.csv, it contains user_id and recorded attention checks in our study. 
- user_input_debugging.csv, it contains user_id and the task order in the debugging phase. 
- user_input_nasa_tlx.csv, it contains user_id and cognitive load (assessed with NASA-TLX questionnaire). 
- user_input_post_task_questionnaire_1.csv, user assessment of their own performance and with AI assistance on task batch 1.
- user_input_post_task_questionnaire_2.csv, user assessment of their own performance and with AI assistance on task batch 2.
- user_input_post_task_questionnaire_tia_1.csv, user results of trust in automation questionnaire after task batch 1.
- user_input_post_task_questionnaire_tia_2.csv, user results of trust in automation questionnaire after task batch 2.
- user_input_pre_task_questionnaire_tia.csv, user results of subscale (propensity to trust and familiarity) in trust automation questionnaire, which is collected before formal tasks.
- user_input_user_answers_phase_1.csv, it includes user answer and confidence for tasks in task batch 1.
- user_input_user_answers_phase_2.csv, it includes user answer and confidence for tasks in task batch 2.
- user_input_user_answers_phase_nt.csv, it includes user answer and confidence for tasks in debugging phase.
- user_input_user_data.csv, it includes user group and task batch ordering for each participant.

***Code Usage***

The naming of code (xx.py) typycally indicates its usage. Most files can be executed with `python xx.py`
python code used in data analysis (data_analysis.zip):

- analysis_confidence.py: analysis of user confidence (section 5.3.4)
- analysis_covariates.py: analysis of user covariates (section 5.3.2)
- analysis_gender_time.py: analysis of user demographics (age and gender) and time across conditions (section 5.1)
- analysis_H1.py: analysis for hypothesis 1 in paper.
- analysis_H2.py: analysis for hypothesis 2 in paper.
- analysis_H3.py: analysis for hypothesis 3 in paper.
- analysis_top_quatile_performance.py: analysis of users’ estimation of AI trustworthiness (section 5.3.3)
- analysis_trust.py: exploratory analysis of debugging intervention on user trust (section 5.3.1)
- descriptive_statistics.py: distribution of variables and box plot of cognitive load (section 5.1)
- util.py: prepare necessary variables for evaluation (e.g., correct answers, user responses)