***  
# Thom van der Velden – Student Help-Seeking Questions Before & After Intervention  

**Authors:** Thom van der Velden
**Institution:** Delft University of Technology, Faculty of Technology, Policy and Management  
**Corresponding author:** Thom van der Velden  
**Contact Information:**  vdveldenthom@gmail.com


### **General Introduction**  
This dataset was created for a master’s thesis investigating how high-school students’ help-seeking behavior in introductory programming changes when exposed to a Large Language Model (LLM) tutor. We collected and coded student questions submitted **before** and **after** an intervention, then manually (and partly automatically) classified each question into categories. The goal is to enable future researchers to reproduce our analyses and to explore the impact of LLM feedback on novice programmers’ question-asking.  

Data were gathered in Spring 2024 in a Dutch secondary‐education setting, under the supervision of the TU Delft Educational Research group. All participants provided informed consent, and the study was approved by the TU Delft ethics committee.  

***  

### **Purpose of the Dataset**  
To provide:  
- A corpus of real student questions posed to an LLM-based tutor both **prior to** and **after** an instructional intervention.  
- A human-annotated codebook linking each question to behavioral categories (e.g., copying questions from the chatbot, implementation questions, understanding).  

***  

### **Description of the Files**  

1. **Questions Before Intervention.xlsx**  
   - **Contents:**  Raw text of student questions before intervention.  
   - **Columns:**  
     - `Participant id` (integer): anonymized student identifier  
     - `Message` (string): full question text submitted  
     - `Manual Classification` (string): assigned code name  

2. **Questions After Intervention.xlsx**  
   - **Contents:**  Raw text of student questions after intervention.  
   - **Columns:**  
     - `Participant id` (integer): matches pre-intervention ID  
     - `Message` (string): full question text submitted  
     - `Manual classification` (string): assigned code name  

3. **Thesis Codes.xlsx**  
   - **Contents:**  Codebook mapping each classification label to descriptive comments and high-level groups.  
   - **Columns:**  
     - `name` (string): code name  
     - `comment` (string): guidelines/description for code  
     - `codegroup 1` (string): thematic bucket (instrumental, executive or unclear)

4. **predictions.csv**
   - **Contents:**  Model-predicted categories and types for student questions, alongside the true (human) labels.  
   - **Columns:**  
     - `question` (string): the original question text  
     - `true_category` (string): human-annotated high-level category  
     - `pred_category` (string): model’s predicted high-level category  
     - `true_type` (string): human-annotated question type (e.g., instrumental, other)  
     - `pred_type` (string): model’s predicted question type  
     - `merged_true` (string): combined label grouping when code/concept comprehension categories are merged
     - `merged_pred` (string): combined predicted grouping  

***  

### **Data Collection & Processing**  
- **Collection:** Students interacted with a web-based LLM tutor. Questions were logged automatically.  
- **Extraction:** Raw logs were filtered to extract only the student’s question text and anonymized IDs.  
- **Coding:** Two independent annotators applied the codebook in “Thesis Codes.xlsx.” Discrepancies were resolved through consensus.  

***  

### **Data Structure & Relationships**  
- Use `Participant id` to link pre- and post-intervention questions.  
- Join `Manual Classification` (in both question files) with `name` in the codebook to retrieve `comment` and `codegroup 1`.  

***  

***  

### **Sharing & Access**  
**License:** https://creativecommons.org/licenses/by-nc-sa/4.0/ 
**Use restrictions:** For academic and educational research only; commercial reuse requires prior permission.  

***  
