Supplementary material for the paper "Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI"

DOI:10.4121/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3.v1
The DOI displayed above is for this specific version of this dataset, which is currently the latest. Newer versions may be published in the future. For a link that will always point to the latest version, please use
DOI: 10.4121/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3

Datacite citation style

Alam, Md Shadab; Bazilinskyy, Pavlo (2025): Supplementary material for the paper "Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI". Version 1. 4TU.ResearchData. dataset. https://doi.org/10.4121/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3.v1
Other citation styles (APA, Harvard, MLA, Vancouver, Chicago, IEEE) available at Datacite

Dataset

Supplementary material for the paper: Alam, M. S., & Bazilinskyy, P. (2025). Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI. Adjunct Proceedings of the 17th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutoUI). Brisbane, QLD, Australia. https://doi.org/10.1145/3744335.3758477


This study evaluates the effectiveness of large language model-based personas for assessing external Human-Machine Interfaces (eHMIs) in automated vehicles. 13 different models namely Bak-LLaVA, ChatGPT-4o, DeepSeek-VL2-Tiny, Gemma3:12B, Gemma3:27B, Granite Vision 3.2, LLaMA 3.2 Vision, LLaVA-13B, LLaVA-34B, LLaVA-LLaMA-3, LLaVA-Phi3, MiniCPM-V and Moondream were tasked with simulating pedestrian decision making for 227 vehicle images equipped with eHMI. Confidence scores (0-100) were collected under two conditions: no memory (images independently assessed) and memory-enabled (conversation history preserved), each in 15 independent trials. The model outputs were compared with the ratings of 1,438 human participants. Gemma3:27B achieved the highest correlation with humans without memory (r = 0.85), while ChatGPT-4o performed best with memory (r = 0.81). DeepSeek-VL2-Tiny and BakLLaVA showed little sensitivity to context, and LLaVA-LLaMA-3, LLaVA-Phi3, LLaVA-13B and Moondream consistently produced limited-range output.


It has the following structure:

* code/

* code/.python-version: Pins the Python interpreter version (3.9.21) for environment consistency.

* code/analysis.py: Main analysis script that processes outputs, computes statistics (e.g., correlations with human data), and produces result figures.

* code/common.py: Contains functions for configuration management, dictionary search, and data serialisation.

* code/custom_logger.py: Implements a custom logger class for handling string formatting and logging at various levels.

* code/default.config: Configuration file specifying paths for data, plotly template, and plots directory.

* code/logmod.py: Initialises and configures the logger with customisable display and storage options, supporting colored logs, threading, and multiprocessing.

* code/main.py: Python script that produces all figures and analyses.

* code/Makefile: Defines shortcut commands for setup, running analysis, and cleaning project outputs.

* code/pyproject.toml: Defines project dependencies and metadata for the `uv` environment manager.

* code/uv.lock: Lockfile with pinned dependency versions for reproducible builds.

* models/

* code/models/chat_gpt.py: Wrapper for interacting with ChatGPT (Vision), including prompt formatting, sending images, and parsing responses.

* code/models/deepseek.py: Wrapper for DeepSeek-VL2 models, coordinating inference, inputs, and outputs.

* code/models/ollama.py: Interface to run local Ollama models with specific parameters (temperature, context, history).

* deepseek_vl2/

* code/deepseek_vl2/__init__.py: Makes the deepseek_vl2 folder a package; initialises the DeepSeek-VL2 module structure.

* code/deepseek_vl2/models/: Contains model definition files for DeepSeek-VL2.

* code/deepseek_vl2/serve/: Implements server or API endpoints for running DeepSeek-VL2 inference.

* code/deepseek_vl2/utils/: Utility scripts (helper functions, preprocessing, logging, etc.) used across DeepSeek-VL2.

* data/

* data/avg_with_memory.csv: Stores the averaged model confidence scores across 15 trials (with conversation memory enabled), aggregated per image.

* data/avg_without_memory.csv: Stores the averaged model confidence scores across 15 trials (without conversation memory enabled), aggregated per image.

* data/with_memory/: Contains all the raw output files directly generated by the LLM under the memory condition.

* data/with_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.

* data/without_memory/: Contains all the raw output files generated by the LLM under the no-memory condition.

* data/without_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.

* crowd_data: Includes the original images shown to participants and the corresponding averaged human responses, which serve as the benchmark for comparing against LLM outputs (sourced from DOI: 10.54941/ahfe1002444).

History

  • 2025-09-01 first online, published, posted

Publisher

4TU.ResearchData

Format

.jpeg; .py

Organizations

TU Eindhoven, Department of Industrial Design

DATA

Files (1)