Supplementary material for the paper "Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI"
DOI: 10.4121/cb208bd8-7cf4-42d5-ae5e-9ad2c654aeb3
Datacite citation style
Dataset
Supplementary material for the paper: Alam, M. S., & Bazilinskyy, P. (2025). Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMI. Adjunct Proceedings of the 17th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutoUI). Brisbane, QLD, Australia. https://doi.org/10.1145/3744335.3758477
This study evaluates the effectiveness of large language model-based personas for assessing external Human-Machine Interfaces (eHMIs) in automated vehicles. 13 different models namely Bak-LLaVA, ChatGPT-4o, DeepSeek-VL2-Tiny, Gemma3:12B, Gemma3:27B, Granite Vision 3.2, LLaMA 3.2 Vision, LLaVA-13B, LLaVA-34B, LLaVA-LLaMA-3, LLaVA-Phi3, MiniCPM-V and Moondream were tasked with simulating pedestrian decision making for 227 vehicle images equipped with eHMI. Confidence scores (0-100) were collected under two conditions: no memory (images independently assessed) and memory-enabled (conversation history preserved), each in 15 independent trials. The model outputs were compared with the ratings of 1,438 human participants. Gemma3:27B achieved the highest correlation with humans without memory (r = 0.85), while ChatGPT-4o performed best with memory (r = 0.81). DeepSeek-VL2-Tiny and BakLLaVA showed little sensitivity to context, and LLaVA-LLaMA-3, LLaVA-Phi3, LLaVA-13B and Moondream consistently produced limited-range output.
It has the following structure:
* code/
* code/.python-version: Pins the Python interpreter version (3.9.21) for environment consistency.
* code/analysis.py: Main analysis script that processes outputs, computes statistics (e.g., correlations with human data), and produces result figures.
* code/common.py: Contains functions for configuration management, dictionary search, and data serialisation.
* code/custom_logger.py: Implements a custom logger class for handling string formatting and logging at various levels.
* code/default.config: Configuration file specifying paths for data, plotly template, and plots directory.
* code/logmod.py: Initialises and configures the logger with customisable display and storage options, supporting colored logs, threading, and multiprocessing.
* code/main.py: Python script that produces all figures and analyses.
* code/Makefile: Defines shortcut commands for setup, running analysis, and cleaning project outputs.
* code/pyproject.toml: Defines project dependencies and metadata for the `uv` environment manager.
* code/uv.lock: Lockfile with pinned dependency versions for reproducible builds.
* models/
* code/models/chat_gpt.py: Wrapper for interacting with ChatGPT (Vision), including prompt formatting, sending images, and parsing responses.
* code/models/deepseek.py: Wrapper for DeepSeek-VL2 models, coordinating inference, inputs, and outputs.
* code/models/ollama.py: Interface to run local Ollama models with specific parameters (temperature, context, history).
* deepseek_vl2/
* code/deepseek_vl2/__init__.py: Makes the deepseek_vl2 folder a package; initialises the DeepSeek-VL2 module structure.
* code/deepseek_vl2/models/: Contains model definition files for DeepSeek-VL2.
* code/deepseek_vl2/serve/: Implements server or API endpoints for running DeepSeek-VL2 inference.
* code/deepseek_vl2/utils/: Utility scripts (helper functions, preprocessing, logging, etc.) used across DeepSeek-VL2.
* data/
* data/avg_with_memory.csv: Stores the averaged model confidence scores across 15 trials (with conversation memory enabled), aggregated per image.
* data/avg_without_memory.csv: Stores the averaged model confidence scores across 15 trials (without conversation memory enabled), aggregated per image.
* data/with_memory/: Contains all the raw output files directly generated by the LLM under the memory condition.
* data/with_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.
* data/without_memory/: Contains all the raw output files generated by the LLM under the no-memory condition.
* data/without_memory/analysed/: Subdirectory that stores the numeric values extracted from the raw outputs.
* crowd_data: Includes the original images shown to participants and the corresponding averaged human responses, which serve as the benchmark for comparing against LLM outputs (sourced from DOI: 10.54941/ahfe1002444).
History
- 2025-09-01 first online, published, posted
Publisher
4TU.ResearchDataFormat
.jpeg; .pyAssociated peer-reviewed publication
Cross or Nah? LLMs Get in the Mindset of a Pedestrian in front of Automated Car with an eHMIOrganizations
TU Eindhoven, Department of Industrial DesignDATA
Files (1)
- 129,039,301 bytesMD5:
9e80c364d38653431ef98663fccfa825
Supplementary material.zip