Analysis questions results

Author: Martin Dierikx
Date: 26-09-2023
Required files: data/preprocessed_database_data.csv
Output files: none

This file contains the code to create the results for the analysis questions for Figures 4.7, 4.8, 4.9, 4.10, 4.11, 4.12, 4.13, and 4.14 and Table 4.3.

All the needed imports:

In [1]:
import math
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats as sc
from sklearn import metrics

Global variables needed for the analysis:

In [2]:
# The number of values for each feature with format [self-motivation, self-efficacy, context state]
state_sizes = [2, 2, 2]

# Convert mood into a number between 0-10 to be used in the context state
possible_moods = ['glad', 'happy', 'pleased', 'delighted', 'serene', 'content', 'satisfied', 'relaxed', 'calm',
                  'excited', 'astonished', 'aroused', 'sleepy', 'neutral', 'tired', 'tense', 'alarmed', 'afraid',
                  'droopy', 'bored', 'angry', 'annoyed', 'frustrated', 'distressed', 'depressed', 'sad', 'gloomy',
                  'miserable']
mood_lookup = [10, 10, 9, 9, 9, 9, 9, 9, 9, 8, 7, 7, 5, 5, 5, 5, 5, 4, 3, 3, 3, 3, 2, 1, 1, 1, 1, 0]

# The base rewards for a chosen action in the reward function with the standard reward per action
action_lookup = [0.8, 0.85, 0.9, 0.95, 1]

# The discount factor and epsilon used in the value iteration
discount = 0.85
epsilon = 0.01

# List all the possible states and actions
possible_actions = ['dec', 'sdec', 'nothing', 'sinc', 'inc']
possible_states = []
for sm in range(state_sizes[0]):
    for se in range(state_sizes[1]):
        for cs in range(state_sizes[2]):
            possible_states.append(f"[{sm}, {se}, {cs}]")
            
# The file with the pre-processed database data
data_filename = "../data/preprocessed_database_data.csv"

Function to calculate the reward:

In [3]:
def calculateReward(steps_taken, recommended_goal, action, reward_function_to_use, goal_achievability):
    """
    Function to calculate the reward for a sample.
    
    Args:
        steps_taken: number of steps taken in the sample.
        recommended_goal: recommended goal to achieve in the sample.
        action: reinforcement learning action taken in the sample.
        reward_function_to_use: number indicating which reward function to use for the calculation of the rewards:
        - 0: default reward,
        - 1: include perceived goal achievability,
        - 2: include standard reward per action,
        - 3: include goal achievability and standard reward per action.
        - 4: default reward without penalty for overachieving.
        goal_achievability: perceived goal achievability in the sample.

    Returns: calculated reward belonging to the sample.
    """
    diff = abs(recommended_goal - steps_taken)
    # If the goal was achieved, it should be less penalized for the difference in steps
    if recommended_goal < steps_taken:
        diff = 0.5 * diff

    # The standard reward function
    if reward_function_to_use == 0:
        reward = 1 - diff / recommended_goal
        return reward
    
    # Include goal achievability in the reward function
    elif reward_function_to_use == 1:
        reward = 1 - diff / recommended_goal + 0.1 * goal_achievability
        return reward
    
    # Include standard reward per action in the reward function
    elif reward_function_to_use == 2:
        reward = action_lookup[action] - diff / recommended_goal
        return reward
    
    # Include both goal achievability and standard reward per action in the reward function
    elif reward_function_to_use == 3:
        reward = action_lookup[action] - diff / recommended_goal + 0.1 * goal_achievability
        return reward
    
    # Remove the penalty for overachieving
    else:
        if recommended_goal < steps_taken:
            reward = 1
        else:
            reward = 1 - diff / recommended_goal
        return reward

Function to extract data from the database data file:

In [4]:
def read_data(index_to_leave_out, reward_function_to_use):
    """
    Method to read the sample data from the csv file and put it into variables.
    If index_to_leave_out is specified, the samples with that pid will be retrieved as left-out samples for analysis.

    Args:
        index_to_leave_out: if specified, the the samples with that pid will be retrieved as left-out samples for analysis.
        reward_function_to_use: number indicating which reward function to use for the calculation of the rewards:
        - 0: default reward,
        - 1: include perceived goal achievability,
        - 2: include standard reward per action,
        - 3: include goal achievability and standard reward per action.

    Returns:
        - a list of states for every sample,
        - a list of actions for every sample,
        - a list of next states for every sample,
        - a list of rewards for every sample,
        - a table with the mean rewards for each state-action pair,
        - a list with states for every sample of the (possible) left-out person,
        - a list with actions for every sample of the (possible) left-out person,
        - a list with rewards for every sample of the (possible) left-out person,
        - a list with next states for every sample of the (possible) left-out person,
        - a lookup list with the transformed values (0-1) for every plain self-motivation value (0-10),
        - a lookup list with the transformed values (0-1) for every plain self-efficacy value (0-10),
        - a lookup list with the transformed values (0-1) for every plain context state value (0-30),
        - a list with session_numbers for every sample
    """
    # Lists for all self-motivation, self-efficacy, and context state values
    sm_list = []
    se_list = []
    cs_list = []

    # Lists for the other information on the data samples
    full_states = []
    transformed_states = []
    actions = []
    full_next_states = []
    transformed_next_states = []
    rewards = []
    steps_taken = []
    recommended_goals = []
    session_numbers = []

    # Lists for the information on the left-out samples
    loocv_datapoints = []
    loocv_actions = []
    loocv_rewards = []
    loocv_next_states = []

    # Read the data from the file
    for d in open(data_filename):
        data = d.split(",")
        # To ignore the header cells
        if data[0] != "ID":
            # Check if this data point needs to be left out for analysis
            if data[0] == "p" + str(index_to_leave_out):
                # State data
                loocv_datapoints.append([int(data[1]), int(data[2]), int(data[3]), int(data[4]), mood_lookup[possible_moods.index(data[5])]])
                sm_list.append(int(data[1]))
                se_list.append(int(data[2]))
                cs_list.append(int(data[3]) + int(data[4]) + mood_lookup[possible_moods.index(data[5])])
                
                # Action data
                loocv_actions.append(possible_actions.index(data[6]))
                
                # Reward data
                prev_activity = data[13].split(";")
                prev_steps = []
                for steps in prev_activity:
                    prev_steps.append(int(steps))
                prev_steps.sort()
                recommended_goal = max(2000, min(10000, int(math.ceil(np.percentile(prev_steps, 60) / 100.0)) * 100))
                loocv_rewards.append(calculateReward(int(data[14]), recommended_goal, possible_actions.index(data[6]), reward_function_to_use, int(data[15])))
                
                # Next state data
                loocv_next_states.append([int(data[7]), int(data[8]), int(data[9]), int(data[10]), mood_lookup[possible_moods.index(data[11])]])
            
            else:
                # State data
                self_motivation = int(data[1])
                sm_list.append(self_motivation)
                self_efficacy = int(data[2])
                se_list.append(self_efficacy)
                rest = int(data[3])
                available_time = int(data[4])
                mood = mood_lookup[possible_moods.index(data[5])]
                cs_list.append(rest + available_time + mood)
                full_states.append([self_motivation, self_efficacy, rest, available_time, mood])

                # Action data
                action = possible_actions.index(data[6])
                actions.append(action)

                # Reward data
                prev_activity = data[13].split(";")
                prev_steps = []
                for steps in prev_activity:
                    prev_steps.append(int(steps))
                prev_steps.sort()
                recommended_goal = max(2000, min(10000, int(math.ceil(np.percentile(prev_steps, 60) / 100.0)) * 100))
                initial_reward = calculateReward(int(data[14]), recommended_goal, action, reward_function_to_use, int(data[15]))
                rewards.append(initial_reward)
                steps_taken.append(int(data[14]))
                recommended_goals.append(recommended_goal)

                # Next state data
                self_motivation_next = int(data[7])
                self_efficacy_next = int(data[8])
                rest_next = int(data[9])
                available_time_next = int(data[10])
                mood_next = mood_lookup[possible_moods.index(data[11])]
                full_next_states.append([self_motivation_next, self_efficacy_next, rest_next, available_time_next, mood_next])

                session_numbers.append(int(data[20].strip('\n')))

    # Get the transformations of the self-motivation
    sm_lookup = []
    sm_precentiles = []
    for i in range(11):
        percentile = sc.percentileofscore(sm_list, i, 'weak')
        sm_precentiles.append(percentile)
        # Transform into a 1 if the value is in the upper 50% of the data
        if percentile > 50:
            sm_lookup.append(1)
        else:
            sm_lookup.append(0)

    # Get the transformations of the self-efficacy
    se_lookup = []
    se_precentiles = []
    for i in range(11):
        percentile = sc.percentileofscore(se_list, i, 'weak')
        se_precentiles.append(percentile)
        # Transform into a 1 if the value is in the upper 50% of the data
        if percentile > 50:
            se_lookup.append(1)
        else:
            se_lookup.append(0)

    # Get the transformations of the context state
    cs_lookup = []
    cs_precentiles = []
    for i in range(31):
        percentile = sc.percentileofscore(cs_list, i, 'weak')
        cs_precentiles.append(percentile)
        # Transform into a 1 if the value is in the upper 50% of the data
        if percentile > 50:
            cs_lookup.append(1)
        else:
            cs_lookup.append(0)

    # Get the mean rewards for each state-action pair
    mean_rewards = [[0 for i in range(len(possible_actions))] for j in range(state_sizes[2] * state_sizes[1] * state_sizes[0])]
    state_action_pair_counter = [[0 for i in range(len(possible_actions))] for j in range(state_sizes[2] * state_sizes[1] * state_sizes[0])]
    for i in range(len(rewards)):
        # Transform the state
        transformed_sm = sm_lookup[full_states[i][0]]
        transformed_se = se_lookup[full_states[i][1]]
        transformed_cs = cs_lookup[full_states[i][2] + full_states[i][3] + full_states[i][4]]
        transformed_states.append([transformed_sm, transformed_se, transformed_cs])
        
        # Make a unique number from the state features to use it as index
        state_as_number = transformed_sm * state_sizes[1] * state_sizes[2] + transformed_se * state_sizes[2] + transformed_cs

        # Transform the next state
        transformed_next_sm = sm_lookup[full_next_states[i][0]]
        transformed_next_se = se_lookup[full_next_states[i][1]]
        transformed_next_cs = cs_lookup[full_next_states[i][2] + full_next_states[i][3] + full_next_states[i][4]]
        transformed_next_states.append([transformed_next_sm, transformed_next_se, transformed_next_cs])

        # Keep track of the mean reward R(s,a)
        reward = mean_rewards[state_as_number][actions[i]]
        state_action_pair_encounters = state_action_pair_counter[state_as_number][actions[i]]

        if state_action_pair_encounters == 0:
            # The first encounter for this state-action pair, so just use the sample reward as mean reward
            mean_rewards[state_as_number][actions[i]] = rewards[i]
            state_action_pair_counter[state_as_number][actions[i]] = 1
        else:
            # Update the previously saved mean reward
            total_reward = reward * state_action_pair_encounters + rewards[i]
            state_action_pair_counter[state_as_number][actions[i]] += 1
            mean_rewards[state_as_number][actions[i]] = total_reward / (state_action_pair_encounters + 1)

    # Return the data
    return transformed_states, actions, transformed_next_states, rewards, mean_rewards, loocv_datapoints, loocv_actions, loocv_rewards, loocv_next_states, sm_lookup, se_lookup, cs_lookup, session_numbers

Calculate the number of samples for each state-action pair

If this is lower than the target sample size divided by the number of state-action pairs (380 / 40 = 10), we wanted to balance the lack of samples out by weighing in some mean reward or transition.

In [5]:
# Extract the sample data
states, actions, next_states, rewards, mean_rewards, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 0)

# Count how many times each state-action pair occurs
sample_count = [[0 for i in range(len(possible_actions))] for j in range(len(possible_states))]
state_to_number_lookup = [state_sizes[1] * state_sizes[2], state_sizes[2], 1]
for i in range(len(states)):
    state_as_number = state_to_number_lookup[0] * states[i][0] + state_to_number_lookup[1] * states[i][1] + states[i][2]
    sample_count[state_as_number][actions[i]] += 1

Function to calculate the transition table:

In [6]:
def calculateTransitions(states, actions, next_states):
    """
    Method to calculate the transition function from the data.
    
    Args:
        states: list with the states from all the samples in the data.
        actions: list with the actions from all the samples in the data.
        next_states: list with the next states from all the samples in the data.

    Returns: three-dimensional list containing a list of probabilities per next state for each state-action pair.
    """
    # Every state-action pair has a list with probabilities for each next state
    transitions = [[[] for i in range(len(possible_actions))] for j in range(len(possible_states))]
    
    # Get the next state for each datapoint and sort them by state-action pair
    for i in range(len(states)):
        state_as_number = state_to_number_lookup[0] * states[i][0] + state_to_number_lookup[1] * states[i][1] + state_to_number_lookup[2] * states[i][2]
        action_as_number = actions[i]
        transitions[state_as_number][action_as_number].append(next_states[i])

    # Calculate the probability of each next state happening for each state-action pair
    for i in range(len(possible_states)):
        for j in range(len(possible_actions)):
            next_states_probs = [0 for l in range(len(possible_states))]
            number_of_next_states = len(transitions[i][j])
            
            for k in range(number_of_next_states):
                next_state = transitions[i][j][k]
                next_state_as_number = state_to_number_lookup[0] * next_state[0] + state_to_number_lookup[1] * next_state[1] + state_to_number_lookup[2] * next_state[2]
                
                if next_states_probs[next_state_as_number] == 0:
                    # Count how many of the transitions of a state-action pair are towards a specific next state
                    number_of_occurrences_of_next_state = transitions[i][j].count(next_state)
                    next_states_probs[next_state_as_number] = number_of_occurrences_of_next_state / number_of_next_states
            
            transitions[i][j] = next_states_probs
    
    # Balance the transition probabilities for the state-action pairs with too little data
    for i in range(len(possible_states)):
        for j in range(len(possible_actions)):
            if sample_count[i][j] < 10:
                for k in range(len(possible_states)):
                    transitions[i][j][k] = (sample_count[i][j] / 10) * transitions[i][j][k] + ((10 - sample_count[i][j]) / 10) * (1 / len(possible_states))
    
    return transitions

Function to calculate the q-values:

In [7]:
def calculateQValues(mean_rewards, transitions):
    """
    Method to calculate the q-values using value iteration.
    
    Args:
        mean_rewards: list with mean rewards for each state-action pair.
        transitions: list containing the probabilities for each transition for each state-action pair.

    Returns: list containing the q-values for each state-action pair.
    """
    delta = 10
    q_values = [[0 for i in range(len(possible_actions))] for j in range(len(possible_states))]
    
    # Continue for as long as there is a significant change in the q-table after an iteration
    while delta > epsilon:
        delta = 0
        next_qvalues = [[0 for i in range(len(possible_actions))] for j in range(len(possible_states))]
        for i in range(len(possible_states)):
            for j in range(len(possible_actions)):
                future_reward = 0
                
                # Calculate the future rewards for each next state
                for k in range(len(possible_states)):
                    if transitions[i][j][k] != 0:
                        best_future_reward = 0
                        
                        # Use the future action that gives the highest q-value
                        for l in range(len(possible_actions)):
                            best_future_reward = max(best_future_reward, q_values[k][l])
                        
                        future_reward += transitions[i][j][k] * best_future_reward
                
                # Calculate the new q-values and calculate the change in values
                next_qvalues[i][j] = mean_rewards[i][j] + discount * future_reward
                delta = max(delta, abs(next_qvalues[i][j] - q_values[i][j]))
        
        # Update the q-values
        q_values = next_qvalues.copy()

    return q_values

Analysis question 1, Figure 4.7 and 4.12:

We predicted people's reward using two approaches: basing it on the samples with only the same action and basing it on samples with both the same action and state.

After that, we did the same thing but then removing the penalty for overachieving from the reward function.

In [8]:
print("Analysis Question 1:")

# Keep track of the prediction error per state when predicting based on only actions or both actions and states
total_prediction_error_actions_per_state = [[] for i in range(len(possible_states))]
total_prediction_error_actions_states_per_state = [[] for i in range(len(possible_states))]

# Leave one person out and predict for the samples of that person
for iteration in range(118):
    # Extract the sample data with one left-out person
    states, actions, next_states, rewards, mean_rewards, loocv_states, loocv_actions, loocv_rewards, loocv_next_states, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(iteration, 0)

    # Calculate prediction errors for every sample of the left-out person
    for j in range(len(loocv_states)):
        total_reward_action = 0
        total_reward_action_state = 0
        action_counter = 0
        action_state_counter = 0
        loocv_state_as_number = state_to_number_lookup[0] * sm_lookup[loocv_states[j][0]] + state_to_number_lookup[1] * se_lookup[loocv_states[j][1]] + cs_lookup[loocv_states[j][2] + loocv_states[j][3] + loocv_states[j][4]]
        
        for i in range(len(actions)):
            if actions[i] == loocv_actions[j]:
                # Sum the rewards to base the prediction on if the action is the same
                total_reward_action += rewards[i]
                action_counter += 1
                
                if state_to_number_lookup[0] * states[i][0] + state_to_number_lookup[1] * states[i][1] + states[i][2] == loocv_state_as_number:
                    # Sum the rewards to base the prediction on if the action and the state are the same
                    total_reward_action_state += rewards[i]
                    action_state_counter += 1

        # Balance the total reward if the number of samples for a certain state-action pair is too little            
        number_of_samples_for_action_state = sample_count[loocv_state_as_number][loocv_actions[j]]
        if number_of_samples_for_action_state < 10:
            total_reward_action += (10 - number_of_samples_for_action_state) * np.mean(rewards)
            total_reward_action_state += (10 - number_of_samples_for_action_state) * np.mean(rewards)
            action_counter += 10 - number_of_samples_for_action_state
            action_state_counter += 10 - number_of_samples_for_action_state

        # Predict the reward for the sample and calculate the prediction error
        average_reward_action_state = total_reward_action_state / action_state_counter
        average_reward_action = total_reward_action / action_counter
        l1_error_action = abs(loocv_rewards[j] - average_reward_action)
        l1_error_action_and_state = abs(loocv_rewards[j] - average_reward_action_state)
        total_prediction_error_actions_per_state[loocv_state_as_number].append(l1_error_action)
        total_prediction_error_actions_states_per_state[loocv_state_as_number].append(l1_error_action_and_state)

average_prediction_error_actions_per_state_graph = []
average_prediction_error_actions_states_per_state_graph = []
# Lists for calculating the credibility intervals for the prediction error
means1 = []
means2 = []

for i in range(len(possible_states)):
    # Calculate the average prediction error per state
    average_prediction_error_actions_per_state_graph.append(sum(total_prediction_error_actions_per_state[i]) / len(total_prediction_error_actions_per_state[i]))
    average_prediction_error_actions_states_per_state_graph.append(sum(total_prediction_error_actions_states_per_state[i]) / len(total_prediction_error_actions_states_per_state[i]))

    # Calculate the 95% credibility intervals
    mean, variance, std = sc.bayes_mvs(total_prediction_error_actions_per_state[i], 0.95)
    means1.append(abs(mean[1][0] - mean[1][1]) * 0.5)
    mean, variance, std = sc.bayes_mvs(total_prediction_error_actions_states_per_state[i], 0.95)
    means2.append(abs(mean[1][0] - mean[1][1]) * 0.5)

# Extract the sample data without leaving a person out
states, actions, next_states, rewards, mean_rewards, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 0)

# Calculate the mean reward per state after balancing the state-action pairs with too little data
mean_rewards_per_state = []
for i in range(len(possible_states)):
    total_reward_per_state = 0
    for j in range(len(possible_actions)):
        if sample_count[i][j] < 10:
            total_reward_per_state += (sample_count[i][j] / 10) * mean_rewards[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards)
        else:
            total_reward_per_state += mean_rewards[i][j]
    total_reward_per_state = total_reward_per_state / len(possible_actions)
    mean_rewards_per_state.append(total_reward_per_state)
# Calculate the overall mean reward
average_reward_total = [np.mean(rewards) for k in range(len(possible_states))]

# Plot the prediction errors
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(len(possible_states))]
x2 = [i + 0.4 for i in x1]
x3 = [i + 0.2 for i in x1]
plt.bar(x1, average_prediction_error_actions_per_state_graph, width=0.4, label='action', color='lightblue', yerr=means1)
plt.bar(x2, average_prediction_error_actions_states_per_state_graph, width=0.4, label='action and state', color='orange', hatch='/', yerr=means2, edgecolor='white')
plt.xticks(x3, possible_states)
plt.subplots_adjust(bottom=0.2)
plt.ylim(0.00, 0.35)
plt.legend(loc='upper center')
plt.xlabel("State [self-motivation, self-efficacy, context state]")
plt.ylabel("Mean prediction error")

# Plot the mean rewards
axes2 = plt.twinx()
axes2.plot(x3, mean_rewards_per_state, label='per state', color='grey')
axes2.plot(x3, average_reward_total, label='overall', color='black', linestyle='dashed')
axes2.set_ylabel("Mean reward")
axes2.set_ylim(0.0, 1.0)
plt.legend(loc='upper right')
plt.show()



print("Analysis Question 1 without penalty for overachieving:")

# Keep track of the prediction error per state when predicting based on only actions or both actions and states
total_prediction_error_actions_per_state = [[] for i in range(len(possible_states))]
total_prediction_error_actions_states_per_state = [[] for i in range(len(possible_states))]

# Leave one person out and predict for the samples of that person
for iteration in range(118):
    # Extract the sample data with one left-out person
    states, actions, next_states, rewards, mean_rewards, loocv_states, loocv_actions, loocv_rewards, loocv_next_states, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(iteration, 4)

    # Calculate prediction errors for every sample of the left-out person
    for j in range(len(loocv_states)):
        total_reward_action = 0
        total_reward_action_state = 0
        action_counter = 0
        action_state_counter = 0
        loocv_state_as_number = state_to_number_lookup[0] * sm_lookup[loocv_states[j][0]] + state_to_number_lookup[1] * se_lookup[loocv_states[j][1]] + cs_lookup[loocv_states[j][2] + loocv_states[j][3] + loocv_states[j][4]]
        
        for i in range(len(actions)):
            if actions[i] == loocv_actions[j]:
                # Sum the rewards to base the prediction on if the action is the same
                total_reward_action += rewards[i]
                action_counter += 1
                
                if state_to_number_lookup[0] * states[i][0] + state_to_number_lookup[1] * states[i][1] + states[i][2] == loocv_state_as_number:
                    # Sum the rewards to base the prediction on if the action and the state are the same
                    total_reward_action_state += rewards[i]
                    action_state_counter += 1

        # Balance the total reward if the number of samples for a certain state-action pair is too little            
        number_of_samples_for_action_state = sample_count[loocv_state_as_number][loocv_actions[j]]
        if number_of_samples_for_action_state < 10:
            total_reward_action += (10 - number_of_samples_for_action_state) * np.mean(rewards)
            total_reward_action_state += (10 - number_of_samples_for_action_state) * np.mean(rewards)
            action_counter += 10 - number_of_samples_for_action_state
            action_state_counter += 10 - number_of_samples_for_action_state

        # Predict the reward for the sample and calculate the prediction error
        average_reward_action_state = total_reward_action_state / action_state_counter
        average_reward_action = total_reward_action / action_counter
        l1_error_action = abs(loocv_rewards[j] - average_reward_action)
        l1_error_action_and_state = abs(loocv_rewards[j] - average_reward_action_state)
        total_prediction_error_actions_per_state[loocv_state_as_number].append(l1_error_action)
        total_prediction_error_actions_states_per_state[loocv_state_as_number].append(l1_error_action_and_state)

average_prediction_error_actions_per_state_graph = []
average_prediction_error_actions_states_per_state_graph = []
# Lists for calculating the credibility intervals for the prediction error
means1 = []
means2 = []

for i in range(len(possible_states)):
    # Calculate the average prediction error per state
    average_prediction_error_actions_per_state_graph.append(sum(total_prediction_error_actions_per_state[i]) / len(total_prediction_error_actions_per_state[i]))
    average_prediction_error_actions_states_per_state_graph.append(sum(total_prediction_error_actions_states_per_state[i]) / len(total_prediction_error_actions_states_per_state[i]))

    # Calculate the 95% credibility intervals
    mean, variance, std = sc.bayes_mvs(total_prediction_error_actions_per_state[i], 0.95)
    means1.append(abs(mean[1][0] - mean[1][1]) * 0.5)
    mean, variance, std = sc.bayes_mvs(total_prediction_error_actions_states_per_state[i], 0.95)
    means2.append(abs(mean[1][0] - mean[1][1]) * 0.5)

# Extract the sample data without leaving a person out
states, actions, next_states, rewards, mean_rewards, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 4)

# Calculate the mean reward per state after balancing the state-action pairs with too little data
mean_rewards_per_state = []
for i in range(len(possible_states)):
    total_reward_per_state = 0
    for j in range(len(possible_actions)):
        if sample_count[i][j] < 10:
            total_reward_per_state += (sample_count[i][j] / 10) * mean_rewards[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards)
        else:
            total_reward_per_state += mean_rewards[i][j]
    total_reward_per_state = total_reward_per_state / len(possible_actions)
    mean_rewards_per_state.append(total_reward_per_state)
# Calculate the overall mean reward
average_reward_total = [np.mean(rewards) for k in range(len(possible_states))]

# Plot the prediction errors
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(len(possible_states))]
x2 = [i + 0.4 for i in x1]
x3 = [i + 0.2 for i in x1]
plt.bar(x1, average_prediction_error_actions_per_state_graph, width=0.4, label='action', color='lightblue', yerr=means1)
plt.bar(x2, average_prediction_error_actions_states_per_state_graph, width=0.4, label='action and state', color='orange', hatch='/', yerr=means2, edgecolor='white')
plt.xticks(x3, possible_states)
plt.subplots_adjust(bottom=0.2)
plt.ylim(0.00, 0.35)
plt.legend(loc='center')
plt.xlabel("State [self-motivation, self-efficacy, context state]")
plt.ylabel("Mean prediction error")

# Plot the mean rewards
axes2 = plt.twinx()
axes2.plot(x3, mean_rewards_per_state, label='per state', color='grey')
axes2.plot(x3, average_reward_total, label='overall', color='black', linestyle='dashed')
axes2.set_ylabel("Mean reward")
axes2.set_ylim(0.0, 1.0)
plt.legend(loc='center right')
plt.show()
Analysis Question 1:
Analysis Question 1 without penalty for overachieving:

Analysis question 2, Figure 4.8:

We calculated the likelihood for a next state based on three approaches: equal probability for each next state, predict that people stay in the same state, and predict transitions using the calculated transition function.

Running the code will give a warning which comes from calculating the credibility interval for an empty list, namely for predicting that people stay in the same state for state [1,0,0]. This is not a problem and will still generate the correct results.

In [9]:
print("Analysis Question 2:")

# Keep track of the likelihood of the different approaches for predicting the next state
total_likelihood_transition_function_per_state = [[] for i in range(len(possible_states))]
total_likelihood_same_state_per_state = [[] for i in range(len(possible_states))]
likelihood_counter_per_state = [0 for i in range(len(possible_states))]

# Leave one person out and predict for the samples of that person
for iteration in range(118):
    # Extract the data
    states, actions, next_states, rewards, mean_rewards, loocv_states, loocv_actions, loocv_rewards, loocv_next_states, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(iteration, 0)

    # Calculate the transition likelihood for each of the samples of the left-out person
    for j in range(len(loocv_states)):
        # Convert the left-out sample's state and next state into a number to use as index
        loocv_state_as_number = state_to_number_lookup[0] * sm_lookup[loocv_states[j][0]] + state_to_number_lookup[1] * se_lookup[loocv_states[j][1]] + cs_lookup[loocv_states[j][2] + loocv_states[j][3] + loocv_states[j][4]]
        loocv_next_state_as_number = state_to_number_lookup[0] * sm_lookup[loocv_next_states[j][0]] + state_to_number_lookup[1] * se_lookup[loocv_next_states[j][1]] + cs_lookup[loocv_next_states[j][2] + loocv_next_states[j][3] + loocv_next_states[j][4]]

        # Calculate the transition function
        transitions = calculateTransitions(states, actions, next_states)

        # Get the likelihood of the next state according to the transition function
        likelihood_for_loocv_next_state = transitions[loocv_state_as_number][loocv_actions[j]][loocv_next_state_as_number]
        total_likelihood_transition_function_per_state[loocv_state_as_number].append(likelihood_for_loocv_next_state)

        # Get the likelihood of the next state when assuming to stay in the same state
        if loocv_state_as_number == loocv_next_state_as_number:
            total_likelihood_same_state_per_state[loocv_state_as_number].append(1)
        else:
            total_likelihood_same_state_per_state[loocv_state_as_number].append(0)

        # Keep track of the amount of predictions per state
        likelihood_counter_per_state[loocv_state_as_number] += 1


average_likelihood_equal_prob_per_state_graph = []
average_likelihood_same_state_per_state_graph = []
average_likelihood_transition_function_per_state_graph = []
# Lists for calculating the credibility intervals for the prediction error
means1 = []
means2 = []

# Calculate the mean likelihood per state using the three approaches
for i in range(len(possible_states)):
    average_likelihood_equal_prob_per_state_graph.append(1 / len(possible_states))
    average_likelihood_same_state_per_state_graph.append(sum(total_likelihood_same_state_per_state[i]) / likelihood_counter_per_state[i])
    average_likelihood_transition_function_per_state_graph.append(sum(total_likelihood_transition_function_per_state[i]) / likelihood_counter_per_state[i])

    # Calculate the 95% credibility intervals
    mean, variance, std = sc.bayes_mvs(total_likelihood_same_state_per_state[i], 0.95)
    means1.append(abs(mean[1][0] - mean[1][1]) * 0.5)
    mean, variance, std = sc.bayes_mvs(total_likelihood_transition_function_per_state[i], 0.95)
    means2.append(abs(mean[1][0] - mean[1][1]) * 0.5)

# Plot the results
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(len(possible_states))]
x2 = [i + 0.3 for i in x1]
x3 = [i + 0.3 for i in x2]
plt.bar(x1, average_likelihood_equal_prob_per_state_graph, width=0.3, label='equal probability', color='lightblue')
plt.bar(x2, average_likelihood_same_state_per_state_graph, width=0.3, label='same next state', color='orange', hatch='/', yerr=means1, edgecolor='white')
plt.bar(x3, average_likelihood_transition_function_per_state_graph, width=0.3, label='transition function', color='grey', hatch='-', yerr=means2, edgecolor='white')
plt.xticks(x2, possible_states)
plt.xlabel("State [self-motivation, self-efficacy, context state]")
plt.ylabel("Mean likelihood of next state")
plt.subplots_adjust(bottom=0.2)
plt.legend(loc='upper center')
plt.show()
Analysis Question 2:
C:\Users\marti\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1920: RuntimeWarning: invalid value encountered in multiply
  lower_bound = self.a * scale + loc
C:\Users\marti\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1921: RuntimeWarning: invalid value encountered in multiply
  upper_bound = self.b * scale + loc

Analysis question 3, Figure 4.9 and 4.10, 4.13, and 4.14:

We first investigated the most occuring transitions for each timestep and whether they are to a state with higher or lower value.

Then, we simulated people over time and see how they transition.

Finally, we simulated people again, but this time following the worst policy.

In [10]:
print("Analysis Question 3:")

# Extract the data
states, actions, next_states, rewards, mean_rewards, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 0)

# Balance the mean rewards if there was too little data in a state-action pair
for i in range(len(possible_states)):
    for j in range(len(possible_actions)):
        if sample_count[i][j] < 10:
            mean_rewards[i][j] = (sample_count[i][j] / 10) * mean_rewards[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards)

# Calculate transitions
transitions = calculateTransitions(states, actions, next_states)

# Calculate q-values
q_values = calculateQValues(mean_rewards, transitions)

# Check for each state which transitions are significant and whether they lead to a state with a higher value
for i in range(len(possible_states)):
    print(f"For state {possible_states[i]} (V*(s) = {round(max(q_values[i]), 2)}), the transitions are as follows:")
    
    # Follow the optimal policy
    best_action = q_values[i].index(max(q_values[i]))
    transitions_for_best_action = transitions[i][best_action]
    
    for j in range(len(possible_states)):
        # Only check transitions that are more likely than 12,5%
        if transitions_for_best_action[j] >= (1/len(possible_states)):
            if max(q_values[i]) - max(q_values[j]) > 0:
                print(f"To state {possible_states[j]} with probability {transitions_for_best_action[j]} which gives a LOWER value")
            
            elif max(q_values[i]) - max(q_values[j]) == 0:
                print(f"To itself with probability {transitions_for_best_action[j]} which gives the same value")
            
            else:
                print(f"To state {possible_states[j]} with probability {transitions_for_best_action[j]} which gives a HIGHER value")
            
    print("\n")
    

print("Analysis Question 3 part 2:")

# Initialize an even distribution amongst all the states
distribution = [1000 for i in range(len(possible_states))]

# Track the percentage of people in each state for certain moments in time
distribution_0 = [1/len(possible_states)*100 for i in range(len(possible_states))]
distribution_1 = [0 for i in range(len(possible_states))]
distribution_2 = [0 for i in range(len(possible_states))]
distribution_5 = [0 for i in range(len(possible_states))]
distribution_10 = [0 for i in range(len(possible_states))]
distribution_20 = [0 for i in range(len(possible_states))]

# Simulate 20 time steps of the optimal policy
for i in range(1, 21):
    new_distribution = [0 for j in range(len(possible_states))]
    
    for j in range(len(possible_states)):
        # Follow the optimal policy
        best_action = q_values[j].index(max(q_values[j]))
        transitions_for_best_action = transitions[j][best_action]
        
        # Redistribute people according to the transition function
        for k in range(len(possible_states)):
            if transitions_for_best_action[k] > 0:
                new_distribution[k] += distribution[j] * transitions_for_best_action[k]
    
    distribution = new_distribution.copy()
    
    # Get the percentage of people in each state
    graph_distribution = []
    for j in range(len(new_distribution)):
        graph_distribution.append(round(new_distribution[j] / (len(possible_states) * 1000), 3) * 100)
    
    # Save the distribution for the specified timesteps
    if i == 1:
        distribution_1 = graph_distribution.copy()
    elif i == 2:
        distribution_2 = graph_distribution.copy()
    elif i == 5:
        distribution_5 = graph_distribution.copy()
    elif i == 10:
        distribution_10 = graph_distribution.copy()
    elif i == 20:
        distribution_20 = graph_distribution.copy()

# Plot the distribution of people over the states over time
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(len(possible_states))]
x2 = [i + 0.15 for i in x1]
x3 = [i + 0.15 for i in x2]
x4 = [i + 0.15 for i in x3]
x5 = [i + 0.15 for i in x4]
x6 = [i + 0.15 for i in x5]
plt.bar(x1, distribution_0, width=0.15, label='0', color='lightgreen')
plt.bar(x2, distribution_1, width=0.15, label='1', color='mediumaquamarine')
plt.bar(x3, distribution_2, width=0.15, label='2', color='lightseagreen')
plt.bar(x4, distribution_5, width=0.15, label='5', color='cadetblue')
plt.bar(x5, distribution_10, width=0.15, label='10', color='teal')
plt.bar(x6, distribution_20, width=0.15, label='20', color='darkslategrey')
plt.xticks(x3, possible_states)
plt.subplots_adjust(bottom=0.2)
plt.ylim(0, 40)
plt.xlabel("State [self-motivation, self-efficacy, context state]")
plt.ylabel("Percentage of people")
plt.legend(loc='upper center')
plt.show()




print("Analysis Question 3 part 2 with worst policy:")

# Initialize an even distribution amongst all the states
distribution = [1000 for i in range(len(possible_states))]

# Track the percentage of people in each state for certain moments in time
distribution_0 = [1/len(possible_states)*100 for i in range(len(possible_states))]
distribution_1 = [0 for i in range(len(possible_states))]
distribution_2 = [0 for i in range(len(possible_states))]
distribution_5 = [0 for i in range(len(possible_states))]
distribution_10 = [0 for i in range(len(possible_states))]
distribution_20 = [0 for i in range(len(possible_states))]

# Simulate 20 time steps of the worst policy
for i in range(1, 21):
    new_distribution = [0 for j in range(len(possible_states))]
    
    for j in range(len(possible_states)):
        # Follow the worst policy
        worst_action = q_values[j].index(min(q_values[j]))
        transitions_for_worst_action = transitions[j][worst_action]
        
        # Redistribute people according to the transition function
        for k in range(len(possible_states)):
            if transitions_for_worst_action[k] > 0:
                new_distribution[k] += distribution[j] * transitions_for_worst_action[k]
    
    distribution = new_distribution.copy()
    
    # Get the percentage of people in each state
    graph_distribution = []
    for j in range(len(new_distribution)):
        graph_distribution.append(round(new_distribution[j] / (len(possible_states) * 1000), 3) * 100)
    
    # Save the distribution for the specified timesteps
    if i == 1:
        distribution_1 = graph_distribution.copy()
    elif i == 2:
        distribution_2 = graph_distribution.copy()
    elif i == 5:
        distribution_5 = graph_distribution.copy()
    elif i == 10:
        distribution_10 = graph_distribution.copy()
    elif i == 20:
        distribution_20 = graph_distribution.copy()

# Plot the distribution of people over the states over time
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(len(possible_states))]
x2 = [i + 0.15 for i in x1]
x3 = [i + 0.15 for i in x2]
x4 = [i + 0.15 for i in x3]
x5 = [i + 0.15 for i in x4]
x6 = [i + 0.15 for i in x5]
plt.bar(x1, distribution_0, width=0.15, label='0', color='lightgreen')
plt.bar(x2, distribution_1, width=0.15, label='1', color='mediumaquamarine')
plt.bar(x3, distribution_2, width=0.15, label='2', color='lightseagreen')
plt.bar(x4, distribution_5, width=0.15, label='5', color='cadetblue')
plt.bar(x5, distribution_10, width=0.15, label='10', color='teal')
plt.bar(x6, distribution_20, width=0.15, label='20', color='darkslategrey')
plt.xticks(x3, possible_states)
plt.subplots_adjust(bottom=0.2)
plt.ylim(0, 40)
plt.xlabel("State [self-motivation, self-efficacy, context state]")
plt.ylabel("Percentage of people")
plt.legend(loc='upper center')
plt.show()
Analysis Question 3:
For state [0, 0, 0] (V*(s) = 5.38), the transitions are as follows:
To itself with probability 0.2631578947368421 which gives the same value
To state [0, 0, 1] with probability 0.15789473684210525 which gives a LOWER value
To state [0, 1, 0] with probability 0.15789473684210525 which gives a HIGHER value
To state [1, 1, 1] with probability 0.21052631578947367 which gives a LOWER value


For state [0, 0, 1] (V*(s) = 5.36), the transitions are as follows:
To state [0, 0, 0] with probability 0.2625 which gives a HIGHER value
To state [0, 1, 0] with probability 0.1625 which gives a HIGHER value
To state [0, 1, 1] with probability 0.1625 which gives a HIGHER value
To state [1, 1, 0] with probability 0.1625 which gives a HIGHER value


For state [0, 1, 0] (V*(s) = 5.4), the transitions are as follows:
To state [0, 0, 0] with probability 0.2 which gives a LOWER value
To state [0, 0, 1] with probability 0.2 which gives a LOWER value
To state [1, 1, 0] with probability 0.3 which gives a HIGHER value


For state [0, 1, 1] (V*(s) = 5.42), the transitions are as follows:
To itself with probability 0.2125 which gives the same value
To state [1, 1, 1] with probability 0.6124999999999999 which gives a LOWER value


For state [1, 0, 0] (V*(s) = 5.37), the transitions are as follows:
To state [1, 1, 0] with probability 0.21250000000000002 which gives a HIGHER value


For state [1, 0, 1] (V*(s) = 5.42), the transitions are as follows:
To state [0, 0, 0] with probability 0.15 which gives a LOWER value
To state [0, 0, 1] with probability 0.15 which gives a LOWER value
To itself with probability 0.15 which gives the same value
To state [1, 1, 0] with probability 0.15 which gives a HIGHER value
To state [1, 1, 1] with probability 0.25 which gives a LOWER value


For state [1, 1, 0] (V*(s) = 5.47), the transitions are as follows:
To itself with probability 0.4 which gives the same value
To state [1, 1, 1] with probability 0.3 which gives a LOWER value


For state [1, 1, 1] (V*(s) = 5.37), the transitions are as follows:
To state [0, 0, 0] with probability 0.14285714285714285 which gives a HIGHER value
To state [1, 0, 1] with probability 0.14285714285714285 which gives a HIGHER value
To itself with probability 0.5714285714285714 which gives the same value


Analysis Question 3 part 2:
Analysis Question 3 part 2 with worst policy:

Analysis question 4, Figure 4.11:

We simulated people again over time following three different policies (optimal, average and worst) and kept track of the mean reward of all of them. The optimal policy uses the action with the highest q-value, the worst policy uses the action with the lowest q-value, and the average policy assumes that all actions are used the same number of times per timestep.

In [11]:
print("Analysis Question 4:")

# Extract the data
states, actions, next_states, rewards, mean_rewards, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 0)

# Balance the mean rewards if there was too little data in a state-action pair
for i in range(len(possible_states)):
    for j in range(len(possible_actions)):
        if sample_count[i][j] < 10:
            mean_rewards[i][j] = (sample_count[i][j] / 10) * mean_rewards[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards)

# Calculate transitions
transitions = calculateTransitions(states, actions, next_states)

# Calculate q-values
q_values = calculateQValues(mean_rewards, transitions)

# Initialize the distribution by using the actual states from the first session
distribution_opt = [0 for i in range(len(possible_states))]
number_of_samples = 0
for i in range(len(states)):
    if session_numbers[i] == 1:
        # Only use the samples of the first session to create the distribution
        number_of_samples += 1
        state_as_number = state_to_number_lookup[0] * states[i][0] + state_to_number_lookup[1] * states[i][1] + state_to_number_lookup[2] * states[i][2]
        distribution_opt[state_as_number] += 1

distribution_avg = distribution_opt.copy()
distribution_bad = distribution_opt.copy()

mean_reward_per_transition_opt_graph = []
mean_reward_per_transition_avg_graph = []
mean_reward_per_transition_bad_graph = []

# For 100 time steps simulate 3 different policies and check the mean reward
for i in range(100):
    new_distribution_opt = [0 for j in range(len(possible_states))]
    new_distribution_avg = [0 for j in range(len(possible_states))]
    new_distribution_bad = [0 for j in range(len(possible_states))]
    
    total_reward_opt = 0
    total_reward_avg = 0
    total_reward_bad = 0
    
    for j in range(len(possible_states)):
        # Get the action following a policy
        best_action = q_values[j].index(max(q_values[j]))
        worst_action = q_values[j].index(min(q_values[j]))

        # Get the total reward per transition following a policy
        total_reward_opt += distribution_opt[j] * mean_rewards[j][best_action]
        total_reward_avg += 0.2 * distribution_avg[j] * mean_rewards[j][0] + 0.2 * distribution_avg[j] * mean_rewards[j][1] + 0.2 * distribution_avg[j] * mean_rewards[j][2] + 0.2 * distribution_avg[j] * mean_rewards[j][3] + 0.2 * distribution_avg[j] * mean_rewards[j][4]
        total_reward_bad += distribution_bad[j] * mean_rewards[j][worst_action]

        # Simulate the transition following a policy
        transitions_for_best_action = transitions[j][best_action]
        transitions_for_avg_action = [0 for j in range(len(possible_states))]
        for k in range(len(possible_states)):
            # Calculate the transitions for the average policy
            transitions_for_avg_action[k] = 0.2 * transitions[j][0][k] + 0.2 * transitions[j][1][k] + 0.2 * transitions[j][2][k] + 0.2 * transitions[j][3][k] + 0.2 * transitions[j][4][k]
        transitions_for_avg_action[j] += 1 - sum(transitions_for_avg_action)
        transitions_for_worst_action = transitions[j][worst_action]
        
        # Simulate the transitions per policy
        for k in range(len(possible_states)):
            if transitions_for_best_action[k] > 0:
                new_distribution_opt[k] += distribution_opt[j] * transitions_for_best_action[k]
        
        for k in range(len(possible_states)):
            if transitions_for_avg_action[k] > 0:
                new_distribution_avg[k] += distribution_avg[j] * transitions_for_avg_action[k]
        
        for k in range(len(possible_states)):
            if transitions_for_worst_action[k] > 0:
                new_distribution_bad[k] += distribution_bad[j] * transitions_for_worst_action[k]
    
    # Update the distribution of people over the states
    distribution_opt = new_distribution_opt.copy()
    distribution_avg = new_distribution_avg.copy()
    distribution_bad = new_distribution_bad.copy()

    # Get the mean reward per transition following a policy for certain timesteps
    if i in [0, 1, 2, 3, 4, 9, 19, 29, 49, 99]:
        mean_reward_per_transition_opt_graph.append(total_reward_opt / sum(distribution_opt))
        mean_reward_per_transition_avg_graph.append(total_reward_avg / sum(distribution_avg))
        mean_reward_per_transition_bad_graph.append(total_reward_bad / sum(distribution_bad))

# Plot the results
plt.subplots(figsize=(10, 6))
x1 = [i for i in range(10)]
plt.plot(x1, mean_reward_per_transition_opt_graph, label='optimal policy', color='lightblue')
plt.plot(x1, mean_reward_per_transition_avg_graph, label='average policy', color='orange')
plt.plot(x1, mean_reward_per_transition_bad_graph, label='worst policy', color='grey')
plt.xticks(x1, [1, 2, 3, 4, 5, 10, 20, 30, 50, 100])
plt.subplots_adjust(bottom=0.2)
plt.ylim(0.0, 1.0)
plt.ylabel("Mean reward")
plt.xlabel("Timestep")
plt.legend(loc='center right')
plt.show()
Analysis Question 4:

Analysis question 5, Table 4.3:

We calculated the optimal policy using different reward functions and compared them using a weighted kappa. The optimal actions in for each reward function are displayed as follows:
-- means decrease,
- means slightly decrease,
0 means do nothing,
+ means slightly increase,
++ means increase

In [12]:
# Extract the data using the different reward functions
states, actions, next_states, rewards_default, mean_rewards_default, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 0)
states, actions, next_states, rewards_goal_achievability, mean_rewards_goal_achievability, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 1)
states, actions, next_states, rewards_action_reward, mean_rewards_action_reward, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 2)
states, actions, next_states, rewards_ga_and_ar, mean_rewards_ga_and_ar, loocv_state, loocv_action, loocv_reward, loocv_next_state, sm_lookup, se_lookup, cs_lookup, session_numbers = read_data(-1, 3)

# Balance the mean rewards for the state-action pairs with too little samples
for i in range(len(possible_states)):
    for j in range(len(possible_actions)):
        if sample_count[i][j] < 10:
            mean_rewards_default[i][j] = (sample_count[i][j] / 10) * mean_rewards_default[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards_default)
            mean_rewards_goal_achievability[i][j] = (sample_count[i][j] / 10) * mean_rewards_goal_achievability[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards_goal_achievability)
            mean_rewards_action_reward[i][j] = (sample_count[i][j] / 10) * mean_rewards_action_reward[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards_action_reward)
            mean_rewards_ga_and_ar[i][j] = (sample_count[i][j] / 10) * mean_rewards_ga_and_ar[i][j] + ((10 - sample_count[i][j]) / 10) * np.mean(rewards_ga_and_ar)

# Calculate transitions
transitions = calculateTransitions(states, actions, next_states)

# Calculate q-values for the different reward functions
q_values_default = calculateQValues(mean_rewards_default, transitions)
q_values_goal_achievability = calculateQValues(mean_rewards_goal_achievability, transitions)
q_values_action_reward = calculateQValues(mean_rewards_action_reward, transitions)
q_values_ga_and_ar = calculateQValues(mean_rewards_ga_and_ar, transitions)

opt_policy_default = []
opt_policy_goal_achivability = []
opt_policy_action_reward = []
opt_policy_ga_and_ar = []
actions = ['--', '-', '0', '+', '++']
# Get the optimal policies for each of the reward functions
print(f"state | default reward | goal achievability | standard reward per action | both ga and sr")
for i in range(len(possible_states)):
    opt_policy_default.append(q_values_default[i].index(max(q_values_default[i])))
    opt_policy_goal_achivability.append(q_values_goal_achievability[i].index(max(q_values_goal_achievability[i])))
    opt_policy_action_reward.append(q_values_action_reward[i].index(max(q_values_action_reward[i])))
    opt_policy_ga_and_ar.append(q_values_ga_and_ar[i].index(max(q_values_ga_and_ar[i])))
    print(f"s{possible_states[i]} | {actions[opt_policy_default[i]]} | {actions[opt_policy_goal_achivability[i]]} | {actions[opt_policy_action_reward[i]]} | {actions[opt_policy_ga_and_ar[i]]}")

# Calculate Cohen's kappa to measure similarities in the policies
kappa_1 = metrics.cohen_kappa_score(opt_policy_default, opt_policy_goal_achivability, weights='linear')
kappa_2 = metrics.cohen_kappa_score(opt_policy_default, opt_policy_action_reward, weights='linear')
kappa_3 = metrics.cohen_kappa_score(opt_policy_default, opt_policy_ga_and_ar, weights='linear')
kappa_4 = metrics.cohen_kappa_score(opt_policy_goal_achivability, opt_policy_ga_and_ar, weights='linear')
kappa_5 = metrics.cohen_kappa_score(opt_policy_action_reward, opt_policy_ga_and_ar, weights='linear')

print(f"The similarity between the default and goal achievability policies is {round(kappa_1, 2)}")
print(f"The similarity between the default and action reward policies is {round(kappa_2, 2)}")
print(f"The similarity between the default and goal achievability + action reward policies is {round(kappa_3, 2)}")
print(f"The similarity between the goal achievability and goal achievability + action reward policies is {round(kappa_4, 2)}")
print(f"The similarity between the action reward and goal achievability + action reward policies is {round(kappa_5, 2)}")
state | default reward | goal achievability | standard reward per action | both ga and sr
s[0, 0, 0] | + | + | + | ++
s[0, 0, 1] | ++ | ++ | ++ | ++
s[0, 1, 0] | - | -- | ++ | +
s[0, 1, 1] | - | - | 0 | -
s[1, 0, 0] | -- | -- | + | +
s[1, 0, 1] | - | - | ++ | ++
s[1, 1, 0] | + | + | ++ | +
s[1, 1, 1] | ++ | - | ++ | ++
The similarity between the default and goal achievability policies is 0.67
The similarity between the default and action reward policies is 0.19
The similarity between the default and goal achievability + action reward policies is 0.33
The similarity between the goal achievability and goal achievability + action reward policies is 0.16
The similarity between the action reward and goal achievability + action reward policies is 0.41
In [ ]: