*** Phylogenetic network rearrangement move heuristics ***
Author: Remie Janssen
TU Delft, Electrical Engineering, Mathematics and Computer Science, Department of Applied Mathematics

Contact information:
r.janssen-2@tudelft.nl



***General introduction***
This dataset contains the data collected to test the rearrangement heuristics developed in Remie Janssen's PhD thesis. This research project was made possible by Leo van Iersel's Vidi grant 639.072.602.


***Experiments***
The heuristic is meant to find sequences of rearrangement moves between a pair of phylogenetic networks. This dataset contains the data for experiments that test this heuristic for its computational speed, and the quality of its solutions (length of the sequences, where shorter is better). The data comprises four experiments:
 - running time,
 - quality for isomorphic inputs,
 - quality for small inputs,
 - quality for small distances.
 
For each of these experiments, networks were generated by the Ntk-generator by Zhang (2016), which can be found at http://phylnet.univ-mlv.fr/tools/randomNtkGenerator.php. To generate inputs with small distances, we started with a network generated by the mentioned generator, and performed a small number of rearrangement moves of the appropriate types (e.g., when testing thehead move heuristic we performed head moves).

 
***Test equipment***
The running time was tested on run on a Linux system with an Intel Core i7-8650U CPU running at 1.90 GHz and 8192 MB of DDR4 RAM clocking in at 2400 MT/s. The operating system was Ubuntu 18.04.4 with a 4.15.0-118-generic kernel. The software was written in Python version 2.7.17.


***Description of the dataset***
The data is structured per experiment, where the running time and isomorphic input experiments share a folder (RunningTimeInputsFormatted). The inputs are grouped in folders by size. In RunningTimeInputs and SmallNetworksFormatted, the folders are named with the convention n_k, where n is the number of leaves in the networks, and k is the number of reticulations. In SmallDistanceNetworkPairs, the inputs are first grouped by move type: move=rSPR, move-h=Head moves, move-t=tail moves. Within those folders the inputs are grouped by number of leaves (n), number of reticulations (k) and number of moves used to create the second network (m); the folder names are n_k_m. Each file then contains a pair of networks in edge-list format, and the naming of these files corresponds to the indices of the networks, so 0_4.txt tests the distance between networks 0 and 4.

The three folders RunningTimeInputs, SmallDistanceNetworkPairs and SmallNetworksFormatted also contain the results of applying the heuristic in .txt files in csv format with a ; separator. These contain the following information.

result_times.txt:        filename;heuristic;time_read_trees;time_find_sequence;length;sequence
results_isom_Quality:    filename;heuristic;length;sequence
results_small_distance:  filename;heuristic;length;sequence
results_small_networks:  filename;algorithm;move;time_read;time_algo;length;sequence

filename:               the name of the file the heuristic is applied to 
heuristic:              contains the parameters used to run the heuristic in the terminal, -t means only tail moves are used, -h means only head moves are used, and the absence of -t and -h means rSPR moves are used. -rand means we used the random version of the heuristic.
time_read_trees:        the time the script needed to read the networks from the input file, this data is not used.
time_find_sequence:     the time the heuristic took to find a sequence between the pair of input networks in seconds.
length:                 the length of the found rearrangement sequence.
sequence:               the found sequence of rearrangement moves.
algorithm:              `heuristic' if the heuristic is used; `exact' if the breadth first search is used.
move:                   `move' if rSPR moves are used, `move-t' if tail moves are used, and `move-h' if head moves are used.
time_read:              like time_read_trees
time_algo:              like time_find_sequence
 
 

