*** Self-adaptive Executors for Big Data Processing ***


Authors: Sobhan Omranian Khorasani, Jan S. Rellermeyer, Dick Epema
Distributed Systems Group
Delft University of Technology



***General Introduction***

This dataset contains the measurements obtained with Apache Spark using different strategies for adapting the number of executor threads to reduce I/O contention. The two main strategies explored are a static solution (number of executor threads for I/O intensive tasks pre-determined) and a dynamic solution that employs an active control loop to measure epoll_wait time.

This data was collected running our enhanced version of Spark on OpenJDK 8 in different settings on the DAS-5 cluster. A complete description of the experimental setup can be found in our ACM/IFIP Middleware 2019 paper "Self-adaptive Executors for Big Data Processing". The source code and the scripts for executing the experiments are available on: https://github.com/SobhanOmranian/spark-dca



***Purpose of the Experiments***
The purpose of the experiments is to show that reducing the executor thread count for I/O-intensive phases of the big data workload yields better performance due to the reduction of I/O contention.



***Overview of Files and Description of the Data***

Filename: figure1_experiment[io_wait_time].csv

Description: This experiment shows the amount of IO wait time and CPU utilization in each stage of the application.

Header:
•	appName: name of the application
•	stage: a number representing the stage of the application
•	usr: the percentage of CPU utilization that occurred while executing at the user level (application).
•	nice: the percentage of CPU utilization that occurred while executing at the user level with nice priority.
•	sys: the percentage of CPU utilization that occurred while executing at the system level (kernel).
•	iowait: the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
•	irq: the percentage of time spent by the CPU or CPUs to service hardware interrupts.
•	soft: show the percentage of time spent by the CPU or CPUs to service software interrupts.
•	steal: the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
•	gnice: the percentage of time spent by the CPU or CPUs to run a niced guest.
•	idle: the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
•	duration: time taken for a given stage to complete. (unit: millisecond)


Filename: figure2_figure4_experiment[static_solution].csv

Description: This experiment shows the performance of static solution for different number of threads. The static solution infers the IO-intensive stages by examining the structure of the application and employs a user-defined value for the number of threads for such stages.

Header:
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	duration: time taken for a given stage to complete. (unit: millisecond)
•	usedCores: actual number of cores used by all the executors across the cluster.
•	totalCores: total number of available cores across all the executors in the cluster.
•	adaptive: which adaptivity method was selected (100 means the static solution) 
•	isIo: a boolean flag which indicates whether a given stage is considered as IO-intensive. (0 means non-IO stage and 1 means IO stage)


Filename: figure3_experiment[io_variability].csv

Description: This experiment compares the IO performance (reading and writing 30GB of data) of all the nodes across the cluster.

Header:
•	nodeNum: name of the node.
•	blockSize: the block size of the read/written file (unit: bytes).
•	blockCount: number of blocks for the read/written file.
•	threadNum: number of threads performing the read/write.
•	writeTime: total time it takes to write the file (unit: seconds).
•	readTime: total time it takes to read the file (unit: seconds).


Filename: figure5_experiment[io_utilization].csv

Description: This experiment shows the average IO utilization of the nodes across the cluster for different number of threads. The data file records the IO information for each node at the interval of one second. This information was collected using the “iostat” command line tool.

Header:
•	nodeId: name of the node.
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	user: the percentage of CPU utilization that occurred while executing at the user level (application).
•	nice: the percentage of CPU utilization that occurred while executing at the user level with nice priority.
•	system: the percentage of CPU utilization that occurred while executing at the system level (kernel).
•	iowait: the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
•	steal: the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
•	idle: the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
•	rrqmPerS: The number of read requests merged per second that were queued to the device.
•	wrqrmPerS: The number of write requests merged per second that were queued to the device.
•	rPerS: The number (after merges) of read requests completed per second for the device.
•	wPerS: The number (after merges) of write requests completed per second for the device.
•	rkBPerS: The number of sectors read from the device per second. (unit: KB/s)
•	wkBPerS: The number of sectors written to the device per second. (unit: KB/s)
•	avgrq_sz: The average size of the requests that were issued to the device. (unit: sectors)
•	avgqu_sz: The average queue length of the requests that were issued to the device.
•	await: The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	r_await: The  average for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	w_await: The average time for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	svctm: The  average  service  time for I/O requests that were issued to the device. (unit: milliseconds)
•	util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device).


Filename: figure6_experiment[thread_selection].csv

Description: This experiment shows the decision by the dynamic solution for the number of threads of each executor in different stages of Terasort.

Header:
•	executorId: id of the executor
•	stage: a number representing the stage of the application.
•	cores: number of cores (i.e., threads) in use.
•	totalEpollWait: the amount of time threads have to wait for I/O events on an epoll file to complete. (unit: seconds)
•	normalisedByThreadNum: epoll wait time divided by the number of threads in use.
•	normalisedByTime: epoll wait time divided by the duration of the interval which the number of threads is equal to the value in the ‘cores’ column.
•	normalisedByBoth: epoll wait time divided by both the number of threads and the duration.
•	normalisedByTotalTaskThroughputFromSampling: epoll wait time divided by the I/O throughput reported by the Spark metric system for a given interval.
•	avgTaskThroughput: the average I/O throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskThroughput: the average I/O throughput of the running tasks, computed at the end of the interval. (unit: MB/s)
•	totalTaskReadThroughputFromSampling: the average read throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskWriteThroughputFromSampling: the average write throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskBothThroughputFromSampling: the average read and write throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	avgTaskThroughputFromSampling: the average I/O throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTime: duration of an interval. (unit: miliseconds)
•	totalFileReadTime: the overhead of reading the strace output for calculating the epoll wait time. (unit: miliseconds)
•	selection: Number of threads selected by the dynamic solution at the end of each interval.


Filename: figure7_experiment[epoll_wait_and_throughput].csv

Description: This experiment shows how epoll wait time and I/O throughput and the combined metric are affected when the number of threads change in all stages of Terasort. 

Header:
•	executorId: id of the executor
•	stage: a number representing the stage of the application.
•	cores: number of cores (i.e., threads) in use.
•	totalEpollWait: the amount of time threads have to wait for I/O events on an epoll file to complete. (unit: seconds)
•	normalisedByThreadNum: epoll wait time divided by the number of threads in use.
•	normalisedByTime: epoll wait time divided by the duration of the interval which the number of threads is equal to the value in the ‘cores’ column.
•	normalisedByBoth: epoll wait time divided by both the number of threads and the duration.
•	normalisedByTotalTaskThroughputFromSampling: epoll wait time divided by the I/O throughput reported by the Spark metric system for a given interval. Based on this value, the algorithm decides which interval has performed better. The smaller the value, the better.
•	avgTaskThroughput: the average I/O throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskThroughput: the average I/O throughput of the running tasks, computed at the end of the interval. (unit: MB/s)
•	totalTaskReadThroughputFromSampling: the average read throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskWriteThroughputFromSampling: the average write throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTaskBothThroughputFromSampling: the average read and write throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	avgTaskThroughputFromSampling: the average I/O throughput of the running tasks, sampled at each second in an interval. (unit: MB/s)
•	totalTime: duration of an interval. (unit: miliseconds)
•	totalFileReadTime: the overhead of reading the strace output for calculating the epoll wait time. (unit: miliseconds)
•	selection: Number of threads selected by the dynamic solution at the end of each interval.


Filename: figure8_experiment[dynamic_solution].csv

Description: This experiment compares the performance of dynamic solution with baseline spark and bestfit-static. 

Header:
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	duration: time taken for a given stage to complete. (unit: millisecond)
•	usedCores: actual number of cores used by all the executors across the cluster.
•	totalCores: total number of available cores across all the executors in the cluster.
•	adaptive: which adaptivity method was selected (100 = static solution, 14 = dynamic solution) 


Filename: figure9_experiment[dynamic_solution].csv

Description: The aim of this experiment is to show how different methods scale. It compares the results of the default, static and dynamic between a 4-node and a 16-node version of the experiment for the Terasort application. The input size is proportionally increased in the 16-node version (4 times).

Header:
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	duration: time taken for a given stage to complete. (unit: millisecond)
•	usedCores: actual number of cores used by all the executors across the cluster.
•	totalCores: total number of available cores across all the executors in the cluster.
•	adaptive: which adaptivity method was selected (100 = baseline spark, 999 = static solution, 14 = dynamic solution) 


Filename: figure10_experiment[static_solution_ssd].csv

Description: The aim of this experiment is showing how the underlying disk (HDD vs SDD) affects the output of the static solution for the Terasort application. All the data in this file is for the SSD version.

Header:
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	duration: time taken for a given stage to complete. (unit: millisecond)
•	usedCores: actual number of cores used by all the executors across the cluster.
•	totalCores: total number of available cores across all the executors in the cluster.
•	adaptive: which adaptivity method was selected (100 means the static solution) 
•	isIo: a boolean flag which indicates whether a given stage is considered as IO-intensive. (0 means non-IO stage and 1 means IO stage)


Filename: figure11_experiment[dynamic_solution_ssd].csv

Description: The aim of this experiment is showing how the underlying disk (HDD vs SDD) affects the output of the dynamic solution for the Terasort application. All the data in this file is for the SSD version.

Header:
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	duration: time taken for a given stage to complete. (unit: millisecond)
•	usedCores: actual number of cores used by all the executors across the cluster.
•	totalCores: total number of available cores across all the executors in the cluster.
•	adaptive: which adaptivity method was selected (100 = static solution, 14 = dynamic solution)


Filename: figure12_experiment[io_performance_hdd].csv

Description: This experiment shows a line chart for the average IO throughput of the nodes across the cluster at each second for different number of threads. The nodes in this experiment have HDD disks. The data file records the IO information for each node at the interval of one second. This information was collected using the “iostat” command line tool.

Header:
•	nodeId: name of the node.
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	user: the percentage of CPU utilization that occurred while executing at the user level (application).
•	nice: the percentage of CPU utilization that occurred while executing at the user level with nice priority.
•	system: the percentage of CPU utilization that occurred while executing at the system level (kernel).
•	iowait: the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
•	steal: the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
•	idle: the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
•	rrqmPerS: The number of read requests merged per second that were queued to the device.
•	wrqrmPerS: The number of write requests merged per second that were queued to the device.
•	rPerS: The number (after merges) of read requests completed per second for the device.
•	wPerS: The number (after merges) of write requests completed per second for the device.
•	rkBPerS: The number of sectors read from the device per second. (unit: KB/s)
•	wkBPerS: The number of sectors written to the device per second. (unit: KB/s)
•	avgrq_sz: The average size of the requests that were issued to the device. (unit: sectors)
•	avgqu_sz: The average queue length of the requests that were issued to the device.
•	await: The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	r_await: The  average for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	w_await: The average time for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	svctm: The  average  service  time for I/O requests that were issued to the device. (unit: milliseconds)
•	util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device).


Filename: figure12_experiment[io_performance_sdd].csv

Description: This experiment shows a line chart for the average IO throughput of the nodes across the cluster at each second for different number of threads. The nodes in this experiment have SDD disks. The data file records the IO information for each node at the interval of one second. This information was collected using the “iostat” command line tool.

Header:
•	nodeId: name of the node.
•	appName: name of the application.
•	stage: a number representing the stage of the application.
•	user: the percentage of CPU utilization that occurred while executing at the user level (application).
•	nice: the percentage of CPU utilization that occurred while executing at the user level with nice priority.
•	system: the percentage of CPU utilization that occurred while executing at the system level (kernel).
•	iowait: the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
•	steal: the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
•	idle: the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
•	rrqmPerS: The number of read requests merged per second that were queued to the device.
•	wrqrmPerS: The number of write requests merged per second that were queued to the device.
•	rPerS: The number (after merges) of read requests completed per second for the device.
•	wPerS: The number (after merges) of write requests completed per second for the device.
•	rkBPerS: The number of sectors read from the device per second. (unit: KB/s)
•	wkBPerS: The number of sectors written to the device per second. (unit: KB/s)
•	avgrq_sz: The average size of the requests that were issued to the device. (unit: sectors)
•	avgqu_sz: The average queue length of the requests that were issued to the device.
•	await: The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	r_await: The  average for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	w_await: The average time for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. (unit: milliseconds)
•	svctm: The  average  service  time for I/O requests that were issued to the device. (unit: milliseconds)
•	util: Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device).


Filename: table2_experiment[io_activity].csv

Description: This experiment shows how much I/O activity (read and writes), different Spark applications generate in total compared to their input size.

Header:
•	appName: name of the application.
•	totalReadKb: total amount of reads. (unit: KB)
•	totalWriteKb: total amount of writes. (unit: KB)
•	totalReadAndWriteKb: total amount of reads and writes. (unit: KB)
•	inputBytes: size of the application input. (unit: bytes)
•	outputBytes: size of the application  output. (unit: bytes)
•	stagesOutputBytes: size of the output produced by the Spark shuffle stages. (unit: bytes)
•	replicationFactor: value of the replication factor parameter for HDFS.

