Title of Dataset: Data underlying the publication: Mining Software Testing Knowledge from Stack Overflow
Author: Dibyendu Gupta
Software Engineering Group, Faculty of Electrical Engineering, Mathematics & Computer Science (EEMCS), TU Delft

Contact Information: D.T.Gupta@student.tudelft.nl

This dataset contains 138 .csv files that have been queried using the Stack Exchange API to query posts from Stack Overflow (SO). The files can be grouped by the year (2017-2023), mentioned at the end of each file (xxx-xxx-20xx.csv). The naming convention of these files is as per the following: search_term-testing-20xx.csv. The search_term refers to the type of testing we are refering to (acceptanace, compatibility, e2e, unit, etc.). The search_term is used to query posts from SO. The posts in the dataset are a result of the search_term being present either in the title or the body of the post. The filters that were used to query these posts are mentioned below:
1. Sort - Relevance
2. Order - Descending
3. Starting date - 2017-01-01
4. Ending date - 2017-12-31
5. Title - The search terms provided (system testing, integration testing, etc.)
6. Body - Same search terms provided
7. Additional parameters:
(a) withbody - To query the content of the post
(b) has more - To query the consequent page until the limit of the search result/limit of the API

There are 11 files in the dataset that have no year at the end of the file name. These files contain information from 01-01-2017 till 31-05-2023 all in one (without any segmentation). The names of the file indicate the search term used to query posts from SO. These files also have an extra column called "critical points" that was not queried from the SO. This column was compiled by the researcher mentioning important information about the posts through manual inspection. The filters that were used to query these posts are mentioned below:
1. Sort - Relevance
2. Order - Descending
3. Starting date - 2017-01-01
4. Ending date - 2023-05-31
5. Title - The search terms provided (advice, best-practices, etc.)
6. Body - Same search terms provided
7. Tagged - The tags that need to be linked to the post (test;tests;testing;software-testing)
8. Additional parameters:
(a) withbody - To query the content of the post
(b) has more - To query the consequent page until the limit of the search result/limit of the API

Both the above mentioned datasets have title, body, tags, view count, number of answers and link to the post. No personal information was recorded or queried.

Apart from the files that were queried from SO, there are 2 other files that was completely compiled/written by the researcher. These files are called "SOScript.py" and "SO data.xlsx". The first one is a python script that was used to query the various files mentioned above. The script is has a few comments about what the code is doing. The second file compiles the results from frequency analysis of the above mentioned files and has other information that were directly extracted from SO. These meta-data were also visualized using excel's analysis of information feature.
