# Emergence of novel SARS-CoV-2 variants in the Netherlands
This is the data repository to accompany our publication, "Emergence of novel SARS-CoV-2 variants in the Netherlands" in *Scientific Reports*. In this repository, you can find the data required to conduct our study of SARS-CoV-2 genomes to explore the viral population diversity in the Netherlands, within a global context. Our work lends insight into the genetic variation of SARS-CoV-2 in the later stages of the pandemic in April and early May. We also provide the output files where necessary.

## Description of the dataset
1. Complete, high quality (number of undetermined bases less than 1% of the whole sequence) genome sequences of SARS-COV-2 that were isolated from human hosts only were obtained from GISAID, NCBI and China’s National Genomics Data Center (NGDC) on June 13th. The dataset contained 29,503 sequences with unique identifiers in total, including the Wuhan-Hu-1 reference sequence (accession ID NC_045512.2). The “Collection date” field was also extracted for all sequences, and it is referred to as “date” throughout this work. The acknowledgement table for GISAID sequences can be found in Supplementary file 2 and the full list of sequence identifiers for NCBI and NGDC records are provided in Supplementary file 3.

You can find the final collection of SARS-CoV-2 genome sequences (after preprocessing and cleaning the metadata) here. The `sarscov2_sequences.fasta` file contains DNA sequences of 29,503 SARS-CoV-2 genomes in fasta format. In addition, we provide the genome metadata file (`sarscov2_metadata.tsv`), mutations obtained from the coronapp web application (`sarscov2_mutations_coronapp.tsv`) as well. 

- `sarscov2_metadata.tsv`: This is tab-separated-file that contains all the metadata related to the 29,503 SARS-CoV-2 genomes used in our work. The table is indexed by the genome sequence IDs and consists of 27 fields. The fields that are used in the analysis are:
  - `date-fixed`: The data of submission, fixed to have a full date for all samples in the same format (YYYY-MM-DD).
  - `gisaid-clade`: The clade assinged to the sample based on the mutations it carries complying to the GISAID clade designations.
  - `region`: The geographical region where the virus was sampled in. 
  - `country`: The country where the virus was sampled in.
- `sarscov2_sequences.fasta`: DNA sequences of 29,503 SARS-CoV-2 genomes in fasta format.
- `sarscov2_mutations_coranapp.tsv`: After running the `mafft` alignment on all SARS-CoV-2 sequences, we uploaded the alignment file after we filtered to remove the identical sequences and trimmed to remove the gaps from the Wuhan-Hu-1 reference (accession ID NC_045512.2) to the *coronapp* web application. Since the *coronapp* application has recently enforced a limit on maximum number of sequences allowed, we provide the original output in the 4TU.ResearchDatabase.


## Citation
- If you use our curated collection of genomes, or if you find any part of our work in this repository useful, please cite the [Github repository](https://github.com/aysunrhn/sarscov2-variants/) or the original publication

```python
@article {urhan2021sarscov2,
    author = {Aysun Urhan and Thomas Abeel},
    title = {Emergence of novel SARS-CoV-2 variants in the Netherlands.},
    journal={Scientific reports},
    volume={11},
    number={1},
    pages={6625},
    year={2021},
    publisher={Nature Publishing Group UK London}    
    doi = {10.1038/s41598-021-85363-7},
}
```