TY - DATA
T1 - Papyrus - A large scale curated dataset aimed at bioactivity predictions
PY - 2022/04/04
AU - Olivier Béquignon
AU - Brandon Bongers
AU - W. (Willem) Jespers
AU - Adriaan P. IJzerman
AU - Bob van de Water
AU - Gerard JP Van westen
UR - https://data.4tu.nl/articles/dataset/Papyrus_-_A_large_scale_curated_dataset_aimed_at_bioactivity_predictions/16896406/3
DO - 10.4121/16896406.v3
KW - Papyrus
KW - machine Learning Predictions
KW - cheminformatics tool
KW - bioactivity data
KW - curated dataset
N2 - <p>This repository contains the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the manuscript "Papyrus - A large scale curated dataset aimed at bioactivity predictions" (Work in Progress).</p><p>With the
recent rapid growth of publicly available ligand-protein bioactivity data,
there is a trove of viable data that can be used to train machine learning
algorithms. However, not all data is equal in terms of size and quality, and a
significant portion of researcher’s time is needed to adapt the data to their
needs. On top of that, finding the right data for a research question can often
be a challenge on its own. As an answer to that, we have constructed the
Papyrus dataset, comprised of around 60 million datapoints. This dataset
contains multiple large publicly available datasets such as ChEMBL and
ExCAPE-DB combined with smaller datasets containing high quality data. This
aggregated data has been standardised and normalised in a manner that is
suitable for machine learning. We show how data can be filtered in a variety of
ways, and also perform some rudimentary quantitative structure-activity
relationship and proteochemometrics modeling. Our ambition is to create a
benchmark set that can be used for constructing predictive models, while also
providing a solid baseline for related research.</p>
ER -