PS4 Dataset Released

I’m pleased to share PS4, the largest open-source dataset for protein secondary structure prediction. Along with the new dataset, I’m also sharing PS4-Mega and PS4-Conv, the new state-of-the-art algorithms for predicting protein secondary structure.

If you've ever worked with protein secondary structure and machine learning, you know that the datasets are fragmented and redundant against each other, making it hard to know how reliable your evaluation results are. The included proteins aren't even identified in many cases, making this issue really tricky for researchers to solve. In PS4, all proteins are identified by PDB code and non-redundancy is guaranteed, including against CB513, another major benchmark in SS prediction.

All code and data is fully open-sourced, to facilitate reproducibility and empower the bioinformatics community to develop ideas further.