BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning
Paper
•
2507.14468
•
Published
This dataset contains the benchmark data used in the paper "BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning" published in Bioinformatics.
The dataset includes three biomedical knowledge graph completion tasks with background knowledge integration:
| Dataset | Task | Background Knowledge Sources | Main Dataset Targets | Total Triples |
|---|---|---|---|---|
| Disease-Gene Prediction | Disease-gene association prediction | Drug-Disease Relationships SIDER (14,631) + Protein-Chemical Relationships STITCH (277,745) | DisGeNet (130,820) Gene | ~423K |
| Protein-Chemical Interaction | Protein-chemical interaction prediction | Drug-Disease Relationships SIDER (14,631) + Disease-Gene Relationships DisGeNet (130,820) | STITCH (23,074) Chemical | ~168K |
| Medical Ontology Reasoning | Medical concept reasoning | Various Medical Relationships UMLS (4,006) | UMLS (2,523) Multi-domain Entities | ~6.5K |
from datasets import load_dataset
# Load the complete dataset
dataset = load_dataset("Y-TARL/BioGraphFusion")
# Load specific task
disgenet_data = load_dataset("Y-TARL/BioGraphFusion", "Disease-Gene")
stitch_data = load_dataset("Y-TARL/BioGraphFusion", "Protein-Chemical")
umls_data = load_dataset("Y-TARL/BioGraphFusion", "umls")
If you use this dataset in your research, please cite our paper:
@article{lin2025biographfusion,
title={BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning},
author={Lin, Yitong and He, Jiaying and Chen, Jiahe and Zhu, Xinnan and Zheng, Jianwei and Tao, Bo},
journal={Bioinformatics},
pages={btaf408},
year={2025},
publisher={Oxford University Press}
}
This dataset is released under the Apache 2.0 License.
We thank the original data providers:
For questions about the dataset, please open an issue in the GitHub repository.