Information Extraction and Database Construction

The Yang lab is building an integrated enzymology data ecosystem that makes the “dark matter” of enzyme kinetics accessible for predictive modeling and method development. IntEnzyDB provides a fast, flattened relational architecture that unifies enzyme structure and function across six EC classes and exposes a public web interface for streamlined access; using 1,050 structure-kinetics pairs, we quantified how efficiency-enhancing mutations are globally encoded while deleterious effects concentrate near active sites, enabling facile statistical modeling and machine learning. To transform literature into model-ready data at scale, we developed EnzyExtract, a large language model pipeline that processes full-text PDFs/XMLs to automatically extract, verify, and structure enzyme-substrate-kinetics records. From 137,892 publications, EnzyExtract assembled >218,095 entries, including 218,095 kcat and 167,794 Km values, mapped across 3,569 unique four-digit EC numbers (84,464 entries assigned ≥ first-digit EC). It uncovered 89,544 kinetic entries absent from BRENDA, and after aligning enzymes and substrates to UniProt and PubChem, yielded 92,286 high-confidence sequence-mapped records compiled as EnzyExtractDB. Benchmarking shows high accuracy versus manual curation and strong consistency with BRENDA, and retraining state-of-the-art kcat predictors (MESI, DLKcat, TurNuP) on EnzyExtractDB improves RMSE, MAE, and R² on held-out tests. Together, IntEnzyDB and EnzyExtractDB supply the breadth, structure linkage, and quality needed to power generalizable, data-driven enzyme engineering.
IntEnzyDB development was led by Bailu Yan and Xinchun Ran; EnzyExtractDB development was led by Galen Wei and Xinchun Ran.
The software code:
https://github.com/ChemBioHTP/IntEnzyDB
https://github.com/ChemBioHTP/EnzyExtract
Web Interface:
https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva
Publications:
https://onlinelibrary.wiley.com/doi/full/10.1002/pro.70251
https://pubs.acs.org/doi/10.1021/acs.jcim.2c01139
Leave a Response
You must be logged in to post a comment