{"id":203,"date":"2021-01-24T13:08:38","date_gmt":"2021-01-24T19:08:38","guid":{"rendered":"https:\/\/lab.prd.vanderbilt.edu\/zyang-lab\/?p=203"},"modified":"2025-09-21T10:05:19","modified_gmt":"2025-09-21T16:05:19","slug":"integrated-database-construction","status":"publish","type":"post","link":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/2021\/01\/24\/integrated-database-construction\/","title":{"rendered":"Information Extraction and Database Construction"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/pubs.acs.org\/cms\/10.1021\/acs.jcim.2c01139\/asset\/images\/medium\/ci2c01139_0006.gif\" alt=\"Abstract Image\" \/><\/p>\n<p>The Yang lab is building an integrated enzymology data ecosystem that makes the \u201cdark matter\u201d of enzyme kinetics accessible for predictive modeling and method development. IntEnzyDB provides a fast, flattened relational architecture that unifies enzyme structure and function across six EC classes and exposes a public web interface for streamlined access; using 1,050 structure-kinetics pairs, we quantified how efficiency-enhancing mutations are globally encoded while deleterious effects concentrate near active sites, enabling facile statistical modeling and machine learning. To transform literature into model-ready data at scale, we developed EnzyExtract, a large language model pipeline that processes full-text PDFs\/XMLs to automatically extract, verify, and structure enzyme-substrate-kinetics records. From 137,892 publications, EnzyExtract assembled &gt;218,095 entries, including 218,095 kcat and 167,794 Km values, mapped across 3,569 unique four-digit EC numbers (84,464 entries assigned \u2265 first-digit EC). It uncovered 89,544 kinetic entries absent from BRENDA, and after aligning enzymes and substrates to UniProt and PubChem, yielded 92,286 high-confidence sequence-mapped records compiled as EnzyExtractDB. Benchmarking shows high accuracy versus manual curation and strong consistency with BRENDA, and retraining state-of-the-art kcat predictors (MESI, DLKcat, TurNuP) on EnzyExtractDB improves RMSE, MAE, and R\u00b2 on held-out tests. Together, IntEnzyDB and EnzyExtractDB supply the breadth, structure linkage, and quality needed to power generalizable, data-driven enzyme engineering.<\/p>\n<p>IntEnzyDB development was led by Bailu Yan and Xinchun Ran; EnzyExtractDB development was led by Galen Wei and Xinchun Ran.<\/p>\n<p><strong>The software code<\/strong>:<\/p>\n<p>https:\/\/github.com\/ChemBioHTP\/IntEnzyDB<\/p>\n<p>https:\/\/github.com\/ChemBioHTP\/EnzyExtract<\/p>\n<p><strong>Web Interface<\/strong>:<\/p>\n<p>https:\/\/colab.research.google.com\/drive\/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva<\/p>\n<p><strong>Publications<\/strong>:<\/p>\n<p>https:\/\/onlinelibrary.wiley.com\/doi\/full\/10.1002\/pro.70251<\/p>\n<p>https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.2c01139<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Yang lab is building an integrated enzymology data ecosystem that makes the \u201cdark matter\u201d of enzyme kinetics accessible for predictive modeling and method development. IntEnzyDB provides a fast, flattened relational architecture that unifies enzyme structure and function across six EC classes and exposes a public web interface for streamlined access; using 1,050 structure-kinetics pairs,&#8230;<\/p>\n","protected":false},"author":253,"featured_media":652,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[9],"tags":[],"class_list":["post-203","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software"],"acf":[],"_links":{"self":[{"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/posts\/203","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/users\/253"}],"replies":[{"embeddable":true,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/comments?post=203"}],"version-history":[{"count":3,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/posts\/203\/revisions"}],"predecessor-version":[{"id":651,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/posts\/203\/revisions\/651"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/media\/652"}],"wp:attachment":[{"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/media?parent=203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/categories?post=203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lab.vanderbilt.edu\/zyang-lab\/wp-json\/wp\/v2\/tags?post=203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}