Many research efforts have been devoted to identifying biological concepts from free text and mapping them to GO terms, including the community-wide challenge tasks in BioCreative 6 – 8 and the TREC Genomic Track 9. While the accurately curated knowledgebase from the GO consortium provides invaluable information, manual processes of encoding knowledge cannot keep up with the ever increasing rate of knowledge accumulation 2 the need for computational approaches to address this challenge has led to recent advances in biomedical natural language processing (BioNLP) and text mining 3 – 5. In the bioinformatics domain, the controlled vocabulary developed by the Gene Ontology Consortium 1 is the de facto standard for representing the knowledge regarding genes and proteins from a molecular biology perspective. In order to tap into the wealth of knowledge in biomedical literature, it is imperative to develop computational methods to automatically extract information from free texts and encode the information in a computable form, so that a knowledgebase populated with such information can be utilized for reasoning and deriving new knowledge. This study uses GO annotation data as a testbed the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |