BioCADDIE Pilot 3.2

Note

Note this project has been archived because it is not actively maintained anymore.

BioCADDIE Pilot 3.2

#Introduction

The BioCADDIE Pilot 3.2 is a scalable data mining platform to cross-link data and publications. This pilot project is part of the BioCADDIE project and provides tools for extracting data set mentions from the full text publications in the PubMedCentral Open Access Subset. Its initial focus is the date mention extraction of Protein Data Bank data sets, but the framework supports extend to other data resources. It also offers tools to analyze citation networks in PubMedCentral using a number of network metrics to rank data mentions by importance.

This project operates on the following data sets

Protein Data Bank (PDB): >110,000 3D structure of biomolecules (current list)
PubMedCentral Open Access Subset: >1 million free text articles (current list)

##What are Data Mentions?

Data mentions are references to data sets in publications that fall into two categories: 1. structured data mentions can be recognized by regular expressions matching, 2. unstructured data mentions require natural language processing and machine learning to disambiguate valid from invalid data mentions.

Structured data mentions for PDB Identifiers

Reference	Example
PDB ID	`PDB ID: 1STP`
PDB DOI	`http://dx.doi.org/10.2210/pdb1stp/pdb`
RCSB PDB URL	`http://www.rcsb.org/../structureId=1stp`
NXML External Link	`<ext-link .. ext-link-type=“pdb” xlink:href=“1STP”>`

Unstructured data mentions for PDB Identifiers

Type	Example
Valid PDB ID (4AHQ)	`The structure of the active site of the K165C enzyme (4AHQ) ...`
Invalid PDB ID (2C19)	`The polymorphisms of cytochrome P450 2C19 (CYP2C19) gene ...`

##Data Mention Extraction for PDB IDs The extraction of data mentions involves the following steps

Download PDB and PMC metadata
Download PMC OC full text articles
Create positive and negative training/test sets for data mention disambiguation
Fit machine learning model for data mention disambiguation
Predict PDB data mentions for all PMC OC articles

Data Mention Extraction details

Publication Network Analysis

Citation Network Analysis

##Project Status

This project is in active development. Expect major refactoring of current code.

##Want to Use or Contribute to this Project? Contact us pwrose@ucsd.edu

##Technology Stack This project relies on the open-source technologies Apache Spark and Apache Parquet to make literature data mining fast and parallelizable.

###Apache Spark

Apache Spark is a fast and general framework for large-scale in-memory data processing. It runs locally, on an in-house or commercial cloud environment. We use Spark DataFrames to store, filter, sort, and join data sets.

###Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. We store Spark DataFrame as Parquet files for high-performance data handling.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.settings		.settings
bin/src/main/resources		bin/src/main/resources
src/main		src/main
target/maven-status/maven-compiler-plugin/compile/default-compile		target/maven-status/maven-compiler-plugin/compile/default-compile
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

bin/src/main/resources

bin/src/main/resources

src/main

src/main

target/maven-status/maven-compiler-plugin/compile/default-compile

target/maven-status/maven-compiler-plugin/compile/default-compile

.gitignore

.gitignore

.project

.project

README.md

README.md

pom.xml

pom.xml

Repository files navigation

BioCADDIE Pilot 3.2

Publication Network Analysis

About

Releases

Packages

Contributors 5

Languages

rcsb/BioCaddiePilot32

Folders and files

Latest commit

History

Repository files navigation

BioCADDIE Pilot 3.2

Publication Network Analysis

About

Resources

Stars

Watchers

Forks

Languages