分析测试百科网

搜索

喜欢作者

微信支付微信支付
×

DataONE:Protocols/Find GEO reuses

2019.8.11
头像

zhaochenxu

致力于为分析测试行业奉献终身

Identify reuses of GEO datasets

Aim

The aim of this protocol is to collect data on the reuses of datasets in the published literature. This particular protocol focuses on reuses of gene expression microarray datasets stored in NCBI''s Gene Expression Omnibus (GEO) repository and tracks reuses attributed through accession numbers within the full text of articles in PubMed Central.

Background

Little research has been done on the patterns and prevalence of data reuse. A few superstar success stories need no analysis: Data from Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. There have been a few surveys, but they suffer from limited scope and self-reporting biases. Download stats are poorly correlated with perceived value <<citation?>>. So let’s track reuse in the published literature.

Unfortunately, there are nto well-established attribution formats and standards for data to facilitate the sort of automated citation analysis that bibliomatricians perform with journal articles. Following the track of data is difficult in several additional ways: datasets do not have unambiguous identifiers, attribution is often within full text and thus difficult to query across journals and disciplines, and it is difficult to disambiguate the mention of a dataset in the context of reuse from the mention of a dataset deposit.

Restricting our focus to gene expression microarray data helps to address several of these issues. First, most shared gene expression microarray data is shared in once central repository: the NCBI''s Gene Expression Omnibus (GEO). It is common practice to refer to datasets by their GEO accession numbers, and the GEO accession numbers have a fairly unique format. Furthermore, most creations and reuses of gene expression microarray data in the published literature are indexed by PubMed and are increasingly (as per NIH mandate) available for full-text query in PubMed Central. The coordinated Entrez databases and eUtils web service means that full-text can be queried automatically, links between articles and datasets can be monitored, and standard indexing metadata can be collected. All disciplines should be so lucky.

Below, then, is a protocol for using these resources to collect information on reuse. Please note the limitations section, and contribute if you have other ideas!

Protocol Overview

Optionally:

Materials

Online connection

Installed software

Used python source code:

NOTE: I''m still getting my git together, so the code at the above links may not be fully standalone or easily run by others. I''m working on it... in the meantime, feel free to email me if you want details!

Procedure

Summary

Accession number formats

Exclude data creation studies

 (geo OR omnibus) 
 AND microarray 
 AND "gene expression"       
 AND accession
 NOT (databases 
        OR user OR users
        OR (public AND accessed) 
        OR (downloaded AND published))
 "gene expression omnibus” AND (submitted OR deposited)

Estimate what percentage of reusers weren''t the original authors

Is the PMC paper by the same investigators as those who originally created the data?

Extrapolate from PubMed Central to PubMed

 number of articles in PMC:  6311, 
 number of articles in PubMed:  21569, 
 so PMC contains 29.26% of related papers

Validation

Variants

Reuses of ArrayExpress datasets

Application

Example data

Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:

Potential uses

Known uses

Assumptions, Limitations, and Unknowns

This protocol captures a subset of all dataset reuses because of several limitations:

Furthermore, extrapolations based on this data may be biased:

Open Questions

Possible Enhancements

  1. Torvik VI and Smalheiser NR. Author Name Disambiguation in MEDLINE. ACM Trans Knowl Discov Data 2009 Jul 1; 3(3). pmid:20072710. PubMed HubMed [Authority2009]

Related references

  1. Piwowar, HA. Studying Reuse Of GEO Datasets In The Published Literature. Research Remix. July 5 2010. blog post [Piwowar-blogGauntlet]

  2. Piwowar HA and Chapman WW. Identifying data sharing in biomedical literature. AMIA Annu Symp Proc 2008 Nov 6 596-600. pmid:18998887. PubMed HubMed [Piwowar-AMIA2008]

  3. Piwowar, Wendy W Chapman (2008) Linking database submissions to primary citations with PubMed Central. BioLINK 2008, Toronto Canada. Full text

    [Piwowar-BioLINK2008]

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input! Please sign your name to your note by adding ''''''*~~~~'''''': to the beginning of your tip.

  1. List troubleshooting tips here.

  2. Anecdotal observations that might be of use to others can also be posted here.


生物在线
文章推荐