Crowdsourcing data citation links

According to the Australian National Data Service, "data citation refers to the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to printed resources.". Many popular citation guidelines have been enriched with templates for data publication and citation. This enables a more informed review, and reuse of scientific work, as readers of scholarly publications can now easily consult the relevant datasets and assess their quality. References to datasets could also become an integral part of bibliographic algorithms in order to add data-specific statistics to traditional citation graphs. Going a step further, datasets could have their own form of citation: a dataset could be composite of, derived from, a subset of, the aggregate of, or a new version of other datasets.

Within this project we aim to develop tools to support the a-posteriori provenance-enabled crowdsourcing of data citation graphs.

  • By a-posteriori we mean that the information is captured “after publication”, as opposed to “at the time of writing or submission”. This is motivated by the large amount of existing publications and datasets, that are already published, but not interlinked.
  • Provenance-enabled hints at the fact that the data collected is modelled based on an extension of the W3C PROV data model, and the provenance of the individual participant contributions are captured as provenance bundles.
  • Crowdsourcing is the mechanism we use to elicit the required information to create the citation graph between publications and datasets.

A first instantiation of the tool was used in a small user study during the 4th edition of the USEWOD workshop, during ESWC2014. The tool is still available online at prov.usewod.org. Besides the actual data citation graph creation for the workshop publications and datasets, the study was designed to test some aspects of the system: the suitability of the vocabulary created, the perceived necessity of the detailed usage metadata in contrast with simple general usage links, and the overall suitability of using crowdsourcing for such data collection. More details on the design and results are available in the related publications.