Tuesday, March 24, 2015

DOIs for Stats Canada and Co. Datasets


While trying to work on solutions for research data identifiers (DOIs in our case), we have stumbled upon some unique issues with the Stats Canada datasets you are all familiar with.

In a nutshell, we have more than 27,000 data files that our ever amazing Paul Lesack is managing for four BC schools for licensed data in our Dataverse <http://dvn.library.ubc.ca/ dvn/>

As we are planning to assign DOIs to any Dataverse files on our end...we are wondering about the federal government datasets. These do not seem to have any unique URLs (e.g. handles or DOIs). If we assign DOIs to these files on our end and say five other schools assign DOIs to the same datasets, we will have a small army of DOIs floating around for the same data.

Working with NRC/CISTI as a Canadian DataCite agency, we seem to be the first ones to deal with this issue. Mind you, we also plan to assign DOIs to all our digital objects (CONTENTdm, DSpace, etc). Still, we are really interested to hear your thoughts.

I have recommended to CISTI to proactively approach Stats Can and Health Can (and others) about assigning DOIs to content they produce. Would you agree to this practice?

And please excuse me for my ignorance as I am not really a data librarian but trying to build a research data service for our large campus.


I would agree with you that CISTI should proactively approach STC and other gov. depts. about DOIs. It is an issue which was first mentioned a couple of years ago to DLI and which needs to be addressed.

You have hit the nail on the head when you state that different schools/repositories cannot all be assigning unique DOIs to the same STC dataset. At Scholars Portal we have discussed this before with respect to <odesi> and we, like you, had concerns with the idea of there being many DOIs floating around for the same data.

Personally, I have no problem with assigning DOIs to the locally hosted datasets, but then again these datasets will have numerous persistent identifiers depending on the hosting institution. I 
am guessing the DLI Nesstar server would not work at this stage for this? 

However, in regards to how DOIs work, it is my understanding that a DOI registered by Statistics Canada would resolve to a webpage maintained by them. So, any time this DOI were cited somewhere it would point back to Statistics Canada, which makes sense. Except what if you specifically wanted to point people back to the copy of the dataset housed in your Dataverse? Would you ever want to do that? For instance, we recently noticed that at least one third party (in the US) has registered at least one DOI for StatCan data: <http://data.datacite.org/10.6068/DP14A4B06A47153>. Not sure if this is just poor practice, or if it is something that will occur on a regular basis.

Dataverse allows us to host a variety of licensed and research data. Even data on social housing of dairy calves ​<hdl.handle.net/11272/10178>. This is the blurb on NRC's website re DataCite: "NRC is a founding member of DataCite and is its DOI allocation agent for Canada." I believe this is the direction many are going with respect to DOIs and assignment for multiple iterations/versions/copies of data. However, there needs to be some authority record or reference identifier to properly identify data sets and the study.

At last year’s IASSIST conference there was a panel discussion on data discovery that mentioned the use of data set identifiers and I believe this issue was discussed by the panelists <http://www.library.yorku.ca/cms/iassist/program/sb4/#sb4o>. Many data organizations host the same data sets (in terms of data ‘s content), but the iterations/copies to go for access are different (UBC hosted, vs. <odesi>, vs. DLI etc.). There is no metadata clearinghouse for data sets, DataCite being the closest thing.

I think there could be a practice in place where the original data producer assigned the DOI for the data set, and that DOI reference was carried forward into other iterations of the data set for reuse, with some customization of the identifier for different access points. Likewise, I believe there is a way to customize the DOI to include reference to other identifiers such as STC catalogue #s or the IMDB, etc., in cases where there is no DOI. For example the ISBN > DOI integration <http://www.doi.org/factsheets/ISBN-A.html>.

However, we need to be very careful with regard to assigning DOI especially where we may be assigning multiple DOI identifiers to the same object. Such a practice is discouraged by the doi.org.  The doi Handbook <http://www.doi.org/hb.htmlhttp://www.doi.org/hb.html> states:

"Each DOI® name is a unique "number", assigned to identify only one entity. Although the DOI system will assure that the same DOI name is not issued twice, it is a primary responsibility of the Registrant (the company or individual assigning the DOI name) and its Registration Agency to identify uniquely each object within a DOI name prefix.

Uniqueness (specification by a DOI name of one and only one referent) is enforced by the DOI system. It is desirable that two DOI names should not be assigned to the same thing."

Likewise, in regards to republished or duplicate datasets:

"We strongly recommend that DOIs be created only for ‘original’ datasets, not duplicate datasets. There may at times be a need to deposit duplicate copies of a dataset in multiple data centres, for example where a project has been funded by multiple funders and each funder requires deposition in a different data centre. If possible we would suggest identifying the primary version of the dataset and assigning a DOI to this version only. Where there is an unavoidable need to publish a dataset in different locations each with a separate DOI, the metadata for each appearance of the dataset should indicate the association."

<http://cisti-icist.nrc-cnrc.gc.ca/obj/cisti-icist/doc/datacite/datasets.pdf> -- scroll to the bottom..