DIGRS - Data Interest Group for Reference Services: March 2021

Monday, March 15, 2021

UBC Library Works With FRDR to Make Statistics Canada Data More Discoverable

October 2, 2020

Portage and the UBC Library’s Abacus Data Network are pleased to announce the addition of the Statistics Canada Open License Dataverse to the Federated Research Data Repository (FRDR) Discovery Service:

Statistics Canada Open License

The Statistics Canada Open License Dataverse includes more than 1,000 public use microdata data files (PUMFs) as well as additional datasets that are not currently part of the federal Open Government Portal. The Abacus Data Network is a data repository collaboration that involves the University of British Columbia (UBC), Simon Fraser University (SFU), the University of Northern British Columbia (UNBC) and the University of Victoria (UVic).

FRDR currently has over 70 Canadian research data repositories included in its discovery service, with more being added.

For more information, please contact FRDR Support at support@frdr-dfdr.ca.

For support with Statistics Canada Open License Data, please contact your local institution’s Data librarian.

For Abacus Dataverse support, please contact jeremy.buhler@ubc.ca.

How to Cite Statistics Canada Products

February 18, 2021

An updated edition of How to Cite Statistics Canada Products (12-591-X) is now available. This guide aims to provide direction on the creation of bibliographic references for Statistics Canada's products and services.

In the absence of international standards, citing statistics and data has been a neglected grey area in academic publishing. This new edition of How to Cite Statistics Canada Products fills the gap by covering an even broader range of Statistics Canada and other statistical products and services.

GSS - Caregiving and Care Receiving 2018 (Cycle 32) PUMF

February 4, 2021

Question

Can you confirm if there is a PUMF for the General Social Survey – Caregiving and Care Receiving 2018 (Cycle 32)? I don’t see GSS Cycle 32 in the DLI EFT /MAD_PUMF_FMGD_DAM/Root/4502_GSS-Care_ESG-Soins folder.

Answer

GSS Cycle 32 is currently on hold. The subject matter team had to move resources over in order to work on other releases so they are hoping to begin working on it again by the summer.

More Data Now Discoverable Through Geodisy

February 10, 2021

Geodisy now includes a much larger collection of data for map-based search and discovery! Geodisy, an open-source geospatial discovery platform for Canadian open research data, previously harvested data exclusively from Scholars Portal Dataverse.

With this update, Geodisy now pulls directly from the Federated Research Data Repository (FRDR)’s harvested metadata collection, which includes content from over 70 Canadian research data repositories. In addition to indexing datasets from Dataverse repositories with either geospatial metadata or geospatial files, Geodisy now also includes datasets with bounding box metadata from all repositories harvested by FRDR. The team continues to work toward processing geospatial files and place name metadata from additional repository platforms.

For more information please contact support@frdr-dfdr.ca.

SHS 2017 Bootstrap Weights

February 12, 2021

Question

I have a prof using the 2017 SHS. They have a question about the bootstrap weights :

"I have a question you might be able to answer about these bootstrap weights. In order to use them in Stata, I need to know whether they were produced using the 'mean bootstrap' method and if so, I would need to specify how many samples were used to produce each weight in order to adjust the variance estimates to account for mean bootstrap weights (see for example https://www150.statcan.gc.ca/n1/pub/12-002-x/2014001/article/11901-eng.htm under “STATA 12”; the example refers to the GSS, but I’m assuming the procedure applies as well to the SHS PUMF)."

Maybe there is some literature we missed?

Answer

The bootstrap weights created for the SHS2017 PUMF should be treated as regular bootstrap weights, and therefore, you are right that the mean bootstrap method is not used, and that the number of samples should be set to 1 (bsn=1).

Canadian Internet Use Survey

February 10, 2021

Since the release of the 2018 Canadian Internet Use Survey (CIUS) microdata file, it was noticed that the ‘Valid skip’ (valid skip = 6) and the ‘Not stated’ (not stated = 9) were missing from some of the ‘Universe statements’ in the Codebook (data dictionary).

In general, this issue only impacts questions that rely on a flow from a previous question. Please refer to the questionnaire flow document to ensure that you are using the proper universe when conducting research and creating indicators.

Note, this issue is in the process of being corrected and a new version of the 2018 CIUS codebook will be released when it becomes available.

The Federated Research Data Repository (FRDR) is now in Full Production

February 3, 2021

Portage’s Federated Research Data Repository (FRDR) has officially launched into full production! Full production offers many new features and benefits:

Publish research data in a Canadian-owned, bilingual national repository option
1 TB of repository storage available to all faculty members at Canadian post-secondary institutions - more storage may be available upon request
Secure repository storage, distributed geographically across multiple Compute Canada Federation hosting sites
Data curation support provided by Portage
Ability to work with multiple collaborators on a single submission
Your data will be discoverable alongside other Canadian collections in the FRDR Discovery Portal

FRDR is designed to address a longstanding gap in Canada's research infrastructure by providing researchers with a robust repository option into which large research datasets can be ingested, curated, processed for preservation, discovered, cited, and shared.

The FRDR Discovery Portal enables discovery of and access to Canadian research data, while FRDR’s repository services will help researchers store and manage their data, preserve their research for future use, and comply with institutional and funding agency data management requirements.

FRDR is made possible through a collaboration between Portage, the Compute Canada Federation and the Canadian Association of Research Libraries, with development and infrastructure support from the University of Saskatchewan, Simon Fraser University, the University of Waterloo, and the University of Toronto.

Several Portage Expert and Working Groups have contributed to FRDR’s development, including the FRDR Policy Working Group, the FRDR User Experience and Training Working Group, the FRDR Discovery Service Working Group, the FAST & the FRDR Working Group, and the Data Repositories Expert Group. The FRDR Steering Committee has been instrumental in providing direction and governance for FRDR from inception to full production.

More information about FRDR and its partners can be found at www.frdr-dfdr.ca.

Portage is offering webinars on FRDR to help researchers, faculty, librarians, and others learn how to use the platform for data sharing, deposit, and discovery. See the announcement for more details.

If you have any questions or would like to know more about FRDR, please contact support@frdr-dfdr.ca.

Funding in support of the Portage Network’s stewardship of research data within Canada is administered through the New Digital Research Infrastructure Organization (NDRIO).

RDR Metadata Available in ProQuest Central Discovery Index

February 1, 2021

We now have a way to make all datasets indexed by the Federated Research Data Repository (FRDR) findable in ProQuest’s Central Discovery Index (CDI). Academic libraries using Summon, Primo, or Alma can now easily include records from FRDR in their discovery service using the ProQuest CDI. Simply navigate to your library’s ProQuest Client Center, search for FRDR, and include it in your ProQuest subscriptions. Detailed instructions are available on FRDR. Many thanks to our partners at UBC Library for working with ProQuest on this initiative to increase the discoverability of Canadian research data!

FRDR is a steward of Canada’s largest index of research metadata, and currently harvests metadata from over 70 Canadian research data repositories for inclusion in its discovery service, with more repositories being added.

For more information please contact support@frdr-dfdr.ca.

Maternal Mental Health 2019 Dataset

January 25, 2021

Question

I have a student looking for the Maternal Mental Health 2019 dataset. Is that available?

Answer

Unfortunately we do not have this available as a PUMF.

Census 2006 Data Question

January 6, 2021

Question

II've pulled data for each CSD from the 2006 census using Beyond 20/20 and my advisor wanted me to ask , based on the info for StatsCan that I've copied below - how can I know if zeroes in my dataset represent data suppression or are true zeroes? He is thinking that those that were suppressed will likely need to be treated as missing for my analyses, since they aren't true zeroes, but I'm not sure how to accurately differentiate these. He's thinking that I can likely do this by looking at CSD population size and the number of total private households, and if neither of these thresholds is exceeded (as outlined below) then the zero is likely a true zero (likely not many of these), otherwise I can replace the other zeroes as missing values. Does that make sense?

Census Info

Area suppression for income characteristic data

Area suppression, when applied for data quality purposes, is used to replace all income characteristic data with zeroes for geographic areas with populations and/or number of households below a specific threshold.

If a census tabulation contains any data showing income characteristics for individuals, families or households, then the following rule applies. Income characteristic data are zeroed out for areas where the population is less than 250 or where the number of private households is less than 40. These thresholds are applied to 2006 Census data as well as all previous census data. The threshold of 40 private households is based upon the fact that weighted data are being used. With the weighting factor for each household being 5, setting a threshold of 40 ensures that there will be at least 8 households used in the calculation. The private household threshold does not apply for tabulations based on place of work geographies.

This seems to be what was happening in my data, as some variables for a single CSD have ‘.’ And others zeroes. Those with zeroes typically seem to relate to income, proportion of household spend etc.

Answer

Statistics Canada places the highest priority on maintaining the privacy and confidentiality of respondents. If necessary, data are suppressed to prevent direct or residual disclosure of identifiable data. Because of this and the data quality measures in place, your client will not be able to distinguish between all “true zeros” and these suppression zeros. Area suppression is one type of suppression which involves removing all characteristic data for geographic areas with populations below a specified size.

Having counts based on the geography should let them filter most of those out.

250 people, if the table contains income data, and if the table also contains place-of-residence data, at least 40 private households 100 people, if it is a six-character postal code area, that is, a local delivery unit (LDU), or if it is a custom area 40 people, in all other cases.

In regards to your client’s question on individual cell suppression please see the following paragraph from Chapter One of the 2006 Overview of the Census: Dissemination Rules for Statistics:

“Tables are sometimes accompanied by statistics such as averages, totals and standard deviations. There are various ways of ensuring that these statistics do not reveal sensitive information; for instance, they may be suppressed or made less precise. Some statistics, such as totals, ratios and percentages, are based on the rounded values in the tables to which they apply. A statistic will be suppressed if there are too few data to compute it. In cases of data items expressed in dollars, if the statistic must be calculated from data where the values are too close or if a value is too high compared to the others, then the statistic will be suppressed.”

Depending on the income source variable, income medians and averages are most always never true 0. When there is a zero for most things it is a suppression. As for counts that have been rounded to zero, it is a feature of the confidentiality system and you cannot distinguish those rounded down from the true zeros.

PCCF+ Question: Improving Patient Access to Hospital Clinics

December 15, 2020

Question

I have a researcher using the PCCF+ to look at improving patient access to hospital clinics. They are calculating the distance a patient travels from home to clinic but are finding that each time they run the data they are getting different lat/longs. The trouble is that they have multiple cohorts and some patients could be in more than one cohort.

With getting different lat/longs, they are sometimes finding the difference in one patient’s lat/long can be as much as 36KM. Is there a way to account for this variance in results? Is there any way to limit the amount of difference between results for a person who is in multiple cohorts? I have included the researcher’s original question below:

----- Original question -----

[Note – ‘pts’ and ‘pt’ means patient]

We are using the software to calculate distance from a pts. postal code to the hospital, and several other satellite clinics, but I am finding each time I run the data, I can get a different latitude and longitude for the same postal code within the same data run and across various data runs.

To give you an example:

Run 1:

Pts attending a clinic appointment between 2014-2017 (a pt. could have attended multiple times, some pts. have the same postal code but live in a different location – especially in rural cases)

Run 2:

Pts. attending clinic between 2018-2019 (some of these pts. may also have been seen in the 2014-17 cohort)

Hence, run 2 will still likely contain pts. that also attended an appointment in run 1, but it is quite likely PCCF+ will assign a different lat and long. for that pt.

We were planning to run different types of analysis and a pt. could be included in more than one analysis, resulting in a possible different lat and long each time, which when then comparing cohorts would mean we are not always using the same lat and long for a patient.

I am trying to figure out the possible variance. To run all the 2014-2019 data in one go and then try to separate into the various different cohorts would be extremely time consuming as cross reference would need to be made back to my tracking sheets to figure out what year the pts. appointment was. Currently planning on 6 different runs, with overlapping pts. across the runs and multiple duplicate postal codes.

I have discovered so far for one postal code the difference between the 2 locations (driving) is as much as 36km.

Any help or thoughts you could provide on this would be great.

I was also wondering if PCCF+ allows for street addresses to be used in combination with postal codes, we may be too late this time round for that route, but it would be good to know for the future.

Answer

There could be a couple of reasons for these results:

One possible cause could be that there are, at times, multiple records for each postal code. For the most accurate results, I would confirm that the record being used is always the one where the single link indicator (SLI) is equal to one.

Additionally, the coordinates are based on the geography the postal code is geocoded to. You can check what geography the postal code is geocoded to by the variable Rep_Pt_Type. The majority of records are automatically geocoded to the block level, but there are others (mainly rural areas) that are geocoded to the Census Sub Division (CSD).

Another possible reason for the results is that we are getting regularly getting updated data from Canada Post, and we are also making corrections to the data as we find errors. This may be the cause of why you would see slight differences from year to year.

As for the last question, I would need to confirm with the team in Health Statistics responsible for the PCCF+ if they have any plans to include new variables into the file.

From Queen's University Library: Exploring CIHI's Information Resources

November 25, 2020

Please check out Exploring CIHI’s Information Resources by Queen’s University’s Graeme Campbell (Government Information Librarian) and Alexandra Cooper (Data Services Coordinator).

Campbell and Cooper created the guide in response to reference questions they had about finding open data sources at CIHI and how to access data that is restricted. They contacted CIHI with a number of questions, and this guide is based on the responses that were received.

They note that it is not comprehensive, but should be able to get users started in finding CIHI resources. For some reference questions, they referred users directly to CIHI; they received the help they needed and occasionally were given tables for free that contained the data they needed. Cooper notes: the guide has CC BY-NC license so feel free to link to it or borrow information from it.

SFS Individual Files and Weights

November 20, 2020

Question
I have a researcher who is looking for the SFS PUMFs for 2016, 2012, and 2005. I was able to help him gain access to the files that are available in Nesstar and the EFT, the family level files, but he is looking for not only the bootstrap weight files for the 2012 and 2005 SFS (I found the 2016 bootstrap weights), but also the individual level files for all three years. Are those unavailable or am I just missing them?

Answer

Unfortunately, there are no bootstrap weights for the PUMF SFS datasets for the reference years 1999, 2005, 2012, the first time the SFS PUMF had bootstrap weight created was for the 2016 reference period. There are also no individual files (only family) for these years.

Recent Surveys Similar to NLSCY?

October 30, 2020

Question

I have a researcher looking for information on childhood development and well-being. They have used the National Longitudinal Study of Children and Youth but they would like to know if there is anything that is similar but more recent. Is there anything else I can suggest they look at?

Answer

I’ve received the following information from subject matter (two separate responses below):

1. “Unfortunately, we do not have any longitudinal survey on children anymore, however, we have a brand new survey that was released in July entitled: Canadian Health Survey on Children and Youth that can be accessed by selecting this table. There is not a lot of data in it because it is brand new, but that is the best data we can provide to assist the client.”

2. “This would really depend on a number of factors for your client/researcher… For a longitudinal survey, unfortunately this was our more recent initiative that focused on children and their development/well-being.

The only other initiative within our division that touched on these topics would be: Ontario Child Health Study. But its focus was primarily on Health and was limited to Ontario children…

Additionally, we did conduct a RapidStats survey in 2019 called Survey on Early Learning and Child Care Arrangements, but this initiative was more focused on Child Care Arrangements…”

CPSS-COVID - Information Sources Consulted During the Pandemic, 2020 PUMF

November 2, 2020

We are pleased to inform you that the following product is now available:

Canadian Perspective Survey Series 4 - Information Sources Consulted During the Pandemic (CPSS-COVID), 2020 PUMF

In order to implement this survey rapidly, it will be conducted online only, among those who volunteered to participate in the Canadian Perspectives Survey Series (CPSS). Each survey in the series will take place approximately every month, with collection lasting approximately a week. Each respondent will participate in several short online surveys over the period of about a year. The CPSS is designed to produce data at a national level (excluding the territories).

Initially, the CPSS focused on topics related to the impacts of the COVID-19 pandemic on Canadians. Other topics will also be added to the series to meet the emerging data needs of a variety of users.

Information collected may be used by government organizations at all levels, as well as other types of organizations, to inform the delivery of services and support to Canadians, during and after the pandemic and to inform policy on a wide variety of other social and economic issues.

EFT: /MAD_PUMF_FMGD_DAM/Root/5311_CPSS-SEPC/Series 4

PCCF+ and Socioeconomic Status Reference File

October 27, 2020

Question

I have a question from a Grad student using the PCCF+ and a file called the socioeconomic status reference file (.txt format) that should be available with the PCCF+ files.

The student found reference to this file in CIHI Measuring Health Inequalities. From the student’s email:

In the CIHI: Measuring Health Inequalities Toolkit, it says it is possible to avoid using SAS and just get this info from the PCCF. It says the following: To assign any income quintiles using PCCF, you would assign a DA to individuals in your dataset using the PCCF, and the socioeconomic status reference file (.txt format) available in the PCCF+ package to assign (the desired) income quintiles to DAs. This seems best for us because we already have a list of every DA and the stores assigned.

I wasn’t able to find this file in the PCCF+ v7c files. Is this file available to us or is it something that only CIHI has access to?

Answer

I’ve spoken to the PCCF team and they’ve explained the following:

“I believe they may be naming it incorrectly – it’s not called a ‘socioeconomic status reference file’. We do provide area-based income quintiles as part of the standard output, if the user puts in their postal codes and uses the PCCF+ as suggested. There is information about that if they check the user guide.

They don’t need to first assign DAs to individuals. The PCCF+ will assign the DA and provide the income quintile at the same time.”

Family and Intimate Partner Violence

October 21, 2020

Question

I'm starting to get a number of requests for data on family violence and intimate partner violence--students are wanting to analyze upticks in violence with specific events.

Aside from the UCR, where else can I obtain data that covers a significant span (more than 10) of years and is current beyond 2015?

Answer

At the moment the only thing we could track down are the following 2017, 2018 and same-sex IPV reports:

https://www150.statcan.gc.ca/n1/pub/85-002-x/2018001/article/54978/02-eng.htm

https://www150.statcan.gc.ca/n1/pub/85-002-x/2019001/article/00018/02-eng.htm

https://www150.statcan.gc.ca/n1/pub/85-002-x/2019001/article/00005-eng.htm

Subject matter is working on updating the 2019 UCR data as well.

Subsequent Question

Thanks! I had seen those. We're looking for numbers of incidences of assault which could be described as domestic violence in aggregate on a daily basis. When the UCR is updated, will it be more detailed? And when do you think this will be available?

Answer

From what I understand, the UCR will be more detailed. It sounds as if they are currently working on the file so I suspect it’ll be a matter of months until release (of course with the way things are going this year that timeline could change!).