Monday, June 7, 2021

From UBC Library: New Data Manifest Creation Tool

May 21, 2021

From UBC Library
I'd like to announce a simple new tool called "damage" that UBC Library has created which will hopefully help data professionals and researchers keep track of their data sets. It's a very simple file utility that produces file manifests. While it's intended for use with data, it can really be used for anything. It's a command line program that outputs a manifest in a variety of formats: plain text, CSV and JSON.

For plain text files, often used for microdata, the utility produces information on:

  • Minimum line length
  • Maximum line length
  • Number of records
  • Constant records flag (ie, all lines are of the same length)
  • Row and column of non-ASCII characters
  • Flag for DOS/Windows formatting (ie, carriage return + line feed as opposed to just a line feed).

For files in SAS, SPSS and Stata formats (ie, .sas7bdat, .sav and .dta) the utility will provide information on:

  • Number of cases (reported as rows)
  • Number of variables (reported as columns)

It's pretty simple to use. For example, to check one file, called setup.py:

>damage setup.py
setup.py
md5 checksum : e51af6d52cffdb9b355b04267bf700eb
Encoding: utf-8
Number of records: 45
Minimum line length 16, maximum line length 81, variable records

It can traverse an entire directory tree, though, and analyze all the files at once. I've included output from the recently released CPSS 6.

The manifest program could prove very useful to us or to researchers because by running it, you can see at a glance if a SAS, SPSS or Stata file has the correct number of records and/or cases, without having to open SAS, Stata or SPSS.

My hope is that a manifest like this could be included with any new additions to the DLI FTP site (or maybe even for current ones). That way when we DLI members download the material, we can quickly run the utility and verify that what we downloaded is correct. Arguably more importantly, if there are any file changes, like a new PDF or syntax, the checksums allow fast and easy comparisons between an old versions and a new one.

The utility could be useful for researchers as well. Using the CSV output, a researcher could produce a largely complete manifest which accompanies any data. They would undoubtedly need to fill in a few details, but that's still much easier than starting from nothing.

Premade binaries for Windows and Mac are available here.

Source code, documentation and a Python library (called fcheck) that you can use in other projects is available here. 

To show you how it would look in practice, I've attached sample outputs I made from a recent release of Canadian Perspectives Survey Series 6:


Any questions or concerns regarding this tool can be directed to UBC's Koerner Library.

Friday, June 4, 2021

Older Census Data - EA Level and Reference Maps

June 29, 2020



Question

I have a research team looking at sub-municipal population trends for Prince George, BC/Regional District of Fraser-Fort George hopefully going back to the 1940s. We get back to 1981 just fine, but prior to that I am running in to a wall looking for EA-level data/statistics and corresponding reference maps. Do these exist somewhere? I have found the EA-level files available from U of T, and I swear I’ve looked at the documentation but I can’t find a map of Enumeration Areas. If anyone could point me in the right direction I would appreciate it.



Answer

The library compiled information, available here.


I searched the Library’s collection of published census reports from 1941 to 1976, and I found limited information on enumeration areas. That said, many of the instructions for enumerators mentioned that maps had been distributed to the enumerators; some of the census reports included lists of EA codes; and I found EA-level data from other resources online. I feel all this suggests that EA-level data might exist for 1976, 1971, 1966, 1961, and possibly 1956. Have you tried contacting demography or DLI to see if they might have it?


While I could not find maps in StatCan reports showing EA boundaries, I found records of EA maps at Library and Archives, as well as the University of Toronto. Furthermore, there are a couple of maps on Scholars GeoPortal that contain EA-level data; these maps do not seem to delineate the boundaries of the EAs, but if you zoom in close enough, you can see numbered blocks on the maps. From the datasets, could a user figure out which blocks are included in each EA? I’m not familiar enough with GIS, but maybe Geo Help would know? (Links to Scholars GeoPortal are in the attached document.)


In general I think that most of the information that is held at Statcan is from 1971 to currently however it might be worth it for you to check with demography anyhow.


Geography had this to say about 1976 and 1971 data when I reached out to see what was available.

In terms of digitally, our data in fichiersGEOfiles go back to 1971. Each folder does have the Geographic Attribute File (GAF) that contains census geographic information at the Enumeration Area level for all of Canada. It may not have everything your client is looking for but each record includes population and dwelling counts, land area, names, unique identifiers, and geographic codes for linkages with other census boundaries. Unfortunately we do not have access to our own library of historical documents at the moment for information prior to 1971.

Friday, May 21, 2021

Human Trafficking Data

February 10, 2021


Question

The Daily announced today that Preliminary national estimates on police-reported human trafficking incidents, 2020 was released. And that is all it said. There was no indication of how to access the data; are they tables, in RTRA, RDC? Or that it is part of the UCR.


I looked up the UCR tables and can’t find any 2020 data related to human trafficking. So what has been released?




Answer

What The Daily released yesterday was a data availability announcement only. This means that there is no analytical report or any associated CODR tables.

 

The data is however available upon request to the CCJCSS.


*UPDATE | FEBRUARY 11, 2021* The table is attached here. It’s not much at all, however subject matter needs to announce every release (no matter how small!) in the Daily.

Census 2006 Data Question

January 6, 2021


Question

Can someone please help with answering the following question from a patron?


I've pulled data for each CSD from the 2006 census using Beyond 20/20 and my advisor wanted me to ask , based on the info for StatsCan that I've copied below - how can I know if zeroes in my dataset represent data suppression or are true zeroes? He is thinking that those that were suppressed will likely need to be treated as missing for my analyses, since they aren't true zeroes, but I'm not sure how to accurately differentiate these. He's thinking that I can likely do this by looking at CSD population size and the number of total private households, and if neither of these thresholds is exceeded (as outlined below) then the zero is likely a true zero (likely not many of these), otherwise I can replace the other zeroes as missing values. Does that make sense?


Census Info | Area suppression for income characteristic data:

Area suppression, when applied for data quality purposes, is used to replace all income characteristic data with zeroes for geographic areas with populations and/or number of households below a specific threshold. 


If a census tabulation contains any data showing income characteristics for individuals, families or households, then the following rule applies. Income characteristic data are zeroed out for areas where the population is less than 250 or where the number of private households is less than 40. These thresholds are applied to 2006 Census data as well as all previous census data. The threshold of 40 private households is based upon the fact that weighted data are being used. With the weighting factor for each household being 5, setting a threshold of 40 ensures that there will be at least 8 households used in the calculation. The private household threshold does not apply for tabulations based on place of work geographies. 


This seems to be what was happening in my data, as some variables for a single CSD have ‘.’ And others zeroes. Those with zeroes typically seem to relate to income, proportion of household spend etc.




Answer

Statistics Canada places the highest priority on maintaining the privacy and confidentiality of respondents. If necessary, data are suppressed to prevent direct or residual disclosure of identifiable data.   Because of this and the data quality measures in place, your client will not be able to distinguish between all “true zeros” and these suppression zeros. Area suppression is one type of suppression which involves removing all characteristic data for geographic areas with populations below a specified size. Having counts based on the geography should let them filter most of those out.

  • 250 people, if the table contains income data, and if the table also contains place-of-residence data, at least 40 private households
  • 100 people, if it is a six-character postal code area, that is, a local delivery unit (LDU), or if it is a custom area
  • 40 people, in all other cases.


In regards to your client’s question on individual cell suppression please see the following paragraph from Chapter One of the 2006 Overview of the Census:

  • Dissemination rules for statistics - Tables are sometimes accompanied by statistics such as averages, totals and standard deviations. There are various ways of ensuring that these statistics do not reveal sensitive information; for instance, they may be suppressed or made less precise. Some statistics, such as totals, ratios and percentages, are based on the rounded values in the tables to which they apply. A statistic will be suppressed if there are too few data to compute it. In cases of data items expressed in dollars, if the statistic must be calculated from data where the values are too close or if a value is too high compared to the others, then the statistic will be suppressed.

Depending on the income source variable, income medians and averages are most always never true 0. When there is a zero for most things it is a suppression. As for counts that have been rounded to zero, it is a feature of the confidentiality system and you cannot distinguish those rounded down from the true zeros.”

Thursday, May 20, 2021

Impact of Covid-19 on K-12 Education

July 7, 2020


Question

I have a researcher looking for data on the impact of covid-19 on K-12 students' education. If possible she'd also like to see comparisons based on race/ethnicity or socioeconomic status. Anyone have suggestions? Please advise. Thanks!



Answer

There are a few articles on children, schooling and COVID-19 on the Data to Insights for a Better Canada page. It includes online preparedness of children, academic and financial impacts on postsecondary students,  and impacts on the work placements of postsecondary students. This may not be exactly what the researcher is looking for, but may be a start.

LFS Supplementary Indicators and Visible Minorities

February 8, 2021

To support the analysis and interpretation of January 2021 LFS results, see attached links to:

 

 

This data is publicly available under the Statistics Canada Open Licence

Friday, May 14, 2021

Immigration of Catholic Priests in Canada

April 6, 2021



Question
I’m currently working on a research piece about the immigration of catholic priests in Canada due to a decline in local priests and growing secularization in the country.


I am looking for the statistics and numbers on the immigration of religious workers in Canada since the 1990s, and more specifically Catholic priests. I would like to find the following information:

·         Number of priests that immigrated to Canada as "religious workers" each year since 1990

·         Where these priests were from

·         Which province these priests migrated to or at least which parish sent a letter to hire them


If the type of religious worker (ex: priest/rabbi/imam) is not tracked, and only the number of religious workers as a whole is tracked, I would still like the stats on the number of religious workers that immigrated to Canada from 1990 and what province they went to.



Answer

I found from IRCC a bit of the legal aspect about this:

Religious work – International Mobility Program https://www.canada.ca/en/immigration-refugees-citizenship/corporate/publications-manuals/operational-bulletins-manuals/temporary-residents/foreign-workers/work-without-permit/authorization-work-without-work-permit-clergy.html

 

There are two separate provisions in the Immigration and Refugee Protection Regulations (IRPR) relating to religious work:

 

  • paragraph R186(l) provides a work permit exemption for religious leaders
  • paragraph R205(d) provides a labour market impact assessment (LMIA) exemption (code C50)

 

So I looked on the Open Data Portal for data on the International Mobility Program and found the following

 

Temporary Residents: Work Permit Holders – Ad Hoc IRCC (Specialized Datasets)

Temporary residents who are in Canada on a work permit in the observed calendar year. Datasets include Temporary Foreign Worker Program (TFWP) and International Mobility Program (IMP) work permit holders by year in which permit(s) became effective. Please note that the datasets will not be updated.

https://open.canada.ca/data/en/dataset/67fd1fae-4950-4018-a491-62e60cbd6974

 

Specialized Research Datasets: Temporary Resident – Ad Hoc IRCC (Specialized Datasets)

https://open.canada.ca/data/en/dataset/31ef4cab-d2b3-4dba-8e91-48fe64211ec5

 

But I can’t validate that the data contains the specific work permit of interest because when I click on the Access button, nothing happens on either dataset. Perhaps IRCC would have the data and could make it available upon request?