DIGRS - Data Interest Group for Reference Services: June 2021

Monday, June 7, 2021

From UBC Library: New Data Manifest Creation Tool

May 21, 2021

From UBC Library
I'd like to announce a simple new tool called "damage" that UBC Library has created which will hopefully help data professionals and researchers keep track of their data sets. It's a very simple file utility that produces file manifests. While it's intended for use with data, it can really be used for anything. It's a command line program that outputs a manifest in a variety of formats: plain text, CSV and JSON.

For plain text files, often used for microdata, the utility produces information on:

Minimum line length
Maximum line length
Number of records
Constant records flag (ie, all lines are of the same length)
Row and column of non-ASCII characters
Flag for DOS/Windows formatting (ie, carriage return + line feed as opposed to just a line feed).

For files in SAS, SPSS and Stata formats (ie, .sas7bdat, .sav and .dta) the utility will provide information on:

Number of cases (reported as rows)
Number of variables (reported as columns)

It's pretty simple to use. For example, to check one file, called setup.py:

>damage setup.py
setup.py
md5 checksum : e51af6d52cffdb9b355b04267bf700eb
Encoding: utf-8
Number of records: 45
Minimum line length 16, maximum line length 81, variable records

It can traverse an entire directory tree, though, and analyze all the files at once. I've included output from the recently released CPSS 6.

The manifest program could prove very useful to us or to researchers because by running it, you can see at a glance if a SAS, SPSS or Stata file has the correct number of records and/or cases, without having to open SAS, Stata or SPSS.

My hope is that a manifest like this could be included with any new additions to the DLI FTP site (or maybe even for current ones). That way when we DLI members download the material, we can quickly run the utility and verify that what we downloaded is correct. Arguably more importantly, if there are any file changes, like a new PDF or syntax, the checksums allow fast and easy comparisons between an old versions and a new one.

The utility could be useful for researchers as well. Using the CSV output, a researcher could produce a largely complete manifest which accompanies any data. They would undoubtedly need to fill in a few details, but that's still much easier than starting from nothing.

Premade binaries for Windows and Mac are available here.

Source code, documentation and a Python library (called fcheck) that you can use in other projects is available here.

To show you how it would look in practice, I've attached sample outputs I made from a recent release of Canadian Perspectives Survey Series 6:

example of damage_csv_output.csv via Google Sheets
example of damage_output_.json via Google Docs
example of damage_text_output.txt via Google Docs

Any questions or concerns regarding this tool can be directed to UBC's Koerner Library.

Friday, June 4, 2021

Older Census Data - EA Level and Reference Maps

June 29, 2020

Question

I have a research team looking at sub-municipal population trends for Prince George, BC/Regional District of Fraser-Fort George hopefully going back to the 1940s. We get back to 1981 just fine, but prior to that I am running in to a wall looking for EA-level data/statistics and corresponding reference maps. Do these exist somewhere? I have found the EA-level files available from U of T, and I swear I’ve looked at the documentation but I can’t find a map of Enumeration Areas. If anyone could point me in the right direction I would appreciate it.

Answer

The library compiled information, available here.

I searched the Library’s collection of published census reports from 1941 to 1976, and I found limited information on enumeration areas. That said, many of the instructions for enumerators mentioned that maps had been distributed to the enumerators; some of the census reports included lists of EA codes; and I found EA-level data from other resources online. I feel all this suggests that EA-level data might exist for 1976, 1971, 1966, 1961, and possibly 1956. Have you tried contacting demography or DLI to see if they might have it?

While I could not find maps in StatCan reports showing EA boundaries, I found records of EA maps at Library and Archives, as well as the University of Toronto. Furthermore, there are a couple of maps on Scholars GeoPortal that contain EA-level data; these maps do not seem to delineate the boundaries of the EAs, but if you zoom in close enough, you can see numbered blocks on the maps. From the datasets, could a user figure out which blocks are included in each EA? I’m not familiar enough with GIS, but maybe Geo Help would know? (Links to Scholars GeoPortal are in the attached document.)

In general I think that most of the information that is held at Statcan is from 1971 to currently however it might be worth it for you to check with demography anyhow.

Geography had this to say about 1976 and 1971 data when I reached out to see what was available.

In terms of digitally, our data in fichiersGEOfiles go back to 1971. Each folder does have the Geographic Attribute File (GAF) that contains census geographic information at the Enumeration Area level for all of Canada. It may not have everything your client is looking for but each record includes population and dwelling counts, land area, names, unique identifiers, and geographic codes for linkages with other census boundaries. Unfortunately we do not have access to our own library of historical documents at the moment for information prior to 1971.