Monday, June 7, 2021

From UBC Library: New Data Manifest Creation Tool

May 21, 2021

From UBC Library
I'd like to announce a simple new tool called "damage" that UBC Library has created which will hopefully help data professionals and researchers keep track of their data sets. It's a very simple file utility that produces file manifests. While it's intended for use with data, it can really be used for anything. It's a command line program that outputs a manifest in a variety of formats: plain text, CSV and JSON.

For plain text files, often used for microdata, the utility produces information on:

  • Minimum line length
  • Maximum line length
  • Number of records
  • Constant records flag (ie, all lines are of the same length)
  • Row and column of non-ASCII characters
  • Flag for DOS/Windows formatting (ie, carriage return + line feed as opposed to just a line feed).

For files in SAS, SPSS and Stata formats (ie, .sas7bdat, .sav and .dta) the utility will provide information on:

  • Number of cases (reported as rows)
  • Number of variables (reported as columns)

It's pretty simple to use. For example, to check one file, called setup.py:

>damage setup.py
setup.py
md5 checksum : e51af6d52cffdb9b355b04267bf700eb
Encoding: utf-8
Number of records: 45
Minimum line length 16, maximum line length 81, variable records

It can traverse an entire directory tree, though, and analyze all the files at once. I've included output from the recently released CPSS 6.

The manifest program could prove very useful to us or to researchers because by running it, you can see at a glance if a SAS, SPSS or Stata file has the correct number of records and/or cases, without having to open SAS, Stata or SPSS.

My hope is that a manifest like this could be included with any new additions to the DLI FTP site (or maybe even for current ones). That way when we DLI members download the material, we can quickly run the utility and verify that what we downloaded is correct. Arguably more importantly, if there are any file changes, like a new PDF or syntax, the checksums allow fast and easy comparisons between an old versions and a new one.

The utility could be useful for researchers as well. Using the CSV output, a researcher could produce a largely complete manifest which accompanies any data. They would undoubtedly need to fill in a few details, but that's still much easier than starting from nothing.

Premade binaries for Windows and Mac are available here.

Source code, documentation and a Python library (called fcheck) that you can use in other projects is available here. 

To show you how it would look in practice, I've attached sample outputs I made from a recent release of Canadian Perspectives Survey Series 6:


Any questions or concerns regarding this tool can be directed to UBC's Koerner Library.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.