Normalize Crosstabbed Data in a CSV File
un-xtab.py

un-xtab.py is a Python module and command-line program that rearranges data from a crosstabulated format to a normalized format. It takes data in this form:

Station2006-05-232006-06-152006-07-19
WQ-014.53.76.8
WQ-029.75.17.2
WQ-03106.18.8

and rearranges it into this form:

StationDateValue
WQ-012006-05-234.5
WQ-022006-05-233.7
WQ-032006-05-236.8
WQ-012006-06-159.7
WQ-022006-05-155.1
WQ-032006-06-157.2
WQ-012006-07-1910
WQ-022006-07-196.1
WQ-032006-07-198.8

Input and output are both text (CSV) files.

You can use the un-xtab program to rearrange data that have been provided in a format designed for readability into a format that is more suitable for storage in a database, or for use with statistical, modeling, graphics, or other software.

The un-xtab program can deal with crosstabbed formats that include multiple rows of column headers and groups of data values that were crosstabbed together. For example, un-xtab can convert something like this:

Laboratory Name
Project Name
Arsenic Cadmium Copper
mg/kg mg/kg mg/kg
dry dry dry
Client sample ID Lab sample ID Preparation date Analysis date Concentration Qualifier Concentration Qualifier Concentration Qualifier
SD00231 45298-01 2013-06-23 2013-06-24 0.11 U 112 251
SD00232 45298-02 2013-06-23 2013-06-24 8.21 582 396
SD00233 45298-03 2013-06-23 2013-06-24 0.11 U 45.1 2.4 U
SD00235 45298-04 2013-06-23 2013-06-24 0.11 U 69 2.4 U
SD00236 45298-05 2013-06-23 2013-06-24 4.53 98.9 85.8

to this:

Client sample IDLab sample IDPreparation dateAnalysis dateAnalyteUnitsMeasurement basisConcentrationQualifier
SD0023145298-012013-06-232013-06-24Arsenicmg/kgdry0.11U
SD0023145298-012013-06-232013-06-24Cadmiummg/kgdry112
SD0023145298-012013-06-232013-06-24Coppermg/kgdry251
SD0023245298-022013-06-232013-06-24Arsenicmg/kgdry8.21
SD0023245298-022013-06-232013-06-24Cadmiummg/kgdry582
SD0023245298-022013-06-232013-06-24Coppermg/kgdry396
SD0023345298-032013-06-232013-06-24Arsenicmg/kgdry0.11U
SD0023345298-032013-06-232013-06-24Cadmiummg/kgdry45.1
SD0023345298-032013-06-232013-06-24Coppermg/kgdry2.4U
SD0023545298-042013-06-232013-06-24Arsenicmg/kgdry0.11U
SD0023545298-042013-06-232013-06-24Cadmiummg/kgdry69
SD0023545298-042013-06-232013-06-24Coppermg/kgdry2.4U
SD0023645298-052013-06-232013-06-24Arsenicmg/kgdry4.53
SD0023645298-052013-06-232013-06-24Cadmiummg/kgdry98.9
SD0023645298-052013-06-232013-06-24Coppermg/kgdry85.8

Syntax and Options

un-xtab.py [options] input_file_name output_file_name Arguments: Input file name The name of a text (CSV) file with crosstabbed data. Output file name The name of a text (CSV) to create with normalized data. Options: --version show program's version number and exit -h, --help show this help message and exit -c CONFIGFILE, --configfile=CONFIGFILE The name of the config file, with path if necessary. The default is to look for a configuration file with the same name as the input file, but with an extension of cfg, in the same directory as the input file. -d, --displayspecs Print the format specifications allowed in the configuration file, then exit. -e ENCODING, --encoding=ENCODING Character encoding of the CSV file. It should be one of the strings listed at http://docs.python.org/library/ codecs.html#standard-encodings. -n ROWSEQ, --number_rows=ROWSEQ Add a sequential number to each output row, with a column header of ROWSEQ. -o, --outputheaders Print the output column headers, then exit. -p, --printconfig Pretty-print the configuration data after reading the configuration file, then exit. -s SPECNAME, --specname=SPECNAME The name of the section to use in the configuration file. The default is to use the name of the input data file, without its extension.

Configuration Parameters

Because of the variety of layouts of crosstabbed data, the information necessary to describe each crostabbed file can be complex. For this reason, and also to facilitate documentation of the way in which a file was handled, the un-crosstabbing information is read from a configuration file rather than passed to the program on the command line.

The configuration file may contain several sections, where each section contains cofiguration data for a separate data set to be normalized. For example, each section in a configuration file could pertain to a different worksheet in an Excel workbook of crosstabbed data.

Each section is identified by a unique name. The name must be enclosed in square brackets, e.g.:.

[section_name]

The configuration data in each section consists of a set of pairs of keywords and values. The keywords and values must be separated by a colon or an equal sign, e.g.:

key:value

or

key=value

Each configuration option must be on its own line. Comment lines can be included in the configuration file; these must begin with a hash symbol (#). The configuration keywords and their meanings are as folllows.

data_columns
A list of the columns in the input file containing crosstabbed data that are to be normalized.
data_rows
A list of the rows containing data to be normalized.
row_headers
A list of the columns to the left (or right) of the crosstabbed data columns that are to be preserved in the normalized output.
column_header_rows
A list of the rows containing column header values that are to be used in the normalized output.
column_group_count
The number of columns (they must be contiguous) containing related data values that should appear on the same row of the normalized output.
column_header_label_#
The header text (label) for one of the column_group_count output columns. The value consists of either two digits or a string. If the value is two digits, these refer to a cell in the matrix of column header cells (column_header_rows by column_group_count). The digits are the row number and the column number, in that order. If the value is a string, the string is used directly. There should be exactly column_header_count column_header_label keywords, and each keyword should include a numeric suffix to make it unique. The numeric suffixes should range from 1 to column_group_count.
header_as_column_#
A multi-part value that identifies a cell in the matrix of (column_header_rows by column_group_count) cells that contains a value that is to be propagated into the normalized output as a new column. The multi-part value consists of two digits and a string. The two digits identify the cell in the column header matrix that is to be used as a data value. The digits are the row number and the column number, in that order. The string is used as the header for that column.
row_headers_row
The row on which the headers for the row_headers columns appear. This configuration parameter is optional. If omitted, the maximum of column_header_rows will be used. If specified, the value must be one of the column_header_rows.
nd_values
A list of strings that represent missing values in cells of the crosstabbed data table. If all column_header_count values of a set are missing, a row will not be written to the output file for this set of values.

List values should be separated by commas. Numeric lists may include ranges, consisting of two integers separated by a dash. All configuration parameters are required except for nd_values. The meaning of several of these configuration parameters is illustrated in the following figure.

Illustration of configuration parameters

An example set of configuration parameters is shown below. This are the parameters that were used to normalize the chemical data set shown above.

[concentration_data] data_columns=5-10 data_rows=7-11 row_headers=1-4 column_header_rows=3-6 column_group_count=2 column_header_label_1=Concentration column_header_label_2=Qualifier header_as_column_1=1,1,Analyte header_as_column_2=2,1,Units header_as_column_3=3,1,Measurement basis

Copyright and License

Copyright (c) 2014, 2016 R.Dreas Nielsen

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.