un-xtab.py is a Python module and command-line program that rearranges data from a crosstabulated format to a normalized format. It takes data in this form:
Station | 2006-05-23 | 2006-06-15 | 2006-07-19 |
---|---|---|---|
WQ-01 | 4.5 | 3.7 | 6.8 |
WQ-02 | 9.7 | 5.1 | 7.2 |
WQ-03 | 10 | 6.1 | 8.8 |
and rearranges it into this form:
Station | Date | Value |
---|---|---|
WQ-01 | 2006-05-23 | 4.5 |
WQ-02 | 2006-05-23 | 3.7 |
WQ-03 | 2006-05-23 | 6.8 |
WQ-01 | 2006-06-15 | 9.7 |
WQ-02 | 2006-05-15 | 5.1 |
WQ-03 | 2006-06-15 | 7.2 |
WQ-01 | 2006-07-19 | 10 |
WQ-02 | 2006-07-19 | 6.1 |
WQ-03 | 2006-07-19 | 8.8 |
Input and output are both text (CSV) files.
You can use the un-xtab program to rearrange data that have been provided in a format designed for readability into a format that is more suitable for storage in a database, or for use with statistical, modeling, graphics, or other software.
The un-xtab program can deal with crosstabbed formats that include multiple rows of column headers and groups of data values that were crosstabbed together. For example, un-xtab can convert something like this:
Laboratory Name | |||||||||
Project Name | |||||||||
Arsenic | Cadmium | Copper | |||||||
mg/kg | mg/kg | mg/kg | |||||||
dry | dry | dry | |||||||
Client sample ID | Lab sample ID | Preparation date | Analysis date | Concentration | Qualifier | Concentration | Qualifier | Concentration | Qualifier |
---|---|---|---|---|---|---|---|---|---|
SD00231 | 45298-01 | 2013-06-23 | 2013-06-24 | 0.11 | U | 112 | 251 | ||
SD00232 | 45298-02 | 2013-06-23 | 2013-06-24 | 8.21 | 582 | 396 | |||
SD00233 | 45298-03 | 2013-06-23 | 2013-06-24 | 0.11 | U | 45.1 | 2.4 | U | |
SD00235 | 45298-04 | 2013-06-23 | 2013-06-24 | 0.11 | U | 69 | 2.4 | U | |
SD00236 | 45298-05 | 2013-06-23 | 2013-06-24 | 4.53 | 98.9 | 85.8 |
to this:
Client sample ID | Lab sample ID | Preparation date | Analysis date | Analyte | Units | Measurement basis | Concentration | Qualifier |
---|---|---|---|---|---|---|---|---|
SD00231 | 45298-01 | 2013-06-23 | 2013-06-24 | Arsenic | mg/kg | dry | 0.11 | U |
SD00231 | 45298-01 | 2013-06-23 | 2013-06-24 | Cadmium | mg/kg | dry | 112 | |
SD00231 | 45298-01 | 2013-06-23 | 2013-06-24 | Copper | mg/kg | dry | 251 | |
SD00232 | 45298-02 | 2013-06-23 | 2013-06-24 | Arsenic | mg/kg | dry | 8.21 | |
SD00232 | 45298-02 | 2013-06-23 | 2013-06-24 | Cadmium | mg/kg | dry | 582 | |
SD00232 | 45298-02 | 2013-06-23 | 2013-06-24 | Copper | mg/kg | dry | 396 | |
SD00233 | 45298-03 | 2013-06-23 | 2013-06-24 | Arsenic | mg/kg | dry | 0.11 | U |
SD00233 | 45298-03 | 2013-06-23 | 2013-06-24 | Cadmium | mg/kg | dry | 45.1 | |
SD00233 | 45298-03 | 2013-06-23 | 2013-06-24 | Copper | mg/kg | dry | 2.4 | U |
SD00235 | 45298-04 | 2013-06-23 | 2013-06-24 | Arsenic | mg/kg | dry | 0.11 | U |
SD00235 | 45298-04 | 2013-06-23 | 2013-06-24 | Cadmium | mg/kg | dry | 69 | |
SD00235 | 45298-04 | 2013-06-23 | 2013-06-24 | Copper | mg/kg | dry | 2.4 | U |
SD00236 | 45298-05 | 2013-06-23 | 2013-06-24 | Arsenic | mg/kg | dry | 4.53 | |
SD00236 | 45298-05 | 2013-06-23 | 2013-06-24 | Cadmium | mg/kg | dry | 98.9 | |
SD00236 | 45298-05 | 2013-06-23 | 2013-06-24 | Copper | mg/kg | dry | 85.8 |
Syntax and Options
Configuration Parameters
Because of the variety of layouts of crosstabbed data, the information necessary to describe each crostabbed file can be complex. For this reason, and also to facilitate documentation of the way in which a file was handled, the un-crosstabbing information is read from a configuration file rather than passed to the program on the command line.
The configuration file may contain several sections, where each section contains cofiguration data for a separate data set to be normalized. For example, each section in a configuration file could pertain to a different worksheet in an Excel workbook of crosstabbed data.
Each section is identified by a unique name. The name must be enclosed in square brackets, e.g.:.
The configuration data in each section consists of a set of pairs of keywords and values. The keywords and values must be separated by a colon or an equal sign, e.g.:
or
Each configuration option must be on its own line. Comment lines can be included in the configuration file; these must begin with a hash symbol (#). The configuration keywords and their meanings are as folllows.
- data_columns
- A list of the columns in the input file containing crosstabbed data that are to be normalized.
- data_rows
- A list of the rows containing data to be normalized.
- row_headers
- A list of the columns to the left (or right) of the crosstabbed data columns that are to be preserved in the normalized output.
- column_header_rows
- A list of the rows containing column header values that are to be used in the normalized output.
- column_group_count
- The number of columns (they must be contiguous) containing related data values that should appear on the same row of the normalized output.
- column_header_label_#
- The header text (label) for one of the column_group_count output columns. The value consists of either two digits or a string. If the value is two digits, these refer to a cell in the matrix of column header cells (column_header_rows by column_group_count). The digits are the row number and the column number, in that order. If the value is a string, the string is used directly. There should be exactly column_header_count column_header_label keywords, and each keyword should include a numeric suffix to make it unique. The numeric suffixes should range from 1 to column_group_count.
- header_as_column_#
- A multi-part value that identifies a cell in the matrix of (column_header_rows by column_group_count) cells that contains a value that is to be propagated into the normalized output as a new column. The multi-part value consists of two digits and a string. The two digits identify the cell in the column header matrix that is to be used as a data value. The digits are the row number and the column number, in that order. The string is used as the header for that column.
- row_headers_row
- The row on which the headers for the row_headers columns appear. This configuration parameter is optional. If omitted, the maximum of column_header_rows will be used. If specified, the value must be one of the column_header_rows.
- nd_values
- A list of strings that represent missing values in cells of the crosstabbed data table. If all column_header_count values of a set are missing, a row will not be written to the output file for this set of values.
List values should be separated by commas. Numeric lists may include ranges, consisting of two integers separated by a dash. All configuration parameters are required except for nd_values. The meaning of several of these configuration parameters is illustrated in the following figure.
An example set of configuration parameters is shown below. This are the parameters that were used to normalize the chemical data set shown above.
Copyright and License
Copyright (c) 2014, 2016 R.Dreas Nielsen
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.