Check the format and content of a delimited text file
chkcsv.py

chkcsv.py is a Python module and program that checks the format and content of a comma-separated-value (CSV) or similar delimited text file. It can check whether required columns are present, and the type, length, and pattern of each column.

Syntax and Options

chkcsv.py [options] <CSV file name> Arguments: <CSV file name> The name of the CSV file to check. Options: --version Show program's version number and exit -h, --help Show this help message and exit -s, --showspecs Show the format specifications allowed in the configuration file, and exit. -f FORMATSPEC, --formatspec=FORMATSPEC Name of the file with the format specification. The default is the name of the CSV file with an extension of fmt. -r, --required A data value is required in data columns for which the format specification does not include an explicit specification of whether data is required for a column. The default is false (i.e., data are not required). -q, --columnsnotrequired Columns listed in the format configuration file are not required to be present unless the column_required specification is explicitly set in the configuration file. The default is true (i.e., all columns in the configuration file are required in the CSV file). -c, --columnexit Exit immediately if there are more columns in the CSV file header than are specified in the format configuration file. -l, --linelength Allow rows of the CSV file to have fewer columns than in the column headers. The default is to report an error for short data rows. If short data rows are allowed, any row without enough columns to match the format specification will still be reported as an error. -i, --case-insensitive Case-insensitive matching of column names in the format configuration file and the CSV file. The default is case-sensitive (i.e., column names must match exactly). -e ENCODING, --encoding=ENCODING Character encoding of the CSV file. It should be one of the strings listed at http://docs.python.org/library/codecs.html#standard- encodings. -o OPTSECTION, --optsection=OPTSECTION An alternate name for the chkcsv options section in the format specification configuration file. -x, --exitonerror Exit when the first error is found.

Format Specifications

The format of each of the columns of the CSV file is specified in a separate configuration file containing a section for each column. Each section begins with the column name in square brackets, followed by key-value pairs identifying the specifications for that column. Each key-value pair consists of a keyword and an associated value. Keywords and values should be separated by either "=" or ":". Each keyword should be at the beginning of a line.

By default, the configuration file has the same name as the CSV file, but with an extension of ".fmt". An alternate configuration file can be specified with the "-f" command-line option.

The keywords that can be used for column format specifications are listed below. A specific type of value should be provided for each keyword. Boolean values are indicated by "Yes", "No", "True", "False", "On", "Off", "1", or "0". Format specification keywords and values should not be quoted in the configuration file. The allowable keywords are:

column_required
Indicates whether or not the column must be present in the CSV file. This is a Boolean value. The default value is True, and can be changed with the "-q" command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
data_required
Indicates whether or not a value is required in this column on every row of the CSV file. This is a Boolean value. The default value is False, and can be changed with the "-r" command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
type
Identifies the type of data in the data column. Valid values are "string", "integer", "float", "bool", "date", and "datetime". Data values in the CSV file will be checked for compatibility with the specified type. If the data type is not specified, data values will be treated as strings—that is, minimum and maximum lengths and the pattern will be checked if they have been specified.
minlen
The required minimum length of data values for this column. This is only checked for string data types and for data with no type specified.
maxlen
The maximum allowed length of data values for this column. This is only checked for string data types and for data with no type specified.
pattern
A regular expression specifying the content of the column value. Patterns must match at the beginning of the column value. This is checked for string, date, and datetime data types, and for data with no type specified.

Usage Notes

Examples

An example configuration file might look like this:

[Study] data_required=True type=string minlen=5 maxlen=20 [Station] data_required=True type=string minlen=4 maxlen=12 [SampleDate] type=date [Sample] type=string data_required=True minlen=4 maxlen=20 pattern=(SO|SD|WA).* [Description] type=string column_required=False maxlen=120 [UpperDepth] type=float data_required=True [LowerDepth] type=float [DepthUnits] type=string data_required=True pattern=(?i)(FT|M|CM)$

Copyright and License

Copyright (c) 2011, R.Dreas Nielsen

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.