chkcsv.py is a Python module and program that checks the format and content of a comma-separated-value (CSV) or similar delimited text file. It can check whether required columns are present, and the type, length, and pattern of each column.
Syntax and Options
Format Specifications
The format of each of the columns of the CSV file is specified in a separate configuration file containing a section for each column. Each section begins with the column name in square brackets, followed by key-value pairs identifying the specifications for that column. Each key-value pair consists of a keyword and an associated value. Keywords and values should be separated by either "=" or ":". Each keyword should be at the beginning of a line.
By default, the configuration file has the same name as the CSV file, but with an extension of ".fmt". An alternate configuration file can be specified with the "-f" command-line option.
The keywords that can be used for column format specifications are listed below. A specific type of value should be provided for each keyword. Boolean values are indicated by "Yes", "No", "True", "False", "On", "Off", "1", or "0". Format specification keywords and values should not be quoted in the configuration file. The allowable keywords are:
- column_required
- Indicates whether or not the column must be present in the CSV file. This is a Boolean value. The default value is True, and can be changed with the "-q" command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
- data_required
- Indicates whether or not a value is required in this column on every row of the CSV file. This is a Boolean value. The default value is False, and can be changed with the "-r" command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
- type
- Identifies the type of data in the data column. Valid values are "string", "integer", "float", "bool", "date", and "datetime". Data values in the CSV file will be checked for compatibility with the specified type. If the data type is not specified, data values will be treated as strings—that is, minimum and maximum lengths and the pattern will be checked if they have been specified.
- minlen
- The required minimum length of data values for this column. This is only checked for string data types and for data with no type specified.
- maxlen
- The maximum allowed length of data values for this column. This is only checked for string data types and for data with no type specified.
- pattern
- A regular expression specifying the content of the column value. Patterns must match at the beginning of the column value. This is checked for string, date, and datetime data types, and for data with no type specified.
Usage Notes
- The first line of the CSV file must contain the names of the columns.
- The order of column specifications in the configuration file does not have to match the order of columns in the CSV file.
- Format specification keywords for a column may be in any order within the column section in the configuration file
- Column names in the CSV file and in the configuration file are case-sensitive, and must match exactly by default. If column names in the configuration file and the CSV file don't match because the case is different, an error will be reported only if the unmatched column is required. The "-i" command-line option can be used to allow case-insensitive matching of column names.
- The pattern that a column should match is specified by a regular expression. The regular expression syntax supported by chkcsv.py is as documented at http://docs.python.org/library/re.html.
- Patterns (regular expressions) must match at the beginning of the column value. To ensure that the regular expression matches the entire column value, you may need to include "$" at the end of the regular expression.
- By default, all columns listed in the configuration file are considered to be required, and if the column name is not present in the CSV file (header row), this will be considered to be an error and chkcsv.py will halt immediately. The default behavior can be changed with the "-q" command-line option. If "-q" is used, or the "column_required" format specification is set to False, and the column is not present, no error will occur. If the column is present, any other format specifications will be applied. That is, even if a column is not required, if it is present and its data fails some other test, an error will be reported.
- chkcsv.py recognizes a wide variety of date and datetime formats. It may actually recognize a date or datetime format that the target software (e.g., a DBMS) does not. In this case, specifying a pattern for the date column can usefully restrict the types of date and datetime values that are accepted.
- The CSV file is expected to have the same number of data items on each row as there are column names in the first row of the file. If the "-l" command-line option is used, the CSV file may have varying numbers of data values in each row, as long as each row has enough values to correspond to each data column that will be checked. That is, if "-l" is used, and there are columns to the right of all of the required columns, those data items may or may not be present in any row without causing an error. However, if a row is short because a value in a required column is missing, and this omission does not cause any violation of any format specification, this error will not necessarily be recognized.
- chkcsv.py does not transform the input file in any way. It does not produce any output file or send any output to stdout except for help and version messages. chkcsv.py only writes error messages, if any, to stderr and sets the exit status value when it terminates.
- chkcsv.py is intended to verify that a data file is suitable for import to database, statistical, graphics, modeling, or other software. The checks that it can perform are generally sufficient to determine whether each data column is compatible with typical specifications for a database column. However, chkcsv.py does not do any row-level checks to verify that column values within a row are consistent with each other. Nor does it do any dataset-level checks to ensure, for example, that each row is unique.
- chkcsv.py includes a provision to allow additional options to be specified in a special section of the configuration file. By default, the name of this special section is "chkcsvoptions". A different name for this special section can be specified with the "-o" command-line argument. Currently there are no special options supported, and if this special section is present in the configuration file, it will be ignored.
Examples
An example configuration file might look like this:
Copyright and License
Copyright (c) 2011, R.Dreas Nielsen
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.