Execute SQL on CSV files
querycsv.py

querycsv.py is a Python module and program that allows you to execute SQL code against data contained in one or more comma-separated-value (CSV) files. The output of the SQL query will be displayed on the console by default, but may be saved in a new CSV file.

Capabilities

Actions that can be performed with the querycsv program include:

Syntax and Options

querycsv.py -i <fname> [other options] [SELECT_stmt] Options: -i <fname> Input CSV file name. Multiple -i options can be used to specify more than one input file. -u <fname> Use the specified sqlite file for input. Options -i, -f, and -k are ignored if -u is specified -o <fname> Send output to the named CSV file. -s Execute a SQL script from the file given as the argument. -f <fname> Use a sqlite file instead of memory for intermediate storage. -k Keep the sqlite file when done (only valid with -f). -h Print this help and exit.

Notes

  1. The first line of each input (CSV) file must contain column names
  2. Table names used in the SQL should match the input CSV file names, without either the leading path or the trailing filename extension.
  3. When multiple input files (or a sqlite file with multiple tables) are used, SQL JOIN clauses can be used to combine the data.
  4. When a SQL script file is used instead of a single SQL commmand on the command line, only the output of the last SQL command will be displayed.
  5. Output to the console (the default) is formatted for readability. Output to a disk file is formatted as CSV, with commas delimiting columns and double quotes around strings.
  6. The primary intended purpose of a command-line SQL statement is to execute a SELECT or UNION statement against the data, and UPDATE and INSERT statements do not have any effect on input CSV file(s). An effect equivalent to an UPDATE statement can be achieved with SELECT statements, however, with output directed to a new CSV file. To perform an INSERT opertation it is necessary to either save the data in a Sqlite file or to use a script file with separate INSERT and SELECT statements.
  7. The SQL language features that can be used with querycsv are those supported by the Sqlite language.

Examples

Given an input data file named testdata.csv that looks like this:

"location_id","sample_date","sample_id","material","analyte","value","units","meas_basis","quals" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Aluminum","7310","mg/kg","dry","" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Antimony","6.2","mg/kg","dry","UJ" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Arsenic","3.9","mg/kg","dry","U" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Barium","101","mg/kg","dry","" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Beryllium","0.36","mg/kg","dry","J" "RM708B1","2005-04-07 00:00:00","05144404","Sediment","Cadmium","1.6","mg/kg","dry","" . . . . . .

following are several examples of how querycsv could be used.

Run a simple SELECT statement to extract and display data

The code

querycsv.py -i testdata.csv "select distinct analyte from testdata;"

will produce output like the following.

analyte =========== Aluminum Antimony Arsenic Barium Beryllium Cadmium . . . . . .

Run a simple SELECT statement to extract data into a new CSV file

The code

querycsv.py -i testdata.csv -o maxconc.csv "select analyte, max(value) from testdata group by analyte;"

will create a new CSV file named "maxconc.csv" with the following contents.

"analyte","max(value)" "Aluminum","7310" "Antimony","7.9" "Arsenic","6.8" "Barium","863" "Beryllium","1.4" "Cadmium","4.7" . . . . . .

Execute multiple SQL statements using a script file

To view the number of samples for which each analyte has been measured in the given data file, the following SQL commands could be used.

create view v_sampanal as select distinct sample_id, analyte from testdata; select analyte, count(sample_id) as num_samps from v_sampanal group by analyte;

If these commands were saved in a text file named "sampcount.sql", the the following code

querycsv.py -i testdata.csv -s sampcount.sql"

would produce output like the following.

analyte | num_samps ======================= Aluminum | 5 Antimony | 5 Arsenic | 5 Barium | 5 Beryllium | 5 Cadmium | 5 . . . . . .

Note that this is a contrived example, as the same result could be obtained with a single command as follows.

querycsv.py -i testdata.csv "select analyte, count(sample_id) as num_samps from (select distinct sample_id, analyte from testdata) group by analyte;"

Script files can be useful for solving more complex querying problems, however. Note that only the output of the last SQL command in a script file is displayed, so that if input is taken from a Sqlite file, and a view is created (as in the example script above), then a command to drop the view should not be the last command in the script file.

The end of each SQL command in the script file is identified by a line that ends with a semicolon. Comment lines are distinguished by two dashes ("--") at the beginning of a line. No more sophisticated parsing or interpretation of the SQL commands is done, so comments and SQL commands should not be placed on the same line, nor should two different SQL commands be put on the same line.

Create and use a Sqlite file

The "-f" and "-k" options can be used to create a Sqlite file, as follows.

querycsv.py -i testdata.csv -f testdata.sqlite -k "select * from testdata.csv limit 1;"

The SQL command used in the previous command line may or may not produce useful output, but it must be present.

The Sqlite file that was created can then be used as follows.

querycsv.py -u testdata.sqlite "select distinct location_id from testdata;"

Note that the name(s) of the table(s) in the Sqlite file must be known; they will not necessarily match the name of the Sqlite file.

Copyright and License

Copyright (c) 2008, R.Dreas Nielsen

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.