Welcome to savReaderWriter’s documentation!

In the documentation below, the associated SPSS commands are given in CAPS. See also the IBM SPSS Statistics Command Syntax Reference.pdf for info about SPSS syntax.

I always appreciate getting on this package, so I can keep improving it!

See also

The savReaderWriter program uses the SPSS I/O module (.so, .dll, .dylib, depending on your Operating System). Users of the SPSS I/O module should read the International License Agreement before using the SPSS I/O module. By downloading, installing, copying, accessing, or otherwise using the SPSS I/O module, licensee agrees to the terms of this agreement. Copyright © IBM Corporation™ 1989, 2012 — all rights reserved.

Installation

Platforms

As shown in Table 0 below, this program works for Linux (incl. z/Linux), Windows, Mac OS (32 and 64 bit), AIX-64, HP-UX and Solaris-64. Version 3.2 has been tested on Linux 32 (Ubuntu and Mint), Windows (mostly on Windows XP 32, but also a few times on Windows 7 64), and Mac OS (with an earlier version of savReaderWriter). Version 3.3 has been tested on Linux 64. The other OSs are entirely untested. I intend to use Jenkins CI and Vagrant to systematically test more platforms in the future (time, time!).

Table 0. supported platforms for savReaderWriter
Operating System Architecture
32 bit 64 bit

AIX

 

X

HP-UX

 

X

Linux

X

X

Mac OS

X

X?

Solaris

 

X

Windows

X

X

zLinux

 

X

Setup

The program can be installed by running:

python setup.py install

Or alternatively:

pip install savReaderWriter --allow-all-external

To get the ‘bleeding edge’ version straight from the repository do:

pip install -U -e git+https://bitbucket.org/fomcl/savreaderwriter.git#egg=savreaderwriter

Note

Users of Mac OS X need to do two additional things:

  • DYLD_LIBRARY_PATH needs to be set to the directory where the SPSS I/O libraries for Mac OS X live. You may also want to edit your ~/.bashrc accordingly.
  • ioLocale needs to be set manually (work-around). The ioLocale is the locale of the SPSS I/O, which is supposed to be copied from the host system, if unset (i.e., equal to None). However, Python locale.setlocale and locale.getlocale are quirky in Mac OS X (see also this OS X and Python locale snippet).

The code below shows an example that uses Python 2.7.2 (Python 3.3.5 also works) under Mac OS X Mountain Lion 10.9.1:

fomcls-Mac-Pro:~ fomcl$ uname -a
Darwin fomcls-Mac-Pro.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_6
fomcls-Mac-Pro:~ fomcl$ export DYLD_LIBRARY_PATH=/Library/Python/2.7/site-packages/savReaderWriter/spssio/macos
fomcls-Mac-Pro:savReaderWriter fomcl$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
>>> import savReaderWriter
>>> savFileName = "/Library/Python/2.7/site-packages/savReaderWriter/test_data/Employee data.sav"
>>> with savReaderWriter.SavReader(savFileName, ioLocale='en_US.UTF-8') as reader:
...     for line in reader:
...         print line
...
[1.0, 'm', '1952-02-03', 15.0, 3.0, 57000.0, 27000.0, 98.0, 144.0, 0.0]
[2.0, 'm', '1958-05-23', 16.0, 1.0, 40200.0, 18750.0, 98.0, 36.0, 0.0]
[3.0, 'f', '1929-07-26', 12.0, 1.0, 21450.0, 12000.0, 98.0, 381.0, 0.0]
[4.0, 'f', '1947-04-15', 8.0, 1.0, 21900.0, 13200.0, 98.0, 190.0, 0.0]
# etc. etc.

Changed in version 3.3.

  • The savReaderWriter program now runs on Python 2 and 3. It is tested with Python 2.7, 3.3 and PyPy under Debian Linux 3.2.0-4-AMD64.
  • Under Python 3.3, the data are in bytes! Use the b’ prefix when writing string data, or write data in unicode mode (ioUtf=True).
  • Several bugs were removed, notably two that prevented the I/O modules from loading in 64-bit Linux and 64-bit Windows systems (NB: these bugs were entirely unrelated). I re-downloaded the SPSS I/O v21 FP1 modules because the Win 64 libs were incorrectly compiled. In addition, long variable labels were truncated to 120 characters, which is now fixed.
  • This has not yet been tested for performance.

Changed in version 3.2.

  • The savReaderWriter program is now self-contained. That is, the IBM SPSS I/O modules now all load by themselves, without any changes being required anymore to PATH, LD_LIBRARY_PATH and equivalents. Also, no extra .deb files need to be installed anymore (i.e. no dependencies).
  • savReaderWriter now uses version 21.0.0.1 (i.e., Fixpack 1) of the I/O module.

Optional features

cWriterow. The cWriterow package is a faster Cython implementation of the pyWriterow method (66 % faster). To install it, you need Cython and run setup.py in the cWriterow folder:

easy_install cython
python setup.py build_ext --inplace

TODO: Note that cWriterow is not yet ready for use under Python 3 (in fact, the code may need to be fixed for Python 2, too).

psyco. The psyco package may be installed to speed up reading (66 % faster). Note that psyco is no longer maintained and that this will therefore be removed from the program at some point.

numpy. The numpy package should be installed if you intend to use array slicing (e.g data[:2,2:4]).

Enviroment variable

To issue warnings you can set an enviroment variable SAVRW_DISPLAY_WARNS to any of the following actions: “error”, “ignore”, “always”, “default”, “module”, “once”. If the enviroment variable is not defined, warnings are ignored. Note that warnings are usually harmless, e.g. SPSS_NO_LABELS. See: http://docs.python.org/2/library/warnings.html.

SavWriter – Write SPSS system files

savReaderWriter.SavWriter(savFileName, varNames, varTypes[, valueLabels=None, varLabels=None, formats=None, missingValues=None, measureLevels=None, columnWidths=None, alignments=None, varSets=None, varRoles=None, varAttributes=None, fileAttributes=None, fileLabel=None, multRespDefs=None, caseWeightVar=None, overwrite=True, ioUtf8=False, ioLocale=None, mode="wb", refSavFileName=None])

Write SPSS system files (.sav, .zsav)

Parameters:
  • savFileName – the file name of the spss data file. File names that end with ‘.zsav’ are compressed using the ZLIB (ZSAV) compression scheme (requires v21 SPSS I/O files), while for file names that end with ‘.sav’ the ‘old’ compression scheme is used (it is not possible to generate uncompressed files unless you modify the source code).
  • varNames – list of the variable names in the order in which they appear in the spss data file.
  • varTypes – varTypes dictionary {varName: varType}, where varType == 0 means ‘numeric’, and varType > 0 means ‘character’ of that length (in bytes)
  • valueLabels – value label dictionary {varName: {value: label}}. Cf. VALUE LABELS (default: None).
  • varLabels – variable label dictionary {varName: varLabel}. Cf. VARIABLE LABEL (default: None).
  • formats – print/write format dictionary {varName: spssFmt}. Commonly used formats include F (numeric, e.g. F5.4), N (numeric with leading zeroes, e.g. N8), A (string, e.g. A8) and EDATE/ADATE (European/American date, e.g. ADATE30). Cf. FORMATS (default: None). See also under Formats.
  • missingValues

    missing values dictionary {varName: {missing_value_spec}}. Cf. MISSING VALUES (default: None). For example:

    missingValues = { \
    
      # discrete values
      b"someNumvar1": {"values": [999, -1, -2]},
    
      # range, cf. MISSING VALUES x (-9 THRU -1)
      # note also that 'lower', 'upper', 'value(s)' are without b' prefix
      b"someNumvar2": {"lower": -9, "upper": -1},
      b"someNumvar3": {"lower": -9, "upper": -1, "value": 999},
    
      # string variables can have up to three missing values
      b"someStrvar1": {"values": [b"foo", b"bar", b"baz"]},
      b"someStrvar2": {"values': b"bletch"}
    }
    

    Warning

    measureLevels, columnWidths, alignments must all three be set, if used

  • measureLevels – measurement level dictionary {varName: <level>}. Valid levels are: “unknown”, “nominal”, “ordinal”, “scale”, “ratio”, “flag”, “typeless”. Cf. VARIABLE LEVEL (default: None).
  • columnWidths – column display width dictionary {varName: <int>}. Cf. VARIABLE WIDTH. (default: None –> >= 10 [stringVars] or automatic [numVars]).
  • alignments – alignment dictionary {varName: <left/center/right>} Cf. VARIABLE ALIGNMENT (default: None –> numerical: right, string: left).
  • varSets – sets dictionary {setName: [list_of_valid_varNames]}. Cf. SETSMACRO extension command. (default: None).
  • varRoles – variable roles dictionary {varName: varRole}. VarRoles may be any of the following: ‘both’, ‘frequency’, ‘input’, ‘none’, ‘partition’, ‘record ID’, ‘split’, ‘target’. Cf. VARIABLE ROLE (default: None).
  • varAttributes

    variable attributes dictionary {varName: {attribName: attribValue} Cf. VARIABLE  ATTRIBUTES. (default: None). For example:

    varAttributes = {b'gender': {b'Binary': b'Yes'},
                     b'educ': {b'DemographicVars': b'1'}}
    
  • fileAttributes

    file attributes dictionary {attribName: attribValue}. Square brackets indicate attribute arrays, which must start with 1. Cf. FILE ATTRIBUTES. (default: None). For example:

    fileAttributes = {b'RevisionDate[1]': b'10/29/2004',
                      b'RevisionDate[2]': b'10/21/2005'}
    
  • fileLabel – file label string, which defaults to “File created by user <username> at <datetime>” if file label is None. Cf. FILE LABEL (default: None).
  • multRespDefs – Multiple response sets definitions (dichotomy groups or category groups) dictionary {setName: <set definition>}. In SPSS syntax, ‘setName’ has a dollar prefix (‘$someSet’). See also docstring of multRespDefs method. Cf. MRSETS. (default: None).
  • caseWeightVar – valid varName that is set as case weight. Cf. WEIGHT BY command.
  • overwrite – Boolean that indicates whether an existing SPSS file should be overwritten (default: True).
  • ioUtf8 – Boolean that indicates the mode in which text communicated to or from the I/O Module will be. Valid values are True (UTF-8/unicode mode, cf. SET UNICODE=ON) or False (Codepage mode, SET  UNICODE=OFF) (default: False).
  • ioLocale – indicates the locale of the I/O module, cf. SET LOCALE (default: None, which is the same as ".".join(locale.getlocale()). Locale specification is OS-dependent. See also under SavHeaderReader.
  • mode – indicates the mode in which <savFileName> should be opened. Possible values are “wb” (write), “ab” (append), “cp” (copy: initialize header using <refSavFileName> as a reference file, cf. APPLY DICTIONARY). (default: “wb”).
  • refSavFileName – reference file that should be used to initialize the header (aka the SPSS data dictionary) containing variable label, value label, missing value, etc, etc definitions. Only relevant in conjunction with mode=”cp”. (default: None).

Typical use:

savFileName = 'someFile.sav'
records = [[b'Test1', 1, 1], [b'Test2', 2, 1]]
varNames = ['var1', 'v2', 'v3']
varTypes = {'var1': 5, 'v2': 0, 'v3': 0}
with SavWriter(savFileName, varNames, varTypes) as writer:
    for record in records:
        writer.writerow(record)

SavReader – Read SPSS system files

savReaderWriter.SavReader(savFileName[, returnHeader=False, recodeSysmisTo=None, verbose=False, selectVars=None, idVar=None, rawMode=False, ioUtf8=False, ioLocale=None])

Read SPSS system files (.sav, .zsav)

Parameters:
  • savFileName – the file name of the spss data file
  • returnHeader – Boolean that indicates whether the first record should be a list of variable names (default = False)
  • recodeSysmisTo – indicates to which value missing values should be recoded (default = None, i.e. no recoding is done)
  • selectVars – indicates which variables in the file should be selected. The variables should be specified as a list or a tuple of valid variable names. If None is specified, all variables in the file are used (default = None)
  • idVar – indicates which variable in the file should be used for use as id variable for the ‘get’ method (default = None)
  • verbose – Boolean that indicates whether information about the spss data file (e.g., number of cases, variable names, file size) should be printed on the screen (default = False).
  • rawMode – Boolean that indicates whether values should get SPSS-style formatting, and whether date variables (if present) should be converted to ISO-dates. If True, the program does not format any values, which increases processing speed. (default = False)
  • ioUtf8 – Boolean that indicates the mode in which text communicated to or from the I/O Module will be. Valid values are True (UTF-8 mode aka Unicode mode) and False (Codepage mode). Cf. SET UNICODE=ON/OFF (default = False)
  • ioLocale – indicates the locale of the I/O module. Cf. SET LOCALE (default = None, which corresponds to ".".join(locale.getlocale())). See also under SavHeaderReader.

Warning

Once a file is open, ioUtf8 and ioLocale can not be changed. The same applies after a file could not be successfully closed. Always ensure a file is closed by calling __exit__() (i.e., using a context manager) or close() (in a try - finally suite)

Typical use:

with SavReader("someFile.sav", returnHeader=True) as reader:
    header = next(reader)
    for line in reader:
        process(line)

Use of __getitem__ and other methods:

data = SavReader("someFile.sav")
with data:

    # fetch all the data, if it fits into memory
    allData = data.all()

    # fetch subsets of the data
    print("The first six records look like this\n"), data[:6]
    print("The first record looks like this\n"), data[0]
    print("The last four records look like this\n"), data.tail(4)
    print("The first five records look like this\n"), data.head()
    print("First column:\n"), data[..., 0]  # requires numpy
    print("Row 4 & 5, first three cols\n"), data[4:6, :3]  # requires numpy

    # check the number of records
    print("The file contains %d records" % len(data))

    # print a file report
    print(str(data))

# Do a binary search for records --> idVar
# Assumes there is a variable named 'id' in someFile.sav.
data = SavReader("someFile.sav", idVar="id")
with data:
    print(data.get(4, "not found"))  # gets 1st record where id==4

SavHeaderReader – Read SPSS file meta data

savReaderWriter.SavHeaderReader(savFileName[, ioUtf8=False, ioLocale=None])

Read SPSS file meta data. Yields the same information as the SPSS command ``DISPLAY DICTIONARY``

Parameters:
  • savFileName – the file name of the spss data file
  • ioUtf8 – Boolean that indicates the mode in which text communicated to or from the I/O Module will be. Valid values are True (UTF-8 mode aka Unicode mode) and False (Codepage mode). Cf. SET UNICODE=ON/OFF (default = False)
  • ioLocale

    indicates the locale of the I/O module. Cf. SET LOCALE (default = None, which corresponds to ".".join(locale.getlocale())). Example where this may be needed:

    # wrong: variables with accented characters are returned as v1, v2, v3
    >>> with SavHeaderReader('german.sav') as header:
    ...     print(header.varNames)
    [b'python', b'programmieren', b'macht', b'v1', b'v2', b'v3']
    
    # correct: variable names contain non-ascii characters
    # locale definition and presence is OS-specific
    # Linux: sudo localedef -f CP1252 -i de_DE /usr/lib/locale/de_DE.cp1252
    >>> with SavHeaderReader('german.sav', ioLocale='de_DE.cp1252') as header:
    ...     print(header.varNames)
    [b'python', b'programmieren', b'macht', b'\xfcberhaupt', b'v\xf6llig', b'spa\xdf']
    

Warning

The program calls spssFree* C functions to free memory allocated to dynamic arrays. This previously sometimes caused segmentation faults. This problem now appears to be solved. However, if you do experience segmentation faults you can set segfaults=True in __init__.py. This will prevent the spssFree* functions from being called (and introduce a memory leak).

Typical use:

with SavHeaderReader(savFileName) as header:
    metadata = header.dataDictionary(True)
    report = str(header)
    print(report)

Formats

SPSS knows just two different data types: string and numerical data. These data types can be formatted (displayed) by SPSS in several different ways. Format names are followed by total width (w) and an optional number of decimal positions (d). Table 1 below shows a complete list of all the available formats.

String data can be alphanumeric characters (A format) or the hexadecimal representation of alphanumeric characters (AHEX format). The maximum size of a string value is 32767 bytes. String formats do not have any decimal positions (d). Currently, SavReader maps both of the string formats to a regular alphanumeric string format.

Numerical data formats include the default numeric format (F), scientific notation (E) and zero-padded (N). For example, a format of F5.2 represents a numeric value with a total width of 5, including two decimal positions and a decimal indicator. For all numeric formats, the maximum width (w) is 40. For numeric formats where decimals are allowed, the maximum number of decimals (d) is 16. SavReader does not format numerical values, except for the N format, and dates/times (see under Date formats). The N format is a zero-padded value (e.g. SPSS format N8 is formatted as Python format %08d, e.g. ‘00001234’). For most numerical values, formatting means loss of precision. For instance, formatting SPSS F5.3 to Python %5.3f means that only the first three digits are retained. In addition, formatting incurs additional processing time. Finally, e.g. appending a percent sign to a value (PCT format) renders the value less useful for calculations.

Table 1. string and numerical formats in SPSS and savReaderWriter
Format Description Format Description

A

Alphanumeric

JDATE

Julian date - yyyyddd

AHEX

Alphanumeric hexadecimal

MONTH

Month

ADATE

Date format dd-mmm-yyyy

MOYR

mmm yyyy

CCA

User Programmable currency format

N

N Format- unsigned with leading 0s

CCB

User Programmable currency format

P

Packed decimal

CCC

User Programmable currency format

PCT

Percent - F followed by %

CCD

User Programmable currency format

PIB

Positive integer binary unsigned

CCE

User Programmable currency format

PIBHEX

Positive integer binary - hex

COMMA

F Format with commas

PK

Positive integer binary unsigned

DATE

Date format dd-mmm-yyyy

QYR

q Q yyyy

DATETIME

Date and Time

RB

Floating point binary

DOLLAR

Commas and floating dollar sign

RBHEX

Floating point binary hex

DOT

Like COMMA, switching dot for comma

SDATE

Date in yyyy/mm/dd style

DTIME

Date-time dd hh:mm:ss.s

TIME

Time format hh:mm:ss.s

E

E Format- with explicit power of 10

WKDAY

Day of the week

EDATE

Date in dd/mm/yyyy style

WKYR

ww WK yyyy

F

Default Numeric Format

Z

Zoned decimal

IB

Integer binary

Note. The User Programmable currency formats (CCA, CCB, CCC and CCD) cannot be defined or written by SavWriter and existing definitions cannot be read by SavReader.

Date formats

Dates in SPSS. Date formats are a group of numerical formats. SPSS stores dates as the number of seconds since midnight, October 14, 1582 (the beginning of the Gregorian calendar). In SPSS, the user can make these seconds human-readable by giving them a print and/or write format (usually these are set at the same time using the FORMATS command). Examples of such display formats include ADATE (American date, mmddyyyy) and EDATE (European date, ddmmyyyy), SDATE (Asian/Sortable date, yyyymmdd) and JDATE (Julian date).

Reading dates. SavReader deliberately does not honor the different SPSS date display formats, but instead tries to convert them to the more practical (sortable) and less ambiguous ISO 8601 format (yyyy-mm-dd). You can easily change this behavior by modifying the supportedDates dictionary in __init__.py. Table 2 below shows how SavReader converts SPSS dates. Where applicable, the SPSS-to-Python conversion always results in the ‘long’ version of a date/time. For instance, TIME5 and TIME40.16 both result in a %H:%M:%S.%f-style format. If you do not want SavReader to automatically convert dates, you can set rawMode=True. If you use this setting, keep in mind that SavReader will also not convert system missing values ($SYSMIS) to an empty string; instead sysmis values will appear as the smallest value that can be represented on that system (-1 * sys.float_info.max)

Table 2. Date formats in SPSS and SavReader
General form Format type Min w in Max w out Max d SPSS Example [1] strftime format [2] savReaderWriter Example Note

dd-mmm-yy

DATEw

9

9

40

28-OCT-90

%Y-%m-%d

1990-10-28

[3]

dd-mmm-yyyy

DATEw

10

11

28-OCT-1990

idem

mm/dd/yy

ADATEw

8

8

40

10/28/90

%Y-%m-%d

1990-10-28

[3]

mm/dd/yyyy

ADATEw

10

10

10/28/1990

idem

dd.mm.yy

EDATEw

8

8

40

28.10.90

%Y-%m-%d

1990-10-28

[3]

dd.mm.yyyy

EDATEw

10

10

28.10.1990

idem

yyddd

JDATEw

5

5

40

90301

%Y-%m-%d

1990-10-28

[3]

yyyyddd

JDATEw

7

7

1990301

idem

yy/mm/dd

SDATEw

8

8

40

90/10/28

%Y-%m-%d

1990-10-28

[3]

yyyy/mm/dd

SDATEw

10

10

1990/10/28

idem

q Q yy

QYRw

4

6

40

4 Q 90

%m Q %Y

4 Q 1990

[4]

q Q yyyy

QYRw

6

8

4 Q 1990

idem

mmm yy

MOYRw

6

6

40

OCT 90

%B %Y

October 1990

[5]

mmm yyyy

MOYRw

8

8

OCT 1990

idem

ww WK yy

WKYRw

6

8

40

43 WK 90

%W WK %Y

43 WK 1990

[5]

ww WK yyyy

WKYRw

8

10

43 WK 1990

idem

(name of the day)

WKDAYw

2

2

40

SU

%A

Sunday

[5]

(name of the month)

MONTHw

3

3

40

JAN

%B

January

[5]

hh:mm

TIMEw

5

5

40

01:02

%H:%M:%S.%f

01:02:34.7500000

hh:mm:ss.s

TIMEw.d

10

10

40

01:02:34.75

idem

dd hh:mm

DTIMEw

1

1

40

20 08:03

%d %H:%M:%S

20 08:03:00

dd hh:mm:ss.s

DTIMEw.d

13

13

40

20 08:03:00

idem

dd-mmm-yyyy hh:mm

DATETIMEw

17

17

40

20-JUN-1990 08:03

%Y-%m-%d %H:%M:%S

1990-06-20 08:03:00

Dd-mmm-yyyy hh:mm:ss.s

DATETIMEw.d

22

22

40

20-JUN-1990 08:03:00

idem

Note. [1] IBM SPSS Statistics Command Syntax Reference.pdf [2] http://docs.python.org/2/library/datetime.html [3] ISO 8601 format dates are used wherever possible, e.g. mmddyyyy (ADATE) and ddmmyyyy (EDATE) is not maintained. [4] Months are converted to quarters using a simple lookup table [5] weekday, month names depend on host locale (not on ioLocale argument)

Writing dates. With SavWriter a Python date string value (e.g. “2010-10-25”) can be converted to an SPSS Gregorian date (i.e., just a whole bunch of seconds) by using the spssDateTime method, e.g.:

kwargs = dict(savFileName='/tmp/date.sav', varNames=['aDate'],
              varTypes={'aDate': 0}, formats={'aDate': 'EDATE40'})
with SavWriter(**kwargs) as writer:
    spssDateValue = writer.spssDateTime(b'2010-10-25', '%Y-%m-%d')
    writer.writerow([spssDateValue])

The display format of the date (i.e., the way it looks in the SPSS data editor after opening the .sav file) may be set by specifying the formats dictionary (see also Table 1). This is one of the optional arguments of the SavWriter initializer. Without such a specification, the date will look like a large integer (the number of seconds since the beginning of the Gregorian calendar).

Indices and tables