Welcome to savReaderWriter’s documentation!¶
See also
- In the documentation below, the associated SPSS commands are given in
CAPS
. See also the IBM SPSS Statistics Command Syntax Reference.pdf for info about SPSS syntax. - The
savReaderWriter
program uses the SPSS I/O module (.so
,.dll
,.dylib
, depending on your Operating System). Users of the SPSS I/O module should read the International License Agreement before using the SPSS I/O module. By downloading, installing, copying, accessing, or otherwise using the SPSS I/O module, licensee agrees to the terms of this agreement. Copyright © IBM Corporation™ 1989, 2012 — all rights reserved.
Installation¶
Platforms¶
As shown in Table 0 below, this program works for Linux (incl. z/Linux), Windows, Mac OS (32 and 64 bit), AIX-64, HP-UX and Solaris-64. The program has been tested with Python 2.7, 3.3 and 3.4 on Debian Linux (32 and 64 bit), Mac OS and Windows 7 (64 bit).
Operating System | Architecture | |
---|---|---|
32 bit | 64 bit | |
AIX |
|
X |
HP-UX |
|
X |
Linux |
X |
X |
Mac OS |
X |
X? |
Solaris |
|
X |
Windows |
X |
X |
zLinux |
|
X |
Setup¶
The program can be installed by running:
python setup.py install
Or alternatively:
pip install savReaderWriter
To get the ‘bleeding edge’ version straight from the repository do:
pip install -U -e git+https://bitbucket.org/fomcl/savreaderwriter.git#egg=savreaderwriter
Note
Users of Mac OS X need to do two additional things:
DYLD_LIBRARY_PATH
needs to be set to the directory where the SPSS I/O libraries for Mac OS X live. If you also setLC_ALL
environment variable, you may skip the nextioLocale
step. You may also want to edit your~/.bashrc
accordingly.ioLocale
needs to be set manually (work-around). TheioLocale
is the locale of the SPSS I/O, which is supposed to be copied from the host system, if unset (i.e., equal toNone
). However, Pythonlocale.setlocale
andlocale.getlocale
are quirky in Mac OS X (see also this OS X and Python locale snippet).
The code below shows an example that uses Python 2.7.2 (Python 3.3.5 also works) under Mac OS X Mountain Lion 10.9.1:
fomcls-Mac-Pro:~ fomcl$ uname -a
Darwin fomcls-Mac-Pro.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_6
fomcls-Mac-Pro:~ fomcl$ export DYLD_LIBRARY_PATH=/Library/Python/2.7/site-packages/savReaderWriter/spssio/macos
fomcls-Mac-Pro:~ fomcl$ export LC_ALL=en_US.UTF-8 # if you also do this, specifiying ioLocale is usually not needed
fomcls-Mac-Pro:savReaderWriter fomcl$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
>>> import savReaderWriter
>>> savFileName = "/Library/Python/2.7/site-packages/savReaderWriter/test_data/Employee data.sav"
>>> with savReaderWriter.SavReader(savFileName, ioLocale='en_US.UTF-8') as reader:
... for line in reader:
... print line
...
[1.0, 'm', '1952-02-03', 15.0, 3.0, 57000.0, 27000.0, 98.0, 144.0, 0.0]
[2.0, 'm', '1958-05-23', 16.0, 1.0, 40200.0, 18750.0, 98.0, 36.0, 0.0]
[3.0, 'f', '1929-07-26', 12.0, 1.0, 21450.0, 12000.0, 98.0, 381.0, 0.0]
[4.0, 'f', '1947-04-15', 8.0, 1.0, 21900.0, 13200.0, 98.0, 190.0, 0.0]
# etc. etc.
Users of AIX, Solaris, HP-UX, zLinux need to install the SPSS I/O libraries separately (Pypi has a file size limit of about 60 Mb, so I had to exclude them - sorry):
python -m savReaderWriter.util.download_mainframe_libs
Changed in version 3.4.
Added
SavReaderNp
, a class to convert .sav files to numpy arraysAdded
savViewer
, a simple PyQt4-based script to view .sav, .zsav, .xls, .xlsx, .csv, .tab files. See also this savViewer screenshot.Usage examples:
python -m savReaderWriter.util.savViewer python -m savReaderWriter.util.savViewer '/path/to/some/file.sav'
Removed several bugs, notably one related to memoization of SPSS datetimes (THANKS everybody for taking the time to report these bugs!)
SavReader.__enter__ now returns self, not iter(self)
Changed in version 3.3.
- The
savReaderWriter
program now runs on Python 2 and 3. It is tested with Python 2.7, 3.3 and PyPy under Debian Linux 3.2.0-4-AMD64. - Under Python 3.3, the data are in
bytes
! Use the b’ prefix when writing string data, or write data in unicode mode (ioUtf8=True
). - Several bugs were removed, notably two that prevented the I/O modules from loading in 64-bit Linux and 64-bit Windows systems (NB: these bugs were entirely unrelated). I re-downloaded the SPSS I/O v21 FP1 modules because the Win 64 libs were incorrectly compiled. In addition, long variable labels were truncated to 120 characters, which is now fixed.
- This has not yet been tested for performance.
Changed in version 3.2.
- The
savReaderWriter
program is now self-contained. That is, the IBM SPSS I/O modules now all load by themselves, without any changes being required anymore toPATH
,LD_LIBRARY_PATH
and equivalents. Also, no extra .deb files need to be installed anymore (i.e. no dependencies). savReaderWriter
now uses version 21.0.0.1 (i.e., Fixpack 1) of the I/O module.
Optional features¶
cWriterow.
The cWriterow
package is a faster Cython implementation of the pyWriterow method (66 % faster). To install it, you need Cython and run setup.py
in the cWriterow
folder:
easy_install cython
python setup.py build_ext --inplace
numpy.
- The
numpy
package should be installed if you intend to use array slicing (e.gdata[:2,2:4]
). numpy
is also needed to use theSavReaderNp
sav-to-numpy class
Enviroment variables¶
SAVRW_DISPLAY_WARNS. To issue warnings you can set an enviroment variable SAVRW_DISPLAY_WARNS
to any of the following actions: “error”, “ignore”, “always”, “default”, “module”, “once”. If the enviroment variable is not defined, warnings are ignored. Note that warnings are usually harmless, e.g. SPSS_NO_LABELS
. See: http://docs.python.org/2/library/warnings.html.
SAVRW_USE_CWRITEROW. You can use this variable to toggle between the cWriterow
and the pyWriterow
method, by setting this variable to ON
or OFF
, respectively. This is intended for testing purposes.
DYLD_LIBRARY_PATH. Users of Mac OSX need to set this variable, see elsewhere in this documentation.
LC_ALL. Users of Mac OSX may need to set this variable, see elsewhere in this documentation.
Typical use (the TL;DR version)¶
The full documentation can be found in the Generated API documentation. Here are the most important parts
Reading files:
with SavReader('someFile.sav') as reader:
header = reader.header
for line in reader:
process(line)
with SavReader('someFile.sav') as reader:
records = reader.all()
Writing files:
savFileName = 'someFile.sav'
records = [[b'Test1', 1, 1], [b'Test2', 2, 1]]
varNames = ['var1', 'v2', 'v3']
varTypes = {'var1': 5, 'v2': 0, 'v3': 0}
with SavWriter(savFileName, varNames, varTypes) as writer:
for record in records:
writer.writerow(record)
Writing numpy arrays, pandas DataFrames, lists-of-lists, etc:
savFileName = 'someFile.sav'
args = ( ["v1", "v2"], dict(v1=0, v2=0) )
array = np.arange(10, dtype=np.float64).reshape(5, 2)
with SavWriter(savFileName, *args) as writer:
writer.writerows(array)
Reading file metadata:
with SavHeaderReader(savFileName) as header:
metadata = header.all()
report = str(header)
print(metadata.valueLabels)
print(report)
Reading files into numpy arrays:
with SavReaderNp("Employee data.sav") as reader_np:
array = reader_np.to_structured_array()
mean_salary = array["salary"].mean()
Reading a file in unicode mode (default in SPSS v21 and up):
>>> with SavReader('greetings.sav', ioUtf8=True) as reader:
... for record in reader:
... print(record[-1])
নমস্কাৰ
আসসালামুআলাইকুম
Greetings and salutations
გამარჯობა
Сәлеметсіз бе
Здравствуйте
¡Hola!
Grüezi
สวัสดี
Bondjoû
Reading a file in codepage mode
This could be needed when the file was created using an older SPSS for Windows version, which used codepage mode. Usually this means that (meta)data are encoded as windows-1252. In Linux, you may need to generate a locale with a windows encoding:
# wrong: variables with accented characters are returned as v1, v2, v3
>>> with SavHeaderReader('german.sav') as header:
... print(header.varNames)
[b'python', b'programmieren', b'macht', b'v1', b'v2', b'v3']
# correct: variable names contain non-ascii characters
# locale definition and presence is OS-specific
# Linux: sudo localedef -f CP1252 -i de_DE /usr/lib/locale/de_DE.cp1252
>>> with SavHeaderReader('german.sav', ioLocale='de_DE.cp1252') as header:
... print(header.varNames)
[b'python', b'programmieren', b'macht', b'\xfcberhaupt', b'v\xf6llig', b'spa\xdf']
Formats¶
SPSS knows just two different data types: string and numerical data. These data types can be formatted (displayed) by SPSS in several different ways. Format names are followed by total width (w) and an optional number of decimal positions (d). Table 1 below shows a complete list of all the available formats.
String data can be alphanumeric characters (A
format) or the hexadecimal representation of alphanumeric characters (AHEX
format). The maximum size of a string value is 32767 bytes. String formats do not have any decimal positions (d). Currently, SavReader
maps both of the string formats to a regular alphanumeric string format.
Numerical data formats include the default numeric format (F
), scientific notation (E
) and zero-padded (N
). For example, a format of F5.2
represents a numeric value with a total width of 5, including two decimal positions and a decimal indicator. For all numeric formats, the maximum width (w) is 40. For numeric formats where decimals are allowed, the maximum number of decimals (d) is 16. SavReader
does not format numerical values, except for the N
format, and dates/times (see under Date formats). The N
format is a zero-padded value (e.g. SPSS format N8
is formatted as Python format %08d
, e.g. ‘00001234’). For most numerical values, formatting means loss of precision. For instance, formatting SPSS F5.3
to Python %5.3f
means that only the first three digits are retained. In addition, formatting incurs additional processing time. Finally, e.g. appending a percent sign to a value (PCT
format) renders the value less useful for calculations.
Format | Description | Format | Description |
---|---|---|---|
A |
Alphanumeric |
JDATE |
Julian date - yyyyddd |
AHEX |
Alphanumeric hexadecimal |
MONTH |
Month |
ADATE |
Date format dd-mmm-yyyy |
MOYR |
mmm yyyy |
CCA |
User Programmable currency format |
N |
N Format- unsigned with leading 0s |
CCB |
User Programmable currency format |
P |
Packed decimal |
CCC |
User Programmable currency format |
PCT |
Percent - F followed by % |
CCD |
User Programmable currency format |
PIB |
Positive integer binary unsigned |
CCE |
User Programmable currency format |
PIBHEX |
Positive integer binary - hex |
COMMA |
F Format with commas |
PK |
Positive integer binary unsigned |
DATE |
Date format dd-mmm-yyyy |
QYR |
q Q yyyy |
DATETIME |
Date and Time |
RB |
Floating point binary |
DOLLAR |
Commas and floating dollar sign |
RBHEX |
Floating point binary hex |
DOT |
Like COMMA, switching dot for comma |
SDATE |
Date in yyyy/mm/dd style |
DTIME |
Date-time dd hh:mm:ss.s |
TIME |
Time format hh:mm:ss.s |
E |
E Format- with explicit power of 10 |
WKDAY |
Day of the week |
EDATE |
Date in dd/mm/yyyy style |
WKYR |
ww WK yyyy |
F |
Default Numeric Format |
Z |
Zoned decimal |
IB |
Integer binary |
Note. The User Programmable currency formats (CCA, CCB, CCC and CCD) cannot be defined or written by SavWriter
and existing definitions cannot be read by SavReader
.
Date formats¶
Dates in SPSS. Date formats are a group of numerical formats. SPSS stores dates as the number of seconds since midnight, October 14, 1582 (the beginning of the Gregorian calendar). In SPSS, the user can make these seconds human-readable by giving them a print and/or write format (usually these are set at the same time using the FORMATS
command). Examples of such display formats include ADATE
(American date, mmddyyyy) and EDATE
(European date, ddmmyyyy), SDATE
(Asian/Sortable date, yyyymmdd) and JDATE
(Julian date).
Reading dates. SavReader
deliberately does not honor the different SPSS date display formats, but instead tries to convert them to the more practical (sortable) and less ambiguous ISO 8601 format (yyyy-mm-dd). You can easily change this behavior by modifying the supportedDates
dictionary in __init__.py
. Table 2 below shows how SavReader
converts SPSS dates. Where applicable, the SPSS-to-Python conversion always results in the ‘long’ version of a date/time. For instance, TIME5
and TIME40.16
both result in a %H:%M:%S.%f
-style format. If you do not want SavReader
to automatically convert dates, you can set rawMode=True
. If you use this setting, keep in mind that SavReader
will also not convert system missing values ($SYSMIS
) to an empty string; instead sysmis values will appear as the smallest value that can be represented on that system (-1 * sys.float_info.max
)
General form | Format type | Min w in | Max w out | Max d | SPSS Example [1] | strftime format [2] | savReaderWriter Example | Note |
---|---|---|---|---|---|---|---|---|
dd-mmm-yy |
DATEw |
9 |
9 |
40 |
28-OCT-90 |
%Y-%m-%d |
1990-10-28 |
[3] |
dd-mmm-yyyy |
DATEw |
10 |
11 |
28-OCT-1990 |
idem |
|||
mm/dd/yy |
ADATEw |
8 |
8 |
40 |
10/28/90 |
%Y-%m-%d |
1990-10-28 |
[3] |
mm/dd/yyyy |
ADATEw |
10 |
10 |
10/28/1990 |
idem |
|||
dd.mm.yy |
EDATEw |
8 |
8 |
40 |
28.10.90 |
%Y-%m-%d |
1990-10-28 |
[3] |
dd.mm.yyyy |
EDATEw |
10 |
10 |
28.10.1990 |
idem |
|||
yyddd |
JDATEw |
5 |
5 |
40 |
90301 |
%Y-%m-%d |
1990-10-28 |
[3] |
yyyyddd |
JDATEw |
7 |
7 |
1990301 |
idem |
|||
yy/mm/dd |
SDATEw |
8 |
8 |
40 |
90/10/28 |
%Y-%m-%d |
1990-10-28 |
[3] |
yyyy/mm/dd |
SDATEw |
10 |
10 |
1990/10/28 |
idem |
|||
q Q yy |
QYRw |
4 |
6 |
40 |
4 Q 90 |
%m Q %Y |
4 Q 1990 |
[4] |
q Q yyyy |
QYRw |
6 |
8 |
4 Q 1990 |
idem |
|||
mmm yy |
MOYRw |
6 |
6 |
40 |
OCT 90 |
%B %Y |
October 1990 |
[5] |
mmm yyyy |
MOYRw |
8 |
8 |
OCT 1990 |
idem |
|||
ww WK yy |
WKYRw |
6 |
8 |
40 |
43 WK 90 |
%W WK %Y |
43 WK 1990 |
[5] |
ww WK yyyy |
WKYRw |
8 |
10 |
43 WK 1990 |
idem |
|||
(name of the day) |
WKDAYw |
2 |
2 |
40 |
SU |
%A |
Sunday |
[5] |
(name of the month) |
MONTHw |
3 |
3 |
40 |
JAN |
%B |
January |
[5] |
hh:mm |
TIMEw |
5 |
5 |
40 |
01:02 |
%H:%M:%S.%f |
01:02:34.7500000 |
|
hh:mm:ss.s |
TIMEw.d |
10 |
10 |
40 |
01:02:34.75 |
idem |
||
dd hh:mm |
DTIMEw |
1 |
1 |
40 |
20 08:03 |
%d %H:%M:%S |
20 08:03:00 |
|
dd hh:mm:ss.s |
DTIMEw.d |
13 |
13 |
40 |
20 08:03:00 |
idem |
||
dd-mmm-yyyy hh:mm |
DATETIMEw |
17 |
17 |
40 |
20-JUN-1990 08:03 |
%Y-%m-%d %H:%M:%S |
1990-06-20 08:03:00 |
|
Dd-mmm-yyyy hh:mm:ss.s |
DATETIMEw.d |
22 |
22 |
40 |
20-JUN-1990 08:03:00 |
idem |
Note.
[1] IBM SPSS Statistics Command Syntax Reference.pdf
[2] http://docs.python.org/2/library/datetime.html
[3] ISO 8601 format dates are used wherever possible, e.g. mmddyyyy (ADATE
) and ddmmyyyy (EDATE
) is not maintained.
[4] Months are converted to quarters using a simple lookup table
[5] weekday, month names depend on host locale (not on ioLocale
argument)
Writing dates. With SavWriter
a Python date string value (e.g. “2010-10-25”) can be converted to an SPSS Gregorian date (i.e., just a whole bunch of seconds) by using the spssDateTime
method, e.g.:
kwargs = dict(savFileName='/tmp/date.sav', varNames=['aDate'],
varTypes={'aDate': 0}, formats={'aDate': 'EDATE40'})
with SavWriter(**kwargs) as writer:
spssDateValue = writer.spssDateTime(b'2010-10-25', '%Y-%m-%d')
writer.writerow([spssDateValue])
The display format of the date (i.e., the way it looks in the SPSS data editor after opening the .sav file) may be set by specifying the formats
dictionary (see also Table 1). This is one of the optional arguments of the SavWriter
initializer. Without such a specification, the date will look like a large integer (the number of seconds since the beginning of the Gregorian calendar).