General information

Acknowledgements

Development of this module would not have been possible without the document “OpenOffice.org’s Documentation of the Microsoft Excel File Format” (“OOo docs” for short). The latest version is available from OpenOffice.org in PDF format and ODT format . Small portions of the OOo docs are reproduced in this document. A study of the OOo docs is recommended for those who wish a deeper understanding of the Excel file layout than the xlrd docs can provide.

Provision of formatting information in version 0.6.1 was funded by Simplistix Ltd.

Unicode

This module presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode. Older files (Excel 95 and earlier) don’t keep strings in Unicode; a CODEPAGE record provides a codepage number (for example, 1252) which is used by xlrd to derive the encoding (for same example: “cp1252”) which is used to translate to Unicode.

If the CODEPAGE record is missing (possible if the file was created by third-party software), xlrd will assume that the encoding is ascii, and keep going. If the actual encoding is not ascii, a UnicodeDecodeError exception will be raised and you will need to determine the encoding yourself, and tell xlrd:

book = xlrd.open_workbook(..., encoding_override="cp1252")

If the CODEPAGE record exists but is wrong (for example, the codepage number is 1251, but the strings are actually encoded in koi8_r), it can be overridden using the same mechanism. The supplied runxlrd.py has a corresponding command-line argument, which may be used for experimentation:

runxlrd.py -e koi8_r 3rows myfile.xls

The first place to look for an encoding (“codec name”) is the Python documentation .

Dates in Excel spreadsheets

In reality, there are no such things. What you have are floating point numbers and pious hope. There are several problems with Excel dates:

(1) Dates are not stored as a separate data type; they are stored as floating point numbers and you have to rely on (a) the “number format” applied to them in Excel and/or (b) knowing which cells are supposed to have dates in them. This module helps with (a) by inspecting the format that has been applied to each number cell; if it appears to be a date format, the cell is classified as a date rather than a number. Feedback on this feature, especially from non-English-speaking locales, would be appreciated.

(2) Excel for Windows stores dates by default as the number of days (or fraction thereof) since 1899-12-31T00:00:00. Excel for Macintosh uses a default start date of 1904-01-01T00:00:00. The date system can be changed in Excel on a per-workbook basis (for example: Tools -> Options -> Calculation, tick the “1904 date system” box). This is of course a bad idea if there are already dates in the workbook. There is no good reason to change it even if there are no dates in the workbook. Which date system is in use is recorded in the workbook. A workbook transported from Windows to Macintosh (or vice versa) will work correctly with the host Excel. When using this module’s xldate_as_tuple function to convert numbers from a workbook, you must use the datemode attribute of the Book object. If you guess, or make a judgement depending on where you believe the workbook was created, you run the risk of being 1462 days out of kilter.

Reference: http://support.microsoft.com/default.aspx?scid=KB;EN-US;q180162

(3) The Excel implementation of the Windows-default 1900-based date system works on the incorrect premise that 1900 was a leap year. It interprets the number 60 as meaning 1900-02-29, which is not a valid date. Consequently any number less than 61 is ambiguous. Example: is 59 the result of 1900-02-28 entered directly, or is it 1900-03-01 minus 2 days? The OpenOffice.org Calc program “corrects” the Microsoft problem; entering 1900-02-27 causes the number 59 to be stored. Save as an XLS file, then open the file with Excel – you’ll see 1900-02-28 displayed.

Reference: http://support.microsoft.com/default.aspx?scid=kb;en-us;214326

(4) The Macintosh-default 1904-based date system counts 1904-01-02 as day 1 and 1904-01-01 as day zero. Thus any number such that (0.0 <= number < 1.0) is ambiguous. Is 0.625 a time of day (15:00:00), independent of the calendar, or should it be interpreted as an instant on a particular day (1904-01-01T15:00:00)? The xldate_* functions in this module take the view that such a number is a calendar-independent time of day (like Python’s datetime.time type) for both date systems. This is consistent with more recent Microsoft documentation (for example, the help file for Excel 2002 which says that the first day in the 1904 date system is 1904-01-02).

(5) Usage of the Excel DATE() function may leave strange dates in a spreadsheet. Quoting the help file, in respect of the 1900 date system: “If year is between 0 (zero) and 1899 (inclusive), Excel adds that value to 1900 to calculate the year. For example, DATE(108,1,2) returns January 2, 2008 (1900+108).” This gimmick, semi-defensible only for arguments up to 99 and only in the pre-Y2K-awareness era, means that DATE(1899, 12, 31) is interpreted as 3799-12-31.

For further information, please refer to the documentation for the xldate_* functions.

Named references, constants, formulas, and macros

A name is used to refer to a cell, a group of cells, a constant value, a formula, or a macro. Usually the scope of a name is global across the whole workbook. However it can be local to a worksheet. For example, if the sales figures are in different cells in different sheets, the user may define the name “Sales” in each sheet. There are built-in names, like “Print_Area” and “Print_Titles”; these two are naturally local to a sheet.

To inspect the names with a user interface like MS Excel, OOo Calc, or Gnumeric, click on Insert/Names/Define. This will show the global names, plus those local to the currently selected sheet.

A Book object provides two dictionaries (name_map and name_and_scope_map) and a list (name_obj_list) which allow various ways of accessing the Name objects. There is one Name object for each NAME record found in the workbook. Name objects have many attributes, several of which are relevant only when obj.macro is 1.

In the examples directory you will find namesdemo.xls which showcases the many different ways that names can be used, and xlrdnamesAPIdemo.py which offers 3 different queries for inspecting the names in your files, and shows how to extract whatever a name is referring to. There is currently one “convenience method”, Name.cell(), which extracts the value in the case where the name refers to a single cell. More convenience methods are planned. The source code for Name.cell (in __init__.py) is an extra source of information on how the Name attributes hang together.

Name information is **not* extracted from files older than Excel 5.0 (Book.biff_version < 50)*

Formatting

Introduction

This collection of features, new in xlrd version 0.6.1, is intended to provide the information needed to (1) display/render spreadsheet contents (say) on a screen or in a PDF file, and (2) copy spreadsheet data to another file without losing the ability to display/render it.

The Palette; Colour Indexes

A colour is represented in Excel as a (red, green, blue) (“RGB”) tuple with each component in range(256). However it is not possible to access an unlimited number of colours; each spreadsheet is limited to a palette of 64 different colours (24 in Excel 3.0 and 4.0, 8 in Excel 2.0). Colours are referenced by an index (“colour index”) into this palette.

Colour indexes 0 to 7 represent 8 fixed built-in colours: black, white, red, green, blue, yellow, magenta, and cyan.

The remaining colours in the palette (8 to 63 in Excel 5.0 and later) can be changed by the user. In the Excel 2003 UI, Tools/Options/Color presents a palette of 7 rows of 8 colours. The last two rows are reserved for use in charts.

The correspondence between this grid and the assigned colour indexes is NOT left-to-right top-to-bottom.

Indexes 8 to 15 correspond to changeable parallels of the 8 fixed colours – for example, index 7 is forever cyan; index 15 starts off being cyan but can be changed by the user.

The default colour for each index depends on the file version; tables of the defaults are available in the source code. If the user changes one or more colours, a PALETTE record appears in the XLS file – it gives the RGB values for all changeable indexes.

Note that colours can be used in “number formats”: “[CYAN]....” and “[COLOR8]....” refer to colour index 7; “[COLOR16]....” will produce cyan unless the user changes colour index 15 to something else.

In addition, there are several “magic” colour indexes used by Excel:

0x18 (BIFF3-BIFF4), 0x40 (BIFF5-BIFF8): System window text colour for border lines (used in XF, CF, and WINDOW2 records)

0x19 (BIFF3-BIFF4), 0x41 (BIFF5-BIFF8): System window background colour for pattern background (used in XF and CF records )

0x43: System face colour (dialogue background colour)

0x4D: System window text colour for chart border lines

0x4E: System window background colour for chart areas

0x4F: Automatic colour for chart border lines (seems to be always Black)

0x50: System ToolTip background colour (used in note objects)

0x51: System ToolTip text colour (used in note objects)

0x7FFF: System window text colour for fonts (used in FONT and CF records)

Note 0x7FFF appears to be the default colour index. It appears quite often in FONT records.

Default Formatting

Default formatting is applied to all empty cells (those not described by a cell record). Firstly row default information (ROW record, Rowinfo class) is used if available. Failing that, column default information (COLINFO record, Colinfo class) is used if available. As a last resort the worksheet/workbook default cell format will be used; this should always be present in an Excel file, described by the XF record with the fixed index 15 (0-based). By default, it uses the worksheet/workbook default cell style, described by the very first XF record (index 0).

Formatting features not included in xlrd version 0.6.1

  • Rich text i.e. strings containing partial bold italic and underlined text, change of font inside a string, etc. See OOo docs s3.4 and s3.2
  • Asian phonetic text (known as “ruby”), used for Japanese furigana. See OOo docs s3.4.2 (p15)
  • Conditional formatting. See OOo docs s5.12, s6.21 (CONDFMT record), s6.16 (CF record)
  • Miscellaneous sheet-level and book-level items e.g. printing layout, screen panes.
  • Modern Excel file versions don’t keep most of the built-in “number formats” in the file; Excel loads formats according to the user’s locale. Currently xlrd’s emulation of this is limited to a hard-wired table that applies to the US English locale. This may mean that currency symbols, date order, thousands separator, decimals separator, etc are inappropriate. Note that this does not affect users who are copying XLS files, only those who are visually rendering cells.</li>

Loading worksheets on demand

This feature, new in version 0.7.1, is governed by the on_demand argument to the open_workbook() function and allows saving memory and time by loading only those sheets that the caller is interested in, and releasing sheets when no longer required.

on_demand=False (default): No change. open_workbook() loads global data and all sheets, releases resources no longer required (principally the str or mmap object containing the Workbook stream), and returns.

on_demand=True and BIFF version < 5.0: A warning message is emitted, on_demand is recorded as False, and the old process is followed.

on_demand=True and BIFF version >= 5.0: open_workbook() loads global data and returns without releasing resources. At this stage, the only information available about sheets is Book.nsheets and Book.sheet_names().

information available about sheets is Book.nsheets and Book.sheet_names().

Book.sheet_by_name() and Book.sheet_by_index() will load the requested sheet if it is not already loaded.

Book.sheets() will load all/any unloaded sheets.

The caller may save memory by calling Book.unload_sheet(sheet_name_or_index) when finished with the sheet. This applies irrespective of the state of on_demand.

The caller may re-load an unloaded sheet by calling Book.sheet_by_xxxx() – except if those required resources have been released (which will have happened automatically when on_demand is false). This is the only case where an exception will be raised.

The caller may query the state of a sheet: Book.sheet_loaded(sheet_name_or_index) -> a bool