Welcome to pdf2xlsx’s documentation!

A framework to open a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.

There is a gui supporting the initialization of the conversion, and a logger to show detailes about the process.

The project home.

pdf2xlsx

Contains framework to upen Zip a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.

pdf2xlsx.pdf2xlsx.do_it(src_name, dst_dir='', xlsx_name='Invoices01.xlsx', tmp_dir='tmp', file_extension='.pdf')

Main script to manage the zip to xls process. It is responsible to create/cleanup temporary directories and files. After zip extraction, seraches every file which ends with file_extension Then it builds up a list of invoices and writes them to xlsx format and opens it up in the predefined xlsx_viewer. If the dst_dir happens to be the same as the tmp_dir, the generated xlsx file is removed after the run.

Parameters:
  • src_name (str) – path to the zip file to extract
  • dst_dir (str) – path to the directory to put the generated xlsx file by default the cwd
  • tmp_dir (str) – temporary directory to work in. This directory is erased at the beginning of the script By default it is tmp
  • file_extension (str) – the file extension to use during file selection. By default it is .pdf
  • xlsx_name (str) – Name of the oputput file
pdf2xlsx.pdf2xlsx.extract_invoces(pdf_list, logger)

Get the invoices from the pdf files in th pdf_list Wrapper around the pdf2rawtxt call

Parameters:
  • pdf_list (list) – List of pdf files path to process.
  • loggerStatLogger, collect statistical data about parsing
Returns:

list of invoices

Return type:

list of Invoice

pdf2xlsx.pdf2xlsx.extract_zip(src_name, directory)

Extract the pdf files from the zip, there is no sanity check for the arguments :param str src_name: Path to a zip file to extract :param str dir: Path to the target directory to extract the zip file

pdf2xlsx.pdf2xlsx.get_pdf_files(directory, extension='.pdf')

Walks through the given dir and collects every files with extension

Parameters:
  • dir (str) – the root directory to start the walk
  • extension (str) – ‘.pdf’ by default, if the file has this extension it is selected
Returns:

list of pdf file path

Return type:

list of str

pdf2xlsx.pdf2xlsx.invoices2xlsx(invoices, directory='', name='Invoices01.xlsx')

Write invoice information to xlsx template file. Go through every invoce and write them out. Simple. Utilizes the xlsxwriter module

Parameters:list of Invocie (invoices) – Representation of invoices from the pdf files
pdf2xlsx.pdf2xlsx.main()

A simple wrapper around do it function, to demonstrate its usage

pdf2xlsx.pdf2xlsx.pdf2rawtxt(pdfile, logger)

Read out the given pdf file to Invoice and Entry classes to parse it. Utilize PyPFD2 PdfFileReader. Go through every page of the pdf. When a new invoice entry was found by the Entry.parse_line it is appended to the Invoice.entries

Parameters:
  • pdfile (str) – file path of the pdf to process
  • loggerStatLogger, collect statistical data about parsing
Returns:

The invoice entry filled up with the information from pdf file

Return type:

Invoice

pdf2xlsx.pdf2xlsx.run_excel(xlsx_path)

Start up Excel, with the file from the argument. The location of the excel executable should be set in the configuration

Parameters:xlsx_path (str) – Path to the xlsx file to open

invoice

Classes for different invoce types

class pdf2xlsx.invoice.CreditEntry(entry_tuple=None, invo=None)

These entries contain negative prices as these are creadit invoices Dummy!

class pdf2xlsx.invoice.CreditInvoice(no=0, orig_date='', pay_due='', total_sum=0, entries=None, orig_invo_no=0)

Creadit invoice class

parse_line(line)
Parameters:line (str) – The actual line to parse
Returns:True when the parsing of the Invoice was started
Return type:bool
xlsx_write(worksheet, row, col)

Write the invoice information to a template xlsx file.

Parameters:
  • worksheet (Worksheet) – Worksheet class to write info
  • row (int) – Row number to start writing
  • col (int) – Column number to start writing
Returns:

the next position of cursor row,col

Return type:

tuple of (int,int)

class pdf2xlsx.invoice.Entry(entry_tuple=None, invo=None)

Parse, store and write to xlsx invoice entries. The invoice informations are stored in the EntryTuple namedtuple. The parsing is contolled by a state variable (:entry_found:) Because the invoice entries are split into two line, the tmp_str attribute is used to store the first part of the entire The ME values are configurable, so they cannot be created at class level, they need to be recomputed at evry instantiation

Parameters:
  • entry_tuple (EntryTuple) – The invoice entry
  • invo (Invoice) – The parent invoice containing this entry
line2entry(line)

Extracts entry information from the given line. Tries to search for nine different group in the line. See implementation of entry_pattern. This should match the following pattern: NNNNNN-NNN STR+WSPACE PREDEFSTR INTEGER INTEGER-. INTEGER% INTEGER-. INTEGER-. INTEGER% Where: N: a single digit: 0-9 STR+WSPACE: string containing white spaces, numbers and special characters PREDEFSTR: string without white space ( predefined ) INTEGER: decimal number, unknown length INTEGER-.: a decimal number, grouped with . by thousends e.g 1.589.674 INTEGER%: an integer with percentage at the end

Parameters:pdfline (str) – Line to parse, this line should be begin with NNNNNNN-NNN
Returns:The actual invoice entry
Return type:EntryTuple
parse_line(line)

Parse through raw text which is supplied line-by-line. This is the structure of the pdf (the brackets() indicate what should be collected): n times: <disinterested rubish> (NNNNNN-NNN ...

...) <disinterested rubish> When the Invoice code is found, an additional line is waited, and then it is sent to the line2entry converter.

Parameters:line (str) – The actual line to parse
Returns:True when an entry was found
Return type:bool
xlsx_write(worksheet, row, col)

Write the entry information to a template xlsx file.

Parameters:
  • worksheet (Worksheet) – Worksheet class to write info
  • row (int) – Row number to start writing
  • col (int) – Column number to start writing
Returns:

the next position of cursor row,col

Return type:

tuple of (int,int)

class pdf2xlsx.invoice.EntryTuple(kod, nev, ME, mennyiseg, BEgysegar, Kedv, NEgysegar, osszesen, AFA)
AFA

Alias for field number 8

BEgysegar

Alias for field number 4

Kedv

Alias for field number 5

ME

Alias for field number 2

NEgysegar

Alias for field number 6

kod

Alias for field number 0

mennyiseg

Alias for field number 3

nev

Alias for field number 1

osszesen

Alias for field number 7

class pdf2xlsx.invoice.Invoice(no=0, orig_date='', pay_due='', total_sum=0, entries=None)

Parse, store and write to xlsx invoce informations. Such as Invoice Number, Invoice Date, Payment Date, Total Sum Price. It also contains a list of Entry, which is also extracted form raw string. The parsing of the raw string is controlled by three state variables: no_parsed, orig_date_parsed and pay_due_parsed. These represent the structure of the pdf.

Parameters:
  • no (int) – Invoice number, default:0
  • orig_date (str) – Invoice date stored as a string YYYY.MM.DD
  • pay_due (str) – Payment Date stored as string YYYY.MM.DD
  • total_sum (int) – Total price of invoice
  • entries (list) – List of Entry containing each entries in invoice

[TODO] implement state pattern for parsing ??? [TODO] implement _to_money as a mixin class

parse_line(line)

Parse through a raw text which is supplied line-by-line. This is the structure of the pdf (the brackets() indicate what should be collected): <disinterested rubish> Számla sorszáma: (NNNNNNNN) ... <disinterested rubish> Számla kelte: (YYYY.MM.DD|DD.MM.YYYY) ... <disinterested rubish> FIZETÉSI HATÁRIDŐ:(YYYY.MM.DD|DD.MM.YYYY) (NNN[.NNN.NNN]) <disinterested rubish> This is structure is paresed using the three state variable, and stored inside the class attributes

Parameters:line (str) – The actual line to parse
Returns:True when the parsing of the Invoice was started
Return type:bool
xlsx_write(worksheet, row, col)

Write the invoice information to a template xlsx file.

Parameters:
  • worksheet (Worksheet) – Worksheet class to write info
  • row (int) – Row number to start writing
  • col (int) – Column number to start writing
Returns:

the next position of cursor row,col

Return type:

tuple of (int,int)

pdf2xlsx.invoice.get_invo_type(pdf_line)

TODO add title parse to decide between invoce types

pdf2xlsx.invoice.invo_parser(pdf_file, logger)

Factory to generate the apropriate invoce type based on the title in the PDF

gui

Not so simple tkinter based gui around the pdf2xlsx.do_it function.

class pdf2xlsx.gui.ConfOption(root, key, row)

This widget is used to place the configuration options to the ConfigWindow. It contains a label to show what is the configuration and an entry with StringVar to provide override possibility. The value of the config JsonDict is converted to a string for the entry. If the value of a configuration is a list, it is converted to a comma separated string.

Parameters:
  • root (Frame) – Tk parent frame
  • key (str) – Key to the “config” JsonDict
  • row (int) – Parameter for grid window manager
update_config()

Write the current entry value to the configuration. The original type of the config value is checked, and the string is converted to this value (int, list of int, list of string...)

class pdf2xlsx.gui.ConfigWindow(master)

Sub window for settings. The window is hidden by default, when the user clickes to the settings button it is activated. It contains the configuration options. There are two buttons the Save ( which hides the window ), and the Accept, both of them updates the configuration file. The configuration items are stored in a list.

Parameters:master – Tk parent class
accept_callback()

Goes through on every configuration item and updates them one by one. Stores the updated configuration.

save_callback()

Hides the ConfigWindow and updates and stores the configuration

class pdf2xlsx.gui.PdfXlsxGui(master)

Simple GUI which lets the user select the source file zip and the desitination directory for the xlsx file. Contains a file dialog for selecting the zip file to work with. There is a button to start the conversion, and also a Settings button to open the settings window

Parameters:master – Tk parent class
browse_src_callback()

Asks for the source zip file, the opened dialog filters for zip files by default The src_entry attribute is updated based on selection

config_callback()

Bring the configuration window up

process_pdf()

Facade for the do_it function. Only the src file and destination dir is updated the other parameters are left for defaults.

logger

Statistics collector helper class for pdf2xlsx

class pdf2xlsx.logger.StatLogger

Collect statistic about the zip to xlsx process. Assembles a list containin invoice number of items. Every item is the number of entries found during the invoice parsing. It implements a simple API: new_invo(), new_entr() and __str__() A new instance contains an empty list: invo_list

new_entr()

When a new entry was found increase the entry counter for the current invoice.

new_invo()

When a new invoice was found create a new invoice log instance The current implementation is a simple list of numbers

config

Configuration structure, loading and storing

class pdf2xlsx.config.JsonDict

OrderedDict class extended with serialization functions, store and load. The configuration will be stored in an orderedDictionary, each value in it will be a regular dictionary containing ‘value’ and ‘text’. Text could be used during GUI implementation, to show what is stored in the value.

load(path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')

Update the config from the config file (path)

Parameters:path (str) – Path and filename of the config file
store(path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')

Store the actual configuration to config file (path)

Parameters:path (str) – Path and filename of the config file
pdf2xlsx.config.init_conf(conf=JsonDict([('tmp_dir', {'value': 'tmp', 'text': 'tmp dir'}), ('file_extension', {'value': 'pdf', 'text': 'file ext'}), ('xlsx_name', {'value': 'invoices.xlsx', 'text': 'xlsx name'}), ('invo_header_ident', {'value': [1, 2, 3, 4], 'text': 'invo header pos'}), ('ME', {'value': ['Pár', 'Darab'], 'text': 'Me category'}), ('excel_path', {'value': 'C:\\Program Files (x86)\\Microsoft Office\\Office14\\excel.exe', 'text': 'Excel:'})]), cfg_path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')

Load the config file from $HOME/pdf2xlsx/cfg_name. If it doesn’t exist try to create it. First create the pdf2xlsx directory, and then write out the default config

utility

Collection of utility functions

pdf2xlsx.utility.list2row(worksheet, row, col, values, positions=None)
Create header of the template xlsx file
param Worksheet worksheet:
 Worksheet class to write info
param int row:Row number to start writing
param int col:Column number to start writing
param list values:
 List of values to write in a row
param list positions:
 Positions for each value (otpional, if not given the values will be printed after each other from column 0)
return:the next position of cursor row,col
rtype:tuple of (int,int)

Indices and tables