Welcome to pdf2xlsx’s documentation!¶
A framework to open a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.
There is a gui supporting the initialization of the conversion, and a logger to show detailes about the process.
pdf2xlsx¶
Contains framework to upen Zip a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.
-
pdf2xlsx.pdf2xlsx.
do_it
(src_name, dst_dir='', xlsx_name='Invoices01.xlsx', tmp_dir='tmp', file_extension='.pdf')¶ Main script to manage the zip to xls process. It is responsible to create/cleanup temporary directories and files. After zip extraction, seraches every file which ends with file_extension Then it builds up a list of invoices and writes them to xlsx format and opens it up in the predefined xlsx_viewer. If the dst_dir happens to be the same as the tmp_dir, the generated xlsx file is removed after the run.
Parameters: - src_name (str) – path to the zip file to extract
- dst_dir (str) – path to the directory to put the generated xlsx file by default the cwd
- tmp_dir (str) – temporary directory to work in. This directory is erased at the beginning of the script By default it is tmp
- file_extension (str) – the file extension to use during file selection. By default it is .pdf
- xlsx_name (str) – Name of the oputput file
-
pdf2xlsx.pdf2xlsx.
extract_invoces
(pdf_list, logger)¶ Get the invoices from the pdf files in th pdf_list Wrapper around the pdf2rawtxt call
Parameters: - pdf_list (list) – List of pdf files path to process.
- logger –
StatLogger
, collect statistical data about parsing
Returns: list of invoices
Return type: list of
Invoice
-
pdf2xlsx.pdf2xlsx.
extract_zip
(src_name, directory)¶ Extract the pdf files from the zip, there is no sanity check for the arguments :param str src_name: Path to a zip file to extract :param str dir: Path to the target directory to extract the zip file
-
pdf2xlsx.pdf2xlsx.
get_pdf_files
(directory, extension='.pdf')¶ Walks through the given dir and collects every files with extension
Parameters: - dir (str) – the root directory to start the walk
- extension (str) – ‘.pdf’ by default, if the file has this extension it is selected
Returns: list of pdf file path
Return type: list of str
-
pdf2xlsx.pdf2xlsx.
invoices2xlsx
(invoices, directory='', name='Invoices01.xlsx')¶ Write invoice information to xlsx template file. Go through every invoce and write them out. Simple. Utilizes the xlsxwriter module
Parameters: list of Invocie (invoices) – Representation of invoices from the pdf files
-
pdf2xlsx.pdf2xlsx.
main
()¶ A simple wrapper around do it function, to demonstrate its usage
-
pdf2xlsx.pdf2xlsx.
pdf2rawtxt
(pdfile, logger)¶ Read out the given pdf file to Invoice and Entry classes to parse it. Utilize PyPFD2 PdfFileReader. Go through every page of the pdf. When a new invoice entry was found by the Entry.parse_line it is appended to the Invoice.entries
Parameters: - pdfile (str) – file path of the pdf to process
- logger –
StatLogger
, collect statistical data about parsing
Returns: The invoice entry filled up with the information from pdf file
Return type: Invoice
-
pdf2xlsx.pdf2xlsx.
run_excel
(xlsx_path)¶ Start up Excel, with the file from the argument. The location of the excel executable should be set in the configuration
Parameters: xlsx_path (str) – Path to the xlsx file to open
invoice¶
Classes for different invoce types
-
class
pdf2xlsx.invoice.
CreditEntry
(entry_tuple=None, invo=None)¶ These entries contain negative prices as these are creadit invoices Dummy!
-
class
pdf2xlsx.invoice.
CreditInvoice
(no=0, orig_date='', pay_due='', total_sum=0, entries=None, orig_invo_no=0)¶ Creadit invoice class
-
parse_line
(line)¶ Parameters: line (str) – The actual line to parse Returns: True when the parsing of the Invoice was started Return type: bool
-
xlsx_write
(worksheet, row, col)¶ Write the invoice information to a template xlsx file.
Parameters: - worksheet (Worksheet) – Worksheet class to write info
- row (int) – Row number to start writing
- col (int) – Column number to start writing
Returns: the next position of cursor row,col
Return type: tuple of (int,int)
-
-
class
pdf2xlsx.invoice.
Entry
(entry_tuple=None, invo=None)¶ Parse, store and write to xlsx invoice entries. The invoice informations are stored in the EntryTuple namedtuple. The parsing is contolled by a state variable (:entry_found:) Because the invoice entries are split into two line, the tmp_str attribute is used to store the first part of the entire The ME values are configurable, so they cannot be created at class level, they need to be recomputed at evry instantiation
Parameters: - entry_tuple (EntryTuple) – The invoice entry
- invo (Invoice) – The parent invoice containing this entry
-
line2entry
(line)¶ Extracts entry information from the given line. Tries to search for nine different group in the line. See implementation of entry_pattern. This should match the following pattern: NNNNNN-NNN STR+WSPACE PREDEFSTR INTEGER INTEGER-. INTEGER% INTEGER-. INTEGER-. INTEGER% Where: N: a single digit: 0-9 STR+WSPACE: string containing white spaces, numbers and special characters PREDEFSTR: string without white space ( predefined ) INTEGER: decimal number, unknown length INTEGER-.: a decimal number, grouped with . by thousends e.g 1.589.674 INTEGER%: an integer with percentage at the end
Parameters: pdfline (str) – Line to parse, this line should be begin with NNNNNNN-NNN Returns: The actual invoice entry Return type: EntryTuple
-
parse_line
(line)¶ Parse through raw text which is supplied line-by-line. This is the structure of the pdf (the brackets() indicate what should be collected): n times: <disinterested rubish> (NNNNNN-NNN ...
...) <disinterested rubish> When the Invoice code is found, an additional line is waited, and then it is sent to the line2entry converter.
Parameters: line (str) – The actual line to parse Returns: True when an entry was found Return type: bool
-
xlsx_write
(worksheet, row, col)¶ Write the entry information to a template xlsx file.
Parameters: - worksheet (Worksheet) – Worksheet class to write info
- row (int) – Row number to start writing
- col (int) – Column number to start writing
Returns: the next position of cursor row,col
Return type: tuple of (int,int)
-
class
pdf2xlsx.invoice.
EntryTuple
(kod, nev, ME, mennyiseg, BEgysegar, Kedv, NEgysegar, osszesen, AFA)¶ -
AFA
¶ Alias for field number 8
-
BEgysegar
¶ Alias for field number 4
-
Kedv
¶ Alias for field number 5
-
ME
¶ Alias for field number 2
-
NEgysegar
¶ Alias for field number 6
-
kod
¶ Alias for field number 0
-
mennyiseg
¶ Alias for field number 3
-
nev
¶ Alias for field number 1
-
osszesen
¶ Alias for field number 7
-
-
class
pdf2xlsx.invoice.
Invoice
(no=0, orig_date='', pay_due='', total_sum=0, entries=None)¶ Parse, store and write to xlsx invoce informations. Such as Invoice Number, Invoice Date, Payment Date, Total Sum Price. It also contains a list of Entry, which is also extracted form raw string. The parsing of the raw string is controlled by three state variables: no_parsed, orig_date_parsed and pay_due_parsed. These represent the structure of the pdf.
Parameters: - no (int) – Invoice number, default:0
- orig_date (str) – Invoice date stored as a string YYYY.MM.DD
- pay_due (str) – Payment Date stored as string YYYY.MM.DD
- total_sum (int) – Total price of invoice
- entries (list) – List of
Entry
containing each entries in invoice
[TODO] implement state pattern for parsing ??? [TODO] implement _to_money as a mixin class
-
parse_line
(line)¶ Parse through a raw text which is supplied line-by-line. This is the structure of the pdf (the brackets() indicate what should be collected): <disinterested rubish> Számla sorszáma: (NNNNNNNN) ... <disinterested rubish> Számla kelte: (YYYY.MM.DD|DD.MM.YYYY) ... <disinterested rubish> FIZETÉSI HATÁRIDŐ:(YYYY.MM.DD|DD.MM.YYYY) (NNN[.NNN.NNN]) <disinterested rubish> This is structure is paresed using the three state variable, and stored inside the class attributes
Parameters: line (str) – The actual line to parse Returns: True when the parsing of the Invoice was started Return type: bool
-
xlsx_write
(worksheet, row, col)¶ Write the invoice information to a template xlsx file.
Parameters: - worksheet (Worksheet) – Worksheet class to write info
- row (int) – Row number to start writing
- col (int) – Column number to start writing
Returns: the next position of cursor row,col
Return type: tuple of (int,int)
-
pdf2xlsx.invoice.
get_invo_type
(pdf_line)¶ TODO add title parse to decide between invoce types
-
pdf2xlsx.invoice.
invo_parser
(pdf_file, logger)¶ Factory to generate the apropriate invoce type based on the title in the PDF
gui¶
Not so simple tkinter based gui around the pdf2xlsx.do_it function.
-
class
pdf2xlsx.gui.
ConfOption
(root, key, row)¶ This widget is used to place the configuration options to the ConfigWindow. It contains a label to show what is the configuration and an entry with StringVar to provide override possibility. The value of the config
JsonDict
is converted to a string for the entry. If the value of a configuration is a list, it is converted to a comma separated string.Parameters: - root (Frame) – Tk parent frame
- key (str) – Key to the “config”
JsonDict
- row (int) – Parameter for grid window manager
-
update_config
()¶ Write the current entry value to the configuration. The original type of the config value is checked, and the string is converted to this value (int, list of int, list of string...)
-
class
pdf2xlsx.gui.
ConfigWindow
(master)¶ Sub window for settings. The window is hidden by default, when the user clickes to the settings button it is activated. It contains the configuration options. There are two buttons the Save ( which hides the window ), and the Accept, both of them updates the configuration file. The configuration items are stored in a list.
Parameters: master – Tk parent class -
accept_callback
()¶ Goes through on every configuration item and updates them one by one. Stores the updated configuration.
-
save_callback
()¶ Hides the ConfigWindow and updates and stores the configuration
-
-
class
pdf2xlsx.gui.
PdfXlsxGui
(master)¶ Simple GUI which lets the user select the source file zip and the desitination directory for the xlsx file. Contains a file dialog for selecting the zip file to work with. There is a button to start the conversion, and also a Settings button to open the settings window
Parameters: master – Tk parent class -
browse_src_callback
()¶ Asks for the source zip file, the opened dialog filters for zip files by default The src_entry attribute is updated based on selection
-
config_callback
()¶ Bring the configuration window up
-
process_pdf
()¶ Facade for the do_it function. Only the src file and destination dir is updated the other parameters are left for defaults.
-
logger¶
Statistics collector helper class for pdf2xlsx
-
class
pdf2xlsx.logger.
StatLogger
¶ Collect statistic about the zip to xlsx process. Assembles a list containin invoice number of items. Every item is the number of entries found during the invoice parsing. It implements a simple API: new_invo(), new_entr() and __str__() A new instance contains an empty list: invo_list
-
new_entr
()¶ When a new entry was found increase the entry counter for the current invoice.
-
new_invo
()¶ When a new invoice was found create a new invoice log instance The current implementation is a simple list of numbers
-
config¶
Configuration structure, loading and storing
-
class
pdf2xlsx.config.
JsonDict
¶ OrderedDict class extended with serialization functions, store and load. The configuration will be stored in an orderedDictionary, each value in it will be a regular dictionary containing ‘value’ and ‘text’. Text could be used during GUI implementation, to show what is stored in the value.
-
load
(path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')¶ Update the config from the config file (path)
Parameters: path (str) – Path and filename of the config file
-
store
(path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')¶ Store the actual configuration to config file (path)
Parameters: path (str) – Path and filename of the config file
-
-
pdf2xlsx.config.
init_conf
(conf=JsonDict([('tmp_dir', {'value': 'tmp', 'text': 'tmp dir'}), ('file_extension', {'value': 'pdf', 'text': 'file ext'}), ('xlsx_name', {'value': 'invoices.xlsx', 'text': 'xlsx name'}), ('invo_header_ident', {'value': [1, 2, 3, 4], 'text': 'invo header pos'}), ('ME', {'value': ['Pár', 'Darab'], 'text': 'Me category'}), ('excel_path', {'value': 'C:\\Program Files (x86)\\Microsoft Office\\Office14\\excel.exe', 'text': 'Excel:'})]), cfg_path='C:\\Users\\tibger01\\.pdf2xlsx\\config.txt')¶ Load the config file from $HOME/pdf2xlsx/cfg_name. If it doesn’t exist try to create it. First create the pdf2xlsx directory, and then write out the default config
utility¶
Collection of utility functions
-
pdf2xlsx.utility.
list2row
(worksheet, row, col, values, positions=None)¶ - Create header of the template xlsx file
param Worksheet worksheet: Worksheet class to write info param int row: Row number to start writing param int col: Column number to start writing param list values: List of values to write in a row param list positions: Positions for each value (otpional, if not given the values will be printed after each other from column 0) return: the next position of cursor row,col rtype: tuple of (int,int)