pdf2xlsx

Contains framework to upen Zip a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.

pdf2xlsx.pdf2xlsx.do_it(src_name, dst_dir='', xlsx_name='Invoices01.xlsx', tmp_dir='tmp', file_extension='.pdf')

Main script to manage the zip to xls process. It is responsible to create/cleanup temporary directories and files. After zip extraction, seraches every file which ends with file_extension Then it builds up a list of invoices and writes them to xlsx format and opens it up in the predefined xlsx_viewer. If the dst_dir happens to be the same as the tmp_dir, the generated xlsx file is removed after the run.

Parameters:
  • src_name (str) – path to the zip file to extract
  • dst_dir (str) – path to the directory to put the generated xlsx file by default the cwd
  • tmp_dir (str) – temporary directory to work in. This directory is erased at the beginning of the script By default it is tmp
  • file_extension (str) – the file extension to use during file selection. By default it is .pdf
  • xlsx_name (str) – Name of the oputput file
pdf2xlsx.pdf2xlsx.extract_invoces(pdf_list, logger)

Get the invoices from the pdf files in th pdf_list Wrapper around the pdf2rawtxt call

Parameters:
  • pdf_list (list) – List of pdf files path to process.
  • loggerStatLogger, collect statistical data about parsing
Returns:

list of invoices

Return type:

list of Invoice

pdf2xlsx.pdf2xlsx.extract_zip(src_name, directory)

Extract the pdf files from the zip, there is no sanity check for the arguments :param str src_name: Path to a zip file to extract :param str dir: Path to the target directory to extract the zip file

pdf2xlsx.pdf2xlsx.get_pdf_files(directory, extension='.pdf')

Walks through the given dir and collects every files with extension

Parameters:
  • dir (str) – the root directory to start the walk
  • extension (str) – ‘.pdf’ by default, if the file has this extension it is selected
Returns:

list of pdf file path

Return type:

list of str

pdf2xlsx.pdf2xlsx.invoices2xlsx(invoices, directory='', name='Invoices01.xlsx')

Write invoice information to xlsx template file. Go through every invoce and write them out. Simple. Utilizes the xlsxwriter module

Parameters:list of Invocie (invoices) – Representation of invoices from the pdf files
pdf2xlsx.pdf2xlsx.main()

A simple wrapper around do it function, to demonstrate its usage

pdf2xlsx.pdf2xlsx.pdf2rawtxt(pdfile, logger)

Read out the given pdf file to Invoice and Entry classes to parse it. Utilize PyPFD2 PdfFileReader. Go through every page of the pdf. When a new invoice entry was found by the Entry.parse_line it is appended to the Invoice.entries

Parameters:
  • pdfile (str) – file path of the pdf to process
  • loggerStatLogger, collect statistical data about parsing
Returns:

The invoice entry filled up with the information from pdf file

Return type:

Invoice

pdf2xlsx.pdf2xlsx.run_excel(xlsx_path)

Start up Excel, with the file from the argument. The location of the excel executable should be set in the configuration

Parameters:xlsx_path (str) – Path to the xlsx file to open