pdf2xlsx¶
Contains framework to upen Zip a zip file which contains multiple pdf files representing invoices. The invoices are parsed into Invoice and Entry (invoice entries) classes. These are converted to XLSX format.
-
pdf2xlsx.pdf2xlsx.do_it(src_name, dst_dir='', xlsx_name='Invoices01.xlsx', tmp_dir='tmp', file_extension='.pdf')¶ Main script to manage the zip to xls process. It is responsible to create/cleanup temporary directories and files. After zip extraction, seraches every file which ends with file_extension Then it builds up a list of invoices and writes them to xlsx format and opens it up in the predefined xlsx_viewer. If the dst_dir happens to be the same as the tmp_dir, the generated xlsx file is removed after the run.
Parameters: - src_name (str) – path to the zip file to extract
- dst_dir (str) – path to the directory to put the generated xlsx file by default the cwd
- tmp_dir (str) – temporary directory to work in. This directory is erased at the beginning of the script By default it is tmp
- file_extension (str) – the file extension to use during file selection. By default it is .pdf
- xlsx_name (str) – Name of the oputput file
-
pdf2xlsx.pdf2xlsx.extract_invoces(pdf_list, logger)¶ Get the invoices from the pdf files in th pdf_list Wrapper around the pdf2rawtxt call
Parameters: - pdf_list (list) – List of pdf files path to process.
- logger –
StatLogger, collect statistical data about parsing
Returns: list of invoices
Return type: list of
Invoice
-
pdf2xlsx.pdf2xlsx.extract_zip(src_name, directory)¶ Extract the pdf files from the zip, there is no sanity check for the arguments :param str src_name: Path to a zip file to extract :param str dir: Path to the target directory to extract the zip file
-
pdf2xlsx.pdf2xlsx.get_pdf_files(directory, extension='.pdf')¶ Walks through the given dir and collects every files with extension
Parameters: - dir (str) – the root directory to start the walk
- extension (str) – ‘.pdf’ by default, if the file has this extension it is selected
Returns: list of pdf file path
Return type: list of str
-
pdf2xlsx.pdf2xlsx.invoices2xlsx(invoices, directory='', name='Invoices01.xlsx')¶ Write invoice information to xlsx template file. Go through every invoce and write them out. Simple. Utilizes the xlsxwriter module
Parameters: list of Invocie (invoices) – Representation of invoices from the pdf files
-
pdf2xlsx.pdf2xlsx.main()¶ A simple wrapper around do it function, to demonstrate its usage
-
pdf2xlsx.pdf2xlsx.pdf2rawtxt(pdfile, logger)¶ Read out the given pdf file to Invoice and Entry classes to parse it. Utilize PyPFD2 PdfFileReader. Go through every page of the pdf. When a new invoice entry was found by the Entry.parse_line it is appended to the Invoice.entries
Parameters: - pdfile (str) – file path of the pdf to process
- logger –
StatLogger, collect statistical data about parsing
Returns: The invoice entry filled up with the information from pdf file
Return type: Invoice
-
pdf2xlsx.pdf2xlsx.run_excel(xlsx_path)¶ Start up Excel, with the file from the argument. The location of the excel executable should be set in the configuration
Parameters: xlsx_path (str) – Path to the xlsx file to open