Saturday, January 30, 2016

Reading text out of PDF bank statements: takes some pain out of tax time!

I was working on my wife's business taxes today for Washington State. Our state has a simple and simply not very fair tax: the B&O (Business & Occupation) tax. But that's not the issue today; the issue is my grinding my teeth about retyping stuff from a bunch of PDF bank statements.


I was starting on this while sighing and then thought, hey, there should be something!

And indeed good old Ghostscript came to the rescue!

I already had this program on my 2013 Macintosh, but it was an older version that didn't have a necessary device, but I downloaded and compiled the latest Ghostscript 9.18 and was able to run:

#!/usr/bin/python

import glob
import os

for f in glob.glob('*.pdf'):

    os.system('gs -sDEVICE=txtwrite -o %s.txt %s' % (f,f))

This program takes all the PDF files in the current directory and converts them to plain text!

Then to get the data out of these files that I was looking for there was a lovely Unix-y command line pipe string to do the trick:

grep Rain *.txt | awk '{ print $3 }' | ~/bin/add.py 

I was looking for the transactions starting with 'Rain' and the numbers were in the third field, and the final program in the chain is a simple adder:

#!/usr/bin/python
import sys

n = 0.0
for line in sys.stdin:
    nline = float(line.rstrip())
    n += nline

print n

Hope this is helpful!