Details

Parsing PDFs in Python with Tika

Parsing PDFs in Python with Tika
5/5 based on 1 votes.
A few months ago, one of my friends asked me if I could help him extract some data from a collection of PDFs. The PDFs contained records of his financial transactions over a period of years and he wanted to analyze them. Unfortunately, Excel and plain text versions of the files were no longer available, so the PDFs were his only option.

I reviewed a few Python-based PDF parsers and decided to try Tika, which is a port of Apache Tika. Tika parsed the PDFs quickly and accurately. I extracted the data my friend needed and sent it to him in CSV format so he could analyze it with the program of his choice. Tika was so fast and easy to use that I really enjoyed the experience. I enjoyed it so much I decided to write a blog post about parsing PDFs with Tika.
Submitted by elementlist on Jan 13, 2017
349 views. Averaging 0 views per day.

Post Reply


Please login or register if you wish to leave a comment.

Quick Search

Statistics

3,012 listings in 21 categories, with 2,256,583 clicks. Directory last updated Sep 12, 2023. Welcome Amara Fatima, the newest member.