martedì, aprile 20, 2010

PdfBox text extraction & GAE

How to do text extraction from pdf files using PdfBox on Google App Engine

(Warning: I used an old version of PdfBox: 0.7.3.)

PdfBox is a very popular Java library for creating and managing pdf files. It's also able to extract text from existing pdf files. Pdfbox is published as a jar file.
I'd like to use it on Google App Engine (java version) for text extraction from particular area of the page of pdf files. PdfBox allows that. The class to use is PDFTextStripperByArea. I tried it but GAE blocked me: PDFTextStripperByArea uses not allowed JRE classes. In particular jawa.awt.Rectangle and Rectangle2D. GAE applies a "white list" approach: only a subset of the standard JRE classes is allowed to run on GAE. 99% of Java.awt.* is blocked. http://code.google.com/appengine/docs/java/jrewhitelist.html
There is also another problem. During text extraction PdfBox uses a temp file. By default it's created on the file-system. GAE also blocks the access to the file-system.

My solution was:
  • use my own Rectangle instead of java.awt.Rectangle
  • use a "in memory" temp file
The first required modification and recompilation of PdfBox.

My own Rectangle

I created my own Rectangle and Rectangle2D classes. My rectangle implementation is not complete compared to the awt one. I only created fields and methods required.
Than I created a new PDFTextStripperByArea: PDFTextStripperByAreaGAE. I not modified the original PDFTextStripperByArea because I didn't want to break the PdfBox library compatibility.
The new class only use my Rectangle. No more references to java.awt. So now GAE allows it to run.
The new PDFTextStripperByAreaGAE is equal to the old PDFTextStripperByArea . The only difference is the use of my Rectangle instead of java.awt.Rectangle. I copied and pasted 99% of the original code.

Temp file in memory

PdfBox uses File System by default. But you can force it to use a "in memory" buffer. PdfBox ships with org.pdfbox.io.RandomAccessBuffer. I use it.


byte[] pdfBytes; // contains the bytes of the Pdf file
RandomAccessBuffer tempMemBuffer = new RandomAccessBuffer();
PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdfBytes), tempMemBuffer);
PDFTextStripperByAreaGAE sa = new PDFTextStripperByAreaGAE();
sa.addRegion("Area1", new Rectangle(26, 86, 62, 10));
sa.addRegion("Area2", new Rectangle(99, 86, 94, 14));
...
PDPage p = (PDPage) doc.getDocumentCatalog().getAllPages().get(0); // page 1
sa.extractRegions(p);
String area1 = sa.getTextForRegion("Area1")
String area2 = sa.getTextForRegion("Area2")
...
doc.close();

Live demo

http://fhtino.appspot.com/PdfBoxGAE/demo.jsp

(please, use small pdf files)

15 commenti:

Irwin ha detto...

Very nice. Would you mind sharing the modified jar file ?

Thanks

Fabrizio ha detto...

@Irwin, I hope to publish it in the next days/weeks. If you need it before, contact me.

Guerzize Brahim ha detto...

Hi,

very very goog,

I have 2 weeks ago looking for this but no result, can you please share this lib

thanks
My email : Guerzizeb@Gmail.com

Anonimo ha detto...

hey, where can i download the modified jar file?
nice work!
Pablo

Fabrizio ha detto...

@Pablo, contact me.
www.fhtino.it

nicolas maisonneuve ha detto...

Hi fhtino,
nice job!
is there a way to get the jar or to publish the source code?
thanks in advance,
nicolas

Fabrizio ha detto...

@Nicolas and other,
contact me at www.fhtino.it

Unknown ha detto...

Hi Fabrizio,

I need to use the PDPage class in GAE. Unfortunately, it doesn't work as this class is using intensively java.awt. I would like to try your approach by replacing the java.awt classes.

Could you please show me your example with the Rectangle class ? Do you know if it is possible to find somewhere the source code from the java.awt classes ?

Thanks in advance for your help.

Pascal

Fabrizio ha detto...

@Pascal,
my approach only works for text extraction. Only a couple of awt classes are used by text extraction. So it was simple to create my own classes.
What are you trying to do? Text extraction or other?

Unknown ha detto...

I'm just trying to have one image (JPG or other format)for each page of documents and store them in the GAE Datastore. PdfBox was the good solution, except this extensive use of AWT not compatible with GAE...

If you have an idea...

Thanks a lot.

Anonimo ha detto...

Hi All,
I'm trying to extract text from PDF with PDFBox but extracted text is in disorder and when i force stripper.setSortByPosition(true) i have incomprehensive char.
Any idea??
thx

Unknown ha detto...

Hi,
how can i know the coordinates for the specific text on the pdf file to use the method Rectangle?

Thanks

Fabrizio ha detto...

@Nuno: Rectangle requires coordinates in "pdf units". A pdf_unit = 1 / 72 of inch = 0.3527 mm

chidhambaram ha detto...

Hi Fabrizio,

I want to extract the text for m specific coordinate but Unfortunately, PDPage class in GAE. doesn't work as this class is using intensively java.awt. I would like to try your approach by replacing the java.awt classes.can you show your source code.

Thanks

Fabrizio ha detto...

@chidhambaram contact me at www.fhtino.it