(Warning: I used an old version of PdfBox: 0.7.3.)
PdfBox is a very popular Java library for creating and managing pdf files. It's also able to extract text from existing pdf files. Pdfbox is published as a jar file.
I'd like to use it on Google App Engine (java version) for text extraction from particular area of the page of pdf files. PdfBox allows that. The class to use is PDFTextStripperByArea. I tried it but GAE blocked me: PDFTextStripperByArea uses not allowed JRE classes. In particular jawa.awt.Rectangle and Rectangle2D. GAE applies a "white list" approach: only a subset of the standard JRE classes is allowed to run on GAE. 99% of Java.awt.* is blocked. http://code.google.com/appengine/docs/java/jrewhitelist.html
There is also another problem. During text extraction PdfBox uses a temp file. By default it's created on the file-system. GAE also blocks the access to the file-system.
My solution was:
- use my own Rectangle instead of java.awt.Rectangle
- use a "in memory" temp file
My own Rectangle
I created my own Rectangle and Rectangle2D classes. My rectangle implementation is not complete compared to the awt one. I only created fields and methods required.
Than I created a new PDFTextStripperByArea: PDFTextStripperByAreaGAE. I not modified the original PDFTextStripperByArea because I didn't want to break the PdfBox library compatibility.
The new class only use my Rectangle. No more references to java.awt. So now GAE allows it to run.
The new PDFTextStripperByAreaGAE is equal to the old PDFTextStripperByArea . The only difference is the use of my Rectangle instead of java.awt.Rectangle. I copied and pasted 99% of the original code.
Temp file in memory
PdfBox uses File System by default. But you can force it to use a "in memory" buffer. PdfBox ships with org.pdfbox.io.RandomAccessBuffer. I use it.
byte[] pdfBytes; // contains the bytes of the Pdf file
RandomAccessBuffer tempMemBuffer = new RandomAccessBuffer();
PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdfBytes), tempMemBuffer);
PDFTextStripperByAreaGAE sa = new PDFTextStripperByAreaGAE();
sa.addRegion("Area1", new Rectangle(26, 86, 62, 10));
sa.addRegion("Area2", new Rectangle(99, 86, 94, 14));
...
PDPage p = (PDPage) doc.getDocumentCatalog().getAllPages().get(0); // page 1
sa.extractRegions(p);
String area1 = sa.getTextForRegion("Area1")
String area2 = sa.getTextForRegion("Area2")
...
doc.close();
Live demo
http://fhtino.appspot.com/PdfBoxGAE/demo.jsp
(please, use small pdf files)
15 commenti:
Very nice. Would you mind sharing the modified jar file ?
Thanks
@Irwin, I hope to publish it in the next days/weeks. If you need it before, contact me.
Hi,
very very goog,
I have 2 weeks ago looking for this but no result, can you please share this lib
thanks
My email : Guerzizeb@Gmail.com
hey, where can i download the modified jar file?
nice work!
Pablo
@Pablo, contact me.
www.fhtino.it
Hi fhtino,
nice job!
is there a way to get the jar or to publish the source code?
thanks in advance,
nicolas
@Nicolas and other,
contact me at www.fhtino.it
Hi Fabrizio,
I need to use the PDPage class in GAE. Unfortunately, it doesn't work as this class is using intensively java.awt. I would like to try your approach by replacing the java.awt classes.
Could you please show me your example with the Rectangle class ? Do you know if it is possible to find somewhere the source code from the java.awt classes ?
Thanks in advance for your help.
Pascal
@Pascal,
my approach only works for text extraction. Only a couple of awt classes are used by text extraction. So it was simple to create my own classes.
What are you trying to do? Text extraction or other?
I'm just trying to have one image (JPG or other format)for each page of documents and store them in the GAE Datastore. PdfBox was the good solution, except this extensive use of AWT not compatible with GAE...
If you have an idea...
Thanks a lot.
Hi All,
I'm trying to extract text from PDF with PDFBox but extracted text is in disorder and when i force stripper.setSortByPosition(true) i have incomprehensive char.
Any idea??
thx
Hi,
how can i know the coordinates for the specific text on the pdf file to use the method Rectangle?
Thanks
@Nuno: Rectangle requires coordinates in "pdf units". A pdf_unit = 1 / 72 of inch = 0.3527 mm
Hi Fabrizio,
I want to extract the text for m specific coordinate but Unfortunately, PDPage class in GAE. doesn't work as this class is using intensively java.awt. I would like to try your approach by replacing the java.awt classes.can you show your source code.
Thanks
@chidhambaram contact me at www.fhtino.it
Posta un commento