(Warning: I used an old version of PdfBox: 0.7.3.)
PdfBox is a very popular Java library for creating and managing pdf files. It's also able to extract text from existing pdf files. Pdfbox is published as a jar file.
I'd like to use it on Google App Engine (java version) for text extraction from particular area of the page of pdf files. PdfBox allows that. The class to use is PDFTextStripperByArea. I tried it but GAE blocked me: PDFTextStripperByArea uses not allowed JRE classes. In particular jawa.awt.Rectangle and Rectangle2D. GAE applies a "white list" approach: only a subset of the standard JRE classes is allowed to run on GAE. 99% of Java.awt.* is blocked. http://code.google.com/appengine/docs/java/jrewhitelist.html
There is also another problem. During text extraction PdfBox uses a temp file. By default it's created on the file-system. GAE also blocks the access to the file-system.
My solution was:
- use my own Rectangle instead of java.awt.Rectangle
- use a "in memory" temp file
My own Rectangle
I created my own Rectangle and Rectangle2D classes. My rectangle implementation is not complete compared to the awt one. I only created fields and methods required.
Than I created a new PDFTextStripperByArea: PDFTextStripperByAreaGAE. I not modified the original PDFTextStripperByArea because I didn't want to break the PdfBox library compatibility.
The new class only use my Rectangle. No more references to java.awt. So now GAE allows it to run.
The new PDFTextStripperByAreaGAE is equal to the old PDFTextStripperByArea . The only difference is the use of my Rectangle instead of java.awt.Rectangle. I copied and pasted 99% of the original code.
Temp file in memory
PdfBox uses File System by default. But you can force it to use a "in memory" buffer. PdfBox ships with org.pdfbox.io.RandomAccessBuffer. I use it.
byte[] pdfBytes; // contains the bytes of the Pdf file
RandomAccessBuffer tempMemBuffer = new RandomAccessBuffer();
PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdfBytes), tempMemBuffer);
PDFTextStripperByAreaGAE sa = new PDFTextStripperByAreaGAE();
sa.addRegion("Area1", new Rectangle(26, 86, 62, 10));
sa.addRegion("Area2", new Rectangle(99, 86, 94, 14));
...
PDPage p = (PDPage) doc.getDocumentCatalog().getAllPages().get(0); // page 1
sa.extractRegions(p);
String area1 = sa.getTextForRegion("Area1")
String area2 = sa.getTextForRegion("Area2")
...
doc.close();
Live demo
http://fhtino.appspot.com/PdfBoxGAE/demo.jsp
(please, use small pdf files)