9/30/09

Hacking Apache POI

In the near future, presumably, I'll have to read and process complex xlsx documents containing a lots of formulas. Jexcelapi can't even open those kinds documents, so I turned to Apache POI. Unfortunately, POI also have problems with some formulas becase support of this field is not fully laboured.

[Update 2010.03.10: After some experimentations I found that Apache POI can read and process new Excel (xlsx) documents including calculating very complex formulas: Basically, cell.getStringCellValue() gives the evaluated outcome. This way, I didn't have to hack POI. That's great!]

I choosed to start two methods:
  1. Knowing exact static structure of the Excel template document, I do problematic calculations in the reader program itself, eliminating POI's formula support.
  2. I study Apache POI's internal structure and try to make up the missing functionality. Unfortunately, POI isn't amendable in a non-invasive way with registering components at certain extension points, so I had to dig in the source code in order to make modifications. Here are the basic experiences regarding setting up the POI sources locally:

Sources can be obtained from subversion http://svn.apache.org/repos/asf/poi/trunk/ or from direct download. There are some "how to build info" on the homepage but it may be useful to record my own experiences, because there are some tricks. Downloaded source has pdf and html docs, but SVN repo contains those in xml format which can be converted by running docs Ant target after installing Apache Forrest v0.5.1 (unzip & set FORREST_HOME system variable to the forrest's root directory). I had to edit forrest.build.xml at line 630, changing failonerror value to false, otherwise the build failed. Place of generated docs is written on the console after a successful build.

For running dist target, JDepend jar have to be on the classpath and setting -Xmx512m VM argument is needed. It completes in 7 minutes for me but no need to run it regularly, just to see if everything can go smoothly. Target jar completes in 45 seconds from scratch and makes all needed jars.

For building with an interactive IDE, add jars to the project from lib and ooxml-lib directories. Setting up JUnit3 may be also needed for running test cases. When running unit tests, add test-data directory to the classpath or specify POI.testdata.path system property.

Target maven-poms creates jar and POM files for Maven and a Unix shell script to install the artifacts in the build/dist directory. I created a DOS batch file with commands like this:
call mvn install:install-file -Dfile=poi-scratchpad-@VERSION@.jar -DpomFile=poi-scratchpad-@VERSION@.pom
I ommited date timestamp info from the artifacts' name. build.xml must have been modified in the jar target to ommit timestamp postfixes. maven-poms also must have been modified to copy the new maven-deploy batch file. version.id property may be modified in build.xml to flag the forked version.

Apache POI is now ready to be hacked.