Poor obfuscation implementation

POI – “Poor Obfuscation Implementation”, at least this was the initial meaning of the acronym. The POI project of Apache contains pure Java ports for the Microsoft File formats based on OLE 2 Compound Document Format.

There are several sub-projects under POI.

  • POIFS is the implementation of the OLE 2 Compound Document format, this is the basic API for all the other POI projects.
  • HSSF-XSSF is the implementation of the Excel ’97 file format and the XSSF is the implementation of the Excel 2007 OOXML (.xlsx) file format.
  • HPSF is the implementation for reading the so called property set streams, essentially this is the Document Summary that one can find in the properties of Microsoft Office files.
  • HWPF is the implementation the Microsoft Word 97(-2007) file format. This is my favorite one, with this API you can read and write MS word documents. Unfortunately this project is headless for the moment. Note that HWPF do not support docx file format (MS Word 2007).
  • HSLF is the implementation for MS PowerPoint files, provides a way to read, create or modify PowerPoint presentations.
  • HSMF is the implementation of the Outlook MSG format.
  • HDGF is the implementation of the Visio file format, provides the ability to read (only) the low level contents of visio files.
  • HPBF is the implementation for MS Publisher files. It is in an early stage and provides the ability to read parts of publisher files, writing is not supported yet.
  • The POI library can also be compiled as a Ruby extension.

The acronyms of this project are really funny:
POIFS – Poor Obfuscation Implementation File System
HSSF – Horrible SpreadSheet Format
HWPF – Horrible Word Processor Format
and so on…

As I said before the HWPF is my favorite, with the API that this project provides we can manipulate the content of MS Word documents. Either we like it or not, a huge amount of data has been written and saved in MS Word format. I am sure that there are a lot of use cases where you would want to have access to this data through java API. Using this java ports we can avoid hacks and provide elegant solutions.

Two of my favorite use cases are:
1. Translating a word document automatically using google translator (the translation is horrible, but is much better than nothing)
2. Paragraph alignment in bilingual text.

Lets see the first use case in e very simple implementation, the steps of the algorithm are:
a. extract the paragraphs from the word document.
b. translate the paragraphs using google translator. (or any proprietary translator that you may have)
c. insert the translated text in a new word document.

Here is the code for the first use case:

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import org.apache.poi.hwpf.extractor.WordExtractor;
import com.google.api.translate.Language;
import com.google.api.translate.Translate;

public class Translator {
	private WordExtractor extractor;
	private String destFileName;

	public Translator(String sourceFileName, String destFileName) throws IOException {
		this.destFileName = destFileName;
		extractor = new WordExtractor(new FileInputStream(sourceFileName));
	}

	/**
	 * #a. extract the paragraphs from the word document.
	 */
	public String[] extractParagraphs() {
		return extractor.getParagraphText();
	}

	/**
	 * #b. translate the paragraphs using google translator.
	 */
	public String translate(String[] paragraphs, String sourceLanguage, String destLanguage)
			throws Exception {
		StringBuilder container = new StringBuilder();
		for (String text : paragraphs) {
			System.out.println(text);
			container.append(text);
		}
		String result = Translate.translate(container.toString(), sourceLanguage, destLanguage);
		System.out.println(result);
		return result;
	}

	/**
	 * #c. insert the translated text in a new word document.
	 */
	public void insertText(String text) throws IOException {
		FileOutputStream fs = new FileOutputStream(destFileName);
		OutputStreamWriter out = new OutputStreamWriter(fs);
		out.write(text);
		out.close();
	}

	public static void main(String[] args) throws Exception {
		if (args == null || args.length < 2) {
			System.out.println("Usage: Translator  ");
			System.exit(0);
		}
		Translator tr = new Translator(args[0], args[1]);
		// a.
		String[] paragraphs = tr.extractParagraphs();
		// b.
		String translatedText = tr.translate(paragraphs,
								Language.ENGLISH,
								Language.FRENCH);
		// c.
		tr.insertText(translatedText);

		System.out.println("\nDone!");
	}
}

The second use case is a bit more complicated but equally easy, for legal reasons I can not post the code… sorry, maybe it is a good exercise ;) for you.

You can find a lot more information about the POI project @ http://poi.apache.org/
end for the google java translation api @ http://code.google.com/p/google-api-translate-java/

To run the code just type: java Translate c:\mydoc_en.doc c:\mydoc_fr.doc

Note that in order to run the code, the following packages will be needed which you can find at the previews listed sites:
poi-3.1.jar
poi-scratchpad-3.1-FINAL-20080629.jar
google-api-translate-java-0.5.jar

Advertisements

4 thoughts on “Poor obfuscation implementation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s