Copyright © 2007 Gene Michael Stover. All rights reserved. Permission to copy, store, & view this document unmodified & in its entirety is granted.
My brother works on the Gnu PSPP project, & he asked me to figure out the format of data files for SAS, a competing application.
Here's how I did it, complete with source code for programs I wrote while I deciphered the file format.
If I do most of the deciphering work in my own head, then I have to document my technique & my calculations. Then if we ever need to perform the deciphering again, someone must update the ideas, then calculate again, & success will depend on the mind of the person doing the calculations.
if I write a program which does the calculations & produces the answer, then I still have to document my technique & how the program works, but the program itself documents the process, & the program in combination with the input files can reproduce the calculations. If we ever need to improve the technique, we improve the program & run it. What's more, success at run-time does not depend on the mind of the person running the program. Success still depends on the minds which created the program, but it seems less dependant than if a human performs all the calculations.
What's more, if I write a program which does the deciphering, then anyone can verify that the program works by running it. We can verify its answer by running it on other input data sets.
I hope that lots of people write lots of programs which automatically decipher lots of file formats. It could help remove the obstacles created by undocumented file formats.
Files in formats which are documented are more useful because programmers can write their own programs to process the data. If the only way to access a file is through a single application, then if someone wants to perform their own calculations on the data, the application's creator must add hooks for general-purpose processing. This bloats the application, makes it less reliable, more expensive, & less fun to use. Jeez, it just plain sucks. But if other programs can use the files, then the original application can remain minimalist & do what it does best.
If it became common for people to write programs which decoded file formats, a lot of file formats would cease to be undocumented.
Yep, some evil entities would respond by encrypting their files, but that extreme gesture might help more people wake-up to the costs of undocumented file formats. And who knows? Some clever programmer of free software1 might even decode the new file format in spite of the encryption.
I'm no cryptanalyst, so even though this is a simple problem in cryptanalytical terms (a chosen-plaintext attack in which all messages are enciphered with the same key), I'm sure my techniques are less than optimal. I don't apologize for this.
>manifest.lisp)''
to create manifest.lisp, a description
of the data sets in Lisp.
(defstruct manifest (pn nil :type pathname) ;; (n-vals 1 :type (integer 1)) (n-vars 1 :type (integer 1)) ;; var-1-type WE DON'T USE THIS FIELD var-1-value ;; var-2-type WE DON'T USE THIS FIELD var-2-value (var-1-name "" :type string) var-2-name ; a string or NIL sas) ; octets from SAS file
(defvar *default-manifest-pathname*
(make-pathname :directory '(:relative "files-2")
:name "manifest" :type "lisp"))
(defun read-manifest (strm)
"Return the next MANIFEST from the input stream
or NIL."
(declare (type stream strm))
(assert (input-stream-p strm))
(assert (open-stream-p strm))
(assert (eq 'character (stream-element-type strm)))
(let ((x (read strm nil)))
(if x
(let ((pn (make-pathname :directory '(:relative "files-2")
:name (string-downcase (sixth x))
:type "sas7bdat")))
(make-manifest
:pn pn
:n-vars (first x)
:var-1-value (if (equal "float" (third x))
(read-from-string (second x))
(second x))
:var-2-value (cond ((equal "NONE" (fourth x))
;; There is no second variable;
;; there's just one variable.
nil)
((equal "float" (fifth x))
(read-from-string (fourth x)))
(t (fourth x)))
:var-1-name (seventh x)
:var-2-name (eighth x)
;; The manifest file isn't entirely
;; accurate. Some of the pathnames
;; are krap. The easiest way to deal
;; with that is to ignore it. It's
;; one of the gillions of examples of
;; how the person who creates the data
;; file must be given an automaton which
;; consumes the data so the person can
;; tell when the data file is correct.
:sas (with-open-file (sas pn :element-type '(unsigned-byte 8))
(coerce
(loop for y = (read-byte sas nil)
until (null y)
collect y)
'(array (unsigned-byte 8) (*))))))
;; else, It's end of file.
nil)))
(defun load-manifest (&optional (pn *default-manifest-pathname*))
(with-open-file (strm pn)
(loop for x = (read-manifest strm)
until (null x)
collect x)))
lisp> (defvar *manifest* (load-manifest)) *MANIFEST* lisp> (length *manifest*) 844
lisp> (count nil *manifest* :key #'manifest-sas) 0
Good.
(defun count-technique (technique test key &optional (lst *manifest*))
(declare (type function technique test key)
(type sequence lst))
(count-if #'(lambda (m)
(declare (type manifest m))
(let ((value (funcall key m)))
(and (funcall test value)
(search (funcall technique value)
(manifest-sas m)))))
lst))
lisp> (load "src/loadall.lisp")
T
lisp> (import 'com.cybertiggyr.gene.ff0:encode-ieee-single-float)
T
lisp> (count-technique #'encode-ieee-single-float
#'numberp
#'manifest-var-1-value)
170
lisp> (count-technique #'(lambda (x)
(reverse (encode-ieee-single-float x)))
#'numberp
#'manifest-var-1-value)
175
What if the makers of SAS don't want me to be able to decipher their file format? It's too late for SAS version 9, but they could change the file format for SAS version 10.
The easiest modification might be to retain the current file format but encrypt the bits before emitting them to external storage. They wouldn't even need the latest cryptosystem. Plain old DES is beyond me & most programmers. Yeah, yeah, the NSA might be able to defeat DES in a single day, but if the NSA (acting on behalf of the federal government of the USA) wants to read your secret data2 or determine your secret file format, you have bigger problems than clever cryptanalysts. Think money, lawyers, & prison. And there's no reason SAS couldn't use the latest, woopie-doo cryptosystem of the month. I just said that something older would do the trick just as well.
With a little more effort, SAS's file format could be protected with less security. Here are some possibilities:
A non-cryptanalyst programmer (such as I) could defeate any of the preceeding techniques, but they would slow him & the deciphering program he wrote. It's like I said: ``With a little more effort, SAS's file format could be protected with less security'' than actually encrypting it.
On the other hand, if the makers of SAS wanted to make it even easier for other programmers to write programs which processed SAS files (thereby increasing the utility of SAS itself), they could document the file format. Here are some specific things they could do:
Gene Michael Stover 2008-04-20