A standard for data?

Firstly, my apologies for the long silence. Presentations and deadlines are looming, leaving little time for deviations.

Now then, on to the topic at hand (and one of my favourite topics): data! In particular, I want to talk about data correction, and whether it may be possible to define a standard. Not a standard for storing data (between NeXuS and the CanSAS file formats, and everyone just going their own way when it comes to storing data, I will leave it up to time to decide on that particular prickly subject). I am talking for a standard in data reduction.

After writing the experimental sections for a few papers, I noticed how much space it takes to indicate which data corrections have been applied, when they have been applied and how. However, when it comes to data correction, there appears to be only one way of doing it correctly (some of which is discussed in my document here, and graphically charted here). It seems to me that it then only comes down to mentioning which data corrections have been done, but even that gets tedious, and may still take up half a page. to give one example:

As (mostly) detailed in a separate publication, collected data is corrected for natural background radiation, transmission, sample thickness, measurement time, primary beam flux, parasitic background, polarisation, detector solid angle coverage and sample self-absorption. The intensity is subsequently binned using 200 bins spanning the aforementioned q-range, and scaled to absolute units using a calibrated glassy carbon standard provided by [CITE ILAVSKY]. etc., etc..

So you see, still a lot of space merely mentioning the corrections. I think we can find a way to shorten this. Would it be crazy to suggest that all of these corrections fall under a standard (ISO-007 or so), and that we merely indicate which corrections have not been done, perhaps in a format similar to the way creative commons licence options are defined (CC-BY-SA for example indicates creative commons (CC) with attribution (BY) and share alike (SA) options).

As far as I can tell we have quite some corrections to do for a dataset before it gets close to 1-2% accurate, in order of appearance:

(DS) Detector strange storage corrections (Rigaku, Bruker, I am looking at you here)
(GD) Detector geometric distortion
(GA) Detector gamma correction (nonlinear response)
(FF) flatfield
(DC) Detector darkcurrent (even for PILATUS detectors)
(MK) Masking invalid pixels or out-of-range pixels
(TI) Time
(TR) Transmission
(FL) Flux
(BG) Background
(PO) Polarisation
(SP) Spherical correction
(SA) Sample self-absorption
(AU) Absolute unit scaling

(please correct me if I have omitted corrections here). As you can see, I have also included two-letter suggestions at the front. However, with 14 corrections (most of which you do automatically anyway), it still would be a pain to write out. So my suggestion is to only note the corrections that have not been done (with an added “!” at the front for “not”) . For my aforementioned example, this would come to:

All data corrections have been applied according to ISO-007 with the exception of (!GD-!GA-!FF) which were deemed unnecessary for the detector employed.

Does that sound like something of worth? As always, share your opinion in the comments section!

Looking At Nothing

A SA(X)S Weblog

A standard for data?

1 Trackback / Pingback

Leave a Reply