Everybody hates statistics…

Everybody hates statistics [1]…

 

… but it can be of major importance in our small angle world. While very few papers on small-angle scattering discuss statistics, they can tell you whether your observations are real or just imaginary. In addition, statistics will let you know whether you have been able to describe your scattering pattern with your model or not. All in all, nice to have.

 

I will briefly discuss two statistical concepts which could be of great use, as they have been for me. While I never really could understand all the concepts during statistics lectures at university (a situation which may sound familiar), I can try to explain some simple concepts. By the by, if you (dear reader) are a statistician, I would be happy to get in touch with you. The first concept is the most straightforward, and involves the uncertainties on fitting parameters. Secondly I will discuss statistics on collected intensity and how to retrieve them for a variety of detectors.

Most of us use model fits to our data in order to extract fitting parameters which, in turn, tell us something about the sample we are observing. By comparing these fitting parameters between a few samples, an attempt can be made to link the phenomenological differences in the samples to the differences in fitting parameters. In many cases, the differences in the fitting parameters are rather small, and we may start to wonder if it is significant at all.

 

While some data fitting programs will automatically give you a confidence interval or uncertainty on your fitting parameters, these uncertainties are not the ones you’re looking for. They are directly linked to the influence of that fitting parameter to the total fit and to the quality of the fit, but do not necessarily describe your real margin of error. The real margin of error may arise from changes in the instrumental accuracy between measurements (such as beam movement or detector drift), or from variations in the sample.

 

The only way to solve most of these errors (apart from a good data reduction to get rid of most sources of error) is to perform the same measurement several times. If I have a fibre I would like to measure, I should prepare five to ten samples out of that fibre, and measure them one by one (f.ex. as done in my paper here [pdf] [journal link]). Apart from systematical errors, the differences in the resulting fitting parameters should give me a very good idea on the real margin of error on each fitting parameter. Then, before publishing, one can calculate the (sample) standard deviation and score some brownie points with the reviewers!

 

The second statistics problem I would like to discuss is that of intensity error. Knowing the error on your collected intensity will prevent over-fitting of the data (f.ex. using the reduced chi-square definition as given in Pedersen, 1997: doi:10.1016/S0001-8686(97)00312-6), and allows you to determine what is visible and what is mere conjecture in problems involving polydispersity (c.f. Revisiting observability in polydisperse systems: “real” polydispersity). As we are all acutely aware, Poisson statistics dictate the absolutely minimal error possible on the intensity counted by photon counting detectors (this error is the square root of the number of photon counts for large count numbers, for interesting information on low-count problems, read this doi:10.1051/0004-6361:20064927) . This is not to say that the error is really just the square root of the number of counts, but it certainly cannot be smaller! As an aside, I have been explained that even for inefficient, but still photon-counting detectors, Poisson statistics still hold. In other words, if my beloved PILATUS detector is only 60% efficient at detecting photons at a certain energy, what it does detect still has a minimum error of the square root of the number of detected photons.

 

Most of us, though, do not have the luxury of a photon counting detector. There is still a way of getting an idea about the uncertainty of the intensity of non-photon-counting detectors, which will additionally provide us with a way to determine the actual uncertainty in photon-counting detectors. This method only works for binned data, however. The way to do this is to determine the standard error of all the intensities in your bin. This is easier done than said, as you determine the (sample) standard deviation of all intensity values in your bin and subsequently divide the number by the square root of the number of values in your bin. If you do this for your photon counting detectors, you will get a better idea of the actual error in the intensity in your bin, especially in areas where your data is affected by some parasitic scattering. You can then take the largest value, be it either the standard error or the Poisson statistics error of your data for more accurate errors.

 

I have attempted to put the latter in a new weighted binning routine. This one has been written in Python, for increased portability and reduced cost of implementation. There may be errors in the code still, but I think it is mostly sorted. If you see anything amiss or can suggest improvements to the code, please let me know. The code is licensed under an open-source BSD license.

[1] except for dr. Ben Goldacre, who cannot stop talking about it, and provides interesting case study after interesting case study.

-Python code->
def binning_weighted_1D(q,I,E=[],Nbins=200,Stats='SE'):
    #USAGE: qbin,Ibin,Ebin=binning_weighted_1D(q,I,E=[],Nbins=200,Stats='SE'):
    #python implementation of the binning routine written in Matlab. The intensities are divided across the q-range in bins of equal size. The intensities of a pixel are divided between the two neighbouring bins depending on the distances to the centres. If error provided is empty, the standard deviation of the intensities in the bins are computed.
    #optional input arguments:
    #   Nbins: integer indicating the number of bins to divide the intensity over. Alternatively, this can be an array of quidistant bin centres. If you go this route, depending on the range, not all intensity may be counted.
    #   Stats: can be set to 'auto'. This takes the maximum error between supplied Poisson statistics error-based errors or the standard error. 
    #Written by Brian R. Pauw, 2011, released under BSD open source license.
    #let's make sure the input is consistent
    if size(q)!=size(I):
        print "Incompatible lengths of q and I, q must be of the same number of elements as I"
        return
    elif (size(E)!=0) & (size(E)!=size(I)):
        print "Size of E is not identical to q and I"
        return
    if (Stats!='STD')&(Stats!='Poisson')&(Stats!='SE')&(Stats!='auto'):
        print "Statistics can only be set to 'SE' (default), or 'auto'. \n"
        print "Only use 'auto' for photon-counting detectors, selects largest error between SE and Poisson.\n"
        print "If errors are supplied, standard errors are not calculated except in the case of 'auto' "
        return
    if (size(Nbins)==1):
        if (Nbins<1):
            print "number of bins, Nbins, is smaller than one. I need at least one bin to fill"
            return
    if size(Nbins)>1:
        print "Nbins is larger than one value. Assuming that an equidistant list of bin centres has been supplied"

    #flatten q, I and E
    q=reshape(q,size(q),0)
    I=reshape(I,size(I),0)
    E=reshape(E,size(E),0)

    if size(Nbins)==1:
        #define the bin edges and centres, and find out the stepsize while we're at it. Probably, there is no need for knowing the edges...
        qbin_edges,stepsize=linspace(numpy.min(q),numpy.max(q),Nbins+1,retstep=True)#do not use qbin_edges!
        #stepsize=qbin_edges[1]-qbin_edges[0]
        qbin_centres=linspace(numpy.min(q)+0.5*stepsize,numpy.max(q)-0.5*stepsize,Nbins)
    else:
        if (numpy.min(q)>numpy.max(Nbins))|(numpy.max(q)<numpy.min(Nbins)):
            print "Bin centres supplied do not overlap with the q-range, cannot continue"
            return
        qbin_centres=sort(Nbins)
        stepsize=mean(diff(qbin_centres))
        Nbins=size(qbin_centres)

    #initialize output matrices
    Ibin=numpy.zeros(Nbins)
    SDbin=numpy.zeros(Nbins)
    SEbin=numpy.zeros(Nbins)

    #now we can fill the bins
    for Bini in range(Nbins):
        #limit ourselves to only the bits we're interested in:
        lim_bool_array=((q>(qbin_centres[Bini]-stepsize))&(q<=(qbin_centres[Bini]+stepsize)))
        q_to_bin=q[lim_bool_array]
        I_to_bin=I[lim_bool_array]
        if (size(E)!=0):
            E_to_bin=E[lim_bool_array]

        #find out the weighting factors for each q,I,E-pair in the array  
        q_dist=abs(q_to_bin-qbin_centres[Bini])
        weight_fact=(1-q_dist/stepsize)

        #sum the intensities in one bin
        Ibin[Bini]=sum(I_to_bin*weight_fact)/sum(weight_fact);

        #now we deal with the Errors:
        if (size(E)!=0): #if we have errors supplied from outside
            #standard error calculation:
            SEbin[Bini]=sqrt(sum(E_to_bin**2*weight_fact))/sum(weight_fact)
            #Ebin2[Bini]=sqrt(sum((E_to_bin*weight_fact)**2))/sum(weight_fact) old, incorrect
            if Stats=='auto':
                SDbin[Bini]=sqrt(sum((I_to_bin-Ibin[Bini])**2*weight_fact)/(sum(weight_fact)-1)) #according to the definition of sample-standard deviation
                SEbin[Bini]=numpy.max(array([SEbin[Bini],SDbin[Bini]/sqrt(sum(weight_fact))])) #maximum between standard error and Poisson statistics
        else:
            #calculate the standard deviation of the intensity in the bin both for samples with supplied error as well as for those where the error is supposed to be calculated
            SDbin[Bini]=sqrt(sum((I_to_bin-Ibin[Bini])**2*weight_fact)/(sum(weight_fact)-1)) #according to the definition of sample-standard deviation
            #calculate standard error by dividing the standard error by the square root of the number of measurements
            SEbin[Bini]=SDbin[Bini]/sqrt(sum(weight_fact))
    return qbin_centres,Ibin,SEbin
<-Python code-

Good luck!

Be the first to comment

Leave a Reply

Your email address will not be published.


*


*

This site uses Akismet to reduce spam. Learn how your comment data is processed.