/** * Generic bootstrap resampling. Quite optimized - Don't be afraid to try it. Executes * <tt>resamples</tt> resampling steps. In each resampling step does the following: * * <ul> * <li>Uniformly samples (chooses) <tt>size()</tt> random elements <i>with replacement</i> from * <tt>this</tt> and fills them into an auxiliary bin <tt>b1</tt>. * <li>Uniformly samples (chooses) <tt>other.size()</tt> random elements <i>with replacement</i> * from <tt>other</tt> and fills them into another auxiliary bin <tt>b2</tt>. * <li>Executes the comparison function <tt>function</tt> on both auxiliary bins * (<tt>function.apply(b1,b2)</tt>) and adds the result of the function to an auxiliary * bootstrap bin <tt>b3</tt>. * </ul> * * <p>Finally returns the auxiliary bootstrap bin <tt>b3</tt> from which the measure of interest * can be read off. * * <p><b>Background:</b> * * <p>Also see a more <A HREF="http://garnet.acns.fsu.edu/~pkelly/bootstrap.html"> in-depth * discussion</A> on bootstrapping and related randomization methods. The classical statistical * test for comparing the means of two samples is the <i>t-test</i>. Unfortunately, this test * assumes that the two samples each come from a normal distribution and that these distributions * have the same standard deviation. Quite often, however, data has a distribution that is * non-normal in many ways. In particular, distributions are often unsymmetric. For such data, the * t-test may produce misleading results and should thus not be used. Sometimes asymmetric data * can be transformed into normally distributed data by taking e.g. the logarithm and the t-test * will then produce valid results, but this still requires postulation of a certain distribution * underlying the data, which is often not warranted, because too little is known about the data * composition. * * <p><i>Bootstrap resampling of means differences</i> (and other differences) is a robust * replacement for the t-test and does not require assumptions about the actual distribution of * the data. The idea of bootstrapping is quite simple: simulation. The only assumption required * is that the two samples <tt>a</tt> and <tt>b</tt> are representative for the underlying * distribution with respect to the statistic that is being tested - this assumption is of course * implicit in all statistical tests. We can now generate lots of further samples that correspond * to the two given ones, by sampling <i>with replacement</i>. This process is called * <i>resampling</i>. A resample can (and usually will) have a different mean than the original * one and by drawing hundreds or thousands of such resamples <tt>a<sub>r</sub></tt> from * <tt>a</tt> and <tt>b<sub>r</sub></tt> from <tt>b</tt> we can compute the so-called bootstrap * distribution of all the differences "mean of <tt>a<sub>r</sub></tt> minus mean of * <tt>b<sub>r</sub></tt>". That is, a bootstrap bin filled with the differences. Now we can * compute, what fraction of these differences is, say, greater than zero. Let's assume we have * computed 1000 resamples of both <tt>a</tt> and <tt>b</tt> and found that only <tt>8</tt> of the * differences were greater than zero. Then <tt>8/1000</tt> or <tt>0.008</tt> is the p-value * (probability) for the hypothesis that the mean of the distribution underlying <tt>a</tt> is * actually larger than the mean of the distribution underlying <tt>b</tt>. From this bootstrap * test, we can clearly reject the hypothesis. * * <p>Instead of using means differences, we can also use other differences, for example, the * median differences. * * <p>Instead of p-values we can also read arbitrary confidence intervals from the bootstrap bin. * For example, <tt>90%</tt> of all bootstrap differences are left of the value <tt>-3.5</tt>, * hence a left <tt>90%</tt> confidence interval for the difference would be * <tt>(3.5,infinity)</tt>; in other words: the difference is <tt>3.5</tt> or larger with * probability <tt>0.9</tt>. * * <p>Sometimes we would like to compare not only means and medians, but also the variability * (spread) of two samples. The conventional method of doing this is the <i>F-test</i>, which * compares the standard deviations. It is related to the t-test and, like the latter, assumes the * two samples to come from a normal distribution. The F-test is very sensitive to data with * deviations from normality. Instead we can again resort to more robust bootstrap resampling and * compare a measure of spread, for example the inter-quartile range. This way we compute a * <i>bootstrap resampling of inter-quartile range differences</i> in order to arrive at a test * for inequality or variability. * * <p><b>Example:</b> * * <table> * <td class="PRE"> * <pre> * // v1,v2 - the two samples to compare against each other * double[] v1 = { 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 21, 22,23,24,25,26,27,28,29,30,31}; * double[] v2 = {10,11,12,13,14,15,16,17,18,19, 20, 30,31,32,33,34,35,36,37,38,39}; * hep.aida.bin.DynamicBin1D X = new hep.aida.bin.DynamicBin1D(); * hep.aida.bin.DynamicBin1D Y = new hep.aida.bin.DynamicBin1D(); * X.addAllOf(new cern.colt.list.DoubleArrayList(v1)); * Y.addAllOf(new cern.colt.list.DoubleArrayList(v2)); * cern.jet.random.engine.RandomEngine random = new cern.jet.random.engine.MersenneTwister(); * * // bootstrap resampling of differences of means: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return x.mean() - y.mean();} * }; * * // bootstrap resampling of differences of medians: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return x.median() - y.median();} * }; * * // bootstrap resampling of differences of inter-quartile ranges: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return (x.quantile(0.75)-x.quantile(0.25)) - (y.quantile(0.75)-y.quantile(0.25)); } * }; * * DynamicBin1D boot = X.sampleBootstrap(Y,1000,random,diff); * * cern.jet.math.Functions F = cern.jet.math.Functions.functions; * System.out.println("p-value="+ (boot.aggregate(F.plus, F.greater(0)) / boot.size())); * System.out.println("left 90% confidence interval = ("+boot.quantile(0.9) + ",infinity)"); * * --> * // bootstrap resampling of differences of means: * p-value=0.0080 * left 90% confidence interval = (-3.571428571428573,infinity) * * // bootstrap resampling of differences of medians: * p-value=0.36 * left 90% confidence interval = (5.0,infinity) * * // bootstrap resampling of differences of inter-quartile ranges: * p-value=0.5699 * left 90% confidence interval = (5.0,infinity) * </pre> * </td> * </table> * * @param other the other bin to compare the receiver against. * @param resamples the number of times resampling shall be done. * @param randomGenerator a random number generator. Set this parameter to <tt>null</tt> to use a * default random number generator seeded with the current time. * @param function a difference function comparing two samples; takes as first argument a sample * of <tt>this</tt> and as second argument a sample of <tt>other</tt>. * @return a bootstrap bin holding the results of <tt>function</tt> of each resampling step. * @see cern.colt.GenericPermuting#permutation(long,int) */ public synchronized DynamicBin1D sampleBootstrap( DynamicBin1D other, int resamples, cern.jet.random.engine.RandomEngine randomGenerator, BinBinFunction1D function) { if (randomGenerator == null) randomGenerator = cern.jet.random.Uniform.makeDefaultGenerator(); // since "resamples" can be quite large, we care about performance and memory int maxCapacity = 1000; int s1 = size(); int s2 = other.size(); // prepare auxiliary bins and buffers DynamicBin1D sample1 = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer buffer1 = sample1.buffered(Math.min(maxCapacity, s1)); DynamicBin1D sample2 = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer buffer2 = sample2.buffered(Math.min(maxCapacity, s2)); DynamicBin1D bootstrap = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer bootBuffer = bootstrap.buffered(Math.min(maxCapacity, resamples)); // resampling steps for (int i = resamples; --i >= 0; ) { sample1.clear(); sample2.clear(); this.sample(s1, true, randomGenerator, buffer1); other.sample(s2, true, randomGenerator, buffer2); bootBuffer.add(function.apply(sample1, sample2)); } bootBuffer.flush(); return bootstrap; }
/** * Removes the <tt>s</tt> smallest and <tt>l</tt> largest elements from the receiver. The * receivers size will be reduced by <tt>s + l</tt> elements. * * @param s the number of smallest elements to trim away (<tt>s >= 0</tt>). * @param l the number of largest elements to trim away (<tt>l >= 0</tt>). */ public synchronized void trim(int s, int l) { DoubleArrayList elems = sortedElements(); clear(); addAllOfFromTo(elems, s, elems.size() - 1 - l); }