/** * Generic bootstrap resampling. Quite optimized - Don't be afraid to try it. Executes * <tt>resamples</tt> resampling steps. In each resampling step does the following: * * <ul> * <li>Uniformly samples (chooses) <tt>size()</tt> random elements <i>with replacement</i> from * <tt>this</tt> and fills them into an auxiliary bin <tt>b1</tt>. * <li>Uniformly samples (chooses) <tt>other.size()</tt> random elements <i>with replacement</i> * from <tt>other</tt> and fills them into another auxiliary bin <tt>b2</tt>. * <li>Executes the comparison function <tt>function</tt> on both auxiliary bins * (<tt>function.apply(b1,b2)</tt>) and adds the result of the function to an auxiliary * bootstrap bin <tt>b3</tt>. * </ul> * * <p>Finally returns the auxiliary bootstrap bin <tt>b3</tt> from which the measure of interest * can be read off. * * <p><b>Background:</b> * * <p>Also see a more <A HREF="http://garnet.acns.fsu.edu/~pkelly/bootstrap.html"> in-depth * discussion</A> on bootstrapping and related randomization methods. The classical statistical * test for comparing the means of two samples is the <i>t-test</i>. Unfortunately, this test * assumes that the two samples each come from a normal distribution and that these distributions * have the same standard deviation. Quite often, however, data has a distribution that is * non-normal in many ways. In particular, distributions are often unsymmetric. For such data, the * t-test may produce misleading results and should thus not be used. Sometimes asymmetric data * can be transformed into normally distributed data by taking e.g. the logarithm and the t-test * will then produce valid results, but this still requires postulation of a certain distribution * underlying the data, which is often not warranted, because too little is known about the data * composition. * * <p><i>Bootstrap resampling of means differences</i> (and other differences) is a robust * replacement for the t-test and does not require assumptions about the actual distribution of * the data. The idea of bootstrapping is quite simple: simulation. The only assumption required * is that the two samples <tt>a</tt> and <tt>b</tt> are representative for the underlying * distribution with respect to the statistic that is being tested - this assumption is of course * implicit in all statistical tests. We can now generate lots of further samples that correspond * to the two given ones, by sampling <i>with replacement</i>. This process is called * <i>resampling</i>. A resample can (and usually will) have a different mean than the original * one and by drawing hundreds or thousands of such resamples <tt>a<sub>r</sub></tt> from * <tt>a</tt> and <tt>b<sub>r</sub></tt> from <tt>b</tt> we can compute the so-called bootstrap * distribution of all the differences "mean of <tt>a<sub>r</sub></tt> minus mean of * <tt>b<sub>r</sub></tt>". That is, a bootstrap bin filled with the differences. Now we can * compute, what fraction of these differences is, say, greater than zero. Let's assume we have * computed 1000 resamples of both <tt>a</tt> and <tt>b</tt> and found that only <tt>8</tt> of the * differences were greater than zero. Then <tt>8/1000</tt> or <tt>0.008</tt> is the p-value * (probability) for the hypothesis that the mean of the distribution underlying <tt>a</tt> is * actually larger than the mean of the distribution underlying <tt>b</tt>. From this bootstrap * test, we can clearly reject the hypothesis. * * <p>Instead of using means differences, we can also use other differences, for example, the * median differences. * * <p>Instead of p-values we can also read arbitrary confidence intervals from the bootstrap bin. * For example, <tt>90%</tt> of all bootstrap differences are left of the value <tt>-3.5</tt>, * hence a left <tt>90%</tt> confidence interval for the difference would be * <tt>(3.5,infinity)</tt>; in other words: the difference is <tt>3.5</tt> or larger with * probability <tt>0.9</tt>. * * <p>Sometimes we would like to compare not only means and medians, but also the variability * (spread) of two samples. The conventional method of doing this is the <i>F-test</i>, which * compares the standard deviations. It is related to the t-test and, like the latter, assumes the * two samples to come from a normal distribution. The F-test is very sensitive to data with * deviations from normality. Instead we can again resort to more robust bootstrap resampling and * compare a measure of spread, for example the inter-quartile range. This way we compute a * <i>bootstrap resampling of inter-quartile range differences</i> in order to arrive at a test * for inequality or variability. * * <p><b>Example:</b> * * <table> * <td class="PRE"> * <pre> * // v1,v2 - the two samples to compare against each other * double[] v1 = { 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 21, 22,23,24,25,26,27,28,29,30,31}; * double[] v2 = {10,11,12,13,14,15,16,17,18,19, 20, 30,31,32,33,34,35,36,37,38,39}; * hep.aida.bin.DynamicBin1D X = new hep.aida.bin.DynamicBin1D(); * hep.aida.bin.DynamicBin1D Y = new hep.aida.bin.DynamicBin1D(); * X.addAllOf(new cern.colt.list.DoubleArrayList(v1)); * Y.addAllOf(new cern.colt.list.DoubleArrayList(v2)); * cern.jet.random.engine.RandomEngine random = new cern.jet.random.engine.MersenneTwister(); * * // bootstrap resampling of differences of means: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return x.mean() - y.mean();} * }; * * // bootstrap resampling of differences of medians: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return x.median() - y.median();} * }; * * // bootstrap resampling of differences of inter-quartile ranges: * BinBinFunction1D diff = new BinBinFunction1D() { * public double apply(DynamicBin1D x, DynamicBin1D y) {return (x.quantile(0.75)-x.quantile(0.25)) - (y.quantile(0.75)-y.quantile(0.25)); } * }; * * DynamicBin1D boot = X.sampleBootstrap(Y,1000,random,diff); * * cern.jet.math.Functions F = cern.jet.math.Functions.functions; * System.out.println("p-value="+ (boot.aggregate(F.plus, F.greater(0)) / boot.size())); * System.out.println("left 90% confidence interval = ("+boot.quantile(0.9) + ",infinity)"); * * --> * // bootstrap resampling of differences of means: * p-value=0.0080 * left 90% confidence interval = (-3.571428571428573,infinity) * * // bootstrap resampling of differences of medians: * p-value=0.36 * left 90% confidence interval = (5.0,infinity) * * // bootstrap resampling of differences of inter-quartile ranges: * p-value=0.5699 * left 90% confidence interval = (5.0,infinity) * </pre> * </td> * </table> * * @param other the other bin to compare the receiver against. * @param resamples the number of times resampling shall be done. * @param randomGenerator a random number generator. Set this parameter to <tt>null</tt> to use a * default random number generator seeded with the current time. * @param function a difference function comparing two samples; takes as first argument a sample * of <tt>this</tt> and as second argument a sample of <tt>other</tt>. * @return a bootstrap bin holding the results of <tt>function</tt> of each resampling step. * @see cern.colt.GenericPermuting#permutation(long,int) */ public synchronized DynamicBin1D sampleBootstrap( DynamicBin1D other, int resamples, cern.jet.random.engine.RandomEngine randomGenerator, BinBinFunction1D function) { if (randomGenerator == null) randomGenerator = cern.jet.random.Uniform.makeDefaultGenerator(); // since "resamples" can be quite large, we care about performance and memory int maxCapacity = 1000; int s1 = size(); int s2 = other.size(); // prepare auxiliary bins and buffers DynamicBin1D sample1 = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer buffer1 = sample1.buffered(Math.min(maxCapacity, s1)); DynamicBin1D sample2 = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer buffer2 = sample2.buffered(Math.min(maxCapacity, s2)); DynamicBin1D bootstrap = new DynamicBin1D(); cern.colt.buffer.DoubleBuffer bootBuffer = bootstrap.buffered(Math.min(maxCapacity, resamples)); // resampling steps for (int i = resamples; --i >= 0; ) { sample1.clear(); sample2.clear(); this.sample(s1, true, randomGenerator, buffer1); other.sample(s2, true, randomGenerator, buffer2); bootBuffer.add(function.apply(sample1, sample2)); } bootBuffer.flush(); return bootstrap; }
/** * Returns whether two bins are equal. They are equal if the other object is of the same class or * a subclass of this class and both have the same size, minimum, maximum, sum and sumOfSquares * and have the same elements, order being irrelevant (multiset equality). * * <p>Definition of <i>Equality</i> for multisets: A,B are equal <=> A is a superset of B and B is * a superset of A. (Elements must occur the same number of times, order is irrelevant.) */ public synchronized boolean equals(Object object) { if (!(object instanceof DynamicBin1D)) return false; if (!super.equals(object)) return false; DynamicBin1D other = (DynamicBin1D) object; double[] s1 = sortedElements_unsafe().elements(); synchronized (other) { double[] s2 = other.sortedElements_unsafe().elements(); int n = size(); return includes(s1, s2, 0, n, 0, n) && includes(s2, s1, 0, n, 0, n); } }
/** * Returns the covariance of two bins, which is <tt>cov(x,y) = (1/size()) * Sum((x[i]-mean(x)) * * (y[i]-mean(y)))</tt>. See the <A * HREF="http://www.cquest.utoronto.ca/geog/ggr270y/notes/not05efg.html"> math definition</A>. * * @param other the bin to compare with. * @return the covariance. * @throws IllegalArgumentException if <tt>size() != other.size()</tt>. */ public synchronized double covariance(DynamicBin1D other) { synchronized (other) { if (size() != other.size()) throw new IllegalArgumentException("both bins must have same size"); double s = 0; for (int i = size(); --i >= 0; ) { s += this.elements.getQuick(i) * other.elements.getQuick(i); } double cov = (s - sum() * other.sum() / size()) / size(); return cov; } }
/** * Returns the correlation of two bins, which is <tt>corr(x,y) = covariance(x,y) / * (stdDev(x)*stdDev(y))</tt> (Pearson's correlation coefficient). A correlation coefficient * varies between -1 (for a perfect negative relationship) to +1 (for a perfect positive * relationship). See the <A * HREF="http://www.cquest.utoronto.ca/geog/ggr270y/notes/not05efg.html"> math definition</A> and * <A HREF="http://www.stat.berkeley.edu/users/stark/SticiGui/Text/gloss.htm#correlation_coef"> * another def</A>. * * @param other the bin to compare with. * @return the correlation. * @throws IllegalArgumentException if <tt>size() != other.size()</tt>. */ public synchronized double correlation(DynamicBin1D other) { synchronized (other) { return covariance(other) / (standardDeviation() * other.standardDeviation()); } }
/** * Returns a deep copy of the receiver. * * @return a deep copy of the receiver. */ public synchronized Object clone() { DynamicBin1D clone = (DynamicBin1D) super.clone(); if (this.elements != null) clone.elements = clone.elements.copy(); if (this.sortedElements != null) clone.sortedElements = clone.sortedElements.copy(); return clone; }