Exemplo n.º 1
0
  /**
   * Generic bootstrap resampling. Quite optimized - Don't be afraid to try it. Executes
   * <tt>resamples</tt> resampling steps. In each resampling step does the following:
   *
   * <ul>
   *   <li>Uniformly samples (chooses) <tt>size()</tt> random elements <i>with replacement</i> from
   *       <tt>this</tt> and fills them into an auxiliary bin <tt>b1</tt>.
   *   <li>Uniformly samples (chooses) <tt>other.size()</tt> random elements <i>with replacement</i>
   *       from <tt>other</tt> and fills them into another auxiliary bin <tt>b2</tt>.
   *   <li>Executes the comparison function <tt>function</tt> on both auxiliary bins
   *       (<tt>function.apply(b1,b2)</tt>) and adds the result of the function to an auxiliary
   *       bootstrap bin <tt>b3</tt>.
   * </ul>
   *
   * <p>Finally returns the auxiliary bootstrap bin <tt>b3</tt> from which the measure of interest
   * can be read off.
   *
   * <p><b>Background:</b>
   *
   * <p>Also see a more <A HREF="http://garnet.acns.fsu.edu/~pkelly/bootstrap.html"> in-depth
   * discussion</A> on bootstrapping and related randomization methods. The classical statistical
   * test for comparing the means of two samples is the <i>t-test</i>. Unfortunately, this test
   * assumes that the two samples each come from a normal distribution and that these distributions
   * have the same standard deviation. Quite often, however, data has a distribution that is
   * non-normal in many ways. In particular, distributions are often unsymmetric. For such data, the
   * t-test may produce misleading results and should thus not be used. Sometimes asymmetric data
   * can be transformed into normally distributed data by taking e.g. the logarithm and the t-test
   * will then produce valid results, but this still requires postulation of a certain distribution
   * underlying the data, which is often not warranted, because too little is known about the data
   * composition.
   *
   * <p><i>Bootstrap resampling of means differences</i> (and other differences) is a robust
   * replacement for the t-test and does not require assumptions about the actual distribution of
   * the data. The idea of bootstrapping is quite simple: simulation. The only assumption required
   * is that the two samples <tt>a</tt> and <tt>b</tt> are representative for the underlying
   * distribution with respect to the statistic that is being tested - this assumption is of course
   * implicit in all statistical tests. We can now generate lots of further samples that correspond
   * to the two given ones, by sampling <i>with replacement</i>. This process is called
   * <i>resampling</i>. A resample can (and usually will) have a different mean than the original
   * one and by drawing hundreds or thousands of such resamples <tt>a<sub>r</sub></tt> from
   * <tt>a</tt> and <tt>b<sub>r</sub></tt> from <tt>b</tt> we can compute the so-called bootstrap
   * distribution of all the differences &quot;mean of <tt>a<sub>r</sub></tt> minus mean of
   * <tt>b<sub>r</sub></tt>&quot;. That is, a bootstrap bin filled with the differences. Now we can
   * compute, what fraction of these differences is, say, greater than zero. Let's assume we have
   * computed 1000 resamples of both <tt>a</tt> and <tt>b</tt> and found that only <tt>8</tt> of the
   * differences were greater than zero. Then <tt>8/1000</tt> or <tt>0.008</tt> is the p-value
   * (probability) for the hypothesis that the mean of the distribution underlying <tt>a</tt> is
   * actually larger than the mean of the distribution underlying <tt>b</tt>. From this bootstrap
   * test, we can clearly reject the hypothesis.
   *
   * <p>Instead of using means differences, we can also use other differences, for example, the
   * median differences.
   *
   * <p>Instead of p-values we can also read arbitrary confidence intervals from the bootstrap bin.
   * For example, <tt>90%</tt> of all bootstrap differences are left of the value <tt>-3.5</tt>,
   * hence a left <tt>90%</tt> confidence interval for the difference would be
   * <tt>(3.5,infinity)</tt>; in other words: the difference is <tt>3.5</tt> or larger with
   * probability <tt>0.9</tt>.
   *
   * <p>Sometimes we would like to compare not only means and medians, but also the variability
   * (spread) of two samples. The conventional method of doing this is the <i>F-test</i>, which
   * compares the standard deviations. It is related to the t-test and, like the latter, assumes the
   * two samples to come from a normal distribution. The F-test is very sensitive to data with
   * deviations from normality. Instead we can again resort to more robust bootstrap resampling and
   * compare a measure of spread, for example the inter-quartile range. This way we compute a
   * <i>bootstrap resampling of inter-quartile range differences</i> in order to arrive at a test
   * for inequality or variability.
   *
   * <p><b>Example:</b>
   *
   * <table>
   * <td class="PRE">
   * <pre>
   * // v1,v2 - the two samples to compare against each other
   * double[] v1 = { 1, 2, 3, 4, 5, 6, 7, 8, 9,10,  21,  22,23,24,25,26,27,28,29,30,31};
   * double[] v2 = {10,11,12,13,14,15,16,17,18,19,  20,  30,31,32,33,34,35,36,37,38,39};
   * hep.aida.bin.DynamicBin1D X = new hep.aida.bin.DynamicBin1D();
   * hep.aida.bin.DynamicBin1D Y = new hep.aida.bin.DynamicBin1D();
   * X.addAllOf(new cern.colt.list.DoubleArrayList(v1));
   * Y.addAllOf(new cern.colt.list.DoubleArrayList(v2));
   * cern.jet.random.engine.RandomEngine random = new cern.jet.random.engine.MersenneTwister();
   *
   * // bootstrap resampling of differences of means:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return x.mean() - y.mean();}
   * };
   *
   * // bootstrap resampling of differences of medians:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return x.median() - y.median();}
   * };
   *
   * // bootstrap resampling of differences of inter-quartile ranges:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return (x.quantile(0.75)-x.quantile(0.25)) - (y.quantile(0.75)-y.quantile(0.25)); }
   * };
   *
   * DynamicBin1D boot = X.sampleBootstrap(Y,1000,random,diff);
   *
   * cern.jet.math.Functions F = cern.jet.math.Functions.functions;
   * System.out.println("p-value="+ (boot.aggregate(F.plus, F.greater(0)) / boot.size()));
   * System.out.println("left 90% confidence interval = ("+boot.quantile(0.9) + ",infinity)");
   *
   * -->
   * // bootstrap resampling of differences of means:
   * p-value=0.0080
   * left 90% confidence interval = (-3.571428571428573,infinity)
   *
   * // bootstrap resampling of differences of medians:
   * p-value=0.36
   * left 90% confidence interval = (5.0,infinity)
   *
   * // bootstrap resampling of differences of inter-quartile ranges:
   * p-value=0.5699
   * left 90% confidence interval = (5.0,infinity)
   * </pre>
   * </td>
   * </table>
   *
   * @param other the other bin to compare the receiver against.
   * @param resamples the number of times resampling shall be done.
   * @param randomGenerator a random number generator. Set this parameter to <tt>null</tt> to use a
   *     default random number generator seeded with the current time.
   * @param function a difference function comparing two samples; takes as first argument a sample
   *     of <tt>this</tt> and as second argument a sample of <tt>other</tt>.
   * @return a bootstrap bin holding the results of <tt>function</tt> of each resampling step.
   * @see cern.colt.GenericPermuting#permutation(long,int)
   */
  public synchronized DynamicBin1D sampleBootstrap(
      DynamicBin1D other,
      int resamples,
      cern.jet.random.engine.RandomEngine randomGenerator,
      BinBinFunction1D function) {
    if (randomGenerator == null) randomGenerator = cern.jet.random.Uniform.makeDefaultGenerator();

    // since "resamples" can be quite large, we care about performance and memory
    int maxCapacity = 1000;
    int s1 = size();
    int s2 = other.size();

    // prepare auxiliary bins and buffers
    DynamicBin1D sample1 = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer buffer1 = sample1.buffered(Math.min(maxCapacity, s1));

    DynamicBin1D sample2 = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer buffer2 = sample2.buffered(Math.min(maxCapacity, s2));

    DynamicBin1D bootstrap = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer bootBuffer = bootstrap.buffered(Math.min(maxCapacity, resamples));

    // resampling steps
    for (int i = resamples; --i >= 0; ) {
      sample1.clear();
      sample2.clear();

      this.sample(s1, true, randomGenerator, buffer1);
      other.sample(s2, true, randomGenerator, buffer2);

      bootBuffer.add(function.apply(sample1, sample2));
    }
    bootBuffer.flush();
    return bootstrap;
  }
Exemplo n.º 2
0
  /**
   * Uniformly samples (chooses) <tt>n</tt> random elements <i>with or without replacement</i> from
   * the contained elements and adds them to the given buffer. If the buffer is connected to a bin,
   * the effect is that the chosen elements are added to the bin connected to the buffer. Also see
   * {@link #buffered(int) buffered}.
   *
   * @param n the number of elements to choose.
   * @param withReplacement <tt>true</tt> samples with replacement, otherwise samples without
   *     replacement.
   * @param randomGenerator a random number generator. Set this parameter to <tt>null</tt> to use a
   *     default random number generator seeded with the current time.
   * @param buffer the buffer to which chosen elements will be added.
   * @throws IllegalArgumentException if <tt>!withReplacement && n > size()</tt>.
   * @see cern.jet.random.sampling
   */
  public synchronized void sample(
      int n,
      boolean withReplacement,
      RandomEngine randomGenerator,
      cern.colt.buffer.DoubleBuffer buffer) {
    if (randomGenerator == null) randomGenerator = cern.jet.random.Uniform.makeDefaultGenerator();
    buffer.clear();

    if (!withReplacement) { // without
      if (n > size()) throw new IllegalArgumentException("n must be less than or equal to size()");
      cern.jet.random.sampling.RandomSamplingAssistant sampler =
          new cern.jet.random.sampling.RandomSamplingAssistant(n, size(), randomGenerator);
      for (int i = n; --i >= 0; ) {
        if (sampler.sampleNextElement()) buffer.add(this.elements.getQuick(i));
      }
    } else { // with
      cern.jet.random.Uniform uniform = new cern.jet.random.Uniform(randomGenerator);
      int s = size();
      for (int i = n; --i >= 0; ) {
        buffer.add(this.elements.getQuick(uniform.nextIntFromTo(0, s - 1)));
      }
      buffer.flush();
    }
  }