Пример #1
0
  /**
   * Generic bootstrap resampling. Quite optimized - Don't be afraid to try it. Executes
   * <tt>resamples</tt> resampling steps. In each resampling step does the following:
   *
   * <ul>
   *   <li>Uniformly samples (chooses) <tt>size()</tt> random elements <i>with replacement</i> from
   *       <tt>this</tt> and fills them into an auxiliary bin <tt>b1</tt>.
   *   <li>Uniformly samples (chooses) <tt>other.size()</tt> random elements <i>with replacement</i>
   *       from <tt>other</tt> and fills them into another auxiliary bin <tt>b2</tt>.
   *   <li>Executes the comparison function <tt>function</tt> on both auxiliary bins
   *       (<tt>function.apply(b1,b2)</tt>) and adds the result of the function to an auxiliary
   *       bootstrap bin <tt>b3</tt>.
   * </ul>
   *
   * <p>Finally returns the auxiliary bootstrap bin <tt>b3</tt> from which the measure of interest
   * can be read off.
   *
   * <p><b>Background:</b>
   *
   * <p>Also see a more <A HREF="http://garnet.acns.fsu.edu/~pkelly/bootstrap.html"> in-depth
   * discussion</A> on bootstrapping and related randomization methods. The classical statistical
   * test for comparing the means of two samples is the <i>t-test</i>. Unfortunately, this test
   * assumes that the two samples each come from a normal distribution and that these distributions
   * have the same standard deviation. Quite often, however, data has a distribution that is
   * non-normal in many ways. In particular, distributions are often unsymmetric. For such data, the
   * t-test may produce misleading results and should thus not be used. Sometimes asymmetric data
   * can be transformed into normally distributed data by taking e.g. the logarithm and the t-test
   * will then produce valid results, but this still requires postulation of a certain distribution
   * underlying the data, which is often not warranted, because too little is known about the data
   * composition.
   *
   * <p><i>Bootstrap resampling of means differences</i> (and other differences) is a robust
   * replacement for the t-test and does not require assumptions about the actual distribution of
   * the data. The idea of bootstrapping is quite simple: simulation. The only assumption required
   * is that the two samples <tt>a</tt> and <tt>b</tt> are representative for the underlying
   * distribution with respect to the statistic that is being tested - this assumption is of course
   * implicit in all statistical tests. We can now generate lots of further samples that correspond
   * to the two given ones, by sampling <i>with replacement</i>. This process is called
   * <i>resampling</i>. A resample can (and usually will) have a different mean than the original
   * one and by drawing hundreds or thousands of such resamples <tt>a<sub>r</sub></tt> from
   * <tt>a</tt> and <tt>b<sub>r</sub></tt> from <tt>b</tt> we can compute the so-called bootstrap
   * distribution of all the differences &quot;mean of <tt>a<sub>r</sub></tt> minus mean of
   * <tt>b<sub>r</sub></tt>&quot;. That is, a bootstrap bin filled with the differences. Now we can
   * compute, what fraction of these differences is, say, greater than zero. Let's assume we have
   * computed 1000 resamples of both <tt>a</tt> and <tt>b</tt> and found that only <tt>8</tt> of the
   * differences were greater than zero. Then <tt>8/1000</tt> or <tt>0.008</tt> is the p-value
   * (probability) for the hypothesis that the mean of the distribution underlying <tt>a</tt> is
   * actually larger than the mean of the distribution underlying <tt>b</tt>. From this bootstrap
   * test, we can clearly reject the hypothesis.
   *
   * <p>Instead of using means differences, we can also use other differences, for example, the
   * median differences.
   *
   * <p>Instead of p-values we can also read arbitrary confidence intervals from the bootstrap bin.
   * For example, <tt>90%</tt> of all bootstrap differences are left of the value <tt>-3.5</tt>,
   * hence a left <tt>90%</tt> confidence interval for the difference would be
   * <tt>(3.5,infinity)</tt>; in other words: the difference is <tt>3.5</tt> or larger with
   * probability <tt>0.9</tt>.
   *
   * <p>Sometimes we would like to compare not only means and medians, but also the variability
   * (spread) of two samples. The conventional method of doing this is the <i>F-test</i>, which
   * compares the standard deviations. It is related to the t-test and, like the latter, assumes the
   * two samples to come from a normal distribution. The F-test is very sensitive to data with
   * deviations from normality. Instead we can again resort to more robust bootstrap resampling and
   * compare a measure of spread, for example the inter-quartile range. This way we compute a
   * <i>bootstrap resampling of inter-quartile range differences</i> in order to arrive at a test
   * for inequality or variability.
   *
   * <p><b>Example:</b>
   *
   * <table>
   * <td class="PRE">
   * <pre>
   * // v1,v2 - the two samples to compare against each other
   * double[] v1 = { 1, 2, 3, 4, 5, 6, 7, 8, 9,10,  21,  22,23,24,25,26,27,28,29,30,31};
   * double[] v2 = {10,11,12,13,14,15,16,17,18,19,  20,  30,31,32,33,34,35,36,37,38,39};
   * hep.aida.bin.DynamicBin1D X = new hep.aida.bin.DynamicBin1D();
   * hep.aida.bin.DynamicBin1D Y = new hep.aida.bin.DynamicBin1D();
   * X.addAllOf(new cern.colt.list.DoubleArrayList(v1));
   * Y.addAllOf(new cern.colt.list.DoubleArrayList(v2));
   * cern.jet.random.engine.RandomEngine random = new cern.jet.random.engine.MersenneTwister();
   *
   * // bootstrap resampling of differences of means:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return x.mean() - y.mean();}
   * };
   *
   * // bootstrap resampling of differences of medians:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return x.median() - y.median();}
   * };
   *
   * // bootstrap resampling of differences of inter-quartile ranges:
   * BinBinFunction1D diff = new BinBinFunction1D() {
   * &nbsp;&nbsp;&nbsp;public double apply(DynamicBin1D x, DynamicBin1D y) {return (x.quantile(0.75)-x.quantile(0.25)) - (y.quantile(0.75)-y.quantile(0.25)); }
   * };
   *
   * DynamicBin1D boot = X.sampleBootstrap(Y,1000,random,diff);
   *
   * cern.jet.math.Functions F = cern.jet.math.Functions.functions;
   * System.out.println("p-value="+ (boot.aggregate(F.plus, F.greater(0)) / boot.size()));
   * System.out.println("left 90% confidence interval = ("+boot.quantile(0.9) + ",infinity)");
   *
   * -->
   * // bootstrap resampling of differences of means:
   * p-value=0.0080
   * left 90% confidence interval = (-3.571428571428573,infinity)
   *
   * // bootstrap resampling of differences of medians:
   * p-value=0.36
   * left 90% confidence interval = (5.0,infinity)
   *
   * // bootstrap resampling of differences of inter-quartile ranges:
   * p-value=0.5699
   * left 90% confidence interval = (5.0,infinity)
   * </pre>
   * </td>
   * </table>
   *
   * @param other the other bin to compare the receiver against.
   * @param resamples the number of times resampling shall be done.
   * @param randomGenerator a random number generator. Set this parameter to <tt>null</tt> to use a
   *     default random number generator seeded with the current time.
   * @param function a difference function comparing two samples; takes as first argument a sample
   *     of <tt>this</tt> and as second argument a sample of <tt>other</tt>.
   * @return a bootstrap bin holding the results of <tt>function</tt> of each resampling step.
   * @see cern.colt.GenericPermuting#permutation(long,int)
   */
  public synchronized DynamicBin1D sampleBootstrap(
      DynamicBin1D other,
      int resamples,
      cern.jet.random.engine.RandomEngine randomGenerator,
      BinBinFunction1D function) {
    if (randomGenerator == null) randomGenerator = cern.jet.random.Uniform.makeDefaultGenerator();

    // since "resamples" can be quite large, we care about performance and memory
    int maxCapacity = 1000;
    int s1 = size();
    int s2 = other.size();

    // prepare auxiliary bins and buffers
    DynamicBin1D sample1 = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer buffer1 = sample1.buffered(Math.min(maxCapacity, s1));

    DynamicBin1D sample2 = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer buffer2 = sample2.buffered(Math.min(maxCapacity, s2));

    DynamicBin1D bootstrap = new DynamicBin1D();
    cern.colt.buffer.DoubleBuffer bootBuffer = bootstrap.buffered(Math.min(maxCapacity, resamples));

    // resampling steps
    for (int i = resamples; --i >= 0; ) {
      sample1.clear();
      sample2.clear();

      this.sample(s1, true, randomGenerator, buffer1);
      other.sample(s2, true, randomGenerator, buffer2);

      bootBuffer.add(function.apply(sample1, sample2));
    }
    bootBuffer.flush();
    return bootstrap;
  }
Пример #2
0
  /**
   * Returns whether two bins are equal. They are equal if the other object is of the same class or
   * a subclass of this class and both have the same size, minimum, maximum, sum and sumOfSquares
   * and have the same elements, order being irrelevant (multiset equality).
   *
   * <p>Definition of <i>Equality</i> for multisets: A,B are equal <=> A is a superset of B and B is
   * a superset of A. (Elements must occur the same number of times, order is irrelevant.)
   */
  public synchronized boolean equals(Object object) {
    if (!(object instanceof DynamicBin1D)) return false;
    if (!super.equals(object)) return false;

    DynamicBin1D other = (DynamicBin1D) object;
    double[] s1 = sortedElements_unsafe().elements();
    synchronized (other) {
      double[] s2 = other.sortedElements_unsafe().elements();
      int n = size();
      return includes(s1, s2, 0, n, 0, n) && includes(s2, s1, 0, n, 0, n);
    }
  }
Пример #3
0
  /**
   * Returns the covariance of two bins, which is <tt>cov(x,y) = (1/size()) * Sum((x[i]-mean(x)) *
   * (y[i]-mean(y)))</tt>. See the <A
   * HREF="http://www.cquest.utoronto.ca/geog/ggr270y/notes/not05efg.html"> math definition</A>.
   *
   * @param other the bin to compare with.
   * @return the covariance.
   * @throws IllegalArgumentException if <tt>size() != other.size()</tt>.
   */
  public synchronized double covariance(DynamicBin1D other) {
    synchronized (other) {
      if (size() != other.size())
        throw new IllegalArgumentException("both bins must have same size");
      double s = 0;
      for (int i = size(); --i >= 0; ) {
        s += this.elements.getQuick(i) * other.elements.getQuick(i);
      }

      double cov = (s - sum() * other.sum() / size()) / size();
      return cov;
    }
  }
Пример #4
0
 /**
  * Returns the correlation of two bins, which is <tt>corr(x,y) = covariance(x,y) /
  * (stdDev(x)*stdDev(y))</tt> (Pearson's correlation coefficient). A correlation coefficient
  * varies between -1 (for a perfect negative relationship) to +1 (for a perfect positive
  * relationship). See the <A
  * HREF="http://www.cquest.utoronto.ca/geog/ggr270y/notes/not05efg.html"> math definition</A> and
  * <A HREF="http://www.stat.berkeley.edu/users/stark/SticiGui/Text/gloss.htm#correlation_coef">
  * another def</A>.
  *
  * @param other the bin to compare with.
  * @return the correlation.
  * @throws IllegalArgumentException if <tt>size() != other.size()</tt>.
  */
 public synchronized double correlation(DynamicBin1D other) {
   synchronized (other) {
     return covariance(other) / (standardDeviation() * other.standardDeviation());
   }
 }
Пример #5
0
 /**
  * Returns a deep copy of the receiver.
  *
  * @return a deep copy of the receiver.
  */
 public synchronized Object clone() {
   DynamicBin1D clone = (DynamicBin1D) super.clone();
   if (this.elements != null) clone.elements = clone.elements.copy();
   if (this.sortedElements != null) clone.sortedElements = clone.sortedElements.copy();
   return clone;
 }