/** * Add into to a hash * * @param hits * @param marker * @param hit2add * @param showGeneDetails * @param compareTemplate */ void regionsAddHit( HashSet<String> hits, Marker hit2add, Marker marker, boolean showGeneDetails, boolean compareTemplate) { String hitStr = hit2add.getClass().getSimpleName(); if (compareTemplate) { Gene gene = (Gene) hit2add.findParent(Gene.class); if (gene != null) hitStr += (hit2add.isStrandPlus() == marker.isStrandPlus()) ? "_TEMPLATE_STRAND" : "_NON_TEMPLATE_STRAND"; } if (showGeneDetails && (hit2add instanceof Gene)) { Gene gene = (Gene) hit2add; hitStr += "[" + gene.getBioType() + ", " + gene.getGeneName() + ", " + (gene.isProteinCoding() ? "protein" : "not-protein") + "]"; } hits.add(hitStr); // Add marker name to the list }
/** * Count number of bases, for a given chromosome and marker type * * @param mtype * @param chr * @param markers * @return */ void countBases(String mtype, Chromosome chr, Markers markers) { String chrName = chr.getChromosomeName(); if (verbose) System.err.print(" " + chrName); // Initialize byte busy[] = new byte[chr.size()]; for (int i = 0; i < busy.length; i++) busy[i] = 0; for (Marker m : markers) { // Same marker type & same chromo? Count bases if (m.getChromosomeName().equals(chrName) && markerTypes.isType(m, mtype)) { for (int i = m.getStart(); i <= m.getEnd(); i++) busy[i] = 1; } } int latest = 0; for (int i = 0; i < busy.length; i++) { // Transition? Count another marker if ((i > 0) && (busy[i] != 0) && (busy[i - 1] == 0)) { if ((i - latest) <= readLength) countBases.inc(mtype, i - latest); // Intervals are less than one read away? Unify them else countMarkers.inc(mtype); } // Base busy? Count another base if (busy[i] != 0) { countBases.inc(mtype); latest = i; } } }
/** * Predict the effect of a seqChange * * @param seqChange : Sequence change * @param seqChangeRef : Before analyzing results, we have to change markers using seqChangerRef * to create a new reference 'on the fly' */ public ChangeEffects seqChangeEffect(Variant seqChange, Variant seqChangeRef) { ChangeEffects changeEffects = new ChangeEffects(seqChange, seqChangeRef); // --- // Chromosome missing? // --- if (Config.get().isErrorOnMissingChromo() && isChromosomeMissing(seqChange)) { changeEffects.addErrorWarning(ErrorWarningType.ERROR_CHROMOSOME_NOT_FOUND); return changeEffects; } // --- // Check that this is not a huge deletion. // Huge deletions would crash the rest of the algorithm, so we need to stop them here. // --- if (seqChange.isDel() && (seqChange.size() > HUGE_DELETION_SIZE_THRESHOLD)) { // Get chromosome String chromoName = seqChange.getChromosomeName(); Chromosome chr = genome.getChromosome(chromoName); if (chr.size() > 0) { double ratio = seqChange.size() / ((double) chr.size()); if (ratio > HUGE_DELETION_RATIO_THRESHOLD) { changeEffects.add(chr, EffectType.CHROMOSOME_LARGE_DELETION, ""); return changeEffects; } } } // --- // Query interval tree: Which intervals does seqChange intersect? // --- Markers intersects = query(seqChange); // Show all results boolean hitChromo = false, hitSomething = false; if (intersects.size() > 0) { for (Marker marker : intersects) { if (marker instanceof Chromosome) hitChromo = true; // Do we hit any chromosome? else { // Analyze all markers marker.seqChangeEffect(seqChange, changeEffects, seqChangeRef); hitSomething = true; } } } // Any errors or intergenic (i.e. did not hit any gene) if (!hitChromo) { if (Config.get().isErrorChromoHit()) changeEffects.addErrorWarning(ErrorWarningType.ERROR_OUT_OF_CHROMOSOME_RANGE); } else if (!hitSomething) { if (Config.get().isOnlyRegulation()) changeEffects.setEffectType(EffectType.NONE); else changeEffects.setEffectType(EffectType.INTERGENIC); } return changeEffects; }
/** * Find the last position where a nonsense mediated decay is supposed to occurr This is 50 bases * (MND_BASES_BEFORE_LAST_JUNCTION bases) before the last exon-exon junction. * * @param tr * @return */ public int lastNmdPos(Transcript tr) { // --- // Get last exon // --- int cdsEnd = tr.getCdsEnd(); int cdsStart = tr.getCdsStart(); Marker cds = new Marker( tr.getChromosome(), Math.min(cdsStart, cdsEnd), Math.max(cdsStart, cdsEnd), tr.getStrand(), ""); // Create a cds marker Exon lastExon = null; int countCodingExons = 0; for (Exon exon : tr.sortedStrand()) { if (exon.intersects(cdsEnd)) lastExon = exon; if (cds.intersects(exon)) countCodingExons++; } // Only one coding exon? => No NMD // Note: I'm assuming that we should have a splice event in a coding part of the transcript for // NMD to happen. if (countCodingExons <= 1) return -1; // Sanity check if (lastExon == null) throw new RuntimeException( "Cannot find last coding exon for transcript '" + tr.getId() + "' (cdsEnd: " + cdsEnd + ")\n\t" + tr); // --- // Find that position of MND_BASES_BEFORE_LAST_JUNCTION before the last exon-exon junction // --- int lastExonJunction = tr.isStrandPlus() ? lastExon.getStart() : lastExon.getEnd(); int chrPos[] = tr.baseNumberCds2Pos(); int lastNmdPos = -1; for (int cdsi = chrPos.length - 1; cdsi >= 0; cdsi--) { if (chrPos[cdsi] == lastExonJunction) { if (cdsi > MND_BASES_BEFORE_LAST_JUNCTION) lastNmdPos = chrPos[cdsi - MND_BASES_BEFORE_LAST_JUNCTION - 1]; else return tr.isStrandPlus() ? 0 : Integer.MAX_VALUE; // Out of CDS range return lastNmdPos; } } throw new RuntimeException( "Cannot find last exon junction position for transcript '" + tr.getId() + "'\n\t" + tr); // return -1; }
/** Count bases covered for each marker type */ public void countBases() { // --- // Add all markers // --- Markers markers = new Markers(); markers.add(snpEffectPredictor.getMarkers()); for (Gene gene : snpEffectPredictor.getGenome().getGenes()) { markers.add(gene); markers.add(gene.markers()); } for (Chromosome chr : snpEffectPredictor.getGenome()) markers.add(chr); // --- // Calculate raw counts // --- for (Marker m : markers) { String mtype = markerTypes.getType(m); String msubtype = markerTypes.getSubType(m); rawCountMarkers.inc(mtype); rawCountBases.inc(mtype, m.size()); // Count sub-types (if any) if (msubtype != null) { rawCountMarkers.inc(msubtype); rawCountBases.inc(msubtype, m.size()); } } // --- // Count number of bases for each marker type (overlap and join) // --- for (String mtype : rawCountMarkers.keysSorted()) { if (mtype.equals(Chromosome.class.getSimpleName())) continue; // We calculate chromosomes later (it's faster) if (verbose) System.err.print(mtype + ":"); if (countMarkers.get(mtype) == 0) { for (Chromosome chr : snpEffectPredictor.getGenome()) countBases(mtype, chr, markers); } if (verbose) System.err.println(""); } // Show chromosomes length String mtype = Chromosome.class.getSimpleName(); for (Chromosome chr : snpEffectPredictor.getGenome()) { countBases.inc(mtype, chr.size()); countMarkers.inc(mtype); } }
public Gene getGene() { if (marker != null) { if (marker instanceof Gene) return (Gene) marker; return (Gene) marker.findParent(Gene.class); } return null; }
/** Get intron (if any) */ public Intron getIntron() { if (marker != null) { if (marker instanceof Intron) return (Intron) marker; return (Intron) marker.findParent(Intron.class); } return null; }
public Transcript getTranscript() { if (marker != null) { if (marker instanceof Transcript) return (Transcript) marker; return (Transcript) marker.findParent(Transcript.class); } return null; }
/** Get exon (if any) */ public Exon getExon() { if (marker != null) { if (marker instanceof Exon) return (Exon) marker; return (Exon) marker.findParent(Exon.class); } return null; }
/** * Is this deletion a LOF? * * <p>Criteria: 1) First (coding) exon deleted 2) More than 50% of coding sequence deleted * * @param changeEffect * @return */ protected boolean isLofDeletion(ChangeEffect changeEffect) { Transcript tr = changeEffect.getTranscript(); if (tr == null) throw new RuntimeException("Transcript not found for change:\n\t" + changeEffect); // --- // Criteria: // 1) First (coding) exon deleted // --- if (changeEffect.getEffectType() == EffectType.EXON_DELETED) { Variant seqChange = changeEffect.getSeqChange(); if (seqChange == null) throw new RuntimeException("Cannot retrieve 'seqChange' from EXON_DELETED effect!"); if (seqChange.includes(tr.getFirstCodingExon())) return true; } // --- // Criteria: // 2) More than 50% of coding sequence deleted // --- // Find coding part of the transcript (i.e. no UTRs) Variant seqChange = changeEffect.getSeqChange(); int cdsStart = tr.isStrandPlus() ? tr.getCdsStart() : tr.getCdsEnd(); int cdsEnd = tr.isStrandPlus() ? tr.getCdsEnd() : tr.getCdsStart(); Marker coding = new Marker(seqChange.getChromosome(), cdsStart, cdsEnd, 1, ""); // Create an interval intersecting the CDS and the deletion int start = Math.max(cdsStart, seqChange.getStart()); int end = Math.min(cdsEnd, seqChange.getEnd()); if (start >= end) return false; // No intersections with coding part of the exon? => not LOF Marker codingDeleted = new Marker(seqChange.getChromosome(), start, end, 1, ""); // Count: // - number of coding bases deleted // - number of coding bases int codingBasesDeleted = 0, codingBases = 0; for (Exon exon : tr) { codingBasesDeleted += codingDeleted.intersectSize(exon); codingBases += coding.intersectSize(exon); } // More than a threshold? => It is a LOF double percDeleted = codingBasesDeleted / ((double) codingBases); return (percDeleted > deleteProteinCodingBases); }
/** * Is the chromosome missing in this marker? * * @param marker * @return */ boolean isChromosomeMissing(Marker marker) { // Missing chromosome in marker? if (marker.getChromosome() == null) return true; // Missing chromosome in genome? String chrName = marker.getChromosomeName(); Chromosome chr = genome.getChromosome(chrName); if (chr == null) return true; // Chromosome length is 1 or less? if (chr.size() < 1) return true; // Tree not found in interval forest? if (!intervalForest.hasTree(chrName)) return true; // OK, we have the chromosome return false; }
/** Save nextprot markers */ void save() { String nextProtBinFile = config.getDirDataVersion() + "/nextProt.bin"; if (verbose) Timer.showStdErr("Saving database to file '" + nextProtBinFile + "'"); // Add chromosomes HashSet<Chromosome> chromos = new HashSet<Chromosome>(); for (Marker m : markers) chromos.add(m.getChromosome()); // Create a set of all markers to be saved Markers markersToSave = new Markers(); markersToSave.add(genome); for (Chromosome chr : chromos) markersToSave.add(chr); for (Marker m : markers) markersToSave.add(m); // Save MarkerSerializer markerSerializer = new MarkerSerializer(); markerSerializer.save(nextProtBinFile, markersToSave); }
/** Set values for codons around change. */ public void setCodonsAround(String codonsLeft, String codonsRight) { codonsAroundOld = codonsLeft.toLowerCase() + codonsRef.toUpperCase() + codonsRight.toLowerCase(); codonsAroundNew = codonsLeft.toLowerCase() + codonsAlt.toUpperCase() + codonsRight.toLowerCase(); // Amino acids surrounding the ones changed CodonTable codonTable = marker.codonTable(); String aasLeft = codonTable.aa(codonsLeft); String aasRigt = codonTable.aa(codonsRight); aasAroundOld = aasLeft.toLowerCase() + aaRef.toUpperCase() + aasRigt.toLowerCase(); aasAroundNew = aasLeft.toLowerCase() + aaAlt.toUpperCase() + aasRigt.toLowerCase(); }
/** Return functional class of this effect (i.e. NONSENSE, MISSENSE, SILENT or NONE) */ public FunctionalClass getFunctionalClass() { if (variant.isSnp()) { if (!aaAlt.equals(aaRef)) { CodonTable codonTable = marker.codonTable(); if (codonTable.isStop(codonsAlt)) return FunctionalClass.NONSENSE; return FunctionalClass.MISSENSE; } if (!codonsAlt.equals(codonsRef)) return FunctionalClass.SILENT; } return FunctionalClass.NONE; }
/** * Find closest gene to this marker * * <p>In case more than one 'closest' gene is found (e.g. two or more genes at the same distance). * The following rules apply: * * <p>i) If many genes have the same 'closest distance', coding genes are preferred. * * <p>ii) If more than one coding gene has the same 'closet distance', a random gene is returned. * * @param inputInterval */ public Gene queryClosestGene(Marker inputInterval) { int initialExtension = 1000; String chrName = inputInterval.getChromosomeName(); Chromosome chr = genome.getChromosome(chrName); if (chr == null) return null; if (chr.size() > 0) { // Extend interval to capture 'close' genes for (int extend = initialExtension; extend < chr.size(); extend *= 2) { int start = Math.max(inputInterval.getStart() - extend, 0); int end = inputInterval.getEnd() + extend; Marker extended = new Marker(chr, start, end, 1, ""); // Find all genes that intersect with the interval Markers markers = query(extended); Markers genes = new Markers(); int minDist = Integer.MAX_VALUE; for (Marker m : markers) { if (m instanceof Gene) { int dist = m.distance(inputInterval); if (dist < minDist) { genes.add(m); minDist = dist; } } } // Found something? if (genes.size() > 0) { // Find a gene having distance 'minDist'. Prefer coding genes Gene minDistGene = null; for (Marker m : genes) { int dist = m.distance(inputInterval); if (dist == minDist) { Gene gene = (Gene) m; if (minDistGene == null) minDistGene = gene; else if (!minDistGene.isProteinCoding() && gene.isProteinCoding()) minDistGene = gene; } } return minDistGene; } } } // Nothing found return null; }
/** Set codon change. Calculate effect type based on codon changes (for SNPs & MNPs) */ public void setCodons(String codonsOld, String codonsNew, int codonNum, int codonIndex) { codonsRef = codonsOld; codonsAlt = codonsNew; this.codonNum = codonNum; this.codonIndex = codonIndex; CodonTable codonTable = marker.codonTable(); // Calculate amino acids if (codonsOld.isEmpty()) aaRef = ""; else { aaRef = codonTable.aa(codonsOld); codonDegeneracy = codonTable.degenerate(codonsOld, codonIndex); // Calculate codon degeneracy } if (codonsNew.isEmpty()) aaAlt = ""; else aaAlt = codonTable.aa(codonsNew); }
/** Show a string with overall effect */ public String effect( boolean shortFormat, boolean showAaChange, boolean showBioType, boolean useSeqOntology) { String e = ""; String codonEffect = codonEffect(showAaChange, showBioType, useSeqOntology); // Codon effect // Create effect string if (!codonEffect.isEmpty()) e = codonEffect; else if (isRegulation()) return getEffectTypeString(useSeqOntology) + "[" + ((Regulation) marker).getName() + "]"; else if (isNextProt()) return getEffectTypeString(useSeqOntology) + "[" + VcfEffect.vcfEffSafe(((NextProt) marker).getId()) + "]"; // Make sure this 'id' is not dangerous in a VCF 'EFF' field else if (isMotif()) return getEffectTypeString(useSeqOntology) + "[" + ((Motif) marker).getPwmId() + ":" + ((Motif) marker).getPwmName() + "]"; else if (isCustom()) { // Custom interval String label = ((Custom) marker).getLabel(); double score = ((Custom) marker).getScore(); if (!Double.isNaN(score)) label = label + ":" + score; if (!label.isEmpty()) label = "[" + label + "]"; return getEffectTypeString(useSeqOntology) + label; } else if (isIntergenic() || isIntron() || isSpliceSite()) e = getEffectTypeString(useSeqOntology); else if (!message.isEmpty()) e = getEffectTypeString(useSeqOntology) + ": " + message; else if (marker == null) e = getEffectTypeString( useSeqOntology); // There are cases when no marker is associated (e.g. "Out of // chromosome", "No such chromosome", etc.) else e = getEffectTypeString(useSeqOntology) + ": " + marker.getId(); if (shortFormat) e = e.split(":")[0]; return e; }
/** * Name of the regions hit by a marker * * @param marker * @param showGeneDetails * @param compareTemplate * @param id : Only use genes or transcripts matching this ID * @return */ public Set<String> regions( Marker marker, boolean showGeneDetails, boolean compareTemplate, String id) { if (Config.get().isErrorOnMissingChromo() && isChromosomeMissing(marker)) throw new RuntimeEOFException("Chromosome missing for marker: " + marker); boolean hitChromo = false; HashSet<String> hits = new HashSet<String>(); Markers intersects = query(marker); if (intersects.size() > 0) { for (Marker markerInt : intersects) { if (markerInt instanceof Chromosome) { hitChromo = true; // OK (we have to hit a chromosome, otherwise it's an error hits.add(markerInt.getClass().getSimpleName()); // Add marker name to the list } else if (markerInt instanceof Gene) { // Analyze Genes Gene gene = (Gene) markerInt; regionsAddHit(hits, gene, marker, showGeneDetails, compareTemplate); // For all transcripts... for (Transcript tr : gene) { if ((id == null) || gene.getId().equals(id) || tr.getId().equals(id)) { // Mathes ID? (...or no ID to match) // Does it intersect this transcript? if (tr.intersects(marker)) { regionsAddHit(hits, tr, marker, showGeneDetails, compareTemplate); // Does it intersect a UTR? for (Utr utr : tr.getUtrs()) if (utr.intersects(marker)) regionsAddHit(hits, utr, marker, showGeneDetails, compareTemplate); // Does it intersect an exon? for (Exon ex : tr) if (ex.intersects(marker)) regionsAddHit(hits, ex, marker, showGeneDetails, compareTemplate); // Does it intersect an intron? for (Intron intron : tr.introns()) if (intron.intersects(marker)) regionsAddHit(hits, intron, marker, showGeneDetails, compareTemplate); } } } } else { // No ID to match? if (id == null) regionsAddHit(hits, markerInt, marker, showGeneDetails, compareTemplate); else { // Is ID from transcript? Transcript tr = (Transcript) markerInt.findParent(Transcript.class); if ((tr != null) && (tr.getId().equals(id))) { regionsAddHit( hits, markerInt, marker, showGeneDetails, compareTemplate); // Transcript ID matches => count } else { // Is ID from gene? Gene gene = (Gene) markerInt.findParent(Gene.class); if ((gene != null) && (gene.getId().equals(id))) regionsAddHit( hits, markerInt, marker, showGeneDetails, compareTemplate); // Gene ID matches => count } } } } } if (!hitChromo) throw new RuntimeException("ERROR: Out of chromosome range. " + marker); return hits; }
/** Show annotations counters in a table */ void analyzeSequenceConservation() { if (verbose) Timer.showStdErr( "Sequence conservation analysis." // + "\n\tAA sequence length : " + 1 // + "\n\tMin AA count : " + HIGHLY_CONSERVED_AA_COUNT // + "\n\tMin AA conservation : " + HIGHLY_CONSERVED_AA_PERCENT // ); ArrayList<String> keys = new ArrayList<String>(); keys.addAll(countAaSequenceByType.keySet()); Collections.sort(keys); // Show title StringBuilder title = new StringBuilder(); for (char aa : GprSeq.AMINO_ACIDS) title.append(aa + "\t"); title.append("\t" + title); if (verbose) System.out.println( "Amino acid regions:\n\tTotal\tMax count\tAvg len\tConservation\tCatergory\tControlled Vocabulary\t" + title + "\tOther AA sequences:"); // Show AA counts for each 'key' for (String key : keys) { long seqLen = 0, totalSeqs = 0, maxCount = 0; CountByType cbt = countAaSequenceByType.get(key); long total = cbt.sum(); boolean highlyConservedAaSequence = false; StringBuilder sb = new StringBuilder(); // For each single amino acid "sequence" for (char aa : GprSeq.AMINO_ACIDS) { long count = cbt.get("" + aa); if (count > 0) { seqLen += 1 * count; totalSeqs += count; maxCount = Math.max(maxCount, count); sb.append(count); double perc = ((double) count) / total; // We estimate that if most AA are the same, then changing this AA can cause a high impact // in protein coding if ((perc > HIGHLY_CONSERVED_AA_PERCENT) && (total >= HIGHLY_CONSERVED_AA_COUNT)) highlyConservedAaSequence = true; } sb.append("\t"); } // Sequences of more than one AA for (String aas : cbt.keySet()) { long count = cbt.get(aas); double perc = ((double) count) / total; if (aas.length() > 1) { seqLen += aas.length() * count; totalSeqs += count; maxCount = Math.max(maxCount, count); sb.append(String.format("\t" + aas + ":" + count)); if ((perc > HIGHLY_CONSERVED_AA_PERCENT) && (total >= HIGHLY_CONSERVED_AA_COUNT)) highlyConservedAaSequence = true; } } long avgLen = seqLen / totalSeqs; // Show line if (verbose) System.out.println( // "\t" + total // + "\t" + maxCount // + "\t" + avgLen // + "\t" + (highlyConservedAaSequence ? "High" : "") // + "\t" + key // + "\t" + sb // ); // Mark highly conserved if (highlyConservedAaSequence) { int count = 0; for (Marker m : markers) { NextProt nextProt = (NextProt) m; if (m.getId().equals(key)) { nextProt.setHighlyConservedAaSequence(true); count++; } } if (verbose) Timer.showStdErr( "NextProt " + count + " markers type '" + key + "' marked as highly conserved AA sequence"); } } }
public String toString(boolean useSeqOntology, boolean useHgvs) { // Get data to show String geneId = "", geneName = "", bioType = "", transcriptId = "", exonId = "", customId = ""; int exonRank = -1; if (marker != null) { // Gene Id, name and biotype Gene gene = getGene(); Transcript tr = getTranscript(); // CDS size info if (gene != null) { geneId = gene.getId(); geneName = gene.getGeneName(); bioType = getBiotype(); } // Update trId if (tr != null) transcriptId = tr.getId(); // Exon rank information Exon exon = getExon(); if (exon != null) { exonId = exon.getId(); exonRank = exon.getRank(); } // Regulation if (isRegulation()) bioType = ((Regulation) marker).getCellType(); } // Add seqChage's ID if (!variant.getId().isEmpty()) customId += variant.getId(); // Add custom markers if ((marker != null) && (marker instanceof Custom)) customId += (customId.isEmpty() ? "" : ";") + marker.getId(); // CDS length int cdsSize = getCdsLength(); String errWarn = error + (error.isEmpty() ? "" : "|") + warning; String aaChange = ""; if (useHgvs) aaChange = getHgvs(); else aaChange = ((aaRef.length() + aaAlt.length()) > 0 ? aaRef + "/" + aaAlt : ""); return errWarn // + "\t" + geneId // + "\t" + geneName // + "\t" + bioType // + "\t" + transcriptId // + "\t" + exonId // + "\t" + (exonRank >= 0 ? exonRank : "") // + "\t" + effect(false, false, false, useSeqOntology) // + "\t" + aaChange // + "\t" + ((codonsRef.length() + codonsAlt.length()) > 0 ? codonsRef + "/" + codonsAlt : "") // + "\t" + (codonNum >= 0 ? (codonNum + 1) : "") // + "\t" + (codonDegeneracy >= 0 ? codonDegeneracy + "" : "") // + "\t" + (cdsSize >= 0 ? cdsSize : "") // + "\t" + (codonsAroundOld.length() > 0 ? codonsAroundOld + " / " + codonsAroundNew : "") // + "\t" + (aasAroundOld.length() > 0 ? aasAroundOld + " / " + aasAroundNew : "") // + "\t" + customId // ; }