GO::TermFinder man page on Pidora

GO::TermFinder man page on Pidora
Man page or keyword search:
man Server 31170 pages
apropos Keyword Search (all sections)
Output format
GO::TermFinder(3)     User Contributed Perl Documentation    GO::TermFinder(3)

NAME
       GO::TermFinder - identify GO nodes that annotate a group of genes with
       a significant p-value

DESCRIPTION
       This package is intended to provide a method whereby the P-values of a
       set of GO annotations can be determined for a set of genes, based on
       the number of genes that exist in the particular genome (or in a
       selected background distribution from the genome), and their
       annotation, and the frequency with which the GO nodes are annotated
       across the provided set of genes.  The P-value is simply calculated
       using the hypergeometric distribution as the probability of x or more
       out of n genes having a given annotation, given that G of N have that
       annotation in the genome in general.  We chose the hypergeometric
       distribution (sampling without replacement) since it is more accurate,
       though slower to calculate, than the binomial distribution (sampling
       with replacement).

       In addition, a corrected p-value can be calculated, to correct for
       multiple hypothesis testing.  The correction factor used is the total
       number of nodes to which the provided list of genes are annotated,
       excepting any nodes which have only a single annotation in the
       background, as a priori, we know that these cannot be significantly
       enriched.  The client has access to both the corrected and uncorrected
       values.	It is also possible to correct the p-value using 1000
       simulations, which control the Family Wise Error Rate - using this
       option suggests that the Bonferroni correction is in fact somewhat
       liberal, rather than conservative, as might be expected.	 Finally, the
       False Discovery Rate can also be calculated.

       The general idea is that a list of genes may have been identified for
       some reason, e.g. they are co-regulated, and TermFinder can be used to
       find out if any nodes annotate the set of genes to a level which is
       extremely improbable if the genes had simply been picked at random.

TODO
       1.  May want the client to decide the behavior for ambiguous names,
	   rather than having it hard coded (e.g. always ignore; use if
	   standard name (current implementation); use all databaseIds for
	   the ambiguous name; decide on a case by case basis (potentially
	   useful if running on command line)).

       2.  Create new GO::Hypothesis and GO::HypothesisSet objects, so that
	   it is easier to access the information generated about the p-value
	   etc. of any particular GO node that annotates a set of genes.

       3.  Instead of all the global variables, $k..., replace them with
	   constants, which may improve runtime, as the optimizer should
	   optimize the hash look ups to look like hard-coded strings at
	   runtime, rather than variable lookups.

       4.  Lots of other stuff....

Instance Constructor
   new
       This is the constructor.	 It expects to be passed named arguments for
       an annotationProvider, and an ontologyProvider.	In addition, it must
       be told the aspect of the ontology provider, so that it knows how to
       query the annotationProvider.

       There are also some additional, optional arguments:

       population:

       This argument allows a client to indicate the population that should
       used to calculate a background distribution of GO terms.	 In the
       absence of population argument, then the background distribution will
       be drawn from all genes in the annotationProvider.  This should be
       provided as an array reference, and no ambiguous names should be
       provided (see AnnotationProvider for details of name ambiguity).	 This
       option is particularly pertinent in a case where for example you
       assayed only 2000 genes in a two hybrid experiment, and found 20
       interesting ones.  To find significant terms, you need to do it in the
       context of the genes that you assayed, not in the context of all genes
       with annotation.

       Note, new in version 0.71, if you provided a population as the
       background distribution from which genes have been drawn, any genes
       provided to the findTerms method that are not in the background
       distribution will be discarded from the calculations.  The identity of
       these genes can be retrieved using the discardedGenes() method, after
       the findTerms() method has been called.

       totalNumGenes:

       This argument allows a client to indicate that the size of the
       background distribution is in fact larger that the number of genes that
       exist in the annotation provider, and the extra genes are merely
       assumed to be entirely unannotated.

       NB: This is an API change, as totalNumGenes was previously required.

       Thus - if using 'population', the total number of genes considered as
       the background will be the number of genes in the provided population.
       If not using 'population', then the number of genes that will be
       considered as the total population will be the number of genes in the
       annotationProvider.  However, if the totalNumGenes argument is
       provided, then that number will be used as the size of the population.
       If it is not larger than the total number of genes in the
       annotationParser, then the number of genes in the annotationParser will
       be used.	 The totalNumGenes and the population arguments are mutually
       exclusive, and both should not be provided at the same time.

       Usage ($num is larger than the number of genes with annotations):

	  my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
					       ontologyProvider	 => $ontologyProvider,
					       totalNumGenes	 => $num,
					       aspect		 => <P|C|F>);

       Usage (use all annotated genes as population):

	  my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
					       ontologyProvider	 => $ontologyProvider,
					       aspect		 => <P|C|F>);

       Usage (use a subset of genes as the background population):

	  my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
					       ontologyProvider	 => $ontologyProvider,
					       population	 => \@genes,
					       aspect		 => <P|C|F>);

Instance Methods
   findTerms
       This method returns an array of hash references, one for each GO::Node
       that was tested as a hypothesis, that indicates which terms annotate
       the list of genes with what P-values.  The contents of the hashes in
       the returned array depend on some of the run time options.  They are:

	   key			 value
	   -------------------------------------------------------------------------

       Always Present:

	   NODE			 A GO::Node

	   PVALUE		 The P-value for having the observed number of
				 annotations that the provided list of genes
				 has to that node.

	   NUM_ANNOTATIONS	 The number of genes within the provided list that
				 are annotated to the node.

	   TOTAL_NUM_ANNOTATIONS The number of genes in the population (total
				 or provided) that are annotated to the node.

	   ANNOTATED_GENES	 A hash reference, whose keys are the
				 databaseIds that are annotated to the node,
				 and whose values are the original name
				 supplied to the findTerms() method.

       Present if corrected p-values are calculated:

	   CORRECTED_PVALUE	 The CORRECTED_PVALUE is the PVALUE, but corrected
				 for multiple hypothesis testing, due to the
				 fact that you are more likely to generate
				 significant looking p-values if you test a
				 lot of hypotheses.  See below for details of
				 how this pvalue is calculated, and the
				 options associated with it.

       Present if p-values were corrected by simulation:

	   NUM_OBSERVATIONS	 The number of simulations in which a p-value as
				 good as this one, or better, was observed.

       Present if the False Discovery Rate is calculated:

	   FDR_RATE		 The False Discovery Rate - this is a fraction
				 of how many of the nodes with p-values as good or
				 better than the node with this FDR would be expected
				 to be false positives.

	   FDR_OBSERVATIONS	 The average number of nodes during simulations
				 that had an uncorrected p-value as good or better
				 than the p-value of this node.

	   EXPECTED_FALSE_POSITIVES The expected number of false positives if this node
				    is chosen as the cut-off.

       The entries in the returned array are sorted by increasing p-value
       (i.e. least likely is first).  If there is a tie in the p-value, then
       the sort order is determined by GOID, using a cmp comparison.

       findTerm() expects to be passed, by reference, a list of gene names for
       which terms will be found.  If a passed in name is ambiguous (see
       AnnotationProvider), then the following will occur:

	   1) If the name can be used as a standard name, it will assume that
	      it is that.

	   2) Otherwise it will not use it.

       Currently a warning will be printed to STDOUT in the case of an
       ambiguous name being used.

       The passed in gene names are converted into a list of databaseIds.  If
       a gene does not map to a databaseId, then an undef is put in the list -
       however, if the same gene name, which does not map to a databaseId, is
       used twice then it will produce only one undef in the list.  If more
       than one gene name maps to the same databaseId (either because you used
       the same name twice, or you used an alias as well), then that
       databaseId is only put into the list once, and a warning is printed.

       If a gene name does not have any information returned from the
       AnnotationProvider, then it is assumed that the gene is entirely
       unannotated.  For these purposes, TermFinder annotates such genes to
       the root node (Gene_Ontology), its immediate child (which indicates the
       aspect of the ontology (such as biological_process), and a dummy go
       node, corresponding to unannotated.  This node will have a goid of
       'GO:XXXXXXX', and a term name of 'unannotated'.	No other information
       will be set up for this GO::Node, so you should not count on being able
       to retrieve it.	What it does mean is that you can determine if the
       predominant feature of a set of genes is that they have no annotation.

       If more genes are provided that have been indicated exist in the genome
       (as provided during object construction), then an error message will be
       printed out, and an empty list will be returned.

       In addition, it is possible that for a small list of genes, that no
       hypotheses will be tested - in this case, those genes will only have
       annotated nodes with a count of 1, other than the Gene_Ontology node
       itself, and the node corresponding to the aspect of the ontology.
       Neither of these are considered for p-value testing, as a priori they
       must have a p-value of 1.

       MULTIPLE HYPOTHESIS CORRECTION

       An optional argument, 'correction' may be used, which indicates what
       method of multiple hypothesis correction should be used.	 Multiple
       hypothesis correction attempts to keep the overall chance of getting
       any false positives at the same level (e.g. 0.05).  Acceptable values
       are:

       bonferroni, none, simulation

	: 'bonferroni' will correct the p-values by using as the correction
	   factor the total number of nodes to which the provided list of
	   genes are annotated, either directly or indirectly, excepting any
	   nodes that are annotated only once in the background distribution,
	   as, a priori, these cannot be overrepresented.

	: 'none' will perform no multiple hypothesis correction

	: 'simulation' will run 1000 simulations with random lists of genes
	  (the same size as the originally provided gene list), and determine
	  a corrected value by how many simulations produced a p-value better
	  than the p-value associated with one of the real hypotheses.
	  E.g. if a node from the real data has a p-value of 0.05, but a
	  p-value that good or better is generated in 500 out of 1000 trials,
	  the corrected pvalue will be 0.5.  In the case that a p-value
	  generated from a real list of genes is never seen in the
	  simulations, it will be given a corrected p-value of < 0.001, and
	  the NUM_OBSERVATIONS attribute of the hypothesis will be 0.  Using
	  this option takes 1000 time as long!

       The default for this argument, if not provided, is bonferroni.

       FALSE DISCOVERY RATE

       As a way of preempting the potential problems of using p-values
       corrected for multiple hypothesis testing, the False Discovery Rate can
       instead be calculated, and you can instead set your cutoff based on an
       acceptable false discovery rate, such as 0.01 (1%), or 0.05 (5%) etc.
       Thus, the optional argument 'calculateFDR' can be used.	A non-zero
       value means that the False Discovery Rate will be calculated for each
       node, such that you can determine, if you chose your p-value cut-off at
       that node, what the FDR would be.  The FDR is calculated by running 50
       simulations, and counting the average number of times a p-value as good
       or better that a p-value generated from the real data is seen.  This is
       used as the numerator.  The denominator is the number of p-values in
       the real data that are as good or better than it.

       Usage example - in this example, the default (Bonferroni) correction is
       used to calculate a corrected p-value, and in addition, the False
       Discovery Rate is also calculated:

	   my @pvalueStructures = $termFinder->findTerms(genes	      => \@genes,
							 calculateFDR => 1);

	   my $hypothesis = 1;

	   foreach my $pvalue (@pvalueStructures){

	   print "-- $hypothesis of ", scalar @pvalueStructures, "--\n",

	       "GOID\t", $pvalue->{NODE}->goid, "\n",

	       "TERM\t", $pvalue->{NODE}->term, "\n",

	       "P-VALUE\t", $pvalue->{PVALUE}, "\n",

	       "CORRECTED P-VALUE\t", $pvalue->{CORRECTED_PVALUE}, "\n",

	       "FALSE DISCOVERY RATE\t", $pvalue->{FDR_RATE}, "\n",

	       "NUM_ANNOTATIONS\t", $pvalue->{NUM_ANNOTATIONS}, " (of ", $pvalue->{TOTAL_NUM_ANNOTATIONS}, ")\n",

	       "ANNOTATED_GENES\t", join(", ", values (%{$pvalue->{ANNOTATED_GENES}})), "\n\n";

	       $hypothesis++;

	   }

       If a background population had been provided when the object was
       constructed, you should check to see if any of your genes for which you
       are finding terms were discarded, due to not being found in the
       background population, e.g.:

	   my @pvalueStructures = $termFinder->findTerms(genes	      => \@genes,
							 calculateFDR => 1);

	   my @discardedGenes = $termFinder->discardedGenes;

	   if (@discardedGenes){

	       print "The following genes were not considered in the pvalue
       calculations, as they were not found in the provided background
       population.\n\n", join("\n", @discardedGenes), "\n\n";

	   }

   discardedGenes
       This method returns an array of genes which were discarded from the
       pvalue calculations, because they could not be found in the background
       population.  It should only be called after findTerms.  It will either
       return an empty list, if no genes were discarded, or an array of genes
       that were discarded.

       Usage:

	   my @pvalueStructures = $termFinder->findTerms(genes	      => \@genes,
							 calculateFDR => 1);

	   my @discardedGenes = $termFinder->discardedGenes;

	   if (@discardedGenes){

	       print "The following genes were not considered in the pvalue
       calculations, as they were not found in the provided background
       population.\n\n", join("\n", @discardedGenes), "\n\n";

	   }

   genesDatabaseIds
       This method returns an array of databaseIds corresponding to the genes
       that were used for the findTerms() method.  Thus it allows a client to
       find out how many actual entities their list of genes that were passed
       in mapped to, e.g. they may have passed in the same thing with two
       different names.	 Using this method, immediately following use of the
       findTerms method, they will determine how many genes their list
       collapsed to.

   totalNumGenes
       This returns the total number of genes that are in the background set
       of genes from which the genes of interest were drawn.  Unannotated
       genes are included in this count.

   aspect
       Returns the aspect with the the GO::TermFinder object was constructed.

       Usage:

	   my $aspect = $termFinder->aspect;

Authors
	   Gavin Sherlock; sherlock@genome.stanford.edu
	   Elizabeth Boyle; ell@mit.edu
	   Ihab Awad; ihab@genome.stanford.edu

perl v5.14.1			  2008-05-14		     GO::TermFinder(3)
[top]

List of man pages available for Pidora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome