From the GTF definition:
The following feature types are required: ‘CDS’, ‘start_codon’, ‘stop_codon’. The features ‘5UTR’, ‘3UTR’, ‘inter’, ‘inter_CNS’, ‘intron_CNS’ and ‘exon’ are optional. All other features will be ignored.
So genes and transcript are not explicitly defined. The transcript extent, is, after all, implied by all exons with a single transcript_id, and the extent of a gene is implied by all exons with a single gene_id. However, this can be tedious to calculate by hand.
gffutils infers the gene and transcript extents when the file is imported into a database, and adds new “derived” features for each gene and transcript. That way, a gene can be easily accessed by its ID, just like for GFF files.
However, not all files meet the official GTF specifications where each feature has transcript_id to indicate its parent feature and gene_id to indicate its “grandparent” feature. To accommodate this, gffutils provides some extra options for the gffutils.create_db() function:
transcript_key and gene_key¶
Example that shows the use of transcript_key:
These kwargs are used to extract the parent and grandparent feature respectively.
By default, transcript_key="transcript_id" and gene_key="gene_id". But if your particular data file does not conform to this, then they can be changed.
Examples that show the use of subfeature:
Genes and transcripts are inferred from their component “exon” features. If your particular data file does not conform to the GTF standard, you can use the subfeature kwarg to change this. By default, subfeature="exon", but see the example wormbase_gff2_alt.txt for an instance of where this needs to be changed.