1'''GFF3 format (:mod:`skbio.io.format.gff3`)
2=========================================
3
4.. currentmodule:: skbio.io.format.gff3
5
6GFF3 (Generic Feature Format version 3) is a standard file format for
7describing features for biological sequences. It contains lines of
8text, each consisting of 9 tab-delimited columns [1]_.
9
10Format Support
11--------------
12**Has Sniffer: Yes**
13
14+------+------+---------------------------------------------------------------+
15|Reader|Writer|                          Object Class                         |
16+======+======+===============================================================+
17|Yes   |Yes   |:mod:`skbio.sequence.Sequence`                                 |
18+------+------+---------------------------------------------------------------+
19|Yes   |Yes   |:mod:`skbio.sequence.DNA`                                      |
20+------+------+---------------------------------------------------------------+
21|Yes   |Yes   |:mod:`skbio.metadata.IntervalMetadata`                         |
22+------+------+---------------------------------------------------------------+
23|Yes   |Yes   |generator of tuple (seq_id of str type,                        |
24|      |      |:mod:`skbio.metadata.IntervalMetadata`)                        |
25+------+------+---------------------------------------------------------------+
26
27Format Specification
28--------------------
29**State: Experimental as of 0.5.1.**
30
31The first line of the file is a comment that identifies the format and
32version. This is followed by a series of data lines. Each data line
33corresponds to an annotation and consists of 9 columns: SEQID, SOURCE,
34TYPE, START, END, SCORE, STRAND, PHASE, and ATTR.
35
36Column 9 (ATTR) is list of feature attributes in the format
37"tag=value". Multiple "tag=value" pairs are delimited by
38semicolons. Multiple values of the same tag are separated with the
39comma ",". The following tags have predefined meanings: ID, Name,
40Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term,
41and Is_circular.
42
43The meaning and format of these columns and attributes are explained
44detail in the format specification [1]_. And they are read in as the
45vocabulary defined in GenBank parser (:mod:`skbio.io.format.genbank`).
46
47Format Parameters
48-----------------
49
50Reader-specific Parameters
51^^^^^^^^^^^^^^^^^^^^^^^^^^
52``IntervalMetadata`` GFF3 reader requires 1 parameter: ``seq_id``.
53It reads the annotation with the specified
54sequence ID from the GFF3 file into an ``IntervalMetadata`` object.
55
56``DNA`` and ``Sequence`` GFF3 readers require ``seq_num`` of int as
57parameter. It specifies which GFF3 record to read from a GFF3 file
58with annotations of multiple sequences in it.
59
60Writer-specific Parameters
61^^^^^^^^^^^^^^^^^^^^^^^^^^
62``skip_subregion`` is a boolean parameter used by all the GFF3 writers. It
63specifies whether you would like to write each non-contiguous
64sub-region for a feature annotation. For example, if there is
65interval feature for a gene with two exons in an ``IntervalMetadata``
66object, it will write one line into the GFF3 file when ``skip_subregion`` is
67``True`` and will write 3 lines (one for the gene and one for each
68exon, respectively) when ``skip_subregion`` is ``False``. Default is ``True``.
69
70In addition, ``IntervalMetadata`` GFF3 writer needs a parameter of
71``seq_id``. It specify the sequence ID (column 1 in GFF3 file) that
72the annotation belong to.
73
74Examples
75--------
76
77Let's create a file stream with following data in GFF3 format:
78
79>>> from skbio import Sequence, DNA
80>>> gff_str = """
81... ##gff-version 3
82... seq_1\\t.\\tgene\\t10\\t90\\t.\\t+\\t0\\tID=gen1
83... seq_1\\t.\\texon\\t10\\t30\\t.\\t+\\t.\\tParent=gen1
84... seq_1\\t.\\texon\\t50\\t90\\t.\\t+\\t.\\tParent=gen1
85... seq_2\\t.\\tgene\\t80\\t96\\t.\\t-\\t.\\tID=gen2
86... ##FASTA
87... >seq_1
88... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
89... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
90... >seq_2
91... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
92... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
93... """
94>>> import io
95>>> from skbio.metadata import IntervalMetadata
96>>> from skbio.io import read
97>>> gff = io.StringIO(gff_str)
98
99We can read it into ``IntervalMetadata``. Each line will be read into
100an interval feature in ``IntervalMetadata`` object:
101
102>>> im = read(gff, format='gff3', into=IntervalMetadata,
103...           seq_id='seq_1')
104>>> im  # doctest: +SKIP
1053 interval features
106-------------------
107Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], \
108fuzzy=[(False, False)], metadata={'type': 'gene', \
109'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'})
110Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], \
111fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', \
112'type': 'exon', 'Parent': 'gen1', 'score': '.'})
113Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], \
114fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', \
115'type': 'exon', 'Parent': 'gen1', 'score': '.'})
116
117We can write the ``IntervalMetadata`` object back to GFF3 file:
118
119>>> with io.StringIO() as fh:    # doctest: +NORMALIZE_WHITESPACE
120...     print(im.write(fh, format='gff3', seq_id='seq_1').getvalue())
121##gff-version 3
122seq_1	.	gene	10	90	.	+	0	ID=gen1
123seq_1	.	exon	10	30	.	+	.	Parent=gen1
124seq_1	.	exon	50	90	.	+	.	Parent=gen1
125<BLANKLINE>
126
127If the GFF3 file does not have the sequence ID, it will return an empty object:
128
129>>> gff = io.StringIO(gff_str)
130>>> im = read(gff, format='gff3', into=IntervalMetadata,
131...           seq_id='foo')
132>>> im
1330 interval features
134-------------------
135
136We can also read the GFF3 file into a generator:
137
138>>> gff = io.StringIO(gff_str)
139>>> gen = read(gff, format='gff3')
140>>> for im in gen:   # doctest: +SKIP
141...     print(im[0])   # the seq id
142...     print(im[1])   # the interval metadata on this seq
143seq_1
1443 interval features
145-------------------
146Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], \
147fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', \
148'source': '.', 'score': '.', 'strand': '+', 'phase': 0})
149Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], \
150fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', \
151'Parent': 'gen1', 'source': '.', 'score': '.'})
152Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], \
153fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', \
154'Parent': 'gen1', 'source': '.', 'score': '.'})
155seq_2
1561 interval feature
157------------------
158Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], \
159fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', \
160'ID': 'gen2', 'source': '.', 'score': '.'})
161
162For the GFF3 file with sequences, we can read it into ``Sequence`` or ``DNA``:
163
164>>> gff = io.StringIO(gff_str)
165>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1)
166>>> seq
167Sequence
168--------------------------------------------------------------------
169Metadata:
170    'description': ''
171    'id': 'seq_1'
172Interval metadata:
173    3 interval features
174Stats:
175    length: 100
176--------------------------------------------------------------------
1770  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
17860 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
179
180>>> gff = io.StringIO(gff_str)
181>>> seq = read(gff, format='gff3', into=DNA, seq_num=2)
182>>> seq
183DNA
184--------------------------------------------------------------------
185Metadata:
186    'description': ''
187    'id': 'seq_2'
188Interval metadata:
189    1 interval feature
190Stats:
191    length: 120
192    has gaps: False
193    has degenerates: False
194    has definites: True
195    GC-content: 50.00%
196--------------------------------------------------------------------
1970  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
19860 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
199
200
201References
202----------
203.. [1] https://github.com/The-Sequence-Ontology/\
204Specifications/blob/master/gff3.md
205
206'''
207
208# ----------------------------------------------------------------------------
209# Copyright (c) 2013--, scikit-bio development team.
210#
211# Distributed under the terms of the Modified BSD License.
212#
213# The full license is in the file COPYING.txt, distributed with this software.
214# ----------------------------------------------------------------------------
215
216import re
217from collections import Iterable
218
219from skbio.sequence import DNA, Sequence
220from skbio.io import create_format, GFF3FormatError
221from skbio.metadata import IntervalMetadata
222from skbio.io.format._base import (
223    _line_generator, _too_many_blanks, _get_nth_sequence)
224from skbio.io.format.fasta import _fasta_to_generator
225from skbio.io.format._sequence_feature_vocabulary import (
226    _vocabulary_change, _vocabulary_skip)
227from skbio.io import write
228
229
230gff3 = create_format('gff3')
231
232
233@gff3.sniffer()
234def _gff3_sniffer(fh):
235    # check the 1st real line is a valid ID line
236    if _too_many_blanks(fh, 5):
237        return False, {}
238
239    try:
240        line = next(_line_generator(fh, skip_blanks=True, strip=False))
241    except StopIteration:
242        return False, {}
243
244    if re.match(r'##gff-version\s+3', line):
245        return True, {}
246    else:
247        return False, {}
248
249
250@gff3.reader(None)
251def _gff3_to_generator(fh):
252    '''Parse the GFF3 into the existing IntervalMetadata
253
254    Parameters
255    ----------
256    fh : file
257        file handler
258
259    Yields
260    ------
261    tuple
262        str of seq id, IntervalMetadata
263    '''
264    id_lengths = {}
265    for data_type, sid, data in _yield_record(fh):
266        if data_type == 'length':
267            # get length from sequence-region pragma.
268            # the pragma lines are always before the real annotation lines.
269            id_lengths[sid] = data
270        elif data_type == 'data':
271            length = id_lengths.get(sid)
272            yield sid, _parse_record(data, length)
273
274
275@gff3.writer(None)
276def _generator_to_gff3(obj, fh, skip_subregion=True):
277    '''Write list of IntervalMetadata into file.
278
279    Parameters
280    ----------
281    obj : Iterable of (seq_id, IntervalMetadata)
282    fh : file handler
283    skip_subregion : bool
284        write a line for each sub-regions of an ``Interval`` if it is ``False``
285    '''
286    # write file header
287    fh.write('##gff-version 3\n')
288    for seq_id, obj_i in obj:
289        _serialize_interval_metadata(obj_i, seq_id, fh, skip_subregion)
290
291
292@gff3.reader(Sequence)
293def _gff3_to_sequence(fh, seq_num=1):
294    return _construct_seq(fh, Sequence, seq_num)
295
296
297@gff3.writer(Sequence)
298def _sequence_to_gff3(obj, fh, skip_subregion=True):
299    # write file header
300    fh.write('##gff-version 3\n')
301    _serialize_seq(obj, fh, skip_subregion)
302
303
304@gff3.reader(DNA)
305def _gff3_to_dna(fh, seq_num=1):
306    return _construct_seq(fh, DNA, seq_num)
307
308
309@gff3.writer(DNA)
310def _dna_to_gff3(obj, fh, skip_subregion=True):
311    # write file header
312    fh.write('##gff-version 3\n')
313    _serialize_seq(obj, fh, skip_subregion)
314
315
316@gff3.reader(IntervalMetadata)
317def _gff3_to_interval_metadata(fh, seq_id):
318    '''Read a GFF3 record into the specified interval metadata.
319
320    Parameters
321    ----------
322    fh : file handler
323    seq_id : str
324        sequence ID which the interval metadata is associated with
325    '''
326    length = None
327    for data_type, sid, data in _yield_record(fh):
328        if seq_id == sid:
329            if data_type == 'length':
330                # get length from sequence-region pragma
331                length = data
332            elif data_type == 'data':
333                return _parse_record(data, length)
334            else:
335                raise GFF3FormatError(
336                    'Unknown section in the input GFF3 file: '
337                    '%r %r %r' % (data_type, sid, data))
338    # return an empty instead of None
339    return IntervalMetadata(None)
340
341
342@gff3.writer(IntervalMetadata)
343def _interval_metadata_to_gff3(obj, fh, seq_id, skip_subregion=True):
344    '''Output ``IntervalMetadata`` object to GFF3 file.
345
346    Parameters
347    ----------
348    obj : IntervalMetadata
349    fh : file object like
350    seq_id : str
351        ID for column 1 in the GFF3 file.
352    skip_subregion : bool
353        write a line for each sub-regions of an ``Interval`` if it is ``False``
354    '''
355    # write file header
356    fh.write('##gff-version 3\n')
357    _serialize_interval_metadata(obj, seq_id, fh, skip_subregion=True)
358
359
360def _construct_seq(fh, constructor=DNA, seq_num=1):
361    lines = []
362    for i, (data_type, seq_id, l) in enumerate(_yield_record(fh), 1):
363        if data_type == 'data' and seq_num == i:
364            lines = l
365    seq = _get_nth_sequence(_fasta_to_generator(fh, constructor=constructor),
366                            seq_num=seq_num)
367    seq.interval_metadata = _parse_record(lines, len(seq))
368    return seq
369
370
371def _yield_record(fh):
372    '''Yield (seq_id, lines) that belong to the same sequence.'''
373    lines = []
374    current = False
375    for line in _line_generator(fh, skip_blanks=True, strip=True):
376        if line.startswith('##sequence-region'):
377            _, seq_id, start, end = line.split()
378            length = int(end) - int(start) + 1
379            yield 'length', seq_id, length
380        if line.startswith('##FASTA'):
381            # stop once reaching to sequence section
382            break
383        if not line.startswith('#'):
384            try:
385                seq_id, _ = line.split('\t', 1)
386            except ValueError:
387                raise GFF3FormatError(
388                    'Wrong GFF3 format at line: %s' % line)
389            if current == seq_id:
390                lines.append(line)
391            else:
392                if current is not False:
393                    yield 'data', current, lines
394                lines = [line]
395                current = seq_id
396    if current is False:
397        # if the input file object is empty, it should return
398        # an empty generator
399        return
400        yield
401    else:
402        yield 'data', current, lines
403
404
405def _parse_record(lines, length):
406    '''Parse the lines into a IntervalMetadata object.'''
407    interval_metadata = IntervalMetadata(length)
408    for line in lines:
409        columns = line.split('\t')
410        # there should be 9 columns
411        if len(columns) != 9:
412            raise GFF3FormatError(
413                'do not have 9 columns in this line: "%s"' % line)
414        # the 1st column is seq ID for every feature. don't store
415        # this repetitive information
416        metadata = {'source': columns[1],
417                    'type': columns[2],
418                    'score': columns[5],
419                    'strand': columns[6]}
420        phase = columns[7]
421        # phase value can only be int or '.'
422        try:
423            metadata['phase'] = int(phase)
424        except ValueError:
425            if phase != '.':
426                raise GFF3FormatError(
427                    'unknown value for phase column: {!r}'.format(phase))
428        metadata.update(_parse_attr(columns[8]))
429
430        start, end = columns[3:5]
431
432        bounds = [(int(start)-1, int(end))]
433
434        interval_metadata.add(bounds, metadata=metadata)
435
436    return interval_metadata
437
438
439def _parse_attr(s):
440    '''parse attribute column'''
441    voca_change = _vocabulary_change('gff3')
442    md = {}
443    # in case the line ending with ';', strip it.
444    s = s.rstrip(';')
445    for attr in s.split(';'):
446        k, v = attr.split('=')
447        if k in voca_change:
448            k = voca_change[k]
449        md[k] = v
450    return md
451
452
453def _serialize_interval_metadata(interval_metadata, seq_id, fh,
454                                 skip_subregion=True):
455    '''Serialize an IntervalMetadata to GFF3.
456
457    Parameters
458    ----------
459    interval_metadata : IntervalMetadata
460    seq_id : str
461        Seq id for the current annotation. It will be used as the 1st column
462        in the GFF3.
463    fh : file handler
464        the file object to output
465    skip_subregion : bool
466        Whether to skip outputting each sub region as a line in GFF3.
467    '''
468    column_keys = ['source', 'type', 'score', 'strand', 'phase']
469    voca_change = _vocabulary_change('gff3', False)
470    voca_skip = _vocabulary_skip('gff3')
471    voca_skip.extend(column_keys)
472
473    # these characters have reserved meanings in column 9 and must be
474    # escaped when used in other contexts
475    escape = str.maketrans({';': '%3B',
476                            '=': '%3D',
477                            '&': '%26',
478                            ',': '%2C'})
479
480    for interval in interval_metadata._intervals:
481        md = interval.metadata
482        bd = interval.bounds
483        start = str(bd[0][0] + 1)
484        end = str(bd[-1][-1])
485
486        source, feat_type, score, strand, phase = [
487            str(md.get(i, '.')) for i in column_keys]
488        columns = [seq_id, source, feat_type, start, end, score, strand, phase]
489
490        # serialize the attributes in column 9
491        attr = []
492        # use sort to make the output order deterministic
493        for k in sorted(md):
494            if k in voca_skip:
495                # skip the metadata that doesn't go to attribute column
496                continue
497            v = md[k]
498            if k in voca_change:
499                k = voca_change[k]
500            if isinstance(v, Iterable) and not isinstance(v, str):
501                # if there are multiple values for this attribute,
502                # convert them to str and concat them with ","
503                v = ','.join(str(i).translate(escape) for i in v)
504            else:
505                v = v.translate(escape)
506            attr.append('%s=%s' % (k.translate(escape), v))
507
508        columns.append(';'.join(attr))
509
510        fh.write('\t'.join(columns))
511        fh.write('\n')
512
513        # if there are multiple regions for this feature,
514        # output each region as a standalone line in GFF3.
515        if len(bd) > 1 and skip_subregion is False:
516            for start, end in bd:
517                # if this is a gene, then each sub region should be an exon
518                if columns[2] == 'gene':
519                    columns[2] = 'exon'
520                columns[3] = str(start + 1)
521                columns[4] = str(end)
522                try:
523                    parent = md['ID']
524                except KeyError:
525                    raise GFF3FormatError(
526                        'You need provide ID info for '
527                        'the parent interval feature: %r' % interval)
528                columns[8] = 'Parent=%s' % parent
529                fh.write('\t'.join(columns))
530                fh.write('\n')
531
532
533def _serialize_seq(seq, fh, skip_subregion=True):
534    '''Serialize a sequence to GFF3.'''
535    _serialize_interval_metadata(
536        seq.interval_metadata, seq.metadata['id'], fh, skip_subregion)
537    fh.write('##FASTA\n')
538    write(seq, into=fh, format='fasta')
539