1'''GFF3 format (:mod:`skbio.io.format.gff3`) 2========================================= 3 4.. currentmodule:: skbio.io.format.gff3 5 6GFF3 (Generic Feature Format version 3) is a standard file format for 7describing features for biological sequences. It contains lines of 8text, each consisting of 9 tab-delimited columns [1]_. 9 10Format Support 11-------------- 12**Has Sniffer: Yes** 13 14+------+------+---------------------------------------------------------------+ 15|Reader|Writer| Object Class | 16+======+======+===============================================================+ 17|Yes |Yes |:mod:`skbio.sequence.Sequence` | 18+------+------+---------------------------------------------------------------+ 19|Yes |Yes |:mod:`skbio.sequence.DNA` | 20+------+------+---------------------------------------------------------------+ 21|Yes |Yes |:mod:`skbio.metadata.IntervalMetadata` | 22+------+------+---------------------------------------------------------------+ 23|Yes |Yes |generator of tuple (seq_id of str type, | 24| | |:mod:`skbio.metadata.IntervalMetadata`) | 25+------+------+---------------------------------------------------------------+ 26 27Format Specification 28-------------------- 29**State: Experimental as of 0.5.1.** 30 31The first line of the file is a comment that identifies the format and 32version. This is followed by a series of data lines. Each data line 33corresponds to an annotation and consists of 9 columns: SEQID, SOURCE, 34TYPE, START, END, SCORE, STRAND, PHASE, and ATTR. 35 36Column 9 (ATTR) is list of feature attributes in the format 37"tag=value". Multiple "tag=value" pairs are delimited by 38semicolons. Multiple values of the same tag are separated with the 39comma ",". The following tags have predefined meanings: ID, Name, 40Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term, 41and Is_circular. 42 43The meaning and format of these columns and attributes are explained 44detail in the format specification [1]_. And they are read in as the 45vocabulary defined in GenBank parser (:mod:`skbio.io.format.genbank`). 46 47Format Parameters 48----------------- 49 50Reader-specific Parameters 51^^^^^^^^^^^^^^^^^^^^^^^^^^ 52``IntervalMetadata`` GFF3 reader requires 1 parameter: ``seq_id``. 53It reads the annotation with the specified 54sequence ID from the GFF3 file into an ``IntervalMetadata`` object. 55 56``DNA`` and ``Sequence`` GFF3 readers require ``seq_num`` of int as 57parameter. It specifies which GFF3 record to read from a GFF3 file 58with annotations of multiple sequences in it. 59 60Writer-specific Parameters 61^^^^^^^^^^^^^^^^^^^^^^^^^^ 62``skip_subregion`` is a boolean parameter used by all the GFF3 writers. It 63specifies whether you would like to write each non-contiguous 64sub-region for a feature annotation. For example, if there is 65interval feature for a gene with two exons in an ``IntervalMetadata`` 66object, it will write one line into the GFF3 file when ``skip_subregion`` is 67``True`` and will write 3 lines (one for the gene and one for each 68exon, respectively) when ``skip_subregion`` is ``False``. Default is ``True``. 69 70In addition, ``IntervalMetadata`` GFF3 writer needs a parameter of 71``seq_id``. It specify the sequence ID (column 1 in GFF3 file) that 72the annotation belong to. 73 74Examples 75-------- 76 77Let's create a file stream with following data in GFF3 format: 78 79>>> from skbio import Sequence, DNA 80>>> gff_str = """ 81... ##gff-version 3 82... seq_1\\t.\\tgene\\t10\\t90\\t.\\t+\\t0\\tID=gen1 83... seq_1\\t.\\texon\\t10\\t30\\t.\\t+\\t.\\tParent=gen1 84... seq_1\\t.\\texon\\t50\\t90\\t.\\t+\\t.\\tParent=gen1 85... seq_2\\t.\\tgene\\t80\\t96\\t.\\t-\\t.\\tID=gen2 86... ##FASTA 87... >seq_1 88... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC 89... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC 90... >seq_2 91... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC 92... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC 93... """ 94>>> import io 95>>> from skbio.metadata import IntervalMetadata 96>>> from skbio.io import read 97>>> gff = io.StringIO(gff_str) 98 99We can read it into ``IntervalMetadata``. Each line will be read into 100an interval feature in ``IntervalMetadata`` object: 101 102>>> im = read(gff, format='gff3', into=IntervalMetadata, 103... seq_id='seq_1') 104>>> im # doctest: +SKIP 1053 interval features 106------------------- 107Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], \ 108fuzzy=[(False, False)], metadata={'type': 'gene', \ 109'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'}) 110Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], \ 111fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', \ 112'type': 'exon', 'Parent': 'gen1', 'score': '.'}) 113Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], \ 114fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', \ 115'type': 'exon', 'Parent': 'gen1', 'score': '.'}) 116 117We can write the ``IntervalMetadata`` object back to GFF3 file: 118 119>>> with io.StringIO() as fh: # doctest: +NORMALIZE_WHITESPACE 120... print(im.write(fh, format='gff3', seq_id='seq_1').getvalue()) 121##gff-version 3 122seq_1 . gene 10 90 . + 0 ID=gen1 123seq_1 . exon 10 30 . + . Parent=gen1 124seq_1 . exon 50 90 . + . Parent=gen1 125<BLANKLINE> 126 127If the GFF3 file does not have the sequence ID, it will return an empty object: 128 129>>> gff = io.StringIO(gff_str) 130>>> im = read(gff, format='gff3', into=IntervalMetadata, 131... seq_id='foo') 132>>> im 1330 interval features 134------------------- 135 136We can also read the GFF3 file into a generator: 137 138>>> gff = io.StringIO(gff_str) 139>>> gen = read(gff, format='gff3') 140>>> for im in gen: # doctest: +SKIP 141... print(im[0]) # the seq id 142... print(im[1]) # the interval metadata on this seq 143seq_1 1443 interval features 145------------------- 146Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], \ 147fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', \ 148'source': '.', 'score': '.', 'strand': '+', 'phase': 0}) 149Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], \ 150fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', \ 151'Parent': 'gen1', 'source': '.', 'score': '.'}) 152Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], \ 153fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', \ 154'Parent': 'gen1', 'source': '.', 'score': '.'}) 155seq_2 1561 interval feature 157------------------ 158Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], \ 159fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', \ 160'ID': 'gen2', 'source': '.', 'score': '.'}) 161 162For the GFF3 file with sequences, we can read it into ``Sequence`` or ``DNA``: 163 164>>> gff = io.StringIO(gff_str) 165>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1) 166>>> seq 167Sequence 168-------------------------------------------------------------------- 169Metadata: 170 'description': '' 171 'id': 'seq_1' 172Interval metadata: 173 3 interval features 174Stats: 175 length: 100 176-------------------------------------------------------------------- 1770 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC 17860 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC 179 180>>> gff = io.StringIO(gff_str) 181>>> seq = read(gff, format='gff3', into=DNA, seq_num=2) 182>>> seq 183DNA 184-------------------------------------------------------------------- 185Metadata: 186 'description': '' 187 'id': 'seq_2' 188Interval metadata: 189 1 interval feature 190Stats: 191 length: 120 192 has gaps: False 193 has degenerates: False 194 has definites: True 195 GC-content: 50.00% 196-------------------------------------------------------------------- 1970 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC 19860 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC 199 200 201References 202---------- 203.. [1] https://github.com/The-Sequence-Ontology/\ 204Specifications/blob/master/gff3.md 205 206''' 207 208# ---------------------------------------------------------------------------- 209# Copyright (c) 2013--, scikit-bio development team. 210# 211# Distributed under the terms of the Modified BSD License. 212# 213# The full license is in the file COPYING.txt, distributed with this software. 214# ---------------------------------------------------------------------------- 215 216import re 217from collections import Iterable 218 219from skbio.sequence import DNA, Sequence 220from skbio.io import create_format, GFF3FormatError 221from skbio.metadata import IntervalMetadata 222from skbio.io.format._base import ( 223 _line_generator, _too_many_blanks, _get_nth_sequence) 224from skbio.io.format.fasta import _fasta_to_generator 225from skbio.io.format._sequence_feature_vocabulary import ( 226 _vocabulary_change, _vocabulary_skip) 227from skbio.io import write 228 229 230gff3 = create_format('gff3') 231 232 233@gff3.sniffer() 234def _gff3_sniffer(fh): 235 # check the 1st real line is a valid ID line 236 if _too_many_blanks(fh, 5): 237 return False, {} 238 239 try: 240 line = next(_line_generator(fh, skip_blanks=True, strip=False)) 241 except StopIteration: 242 return False, {} 243 244 if re.match(r'##gff-version\s+3', line): 245 return True, {} 246 else: 247 return False, {} 248 249 250@gff3.reader(None) 251def _gff3_to_generator(fh): 252 '''Parse the GFF3 into the existing IntervalMetadata 253 254 Parameters 255 ---------- 256 fh : file 257 file handler 258 259 Yields 260 ------ 261 tuple 262 str of seq id, IntervalMetadata 263 ''' 264 id_lengths = {} 265 for data_type, sid, data in _yield_record(fh): 266 if data_type == 'length': 267 # get length from sequence-region pragma. 268 # the pragma lines are always before the real annotation lines. 269 id_lengths[sid] = data 270 elif data_type == 'data': 271 length = id_lengths.get(sid) 272 yield sid, _parse_record(data, length) 273 274 275@gff3.writer(None) 276def _generator_to_gff3(obj, fh, skip_subregion=True): 277 '''Write list of IntervalMetadata into file. 278 279 Parameters 280 ---------- 281 obj : Iterable of (seq_id, IntervalMetadata) 282 fh : file handler 283 skip_subregion : bool 284 write a line for each sub-regions of an ``Interval`` if it is ``False`` 285 ''' 286 # write file header 287 fh.write('##gff-version 3\n') 288 for seq_id, obj_i in obj: 289 _serialize_interval_metadata(obj_i, seq_id, fh, skip_subregion) 290 291 292@gff3.reader(Sequence) 293def _gff3_to_sequence(fh, seq_num=1): 294 return _construct_seq(fh, Sequence, seq_num) 295 296 297@gff3.writer(Sequence) 298def _sequence_to_gff3(obj, fh, skip_subregion=True): 299 # write file header 300 fh.write('##gff-version 3\n') 301 _serialize_seq(obj, fh, skip_subregion) 302 303 304@gff3.reader(DNA) 305def _gff3_to_dna(fh, seq_num=1): 306 return _construct_seq(fh, DNA, seq_num) 307 308 309@gff3.writer(DNA) 310def _dna_to_gff3(obj, fh, skip_subregion=True): 311 # write file header 312 fh.write('##gff-version 3\n') 313 _serialize_seq(obj, fh, skip_subregion) 314 315 316@gff3.reader(IntervalMetadata) 317def _gff3_to_interval_metadata(fh, seq_id): 318 '''Read a GFF3 record into the specified interval metadata. 319 320 Parameters 321 ---------- 322 fh : file handler 323 seq_id : str 324 sequence ID which the interval metadata is associated with 325 ''' 326 length = None 327 for data_type, sid, data in _yield_record(fh): 328 if seq_id == sid: 329 if data_type == 'length': 330 # get length from sequence-region pragma 331 length = data 332 elif data_type == 'data': 333 return _parse_record(data, length) 334 else: 335 raise GFF3FormatError( 336 'Unknown section in the input GFF3 file: ' 337 '%r %r %r' % (data_type, sid, data)) 338 # return an empty instead of None 339 return IntervalMetadata(None) 340 341 342@gff3.writer(IntervalMetadata) 343def _interval_metadata_to_gff3(obj, fh, seq_id, skip_subregion=True): 344 '''Output ``IntervalMetadata`` object to GFF3 file. 345 346 Parameters 347 ---------- 348 obj : IntervalMetadata 349 fh : file object like 350 seq_id : str 351 ID for column 1 in the GFF3 file. 352 skip_subregion : bool 353 write a line for each sub-regions of an ``Interval`` if it is ``False`` 354 ''' 355 # write file header 356 fh.write('##gff-version 3\n') 357 _serialize_interval_metadata(obj, seq_id, fh, skip_subregion=True) 358 359 360def _construct_seq(fh, constructor=DNA, seq_num=1): 361 lines = [] 362 for i, (data_type, seq_id, l) in enumerate(_yield_record(fh), 1): 363 if data_type == 'data' and seq_num == i: 364 lines = l 365 seq = _get_nth_sequence(_fasta_to_generator(fh, constructor=constructor), 366 seq_num=seq_num) 367 seq.interval_metadata = _parse_record(lines, len(seq)) 368 return seq 369 370 371def _yield_record(fh): 372 '''Yield (seq_id, lines) that belong to the same sequence.''' 373 lines = [] 374 current = False 375 for line in _line_generator(fh, skip_blanks=True, strip=True): 376 if line.startswith('##sequence-region'): 377 _, seq_id, start, end = line.split() 378 length = int(end) - int(start) + 1 379 yield 'length', seq_id, length 380 if line.startswith('##FASTA'): 381 # stop once reaching to sequence section 382 break 383 if not line.startswith('#'): 384 try: 385 seq_id, _ = line.split('\t', 1) 386 except ValueError: 387 raise GFF3FormatError( 388 'Wrong GFF3 format at line: %s' % line) 389 if current == seq_id: 390 lines.append(line) 391 else: 392 if current is not False: 393 yield 'data', current, lines 394 lines = [line] 395 current = seq_id 396 if current is False: 397 # if the input file object is empty, it should return 398 # an empty generator 399 return 400 yield 401 else: 402 yield 'data', current, lines 403 404 405def _parse_record(lines, length): 406 '''Parse the lines into a IntervalMetadata object.''' 407 interval_metadata = IntervalMetadata(length) 408 for line in lines: 409 columns = line.split('\t') 410 # there should be 9 columns 411 if len(columns) != 9: 412 raise GFF3FormatError( 413 'do not have 9 columns in this line: "%s"' % line) 414 # the 1st column is seq ID for every feature. don't store 415 # this repetitive information 416 metadata = {'source': columns[1], 417 'type': columns[2], 418 'score': columns[5], 419 'strand': columns[6]} 420 phase = columns[7] 421 # phase value can only be int or '.' 422 try: 423 metadata['phase'] = int(phase) 424 except ValueError: 425 if phase != '.': 426 raise GFF3FormatError( 427 'unknown value for phase column: {!r}'.format(phase)) 428 metadata.update(_parse_attr(columns[8])) 429 430 start, end = columns[3:5] 431 432 bounds = [(int(start)-1, int(end))] 433 434 interval_metadata.add(bounds, metadata=metadata) 435 436 return interval_metadata 437 438 439def _parse_attr(s): 440 '''parse attribute column''' 441 voca_change = _vocabulary_change('gff3') 442 md = {} 443 # in case the line ending with ';', strip it. 444 s = s.rstrip(';') 445 for attr in s.split(';'): 446 k, v = attr.split('=') 447 if k in voca_change: 448 k = voca_change[k] 449 md[k] = v 450 return md 451 452 453def _serialize_interval_metadata(interval_metadata, seq_id, fh, 454 skip_subregion=True): 455 '''Serialize an IntervalMetadata to GFF3. 456 457 Parameters 458 ---------- 459 interval_metadata : IntervalMetadata 460 seq_id : str 461 Seq id for the current annotation. It will be used as the 1st column 462 in the GFF3. 463 fh : file handler 464 the file object to output 465 skip_subregion : bool 466 Whether to skip outputting each sub region as a line in GFF3. 467 ''' 468 column_keys = ['source', 'type', 'score', 'strand', 'phase'] 469 voca_change = _vocabulary_change('gff3', False) 470 voca_skip = _vocabulary_skip('gff3') 471 voca_skip.extend(column_keys) 472 473 # these characters have reserved meanings in column 9 and must be 474 # escaped when used in other contexts 475 escape = str.maketrans({';': '%3B', 476 '=': '%3D', 477 '&': '%26', 478 ',': '%2C'}) 479 480 for interval in interval_metadata._intervals: 481 md = interval.metadata 482 bd = interval.bounds 483 start = str(bd[0][0] + 1) 484 end = str(bd[-1][-1]) 485 486 source, feat_type, score, strand, phase = [ 487 str(md.get(i, '.')) for i in column_keys] 488 columns = [seq_id, source, feat_type, start, end, score, strand, phase] 489 490 # serialize the attributes in column 9 491 attr = [] 492 # use sort to make the output order deterministic 493 for k in sorted(md): 494 if k in voca_skip: 495 # skip the metadata that doesn't go to attribute column 496 continue 497 v = md[k] 498 if k in voca_change: 499 k = voca_change[k] 500 if isinstance(v, Iterable) and not isinstance(v, str): 501 # if there are multiple values for this attribute, 502 # convert them to str and concat them with "," 503 v = ','.join(str(i).translate(escape) for i in v) 504 else: 505 v = v.translate(escape) 506 attr.append('%s=%s' % (k.translate(escape), v)) 507 508 columns.append(';'.join(attr)) 509 510 fh.write('\t'.join(columns)) 511 fh.write('\n') 512 513 # if there are multiple regions for this feature, 514 # output each region as a standalone line in GFF3. 515 if len(bd) > 1 and skip_subregion is False: 516 for start, end in bd: 517 # if this is a gene, then each sub region should be an exon 518 if columns[2] == 'gene': 519 columns[2] = 'exon' 520 columns[3] = str(start + 1) 521 columns[4] = str(end) 522 try: 523 parent = md['ID'] 524 except KeyError: 525 raise GFF3FormatError( 526 'You need provide ID info for ' 527 'the parent interval feature: %r' % interval) 528 columns[8] = 'Parent=%s' % parent 529 fh.write('\t'.join(columns)) 530 fh.write('\n') 531 532 533def _serialize_seq(seq, fh, skip_subregion=True): 534 '''Serialize a sequence to GFF3.''' 535 _serialize_interval_metadata( 536 seq.interval_metadata, seq.metadata['id'], fh, skip_subregion) 537 fh.write('##FASTA\n') 538 write(seq, into=fh, format='fasta') 539