1\chapter{KEGG}
2\label{chapter:kegg}
3
4KEGG (\url{https://www.kegg.jp/}) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
5
6Please note that the KEGG parser implementation in Biopython is incomplete. While the KEGG website indicates many flat file formats, only parsers and writers for compound, enzyme, and map are currently implemented. However, a generic parser is implemented to handle the other formats.
7
8\section{Parsing KEGG records}
9Parsing a KEGG record is as simple as using any other file format parser in Biopython.
10(Before running the following codes, please open \url{http://rest.kegg.jp/get/ec:5.4.2.2} with your web browser and save it as \verb|ec_5.4.2.2.txt|.)
11
12%doctest examples
13\begin{minted}{pycon}
14>>> from Bio.KEGG import Enzyme
15>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
16>>> record = list(records)[0]
17>>> record.classname
18['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
19>>> record.entry
20'5.4.2.2'
21\end{minted}
22
23Alternatively, if the input KEGG file has exactly one entry, you can use \verb|read|:
24
25%doctest examples
26\begin{minted}{pycon}
27>>> from Bio.KEGG import Enzyme
28>>> record = Enzyme.read(open("ec_5.4.2.2.txt"))
29>>> record.classname
30['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
31>>> record.entry
32'5.4.2.2'
33\end{minted}
34
35The following section will shows how to download the above enzyme using the KEGG api as well as how to use the generic parser with data that does not have a custom parser implemented.
36
37\section{Querying the KEGG API}
38
39Biopython has full support for the querying of the KEGG api. Querying all KEGG endpoints are supported; all methods documented by KEGG (\url{https://www.kegg.jp/kegg/rest/keggapi.html}) are supported. The interface has some validation of queries which follow rules defined on the KEGG site. However, invalid queries which return a 400 or 404 must be handled by the user.
40
41First, here is how to extend the above example by downloading the relevant enzyme and passing it through the Enzyme parser.
42
43%want online doctest here
44\begin{minted}{pycon}
45>>> from Bio.KEGG import REST
46>>> from Bio.KEGG import Enzyme
47>>> request = REST.kegg_get("ec:5.4.2.2")
48>>> open("ec_5.4.2.2.txt", "w").write(request.read())
49>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
50>>> record = list(records)[0]
51>>> record.classname
52['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
53>>> record.entry
54'5.4.2.2'
55\end{minted}
56
57Now, here's a more realistic example which shows a combination of querying the KEGG API. This will demonstrate how to extract a unique set of all human pathway gene symbols which relate to DNA repair. The steps that need to be taken to do so are as follows. First, we need to get a list of all human pathways. Secondly, we need to filter those for ones which relate to "repair". Lastly, we need to get a list of all the gene symbols in all repair pathways.
58
59%want online doctest here
60\begin{minted}{python}
61from Bio.KEGG import REST
62
63human_pathways = REST.kegg_list("pathway", "hsa").read()
64
65# Filter all human pathways for repair pathways
66repair_pathways = []
67for line in human_pathways.rstrip().split("\n"):
68    entry, description = line.split("\t")
69    if "repair" in description:
70        repair_pathways.append(entry)
71
72# Get the genes for pathways and add them to a list
73repair_genes = []
74for pathway in repair_pathways:
75    pathway_file = REST.kegg_get(pathway).read()  # query and read each pathway
76
77    # iterate through each KEGG pathway file, keeping track of which section
78    # of the file we're in, only read the gene in each pathway
79    current_section = None
80    for line in pathway_file.rstrip().split("\n"):
81        section = line[:12].strip()  # section names are within 12 columns
82        if not section == "":
83            current_section = section
84
85        if current_section == "GENE":
86            gene_identifiers, gene_description = line[12:].split("; ")
87            gene_id, gene_symbol = gene_identifiers.split()
88
89            if not gene_symbol in repair_genes:
90                repair_genes.append(gene_symbol)
91
92print(
93    "There are %d repair pathways and %d repair genes. The genes are:"
94    % (len(repair_pathways), len(repair_genes))
95)
96print(", ".join(repair_genes))
97\end{minted}
98
99The KEGG API wrapper is compatible with all endpoints. Usage is essentially replacing all slashes in the url with commas and using that list as arguments to the corresponding method in the KEGG module. Here are a few examples from the api documentation (\url{https://www.kegg.jp/kegg/docs/keggapi.html}).
100
101\begin{minted}{text}
102/list/hsa:10458+ece:Z5100	         -> REST.kegg_list(["hsa:10458", "ece:Z5100"])
103/find/compound/300-310/mol_weight	 -> REST.kegg_find("compound", "300-310", "mol_weight")
104/get/hsa:10458+ece:Z5100/aaseq	    -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")
105\end{minted}
106