1\chapter{KEGG} 2\label{chapter:kegg} 3 4KEGG (\url{https://www.kegg.jp/}) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. 5 6Please note that the KEGG parser implementation in Biopython is incomplete. While the KEGG website indicates many flat file formats, only parsers and writers for compound, enzyme, and map are currently implemented. However, a generic parser is implemented to handle the other formats. 7 8\section{Parsing KEGG records} 9Parsing a KEGG record is as simple as using any other file format parser in Biopython. 10(Before running the following codes, please open \url{http://rest.kegg.jp/get/ec:5.4.2.2} with your web browser and save it as \verb|ec_5.4.2.2.txt|.) 11 12%doctest examples 13\begin{minted}{pycon} 14>>> from Bio.KEGG import Enzyme 15>>> records = Enzyme.parse(open("ec_5.4.2.2.txt")) 16>>> record = list(records)[0] 17>>> record.classname 18['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)'] 19>>> record.entry 20'5.4.2.2' 21\end{minted} 22 23Alternatively, if the input KEGG file has exactly one entry, you can use \verb|read|: 24 25%doctest examples 26\begin{minted}{pycon} 27>>> from Bio.KEGG import Enzyme 28>>> record = Enzyme.read(open("ec_5.4.2.2.txt")) 29>>> record.classname 30['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)'] 31>>> record.entry 32'5.4.2.2' 33\end{minted} 34 35The following section will shows how to download the above enzyme using the KEGG api as well as how to use the generic parser with data that does not have a custom parser implemented. 36 37\section{Querying the KEGG API} 38 39Biopython has full support for the querying of the KEGG api. Querying all KEGG endpoints are supported; all methods documented by KEGG (\url{https://www.kegg.jp/kegg/rest/keggapi.html}) are supported. The interface has some validation of queries which follow rules defined on the KEGG site. However, invalid queries which return a 400 or 404 must be handled by the user. 40 41First, here is how to extend the above example by downloading the relevant enzyme and passing it through the Enzyme parser. 42 43%want online doctest here 44\begin{minted}{pycon} 45>>> from Bio.KEGG import REST 46>>> from Bio.KEGG import Enzyme 47>>> request = REST.kegg_get("ec:5.4.2.2") 48>>> open("ec_5.4.2.2.txt", "w").write(request.read()) 49>>> records = Enzyme.parse(open("ec_5.4.2.2.txt")) 50>>> record = list(records)[0] 51>>> record.classname 52['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)'] 53>>> record.entry 54'5.4.2.2' 55\end{minted} 56 57Now, here's a more realistic example which shows a combination of querying the KEGG API. This will demonstrate how to extract a unique set of all human pathway gene symbols which relate to DNA repair. The steps that need to be taken to do so are as follows. First, we need to get a list of all human pathways. Secondly, we need to filter those for ones which relate to "repair". Lastly, we need to get a list of all the gene symbols in all repair pathways. 58 59%want online doctest here 60\begin{minted}{python} 61from Bio.KEGG import REST 62 63human_pathways = REST.kegg_list("pathway", "hsa").read() 64 65# Filter all human pathways for repair pathways 66repair_pathways = [] 67for line in human_pathways.rstrip().split("\n"): 68 entry, description = line.split("\t") 69 if "repair" in description: 70 repair_pathways.append(entry) 71 72# Get the genes for pathways and add them to a list 73repair_genes = [] 74for pathway in repair_pathways: 75 pathway_file = REST.kegg_get(pathway).read() # query and read each pathway 76 77 # iterate through each KEGG pathway file, keeping track of which section 78 # of the file we're in, only read the gene in each pathway 79 current_section = None 80 for line in pathway_file.rstrip().split("\n"): 81 section = line[:12].strip() # section names are within 12 columns 82 if not section == "": 83 current_section = section 84 85 if current_section == "GENE": 86 gene_identifiers, gene_description = line[12:].split("; ") 87 gene_id, gene_symbol = gene_identifiers.split() 88 89 if not gene_symbol in repair_genes: 90 repair_genes.append(gene_symbol) 91 92print( 93 "There are %d repair pathways and %d repair genes. The genes are:" 94 % (len(repair_pathways), len(repair_genes)) 95) 96print(", ".join(repair_genes)) 97\end{minted} 98 99The KEGG API wrapper is compatible with all endpoints. Usage is essentially replacing all slashes in the url with commas and using that list as arguments to the corresponding method in the KEGG module. Here are a few examples from the api documentation (\url{https://www.kegg.jp/kegg/docs/keggapi.html}). 100 101\begin{minted}{text} 102/list/hsa:10458+ece:Z5100 -> REST.kegg_list(["hsa:10458", "ece:Z5100"]) 103/find/compound/300-310/mol_weight -> REST.kegg_find("compound", "300-310", "mol_weight") 104/get/hsa:10458+ece:Z5100/aaseq -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq") 105\end{minted} 106