Return only the longest matches in re.finditer when identifying an Open
Reading Frame
I am trying to write code that will identify open reading frames in a DNA
sequence. An ORF is defined as a portion of a sequence that starts with
ATG and ends with a stop codon TAG, TAA, or TGA. I have used a look ahead
expression to find overlapping sequences. However, I want only the longest
strings to be printed. For example, if we have:
ATGAAAATGAAATAAGTCGTCGGG
Then only: ATGAAATGAAATAA should be returned. ATGAAATAA should not be
returned.
def find_orfs(sequence, aa):
orfs = []
orfre = '(?=(ATG(?:[ATGC]{3}){%d,}?(?:TAG|TAA|TGA)))' % (aa)
for match in re.finditer(orfre, sequence):
orfs.append(
{'start':match.start(1) + 1, 'stop':match.end(1),\
'stop codon':sequence[match.end(1)-3:match.end(1)],\
'nucleotide length':match.end(1) - match.start(1),\
'amino acid length':(match.end(1) - match.start(1) - 3)/3,\
'reading frame':match.start() % 3})
print match.group(1)
return orfs
No comments:
Post a Comment