1
2
3
4
5
6
7
8 """Substitution matrices, log odds matrices, and operations on them.
9
10 General:
11 -------
12
13 This module provides a class and a few routines for generating
14 substitution matrices, similar ot BLOSUM or PAM matrices, but based on
15 user-provided data.
16 The class used for these matrices is SeqMat
17
18 Matrices are implemented as a dictionary. Each index contains a 2-tuple,
19 which are the two residue/nucleotide types replaced. The value differs
20 according to the matrix's purpose: e.g in a log-odds frequency matrix, the
21 value would be log(Pij/(Pi*Pj)) where:
22 Pij: frequency of substitution of letter (residue/nucleotide) i by j
23 Pi, Pj: expected frequencies of i and j, respectively.
24
25 Usage:
26 -----
27 The following section is laid out in the order by which most people wish
28 to generate a log-odds matrix. Of course, interim matrices can be
29 generated and investigated. Most people just want a log-odds matrix,
30 that's all.
31
32 Generating an Accepted Replacement Matrix:
33 -----------------------------------------
34 Initially, you should generate an accepted replacement matrix (ARM)
35 from your data. The values in ARM are the _counted_ number of
36 replacements according to your data. The data could be a set of pairs
37 or multiple alignments. So for instance if Alanine was replaced by
38 Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding
39 ARM entries would be:
40 ['A','C']: 10,
41 ['C','A'] 12
42 As order doesn't matter, user can already provide only one entry:
43 ['A','C']: 22
44 A SeqMat instance may be initialized with either a full (first
45 method of counting: 10, 12) or half (the latter method, 22) matrix. A
46 Full protein alphabet matrix would be of the size 20x20 = 400. A Half
47 matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because
48 same-letter entries don't change. (The matrix diagonal). Given an
49 alphabet size of N:
50 Full matrix size:N*N
51 Half matrix size: N(N+1)/2
52
53 If you provide a full matrix, the constructor will create a half-matrix
54 automatically.
55 If you provide a half-matrix, make sure of a (low, high) sorted order in
56 the keys: there should only be
57 a ('A','C') not a ('C','A').
58
59 Internal functions:
60
61 Generating the observed frequency matrix (OFM):
62 ----------------------------------------------
63 Use: OFM = _build_obs_freq_mat(ARM)
64 The OFM is generated from the ARM, only instead of replacement counts, it
65 contains replacement frequencies.
66
67 Generating an expected frequency matrix (EFM):
68 ---------------------------------------------
69 Use: EFM = _build_exp_freq_mat(OFM,exp_freq_table)
70 exp_freq_table: should be a freqTableC instantiation. See freqTable.py for
71 detailed information. Briefly, the expected frequency table has the
72 frequencies of appearance for each member of the alphabet
73
74 Generating a substitution frequency matrix (SFM):
75 ------------------------------------------------
76 Use: SFM = _build_subs_mat(OFM,EFM)
77 Accepts an OFM, EFM. Provides the division product of the corresponding
78 values.
79
80 Generating a log-odds matrix (LOM):
81 ----------------------------------
82 Use: LOM=_build_log_odds_mat(SFM[,logbase=10,factor=10.0,roundit=1])
83 Accepts an SFM. logbase: base of the logarithm used to generate the
84 log-odds values. factor: factor used to multiply the log-odds values.
85 roundit: default - true. Whether to round the values.
86 Each entry is generated by log(LOM[key])*factor
87 And rounded if required.
88
89 External:
90 ---------
91 In most cases, users will want to generate a log-odds matrix only, without
92 explicitly calling the OFM --> EFM --> SFM stages. The function
93 build_log_odds_matrix does that. User provides an ARM and an expected
94 frequency table. The function returns the log-odds matrix.
95
96 Methods for subtraction, addition and multiplication of matrices:
97 -----------------------------------------------------------------
98 * Generation of an expected frequency table from an observed frequency
99 matrix.
100 * Calculation of linear correlation coefficient between two matrices.
101 * Calculation of relative entropy is now done using the
102 _make_relative_entropy method and is stored in the member
103 self.relative_entropy
104 * Calculation of entropy is now done using the _make_entropy method and
105 is stored in the member self.entropy.
106 * Jensen-Shannon distance between the distributions from which the
107 matrices are derived. This is a distance function based on the
108 distribution's entropies.
109 """
110
111
112 import re
113 import sys
114 import copy
115 import math
116 import warnings
117
118
119 import Bio
120 from Bio import Alphabet
121 from Bio.SubsMat import FreqTable
122
123 log = math.log
124
125 NOTYPE = 0
126 ACCREP = 1
127 OBSFREQ = 2
128 SUBS = 3
129 EXPFREQ = 4
130 LO = 5
131 EPSILON = 0.00000000000001
132
133
135 """A Generic sequence matrix class
136 The key is a 2-tuple containing the letter indices of the matrix. Those
137 should be sorted in the tuple (low, high). Because each matrix is dealt
138 with as a half-matrix."""
139
141 ab_dict = {}
142 s = ''
143 for i in self:
144 ab_dict[i[0]] = 1
145 ab_dict[i[1]] = 1
146 for i in sorted(ab_dict):
147 s += i
148 self.alphabet.letters = s
149
150 - def __init__(self, data=None, alphabet=None, mat_name='', build_later=0):
198
200 keylist = self.keys()
201 for key in keylist:
202 if key[0] > key[1]:
203 self[(key[1], key[0])] = self[key]
204 del self[key]
205
207 """
208 Convert a full-matrix to a half-matrix
209 """
210
211
212
213
214 N = len(self.alphabet.letters)
215
216 if len(self) == N*(N+1)/2:
217 return
218 for i in self.ab_list:
219 for j in self.ab_list[:self.ab_list.index(i)+1]:
220 if i != j:
221 self[j, i] = self[j, i] + self[i, j]
222 del self[i, j]
223
225 for i in self.ab_list:
226 for j in self.ab_list[:self.ab_list.index(i)+1]:
227 self[j, i] = 0.
228
230 self.entropy = 0
231 for i in self:
232 if self[i] > EPSILON:
233 self.entropy += self[i]*log(self[i])/log(2)
234 self.entropy = -self.entropy
235
237 result = {}
238 for letter in self.alphabet.letters:
239 result[letter] = 0.0
240 for pair, value in self.iteritems():
241 i1, i2 = pair
242 if i1 == i2:
243 result[i1] += value
244 else:
245 result[i1] += value / 2
246 result[i2] += value / 2
247 return result
248
249 - def print_full_mat(self, f=None, format="%4d", topformat="%4s",
250 alphabet=None, factor=1, non_sym=None):
251 f = f or sys.stdout
252
253
254 assert non_sym is None or isinstance(non_sym, float) or \
255 isinstance(non_sym, int)
256 full_mat = copy.copy(self)
257 for i in self:
258 if i[0] != i[1]:
259 full_mat[(i[1], i[0])] = full_mat[i]
260 if not alphabet:
261 alphabet = self.ab_list
262 topline = ''
263 for i in alphabet:
264 topline = topline + topformat % i
265 topline = topline + '\n'
266 f.write(topline)
267 for i in alphabet:
268 outline = i
269 for j in alphabet:
270 if alphabet.index(j) > alphabet.index(i) and non_sym is not None:
271 val = non_sym
272 else:
273 val = full_mat[i, j]
274 val *= factor
275 if val <= -999:
276 cur_str = ' ND'
277 else:
278 cur_str = format % val
279
280 outline = outline+cur_str
281 outline = outline + '\n'
282 f.write(outline)
283
284 - def print_mat(self, f=None, format="%4d", bottomformat="%4s",
285 alphabet=None, factor=1):
286 """Print a nice half-matrix. f=sys.stdout to see on the screen
287 User may pass own alphabet, which should contain all letters in the
288 alphabet of the matrix, but may be in a different order. This
289 order will be the order of the letters on the axes"""
290
291 f = f or sys.stdout
292 if not alphabet:
293 alphabet = self.ab_list
294 bottomline = ''
295 for i in alphabet:
296 bottomline = bottomline + bottomformat % i
297 bottomline = bottomline + '\n'
298 for i in alphabet:
299 outline = i
300 for j in alphabet[:alphabet.index(i)+1]:
301 try:
302 val = self[j, i]
303 except KeyError:
304 val = self[i, j]
305 val *= factor
306 if val == -999:
307 cur_str = ' ND'
308 else:
309 cur_str = format % val
310
311 outline = outline + cur_str
312 outline = outline + '\n'
313 f.write(outline)
314 f.write(bottomline)
315
317 """Print a nice half-matrix."""
318 output = ""
319 alphabet = self.ab_list
320 n = len(alphabet)
321 for i in range(n):
322 c1 = alphabet[i]
323 output += c1
324 for j in range(i+1):
325 c2 = alphabet[j]
326 try:
327 val = self[c2, c1]
328 except KeyError:
329 val = self[c1, c2]
330 if val == -999:
331 output += ' ND'
332 else:
333 output += "%4d" % val
334 output += '\n'
335 output += '%4s' * n % tuple(alphabet) + "\n"
336 return output
337
339 """ returns a number which is the subtraction product of the two matrices"""
340 mat_diff = 0
341 for i in self:
342 mat_diff += (self[i] - other[i])
343 return mat_diff
344
346 """ returns a matrix for which each entry is the multiplication product of the
347 two matrices passed"""
348 new_mat = copy.copy(self)
349 for i in self:
350 new_mat[i] *= other[i]
351 return new_mat
352
354 new_mat = copy.copy(self)
355 for i in self:
356 new_mat[i] += other[i]
357 return new_mat
358
359
361 """Accepted replacements matrix"""
362
363
365 """Observed frequency matrix"""
366
367
369 """Expected frequency matrix"""
370
371
373 """Substitution matrix"""
374
376 """Calculate and return the relative entropy with respect to an
377 observed frequency matrix"""
378 relative_entropy = 0.
379 for key, value in self.iteritems():
380 if value > EPSILON:
381 relative_entropy += obs_freq_mat[key] * log(value)
382 relative_entropy /= log(2)
383 return relative_entropy
384
385
387 """Log odds matrix"""
388
390 """Calculate and return the relative entropy with respect to an
391 observed frequency matrix"""
392 relative_entropy = 0.
393 for key, value in self.iteritems():
394 relative_entropy += obs_freq_mat[key] * value / log(2)
395 return relative_entropy
396
397
399 """
400 build_obs_freq_mat(acc_rep_mat):
401 Build the observed frequency matrix, from an accepted replacements matrix
402 The acc_rep_mat matrix should be generated by the user.
403 """
404
405 total = float(sum(acc_rep_mat.values()))
406 obs_freq_mat = ObservedFrequencyMatrix(alphabet=acc_rep_mat.alphabet,
407 build_later=1)
408 for i in acc_rep_mat:
409 obs_freq_mat[i] = acc_rep_mat[i] / total
410 return obs_freq_mat
411
412
414 exp_freq_table = {}
415 for i in obs_freq_mat.alphabet.letters:
416 exp_freq_table[i] = 0.
417 for i in obs_freq_mat:
418 if i[0] == i[1]:
419 exp_freq_table[i[0]] += obs_freq_mat[i]
420 else:
421 exp_freq_table[i[0]] += obs_freq_mat[i] / 2.
422 exp_freq_table[i[1]] += obs_freq_mat[i] / 2.
423 return FreqTable.FreqTable(exp_freq_table, FreqTable.FREQ)
424
425
427 """Build an expected frequency matrix
428 exp_freq_table: should be a FreqTable instance
429 """
430 exp_freq_mat = ExpectedFrequencyMatrix(alphabet=exp_freq_table.alphabet,
431 build_later=1)
432 for i in exp_freq_mat:
433 if i[0] == i[1]:
434 exp_freq_mat[i] = exp_freq_table[i[0]]**2
435 else:
436 exp_freq_mat[i] = 2.0*exp_freq_table[i[0]]*exp_freq_table[i[1]]
437 return exp_freq_mat
438
439
440
441
442
444 """ Build the substitution matrix """
445 if obs_freq_mat.ab_list != exp_freq_mat.ab_list:
446 raise ValueError("Alphabet mismatch in passed matrices")
447 subs_mat = SubstitutionMatrix(obs_freq_mat)
448 for i in obs_freq_mat:
449 subs_mat[i] = obs_freq_mat[i]/exp_freq_mat[i]
450 return subs_mat
451
452
453
454
455
457 """_build_log_odds_mat(subs_mat,logbase=10,factor=10.0,round_digit=1):
458 Build a log-odds matrix
459 logbase=2: base of logarithm used to build (default 2)
460 factor=10.: a factor by which each matrix entry is multiplied
461 round_digit: roundoff place after decimal point
462 keep_nd: if true, keeps the -999 value for non-determined values (for which there
463 are no substitutions in the frequency substitutions matrix). If false, plants the
464 minimum log-odds value of the matrix in entries containing -999
465 """
466 lo_mat = LogOddsMatrix(subs_mat)
467 for key, value in subs_mat.iteritems():
468 if value < EPSILON:
469 lo_mat[key] = -999
470 else:
471 lo_mat[key] = round(factor*log(value)/log(logbase), round_digit)
472 mat_min = min(lo_mat.values())
473 if not keep_nd:
474 for i in lo_mat:
475 if lo_mat[i] <= -999:
476 lo_mat[i] = mat_min
477 return lo_mat
478
479
480
481
482
483
484
485
486 -def make_log_odds_matrix(acc_rep_mat, exp_freq_table=None, logbase=2,
487 factor=1., round_digit=9, keep_nd=0):
495
496
502
503
504 -def read_text_matrix(data_file):
505 matrix = {}
506 tmp = data_file.read().split("\n")
507 table=[]
508 for i in tmp:
509 table.append(i.split())
510
511 for rec in table[:]:
512 if (rec.count('#') > 0):
513 table.remove(rec)
514
515
516 while (table.count([]) > 0):
517 table.remove([])
518
519 alphabet = table[0]
520 j = 0
521 for rec in table[1:]:
522
523 row = alphabet[j]
524
525 if re.compile('[A-z\*]').match(rec[0]):
526 first_col = 1
527 else:
528 first_col = 0
529 i = 0
530 for field in rec[first_col:]:
531 col = alphabet[i]
532 matrix[(row, col)] = float(field)
533 i += 1
534 j += 1
535
536 for i in matrix.keys():
537 if '*' in i:
538 del(matrix[i])
539 ret_mat = SeqMat(matrix)
540 return ret_mat
541
542 diagNO = 1
543 diagONLY = 2
544 diagALL = 3
545
546
548 rel_ent = 0.
549 key_list_1 = sorted(mat_1)
550 key_list_2 = sorted(mat_2)
551 key_list = []
552 sum_ent_1 = 0.
553 sum_ent_2 = 0.
554 for i in key_list_1:
555 if i in key_list_2:
556 key_list.append(i)
557 if len(key_list_1) != len(key_list_2):
558 sys.stderr.write("Warning: first matrix has more entries than the second\n")
559 if key_list_1 != key_list_2:
560 sys.stderr.write("Warning: indices not the same between matrices\n")
561 for key in key_list:
562 if diag == diagNO and key[0] == key[1]:
563 continue
564 if diag == diagONLY and key[0] != key[1]:
565 continue
566 if mat_1[key] > EPSILON and mat_2[key] > EPSILON:
567 sum_ent_1 += mat_1[key]
568 sum_ent_2 += mat_2[key]
569
570 for key in key_list:
571 if diag == diagNO and key[0] == key[1]:
572 continue
573 if diag == diagONLY and key[0] != key[1]:
574 continue
575 if mat_1[key] > EPSILON and mat_2[key] > EPSILON:
576 val_1 = mat_1[key] / sum_ent_1
577 val_2 = mat_2[key] / sum_ent_2
578
579 rel_ent += val_1 * log(val_1/val_2)/log(logbase)
580 return rel_ent
581
582
583
585 try:
586 import numpy
587 except ImportError:
588 raise ImportError("Please install Numerical Python (numpy) if you want to use this function")
589 values = []
590 assert mat_1.ab_list == mat_2.ab_list
591 for ab_pair in mat_1:
592 try:
593 values.append((mat_1[ab_pair], mat_2[ab_pair]))
594 except KeyError:
595 raise ValueError("%s is not a common key" % ab_pair)
596 correlation_matrix = numpy.corrcoef(values, rowvar=0)
597 correlation = correlation_matrix[0, 1]
598 return correlation
599
600
601
602
604 assert mat_1.ab_list == mat_2.ab_list
605 assert pi_1 > 0 and pi_2 > 0 and pi_1< 1 and pi_2 <1
606 assert not (pi_1 + pi_2 - 1.0 > EPSILON)
607 sum_mat = SeqMat(build_later=1)
608 sum_mat.ab_list = mat_1.ab_list
609 for i in mat_1:
610 sum_mat[i] = pi_1 * mat_1[i] + pi_2 * mat_2[i]
611 sum_mat.make_entropy()
612 mat_1.make_entropy()
613 mat_2.make_entropy()
614
615 dJS = sum_mat.entropy - pi_1 * mat_1.entropy - pi_2 * mat_2.entropy
616 return dJS
617
618 """
619 This isn't working yet. Boo hoo!
620 def two_mat_print(mat_1, mat_2, f=None, alphabet=None, factor_1=1, factor_2=1,
621 format="%4d", bottomformat="%4s", topformat="%4s",
622 topindent=7*" ", bottomindent=1*" "):
623 f = f or sys.stdout
624 if not alphabet:
625 assert mat_1.ab_list == mat_2.ab_list
626 alphabet = mat_1.ab_list
627 len_alphabet = len(alphabet)
628 print_mat = {}
629 topline = topindent
630 bottomline = bottomindent
631 for i in alphabet:
632 bottomline += bottomformat % i
633 topline += topformat % alphabet[len_alphabet-alphabet.index(i)-1]
634 topline += '\n'
635 bottomline += '\n'
636 f.write(topline)
637 for i in alphabet:
638 for j in alphabet:
639 print_mat[i, j] = -999
640 diag_1 = {}
641 diag_2 = {}
642 for i in alphabet:
643 for j in alphabet[:alphabet.index(i)+1]:
644 if i == j:
645 diag_1[i] = mat_1[(i, i)]
646 diag_2[i] = mat_2[(alphabet[len_alphabet-alphabet.index(i)-1],
647 alphabet[len_alphabet-alphabet.index(i)-1])]
648 else:
649 if i > j:
650 key = (j, i)
651 else:
652 key = (i, j)
653 mat_2_key = [alphabet[len_alphabet-alphabet.index(key[0])-1],
654 alphabet[len_alphabet-alphabet.index(key[1])-1]]
655 # print mat_2_key
656 mat_2_key.sort()
657 mat_2_key = tuple(mat_2_key)
658 # print key, "||", mat_2_key
659 print_mat[key] = mat_2[mat_2_key]
660 print_mat[(key[1], key[0])] = mat_1[key]
661 for i in alphabet:
662 outline = i
663 for j in alphabet:
664 if i == j:
665 if diag_1[i] == -999:
666 val_1 = ' ND'
667 else:
668 val_1 = format % (diag_1[i]*factor_1)
669 if diag_2[i] == -999:
670 val_2 = ' ND'
671 else:
672 val_2 = format % (diag_2[i]*factor_2)
673 cur_str = val_1 + " " + val_2
674 else:
675 if print_mat[(i, j)] == -999:
676 val = ' ND'
677 elif alphabet.index(i) > alphabet.index(j):
678 val = format % (print_mat[(i, j)]*factor_1)
679 else:
680 val = format % (print_mat[(i, j)]*factor_2)
681 cur_str = val
682 outline += cur_str
683 outline += bottomformat % (alphabet[len_alphabet-alphabet.index(i)-1] +
684 '\n')
685 f.write(outline)
686 f.write(bottomline)
687 """
688