LINUX & PERL, 学习和分析生物学信息的电脑工具

ArticleCategory:

Software Development

AuthorImage:

La Foto

TranslationInfo:

original in es Carlos Andrés Pérez

en to CN 汪 君

AboutTheAuthor:

Carlos Andrés Pérez 是分子模拟的专家,生物学博士,GIEV 的技术顾问(GIEV, the Grupo de Investigación en Educación Virtual (GIEV) - Research Group in Virtual Learning,即虚拟学习技术研究小组)。地址: Universidad Santiago de Cali, Calle 5ª carrera 62 Campus Pampalinda, Cali – Colombia.

Abstract:

这篇文章介绍了对DNA、RNA和蛋白质序列数据库的生物信息提取时,在Unix上的Perl程序的一些优点。这些Perl程序可用来作数据比对处理和分析。人类基因组计划和DNA克隆技术的发展加速了这个领域的进步。这些领域每天产生的大量信息使得我们处理这些信息的方式不得不有所改进。

不同基因组(生物有机体上的一组基因)上激增的生物信息推动着生物信息学成为处理分析这些数据的基础手段。

ArticleIllustration:

[Illustration]

ArticleBody:

生物信息学(Bioinformatics)

生物信息学开始于科学家们将生物学数据以数字格式存放并且用程序来处理这些数据。很长一段时间以来,生物信息学都限制在序列的分析上。然而,随着构建分子的结构模型的重要性开始显现,电子计算机也开始成为理论生物化学的重要工具。每天都不断有关于分子3D信息和数据被采集,人们对基因的认识和研究也从单个的基因研究转变为从整体上或者扩展式的研究。由于生物信息学的发展,现在更容易理解蛋白质之间为何相互之间那样作用、又是如何通过新陈代谢来组织相互的。而且我们现在也越来越清醒的认识到组织好这些数据的重要性。

生物信息学里面至少有两点特征使她变得非常有趣。其一,生物信息学的研究目标是找出各种生命分子的关系;而这个目标恰恰是一个有趣的程序设计问题,因为这就需要我们联合并整合我们得到的那些信息,然后从中得到对生命活动的一些整体的和有效的一些认识。我们还发现,将计算机科学中的不同领域的知识结合起来是非常必要的,比如数据的管理和整合、高效可靠的算法和强劲的硬件--格点技术、多处理器的使用等。

Perl

Larry Wall 于1986年开始开发Perl。 Perl是一种解释型的语言,是处理文本、文件和进程的强大的工具。Perl使得我们能够很快的开发出小程序。可以说,Perl是高级编程语言(例如C)和脚本语言(如bash)的一种有效组合。

Perl程序可以运行在多种操作系统/平台上,尽管Perl是在Unix上诞生并且快速发展的。由于Perl广泛的用于web程序设计,其发展很快便超出了其预想。在Perl之前,人们使用awk,thirstgrep 来分析文件并提取信息。

Perl将这些UNIX上广泛使用的工具统一在一个程序里面,并将这些功能扩展和现代化以适应各种需求。

Perl是一种免费/自由的程序语言,可以运行在现代生物实验室里使用的各种操作系统上。在UNIX和MacOSX上,它是预安装好的,在其他系统上,得先安装好Perl。http://www.cpan.org 网站上有安装和使用Perl的很多实用信息。

在Linux下,运行Perl程序,是将这个程序的文件名作为perl 这个命令的一个参数,然后perl 会依次解释执行这个程序里的命令。

另一种常用的方法,不需要运行perl 这个命令,为此,我们需要做以下两件事: (a)在程序的文件里加入一行特殊的注释:

#!/usr/bin/env perl
print "Hi\n";

(b) 保存此文件并给它加上可执行的属性:

% chmod +x greetings.pl

这样,我们就可以直接通过文件名来运行这个程序:

% ./greetings.pl

用Perl来文件管理:

当我们有了文本格式的分子序列,我们可以用Perl写一个序列搜索工具。下面的例子我们可以看到如何在SWISS-PROT(db_human_swissprot)格式的数据库中用id码来查找蛋白质序列。

#!/usr/bin/perl
# Look for aminoacid sequence in a database
# SWISS-PROT formated, with a given id code
# Ask for the code in the ID field
# and it assigns it from the input(STDIN)to a variable
print "Enter the ID to search: "; $id_query=<STDIN>; chomp $id_query; # We open the database file
# but if it isn't possible the program ends
open (db, "human_kinases_swissprot.txt") || die "problem opening the file human_kinases_swissprot.txt\n"; # Look line by line in the database
while (<db>) { chomp $_; # Check if we are in the ID field if ($_ =~ /^ID/) { # If it is possitive we gather the information
# breaking the line by spaces
($a1,$id_db) = split (/\s+/,$_); # but if there is no coincidence of ID we continue to the following
next if ($id_db ne $id_query); # When they coincide, we put a mark
$signal_good=1; # Then we check the sequence field
# and if the mark is 1 (chosen sequence) # If possitive, we change the mark to 2,to collect the sequence
} elsif (($_ =~ /^SQ/) && ($signal_good==1)) { $signal_good=2; # Finally, if the mark is 2, we present each line
# of the sequence, until the line begins with // # is such case we broke the while } elsif ($signal_good == 2) { last if ($_ =~ /^\/\//); print "$_\n"; } } # When we left the while instruction we check the mark
# if negative that means that we don't find the chosen sequence
# that will give us an error
if (!$signal_good) { print "ERROR: "."Sequence not found\n"; } # Finally, we close the file # that still si open
close (db); exit;

查找氨基酸的模式(Search for aminoacid patterns)

#!/usr/bin/perl
# Searcher for aminoacid patterns
# Ask the user the patterns for search
print "Please, introduce the pattern to search in query.seq: ";
$patron = <STDIN>;
chomp $patron;
# Open the database file
# but if it can't it ends the program
open (query, "query_seq.txt") || die "problem opening the file query_seq.txt\n";
# Look line by line the SWISS-PROT sequence
while (<query>) {
chomp $_;
# When arrives to the SQ field,put the mark in 1
if ($_ =~ /^SQ/) {
$signal_seq = 1; # When arrive to the end of sequence, leave the curl
# Check that this expression is put before to check
# the mark=1,because this line doesn't belong to the aminoacid sequence
} elsif ($_ =~ /^\/\//) {
last; # Check the mark if it is equal to 1, if possitive
# eliminate the blank spaces in the sequence line
# and join every line in a new variable
# To concatenate, we also can do:
# $secuencia_total.=$_;
} elsif ($signal_seq == 1) {
$_ =~ s/ //g;
$secuencia_total=$secuencia_total.$_;
}
} # Now check the sequence, collected in its entirety,
# for the given pattern
if ($secuencia_total =~ /$patron/) {
print "The sequence query.seq contains the pattern $patron\n";
} else {
print "The sequence query.seq doesn't contains the pattern $patron\n";
} # Finally we close the file
# and leave the program
close (query);
exit;

如果想知道数据库里模式的具体位置,我们必须使用特殊变量`$&',这个变量在对正则表达式求值后仍然保存着找到的模式(应该将它放在`if ($$secuencia_total>= ~/$$patron>/ 一句的后面)。另外,可以将变量` $ ` ' 和` $ ´ '组合起来使用,它们会将找到的模式的左右位置的信息保存。将这些变量正确的加入前面的程序中,我们就可以给出模式的确切位置。注意:length也是非常有用的,它会给出一串数据的长度。

 

# Only we need to change the if where the pattern was found
# Now check the sequence, collected in its entirety,
# for the given pattern
# and check its position in the sequence
if ($secuencia_total =~ /$patron/) {
$posicion=length($`)+1;
print "The sequence query_seq.txt contains the pattern $patron in the following position $posicion\n"; } else {
print "The sequence query_seq.txt doesn't contains the pattern $patron\n";
}

计算氨基酸的频度(Calculus of aminoacid frequences):

不同蛋白质里,特定的氨基酸出现的频度是不同的,这是因为他们处在不同的环境里面、并且功能不同。下面,我们给出一个例子来展示如何计算给定氨基酸序列里某种氨基酸频度。


#!/usr/bin/perl # Calculates the frequency of aminoacid in a proteinic sequence # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The execution line shall be: program.pl file_swissprot\n";} $fichero = $ARGV[0]; # Initialize the variable $errores my $errores=0; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now use a curl that checks every position of the aminoacid # in the sequence (from a funcion of its own,that can be used after in other # programs) comprueba_aa ($secuencia); # Print the results to the screen # First the 20 aminoacids and then the array with their frequencies # In this case 'sort' can't be used in foreach, # because the array contains the frequencies (numbers) print"A\tC\tD\tE\tF\tG\tH\tI\tK\tL\tM\tN\tP\tQ\tR\tS\tT\tV\tW\tY\n"; foreach $each_aa (@aa) { print "$each_aa\t"; } # Ten it gives the possible errors # and ends the program print "\nerrores = $errores\n"; exit; # Functions # This one calculates each aminoacid frequency # from a proteinic sequence sub comprueba_aa { # Gets the sequence my ($secuencia)=@_; # and runs aminoacid by aminoacid, using a for running # from 0 until the sequence length for ($posicion=0 ; $posicion<length $secuencia ; $posicion++ ) { # Gets the aminoacid $aa = substr($secuencia, $posicion, 1); # and checks which one is using if # when it is checked it aggregates 1 to the correspondant frequency # in an array using a pointer for each one # ordered in alphabetic way if ( $aa eq 'A' ) { $aa[0]++; } elsif ( $aa eq 'C' ) { $aa[1]++; } elsif ( $aa eq 'D' ) { $aa[2]++; } elsif ( $aa eq 'E' ) { $aa[3]++; } elsif ( $aa eq 'F' ) { $aa[4]++; } elsif ( $aa eq 'G' ) { $aa[5]++; } elsif ( $aa eq 'H' ) { $aa[6]++; } elsif ( $aa eq 'I' ) { $aa[7]++; } elsif ( $aa eq 'K' ) { $aa[8]++; } elsif ( $aa eq 'L' ) { $aa[9]++; } elsif ( $aa eq 'M' ) { $aa[10]++; } elsif ( $aa eq 'N' ) { $aa[11]++; } elsif ( $aa eq 'P' ) { $aa[12]++; } elsif ( $aa eq 'Q' ) { $aa[13]++; } elsif ( $aa eq 'R' ) { $aa[14]++; } elsif ( $aa eq 'S' ) { $aa[15]++; } elsif ( $aa eq 'T' ) { $aa[16]++; } elsif ( $aa eq 'V' ) { $aa[17]++; } elsif ( $aa eq 'W' ) { $aa[18]++; } elsif ( $aa eq 'Y' ) { $aa[19]++; # If the aminoacid is not found # it aggregates 1 to the errors } else { print "ERROR: Aminoacid not found: $aa\n"; $errores++; } } # Finally returns to the frequency array return @aa; }

下面就让我们跟着大自然的步伐,看看细胞中的信息流向了何方。其中之一就是转录,RNA 从DNA(基因)中复制出遗传信息,然后又将这些信息传递给蛋白质或者氨基酸序列。为此,我们必须使用与氨基酸对应的基因密码--所谓的RNA/DNA三联密码子。我们要提取Escherichia coli(一种埃[舍利]希氏杆菌属的大肠杆菌) 的基因所对应的氨基酸序列,而这些信息都是以EMBL(European Molecular Biology Laboratory)要求的格式。做完这些转换之后,我们将与已有的转录信息校验。对这个例子,非常有必要引进数组的关联变量(associative variables of arrays)和哈希表。


#!/usr/bin/perl # Translates an ADN sequence from an EMBL fiche # to the aminoacid correspondant # Gets the file name from the command line # (SWISS-PROT formatted) # Also can be asked with print from the <STDIN> if (!$ARGV[0]) {print "The program line shall be: program.pl ficha_embl\n";} $fichero = $ARGV[0]; # Open the file for reading open (FICHA, "$fichero") || die "problem opening the file $fichero\n"; # First we check the sequence as did in the example 2 while (<FICHA>) { chomp $_; if ($_ =~ /^FT CDS/) { $_ =~ tr/../ /; ($a1,$a2,$a3,$a4) = split (" ",$_); } elsif ($_ =~ /^SQ/) { $signal_good = 1; } elsif ($signal_good == 1) { last if ($_ =~ /^\/\//); # Eliminate numbers and spaces $_ =~ tr/0-9/ /; $_ =~ s/\s//g; $secuencia.=$_; } } close (FICHA); # Now we define an associate array with the correpondence # of every aminoacids with their nucleotide # correspondants (also in an own function, # for if the same genetic code is used in other program my(%codigo_genetico) = ( 'TCA' => 'S',# Serine 'TCC' => 'S',# Serine 'TCG' => 'S',# Serine 'TCT' => 'S',# Serine 'TTC' => 'F',# Fenilalanine 'TTT' => 'F',# Fenilalanine 'TTA' => 'L',# Leucine 'TTG' => 'L',# Leucine 'TAC' => 'Y',# Tirosine 'TAT' => 'Y',# Tirosine 'TAA' => '*',# Stop 'TAG' => '*',# Stop 'TGC' => 'C',# Cysteine 'TGT' => 'C',# Cysteine 'TGA' => '*',# Stop 'TGG' => 'W',# Tryptofane 'CTA' => 'L',# Leucine 'CTC' => 'L',# Leucine 'CTG' => 'L',# Leucine 'CTT' => 'L',# Leucine 'CCA' => 'P',# Proline 'CCC' => 'P',# Proline 'CCG' => 'P',# Proline 'CCT' => 'P',# Proline 'CAC' => 'H',# Hystidine 'CAT' => 'H',# Hystidine 'CAA' => 'Q',# Glutamine 'CAG' => 'Q',# Glutamine 'CGA' => 'R',# Arginine 'CGC' => 'R',# Arginine 'CGG' => 'R',# Arginine 'CGT' => 'R',# Arginine 'ATA' => 'I',# IsoLeucine 'ATC' => 'I',# IsoLeucine 'ATT' => 'I',# IsoLeucine 'ATG' => 'M',# Methionina 'ACA' => 'T',# Treonina 'ACC' => 'T',# Treonina 'ACG' => 'T',# Treonina 'ACT' => 'T',# Treonina 'AAC' => 'N',# Asparagina 'AAT' => 'N',# Asparagina 'AAA' => 'K',# Lisina 'AAG' => 'K',# Lisina 'AGC' => 'S',# Serine 'AGT' => 'S',# Serine 'AGA' => 'R',# Arginine 'AGG' => 'R',# Arginine 'GTA' => 'V',# Valine 'GTC' => 'V',# Valine 'GTG' => 'V',# Valine 'GTT' => 'V',# Valine 'GCA' => 'A',# Alanine 'GCC' => 'A',# Alanine 'GCG' => 'A',# Alanine 'GCT' => 'A',# Alanine 'GAC' => 'D',# Aspartic Acid 'GAT' => 'D',# Aspartic Acid 'GAA' => 'E',# Glutamic Acid 'GAG' => 'E',# Glutamic Acid 'GGA' => 'G',# Glicine 'GGC' => 'G',# Glicine 'GGG' => 'G',# Glicine 'GGT' => 'G',# Glicine ); # Translate every codon in its correspondant aminoacid # and aggregates to the proteinic sequence print $a3; for($i=$a3 - 1; $i < $a4 - 3 ; $i += 3) { $codon = substr($secuencia,$i,3); # Pass the codon from subcase (EMBL format) to uppercase $codon =~ tr/a-z/A-Z/; $protein.= codon2aa($codon); } print "This proteinic sequence of the gen:\n$secuencia\nis the following:\n$protein\n\n"; exit;

Bibliographic References