-1

I have a data saved in .txt like the following

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK

I am trying to split each section to as many possible 10 regions as I can.

for example the rows that starts with > remain the same. it becomes like this

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVC
NDDDDTSVCL
DDDDTSVCLG
DDDTSVCLGT
.
.
.
.

So I can easily split the data to as many e.g. 10 section as I want using the following. However, I don't want to set the number of letters that I want to split inside the algorithm. I want to be able to choice whatever number I want. I am trying to use Getopt::Std Can anyone help me find a way to do that ?

For example, I want to run the code like this

perl script.pl data.txt 10

use warnings;
use strict;
use Getopt::Std

unless (defined $DESIRED_LENGTH and $DESIRED_LENGTH =~ /^[0-9]+$/) {
my $DESIRED_LENGTH ;
while (<>) {
    chomp; # remove trailing newline
    if (m/^>/) {          # if line starts with '>'
        print "$_\n";     # just print it
    } else {
        my $i = 0;
        while ($i + $DESIRED_LENGTH <= length($_)) {
            print substr($_, $i, $DESIRED_LENGTH);
            print "\n";
            $i++;
        }
    }
}
}

Or I also tried this

use warnings;
use strict;
use Getopt::Std

getopts('i');
our($opt_i)
my $DESIRED_LENGTH = $opt_i;
while (<>) {
    chomp; # remove trailing newline
    if (m/^>/) {          # if line starts with '>'
        print "$_\n";     # just print it
    } else {
        my $i = 0;
        while ($i + $DESIRED_LENGTH <= length($_)) {
            print substr($_, $i, $DESIRED_LENGTH);
            print "\n";
            $i++;
        }
    }
}
Learner
  • 757
  • 3
  • 15
  • The first one will raise errors because it violates `strict` requirements. What does the second one do after you fix the errors in it? – Shawn Apr 06 '19 at 00:18
  • Also, `Getopt::Long` is preferrable. Much more user friendly and plays better with strict. – Shawn Apr 06 '19 at 00:19
  • @Shawn Can you please give me a solution? I have been working on this simple problem for 1 day and I could not figure it out – Learner Apr 06 '19 at 00:24
  • `say substr($_, 0, $desired_length, "") while length($_);` – ikegami Apr 06 '19 at 00:30
  • @ikegami do you think the rest of the code is okay ? – Learner Apr 06 '19 at 00:32
  • Is there a problem, or are you looking for a code review? There's a separate site for the latter. – ikegami Apr 06 '19 at 00:36
  • @ikegami I basically cannot get this to run! which website should I ask question ? – Learner Apr 06 '19 at 00:39
  • You haven't said what your actual problem is yet. Errors? Not the expected output (if so, what *are* you getting?), etc – Shawn Apr 06 '19 at 00:44

2 Answers2

2
  • You're missing a couple of semi-colons.
  • You didn't ensure that -i was provided or provide a default for when it wasn't.
  • You didn't tell getopts that the -i option expected a parameter.
  • You didn't validate the provided length.
  • You were incrementing $i by 1 instead of the by how much you already printed.
  • You were cutting off the end of every sequence unless they happened to be an exact multiple of the specified length. This could lead to entire sequences being lost if they were short enough.
  • -i is a weird choice for a length, but maybe you're trying to be consistent with another tool?
  • You were chomping lines that start with ^ only to add the line feed right back.

Fixed:

use warnings;
use strict;
use feature qw( say );

use Getopt::Std;

our $opt_i;
getopts('i:');
die("Illegal value for -i\n") if defined($opt_i) && $opt_i !~ /^[1-9][0-9]*\z/;

my $max_len = $opt_i // 70;

while (<>) {
    if (/^>/) {
        print;
    } else {
        chomp;
        while (length($_)) {
           say substr($_, 0, $max_len, "");
        }
    }
}

Of course, if the sequences in the file were previously wrapped to a line length shorter than the specified length, the above program doesn't extend them to the desired length.

use warnings;
use strict;
use feature qw( say );

use Getopt::Std;

our $opt_i;
getopts('i:');
die("Illegal value for -i\n") if defined($opt_i) && $opt_i !~ /^[1-9][0-9]*\z/;

my $desired_len = $opt_i // 70;

my $seq;
while (1) {
   my $line = <>;
   if (!defined($line) || $line =~ /^>/) {
      while (length($seq)) {
         say substr($seq, 0, $desired_len, "");
      }

      last if !defined($line);

      print($line);
      $seq = "";
   } else {
      chomp($line);
      $seq .= $line;
   }
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I get this error `perl second_split.pl data.txt 10 Global symbol "$i" requires explicit package name at second_split.pl line 10. Execution of second_split.pl aborted due to compilation errors.` – Learner Apr 06 '19 at 01:26
  • Surely you've seen that error before and know what it means. Or googled it. We'll help you, but you gotta start putting in some effort! – ikegami Apr 06 '19 at 02:13
  • since you posted your answer, I tried to find the problem and solve it myself. for example here https://stackoverflow.com/questions/26226034/perl-global-symbol-requires-explicit-package-name-error but I don't see any error with () which I should change to {} and the `I` is out of `while` loop https://stackoverflow.com/questions/23854436/global-symbol-i-requires-explicit-package-name – Learner Apr 06 '19 at 02:16
  • I am having some issues, I will try to solve it and if it works then I accept and like your answer , I run your code like this `perl second_split.pl data.txt -I 10` but I get error. Thanks for all your help – Learner Apr 06 '19 at 02:33
  • I accept it , my own code print the line that starts with `>` but yours discard that, is there a possibility to add that also ? – Learner Apr 06 '19 at 02:42
  • Oops, I accidentally deleted the `print` from my second solution. – ikegami Apr 06 '19 at 02:45
0

if your data in 'd' file;

perl -ne 'if(/^>s.+\n/) {chomp; $_.="\n".<>; /^(>s.+\n)(\w+\n)/; print "\n$1"; print substr $2,$_ for 0..9}' d