Random mutagenesis with bash

up vote
4
down vote

favorite

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ

KLMNOPQRST

UVWXYZABCD

...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string

randomposition=$(jot -r 1 1 $seqpositions)

#chose a random number between 1 and the number of lines in the set of potential replacement strings

randomline=$(jot -r 1 1 $alignlines)

#find the character at randomline:randomposition

newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)

#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'

sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

2

What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34

Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37

Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34

I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50

1

Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11

|
show 6 more comments

up vote
4
down vote

favorite

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ

KLMNOPQRST

UVWXYZABCD

...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string

randomposition=$(jot -r 1 1 $seqpositions)

#chose a random number between 1 and the number of lines in the set of potential replacement strings

randomline=$(jot -r 1 1 $alignlines)

#find the character at randomline:randomposition

newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)

#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'

sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

2

What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34

Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37

Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34

I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50

1

Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11

|
show 6 more comments

up vote
4
down vote

favorite

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ

KLMNOPQRST

UVWXYZABCD

...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string

randomposition=$(jot -r 1 1 $seqpositions)

#chose a random number between 1 and the number of lines in the set of potential replacement strings

randomline=$(jot -r 1 1 $alignlines)

#find the character at randomline:randomposition

newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)

#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'

sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ

KLMNOPQRST

UVWXYZABCD

...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string

randomposition=$(jot -r 1 1 $seqpositions)

#chose a random number between 1 and the number of lines in the set of potential replacement strings

randomline=$(jot -r 1 1 $alignlines)

#find the character at randomline:randomposition

newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)

#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'

sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

bash scripting

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

edited Oct 10 at 16:15

Rui F Ribeiro

38.7k1479128

asked Oct 8 at 18:20

catchprj

384

asked Oct 8 at 18:20

catchprj

384

asked Oct 8 at 18:20

catchprj

384

2

What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34

Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37

Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34

I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50

1

Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11

|
show 6 more comments

2

What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34

Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37

Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34

I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50

1

Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11

What is the application of this? Do you work in a field where this sort of thing may already be implemented by a field specific standard tool?
– Kusalananda
Oct 8 at 18:34

Looks like a computational biology question: I agree with @Kusalananda: give us a real-world example.
– Fabby
Oct 8 at 18:37

Yes, it's indeed biology based. The idea is that I have a protein sequence, and I want to mutate that sequence randomly-ish; that is, I want to mutate it randomly, but only allow specific characters (amino acid residues) that have been observed previously at that specific position i.e. by swapping with characters in my alignment file at the same position. By doing it this way, I keep the protein sequence resembling something true, while also maintaining some information about the frequency of characters at each position (mutation to a rare character only occurs rarely etc). Does that help?
– catchprj
Oct 8 at 20:34

I edited the original question to include the relevant parts of the script I am currently using. Perhaps someone can see how I could make it change more than one position at a time?
– catchprj
Oct 8 at 20:50

Could you please include the expected output so we can work on it. currently, not clear what is the fruit is ?!
– user90704
Oct 8 at 22:11

|
show 6 more comments

3 Answers
3

active

oldest

votes

up vote
0
down vote

accepted

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl

# usage mutagen number_of_replacements alignment_file [ sequence_file ..]

use strict;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max }

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

}

Usage example:

$ cat seq

1634870295

5684937021

2049163587

6598471230

$ cat alg

DPMBHZJEIO

INTMJZOYKQ

KNTXGLCJSR

GLJZRFVSEX

SYJVHEPNAZ

$ perl mutagen 3 alg seq

1L3V8702I5

5684HE7Y21

2049JZC587

6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {

    my (@r, $m, %h);

    die "more replacements than positions/alignments" if $max >= $_[0];

    for(0..$max){

        my $r = int(rand() * $_[0]);

        $r = ($r + 1) % $_[0] while $h{$r};

        $h{$r} = 1;

        push @r, $r;

    }

    @r;

}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl

# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]

use strict;



my $debug = $ARGV[0] eq '-d' ? shift : 0;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max } 

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;



    if($debug){

        my $t = ' ' x (length() - 1);

        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;

        warn "@ip | @opn    $_    $tn";

        for my $i (0..$max){

            my $t = $alg[$op[$i]];

            $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;

            printf STDERR " %2d %s", $op[$i], $t;

        }

    }

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

    if($debug){

        my @t = split "", $_;

        for my $i (0..$max){

            $_ = "e[1;31m$_e[m" for $t[$ip[$i]];

        }

        warn "  = ", @t, "n";

    }

}

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

add a comment |

up vote
0
down vote

This linear would generate an infinite number of random keys:

cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1

Sample output:

MB0JZZ85VI

2OKOY4JL61

2YN7B71Z6K

KH29TYCQ4K

B4N1XOFY5O

Explanation:

/dev/random, /dev/urandom or even /dev/arandom are special files that serve as pseudorandom number generators in the system. They allow access to environmental noise collected from device drivers and other sources, more information can be reached here

The fold command in UNIX is a command line utility for folding contents of specified files, or standard input. By default it wraps lines at a maximum width of 80 columns. It also supports specifying the column width and wrapping by numbers of bytes. The flag w in the command fold represent the columns width and it can help, indirectly, to adjust for how many bytes would be included in the randomly generated keys.

The regex expression in the command tr controls for which characters would be included in the random keys.

head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

add a comment |

up vote
0
down vote

Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.

As you're using bash, and not pure sh, you can benefit from the $RANDOM variable, Substring Expansion and Arrays. These make it possible to perform the replacements with no external commands -- not even any bash subshells.

#!/bin/bash



count=$1

read sequence < $2

IFS=$'n' read -d '' -a replacements < $3

len=${#sequence}

choices=${#replacements[*]}



while ((count--)) ; do

        pos=$(($RANDOM % $len))

        choice=$(($RANDOM % $choices))

        replacement=${replacements[$choice]}

        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}

done



echo "$sequence"

Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.

This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.

answered Oct 9 at 13:32

JigglyNaga

3,673829

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474057%2frandom-mutagenesis-with-bash%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
0
down vote

accepted

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl

# usage mutagen number_of_replacements alignment_file [ sequence_file ..]

use strict;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max }

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

}

Usage example:

$ cat seq

1634870295

5684937021

2049163587

6598471230

$ cat alg

DPMBHZJEIO

INTMJZOYKQ

KNTXGLCJSR

GLJZRFVSEX

SYJVHEPNAZ

$ perl mutagen 3 alg seq

1L3V8702I5

5684HE7Y21

2049JZC587

6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {

    my (@r, $m, %h);

    die "more replacements than positions/alignments" if $max >= $_[0];

    for(0..$max){

        my $r = int(rand() * $_[0]);

        $r = ($r + 1) % $_[0] while $h{$r};

        $h{$r} = 1;

        push @r, $r;

    }

    @r;

}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl

# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]

use strict;



my $debug = $ARGV[0] eq '-d' ? shift : 0;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max } 

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;



    if($debug){

        my $t = ' ' x (length() - 1);

        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;

        warn "@ip | @opn    $_    $tn";

        for my $i (0..$max){

            my $t = $alg[$op[$i]];

            $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;

            printf STDERR " %2d %s", $op[$i], $t;

        }

    }

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

    if($debug){

        my @t = split "", $_;

        for my $i (0..$max){

            $_ = "e[1;31m$_e[m" for $t[$ip[$i]];

        }

        warn "  = ", @t, "n";

    }

}

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

add a comment |

up vote
0
down vote

accepted

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl

# usage mutagen number_of_replacements alignment_file [ sequence_file ..]

use strict;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max }

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

}

Usage example:

$ cat seq

1634870295

5684937021

2049163587

6598471230

$ cat alg

DPMBHZJEIO

INTMJZOYKQ

KNTXGLCJSR

GLJZRFVSEX

SYJVHEPNAZ

$ perl mutagen 3 alg seq

1L3V8702I5

5684HE7Y21

2049JZC587

6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {

    my (@r, $m, %h);

    die "more replacements than positions/alignments" if $max >= $_[0];

    for(0..$max){

        my $r = int(rand() * $_[0]);

        $r = ($r + 1) % $_[0] while $h{$r};

        $h{$r} = 1;

        push @r, $r;

    }

    @r;

}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl

# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]

use strict;



my $debug = $ARGV[0] eq '-d' ? shift : 0;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max } 

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;



    if($debug){

        my $t = ' ' x (length() - 1);

        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;

        warn "@ip | @opn    $_    $tn";

        for my $i (0..$max){

            my $t = $alg[$op[$i]];

            $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;

            printf STDERR " %2d %s", $op[$i], $t;

        }

    }

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

    if($debug){

        my @t = split "", $_;

        for my $i (0..$max){

            $_ = "e[1;31m$_e[m" for $t[$ip[$i]];

        }

        warn "  = ", @t, "n";

    }

}

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

add a comment |

up vote
0
down vote

accepted

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl

# usage mutagen number_of_replacements alignment_file [ sequence_file ..]

use strict;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max }

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

}

Usage example:

$ cat seq

1634870295

5684937021

2049163587

6598471230

$ cat alg

DPMBHZJEIO

INTMJZOYKQ

KNTXGLCJSR

GLJZRFVSEX

SYJVHEPNAZ

$ perl mutagen 3 alg seq

1L3V8702I5

5684HE7Y21

2049JZC587

6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {

    my (@r, $m, %h);

    die "more replacements than positions/alignments" if $max >= $_[0];

    for(0..$max){

        my $r = int(rand() * $_[0]);

        $r = ($r + 1) % $_[0] while $h{$r};

        $h{$r} = 1;

        push @r, $r;

    }

    @r;

}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl

# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]

use strict;



my $debug = $ARGV[0] eq '-d' ? shift : 0;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max } 

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;



    if($debug){

        my $t = ' ' x (length() - 1);

        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;

        warn "@ip | @opn    $_    $tn";

        for my $i (0..$max){

            my $t = $alg[$op[$i]];

            $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;

            printf STDERR " %2d %s", $op[$i], $t;

        }

    }

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

    if($debug){

        my @t = split "", $_;

        for my $i (0..$max){

            $_ = "e[1;31m$_e[m" for $t[$ip[$i]];

        }

        warn "  = ", @t, "n";

    }

}

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl

# usage mutagen number_of_replacements alignment_file [ sequence_file ..]

use strict;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max }

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

}

Usage example:

$ cat seq

1634870295

5684937021

2049163587

6598471230

$ cat alg

DPMBHZJEIO

INTMJZOYKQ

KNTXGLCJSR

GLJZRFVSEX

SYJVHEPNAZ

$ perl mutagen 3 alg seq

1L3V8702I5

5684HE7Y21

2049JZC587

6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {

    my (@r, $m, %h);

    die "more replacements than positions/alignments" if $max >= $_[0];

    for(0..$max){

        my $r = int(rand() * $_[0]);

        $r = ($r + 1) % $_[0] while $h{$r};

        $h{$r} = 1;

        push @r, $r;

    }

    @r;

}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl

# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]

use strict;



my $debug = $ARGV[0] eq '-d' ? shift : 0;

my $max = shift() - 1;

my $algf = shift;

open my $alg, $algf or die "open $algf: $!";

my @alg = <$alg>;



sub prand { map int(rand() * $_[0]), 0..$max } 

while(<>){

    my @ip = prand length() - 1;

    my @op = prand scalar @alg;



    if($debug){

        my $t = ' ' x (length() - 1);

        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;

        warn "@ip | @opn    $_    $tn";

        for my $i (0..$max){

            my $t = $alg[$op[$i]];

            $t =~ s/(.{$ip[$i]})(.)/$1e[1;31m$2e[m/;

            printf STDERR " %2d %s", $op[$i], $t;

        }

    }

    for my $i (0..$max){

        my $p = $ip[$i];

        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;

    }

    print;

    if($debug){

        my @t = split "", $_;

        for my $i (0..$max){

            $_ = "e[1;31m$_e[m" for $t[$ip[$i]];

        }

        warn "  = ", @t, "n";

    }

}

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

edited Oct 9 at 10:03

answered Oct 8 at 22:58

qubert

5476

answered Oct 8 at 22:58

qubert

5476

answered Oct 8 at 22:58

qubert

5476

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

add a comment |

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

Wow! I can't even begin to understand that, but it works perfectly. I bow down to you. Thanks so much
– catchprj
Oct 9 at 8:50

I notice the script breaks if you request more replacements than there are sequences in the alignment, which suggests that it will only access each alignment sequence once, right? Is it possible to chose a random alignment sequence each time, even if it has been chosen before? That way, in your usage example above, it would be possible to do >5 replacements (it's even possible, though unlikely, they all come from the same alignment sequence)
– catchprj
Oct 9 at 9:07

Yes that's very easy to do, it will even simplify the script a lot -- replace the sub prand {...} with sub sub prand { map int(rand() * $_[0]), 0..$max }.
– qubert
Oct 9 at 9:32

@catchprj I've updated the answer.
– qubert
Oct 9 at 9:54

add a comment |

up vote
0
down vote

This linear would generate an infinite number of random keys:

cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1

Sample output:

MB0JZZ85VI

2OKOY4JL61

2YN7B71Z6K

KH29TYCQ4K

B4N1XOFY5O

Explanation:

The regex expression in the command tr controls for which characters would be included in the random keys.

head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

add a comment |

up vote
0
down vote

This linear would generate an infinite number of random keys:

cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1

Sample output:

MB0JZZ85VI

2OKOY4JL61

2YN7B71Z6K

KH29TYCQ4K

B4N1XOFY5O

Explanation:

The regex expression in the command tr controls for which characters would be included in the random keys.

head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

add a comment |

up vote
0
down vote

This linear would generate an infinite number of random keys:

cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1

Sample output:

MB0JZZ85VI

2OKOY4JL61

2YN7B71Z6K

KH29TYCQ4K

B4N1XOFY5O

Explanation:

The regex expression in the command tr controls for which characters would be included in the random keys.

head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

This linear would generate an infinite number of random keys:

cat /dev/urandom | tr -dc 'A-Z0-9' | fold -w 10 | head -n 1

Sample output:

MB0JZZ85VI

2OKOY4JL61

2YN7B71Z6K

KH29TYCQ4K

B4N1XOFY5O

Explanation:

The regex expression in the command tr controls for which characters would be included in the random keys.

head -n would adjust for how many random keys would be generated. For example, replacing -n 1 by 10000 would generate 10.000 keys.

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

edited Oct 9 at 10:08

answered Oct 8 at 19:02

user88036

answered Oct 8 at 19:02

user88036

answered Oct 8 at 19:02

user88036

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

add a comment |

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

This would give a sequence of 10 random uppercase characters and/or digits, but it would not mutate a given string the way that the question asked about. Note that the strings in the question (the original one and the set of other strings) are not random. These (I assume) represent, for example, DNA that is mutated in different ways.
– Kusalananda
Oct 8 at 19:42

@Kusalananda, I appreciate your comment! Honestly, I could not find simpler method to generate 10.000 replacements without looping, let's wait for the OP feedback and then we can go from there ;-)
– user88036
Oct 8 at 20:02

add a comment |

up vote
0
down vote

Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.

#!/bin/bash



count=$1

read sequence < $2

IFS=$'n' read -d '' -a replacements < $3

len=${#sequence}

choices=${#replacements[*]}



while ((count--)) ; do

        pos=$(($RANDOM % $len))

        choice=$(($RANDOM % $choices))

        replacement=${replacements[$choice]}

        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}

done



echo "$sequence"

Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.

This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.

answered Oct 9 at 13:32

JigglyNaga

3,673829

add a comment |

up vote
0
down vote

Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.

#!/bin/bash



count=$1

read sequence < $2

IFS=$'n' read -d '' -a replacements < $3

len=${#sequence}

choices=${#replacements[*]}



while ((count--)) ; do

        pos=$(($RANDOM % $len))

        choice=$(($RANDOM % $choices))

        replacement=${replacements[$choice]}

        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}

done



echo "$sequence"

Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.

This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.

answered Oct 9 at 13:32

JigglyNaga

3,673829

add a comment |

up vote
0
down vote

Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.

#!/bin/bash



count=$1

read sequence < $2

IFS=$'n' read -d '' -a replacements < $3

len=${#sequence}

choices=${#replacements[*]}



while ((count--)) ; do

        pos=$(($RANDOM % $len))

        choice=$(($RANDOM % $choices))

        replacement=${replacements[$choice]}

        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}

done



echo "$sequence"

Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.

This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.

answered Oct 9 at 13:32

JigglyNaga

3,673829

Your original bash attempt was slow because of the number of external processes being started. Each random number called jot, and each string manipulation used two sed and a cut.

#!/bin/bash



count=$1

read sequence < $2

IFS=$'n' read -d '' -a replacements < $3

len=${#sequence}

choices=${#replacements[*]}



while ((count--)) ; do

        pos=$(($RANDOM % $len))

        choice=$(($RANDOM % $choices))

        replacement=${replacements[$choice]}

        sequence=${sequence:0:$pos}${replacement:$pos:1}${sequence:$((pos+1))}

done



echo "$sequence"

Note that $RANDOM won't exceed 32767, so if your sequences are bigger than that (or even approaching that size), you will need something more complex than $RANDOM % maximum.

This is still unlikely to beat a dedicated scripting language for speed, let alone a compiled language.

answered Oct 9 at 13:32

JigglyNaga

3,673829

answered Oct 9 at 13:32

JigglyNaga

3,673829

answered Oct 9 at 13:32

JigglyNaga

3,673829

answered Oct 9 at 13:32

JigglyNaga

3,673829

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj