Remove rows which contain duplicate strings between the first 4 characters of two columns
up vote
0
down vote
favorite
I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.
Input:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
Output:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".
text-processing bioinformatics
add a comment |
up vote
0
down vote
favorite
I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.
Input:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
Output:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".
text-processing bioinformatics
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.
Input:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
Output:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".
text-processing bioinformatics
I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.
Input:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
Output:
Gator_locus75 AATTCCATGTACG Gator_locus23 CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC
I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".
text-processing bioinformatics
text-processing bioinformatics
edited Nov 21 at 21:05
Rui F Ribeiro
38.2k1475125
38.2k1475125
asked Feb 12 at 20:10
Josh
323
323
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34
add a comment |
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:
awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input
This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.
add a comment |
up vote
0
down vote
To remove rows in which the second field starts with AATT
and the fourth field starts with AATT
, and the same with CATG
:
awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file
As a more general solution:
awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:
awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input
This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.
add a comment |
up vote
2
down vote
accepted
To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:
awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input
This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:
awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input
This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.
To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:
awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input
This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.
answered Feb 12 at 20:38
Jeff Schaller
36.4k952120
36.4k952120
add a comment |
add a comment |
up vote
0
down vote
To remove rows in which the second field starts with AATT
and the fourth field starts with AATT
, and the same with CATG
:
awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file
As a more general solution:
awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file
add a comment |
up vote
0
down vote
To remove rows in which the second field starts with AATT
and the fourth field starts with AATT
, and the same with CATG
:
awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file
As a more general solution:
awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file
add a comment |
up vote
0
down vote
up vote
0
down vote
To remove rows in which the second field starts with AATT
and the fourth field starts with AATT
, and the same with CATG
:
awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file
As a more general solution:
awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file
To remove rows in which the second field starts with AATT
and the fourth field starts with AATT
, and the same with CATG
:
awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file
As a more general solution:
awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file
answered Feb 12 at 20:38
DopeGhoti
42.6k55181
42.6k55181
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423687%2fremove-rows-which-contain-duplicate-strings-between-the-first-4-characters-of-tw%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39
Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55
Sorry for the delay. They both work
– Josh
Mar 9 at 19:34