Remove rows which contain duplicate strings between the first 4 characters of two columns











up vote
0
down vote

favorite
1












I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.



Input:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


Output:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".










share|improve this question
























  • Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
    – Jeff Schaller
    Feb 12 at 20:39










  • Yes, only the first four characters of each should be compared. Sorry for the confusion.
    – Josh
    Feb 12 at 20:55










  • Sorry for the delay. They both work
    – Josh
    Mar 9 at 19:34















up vote
0
down vote

favorite
1












I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.



Input:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


Output:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".










share|improve this question
























  • Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
    – Jeff Schaller
    Feb 12 at 20:39










  • Yes, only the first four characters of each should be compared. Sorry for the confusion.
    – Josh
    Feb 12 at 20:55










  • Sorry for the delay. They both work
    – Josh
    Mar 9 at 19:34













up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.



Input:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


Output:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".










share|improve this question















I have a large file that contains 4 columns and 7,000 lines. I need to remove the rows in which the start of the second column is the same as the start to the fourth column.



Input:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus18 AATTCCATTATGG Gator_locus14 AATTCAAAAAAT
Gator_locus13 CTAGAACCCACC Gator_locus72 CTAGAATGTATG
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


Output:



Gator_locus75   AATTCCATGTACG   Gator_locus23   CTAGAGGAAGT
Gator_locus16 AATTCATCCTCT Gator_locus15 CTAGATTGCCAA
Gator_locus24 CTAGAGCTGCTG Gator_locus12 AATTCAGTCCAC


I need to remove the rows in which the string the the second column starts "AATT" and the string in the same row fourth column starts "AATT". I also need to do the same thing when the string in the second column starts "CTAG" and the string in the fourth column starts "CTAG".







text-processing bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 at 21:05









Rui F Ribeiro

38.2k1475125




38.2k1475125










asked Feb 12 at 20:10









Josh

323




323












  • Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
    – Jeff Schaller
    Feb 12 at 20:39










  • Yes, only the first four characters of each should be compared. Sorry for the confusion.
    – Josh
    Feb 12 at 20:55










  • Sorry for the delay. They both work
    – Josh
    Mar 9 at 19:34


















  • Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
    – Jeff Schaller
    Feb 12 at 20:39










  • Yes, only the first four characters of each should be compared. Sorry for the confusion.
    – Josh
    Feb 12 at 20:55










  • Sorry for the delay. They both work
    – Josh
    Mar 9 at 19:34
















Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39




Is it only the first 4 characters of each that we should compare? Or did DopeGhoti interpret you correctly by literally comparing only AATT and CTAG?
– Jeff Schaller
Feb 12 at 20:39












Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55




Yes, only the first four characters of each should be compared. Sorry for the confusion.
– Josh
Feb 12 at 20:55












Sorry for the delay. They both work
– Josh
Mar 9 at 19:34




Sorry for the delay. They both work
– Josh
Mar 9 at 19:34










2 Answers
2






active

oldest

votes

















up vote
2
down vote



accepted










To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:



awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input


This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.






share|improve this answer




























    up vote
    0
    down vote













    To remove rows in which the second field starts with AATT and the fourth field starts with AATT, and the same with CATG:



    awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file


    As a more general solution:



    awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file





    share|improve this answer





















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423687%2fremove-rows-which-contain-duplicate-strings-between-the-first-4-characters-of-tw%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      2
      down vote



      accepted










      To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:



      awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input


      This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.






      share|improve this answer

























        up vote
        2
        down vote



        accepted










        To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:



        awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input


        This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.






        share|improve this answer























          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:



          awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input


          This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.






          share|improve this answer












          To print lines where the first 4 characters of column 2 are not equal to the first 4 characters of column 4:



          awk 'substr($2, 1, 4) != substr($4, 1, 4)' < input


          This uses the main code as a "test" to see whether a line should be printed; there's no explicit action section, since the default-print action is what we want. The main code simply extracts the first four characters from each column and compares them.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 12 at 20:38









          Jeff Schaller

          36.4k952120




          36.4k952120
























              up vote
              0
              down vote













              To remove rows in which the second field starts with AATT and the fourth field starts with AATT, and the same with CATG:



              awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file


              As a more general solution:



              awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file





              share|improve this answer

























                up vote
                0
                down vote













                To remove rows in which the second field starts with AATT and the fourth field starts with AATT, and the same with CATG:



                awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file


                As a more general solution:



                awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file





                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  To remove rows in which the second field starts with AATT and the fourth field starts with AATT, and the same with CATG:



                  awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file


                  As a more general solution:



                  awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file





                  share|improve this answer












                  To remove rows in which the second field starts with AATT and the fourth field starts with AATT, and the same with CATG:



                  awk '($2 !~ /^AATT/ && $4 !~ /^AATT/) && ($2 !~ /^CTAG/ && $4 !~ /^CTAG/) {print}' /path/to/file


                  As a more general solution:



                  awk 'substr($2,1,4) != substr($4,1,4) {print}' /path/to/file






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Feb 12 at 20:38









                  DopeGhoti

                  42.6k55181




                  42.6k55181






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423687%2fremove-rows-which-contain-duplicate-strings-between-the-first-4-characters-of-tw%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      サソリ

                      広島県道265号伴広島線

                      Setup Asymptote in Texstudio