Script optimisation to find duplicates filename in huge CSV












4















I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.



Those files are formatted like this :



timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160


My goal is to identify the file with the same name that appears in different folders.

In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.



My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:



quad_list_14.json


To do this I've written this small piece of code:





#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""

#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done

#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d



Don't use this code at home it's terrible, I should have replace this script with a onliner like this :



#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d


My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.



I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.










share|improve this question





























    4















    I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
    timestamp;fullpath;event;size.



    Those files are formatted like this :



    timestamp;fullpath;event;size
    1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
    1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
    1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
    1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
    1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
    1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
    1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
    1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
    1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
    1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
    1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160


    My goal is to identify the file with the same name that appears in different folders.

    In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.



    My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:



    quad_list_14.json


    To do this I've written this small piece of code:





    #this line cut the file to only get unique filepath
    PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
    FILENAMELIST=""

    #this loop build a list of basename from the list of filepath
    for path in ${PATHLIST}
    do
    FILENAMELIST="$(basename "${path}")
    ${FILENAMELIST}"
    done

    #once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
    echo "${FILENAMELIST}" | sort | uniq -d



    Don't use this code at home it's terrible, I should have replace this script with a onliner like this :



    #this get all file path, sort them and only keep unique entry then
    #remove the path to get the basename of the file
    #and finally sort and output duplicates entry.
    cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d


    My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.



    I need to improve this code to make it more efficient.
    My only limitation is that I can't fully load the files in RAM.










    share|improve this question



























      4












      4








      4








      I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
      timestamp;fullpath;event;size.



      Those files are formatted like this :



      timestamp;fullpath;event;size
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
      1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
      1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
      1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
      1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
      1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160


      My goal is to identify the file with the same name that appears in different folders.

      In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.



      My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:



      quad_list_14.json


      To do this I've written this small piece of code:





      #this line cut the file to only get unique filepath
      PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
      FILENAMELIST=""

      #this loop build a list of basename from the list of filepath
      for path in ${PATHLIST}
      do
      FILENAMELIST="$(basename "${path}")
      ${FILENAMELIST}"
      done

      #once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
      echo "${FILENAMELIST}" | sort | uniq -d



      Don't use this code at home it's terrible, I should have replace this script with a onliner like this :



      #this get all file path, sort them and only keep unique entry then
      #remove the path to get the basename of the file
      #and finally sort and output duplicates entry.
      cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d


      My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.



      I need to improve this code to make it more efficient.
      My only limitation is that I can't fully load the files in RAM.










      share|improve this question
















      I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
      timestamp;fullpath;event;size.



      Those files are formatted like this :



      timestamp;fullpath;event;size
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
      1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
      1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
      1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
      1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
      1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
      1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
      1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160


      My goal is to identify the file with the same name that appears in different folders.

      In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.



      My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:



      quad_list_14.json


      To do this I've written this small piece of code:





      #this line cut the file to only get unique filepath
      PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
      FILENAMELIST=""

      #this loop build a list of basename from the list of filepath
      for path in ${PATHLIST}
      do
      FILENAMELIST="$(basename "${path}")
      ${FILENAMELIST}"
      done

      #once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
      echo "${FILENAMELIST}" | sort | uniq -d



      Don't use this code at home it's terrible, I should have replace this script with a onliner like this :



      #this get all file path, sort them and only keep unique entry then
      #remove the path to get the basename of the file
      #and finally sort and output duplicates entry.
      cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d


      My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.



      I need to improve this code to make it more efficient.
      My only limitation is that I can't fully load the files in RAM.







      bash awk grep uniq optimization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 hours ago









      Rui F Ribeiro

      41.5k1483140




      41.5k1483140










      asked Apr 11 '18 at 15:28









      KiwyKiwy

      6,08253560




      6,08253560






















          3 Answers
          3






          active

          oldest

          votes


















          3














          The following AWK script should do the trick, without using too much memory:



          #!/usr/bin/awk -f

          BEGIN {
          FS = ";"
          }

          {
          idx = match($2, "/[^/]+$")
          if (idx > 0) {
          path = substr($2, 1, idx)
          name = substr($2, idx + 1)
          if (paths[name] && paths[name] != path && !output[name]) {
          print name
          output[name] = 1
          }
          paths[name] = path
          }
          }


          It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.






          share|improve this answer


























          • Thank you for the help, I really need to learn awk looks awesomely effective

            – Kiwy
            Apr 12 '18 at 8:44











          • You made the most effective script.

            – Kiwy
            Apr 12 '18 at 12:18



















          4














          The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.



          The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.



          Suggestion:



          $ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
          quad_list_14.json


          The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.



          The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.



          The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.



          The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.





          Alternative for only looking at IN_OPEN entries:



          sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d





          share|improve this answer


























          • Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

            – Isaac
            Apr 12 '18 at 7:08











          • @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

            – Kusalananda
            Apr 12 '18 at 7:10













          • @Kusalananda see my answer after some testing ;-) thank you for your help

            – Kiwy
            Apr 12 '18 at 8:43



















          1














          Thank you both of you for your answers and thanks Isaac for the comments.

          I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:



          for i in $(ls *.csv)
          do
          script.sh $1
          done


          With the command time I compare them and here's the results:



          stephen.awk



          real    2m35,049s
          user 2m26,278s
          sys 0m8,495s


          stephen.awk: updated with /IN_OPEN/ before the second block



          real    0m35,749s
          user 0m15,711s
          sys 0m4,915s


          kusa.sh



          real    8m55,754s
          user 8m48,924s
          sys 0m21,307s


          Update with filter on IN_OPEN:



          real    0m37,463s
          user 0m9,340s
          sys 0m4,778s


          Side note:

          Though correct I had a lot of blank line outputed with sed, your script were the only one like that.



          isaac.sh



          grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
          real 7m2,715s
          user 6m56,009s
          sys 0m18,385s


          With filter on IN_OPEN:



          real    0m32,785s
          user 0m8,775s
          sys 0m4,202s


          my script



          real    6m27,645s
          user 6m13,742s
          sys 0m20,570s


          @Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.



          Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:



          #see I add grep "IN_OPEN" to reduce complexity
          PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
          FILENAMELIST=""
          for path in ${PATHLIST}
          do
          FILENAMELIST="$(basename "${path}")
          ${FILENAMELIST}"
          done
          echo "${FILENAMELIST}" | sort | uniq -d


          With this only modification which gave me the same result I end up with this time value:



          real    0m56,412s
          user 0m27,439s
          sys 0m9,928s


          And I'm pretty sure there's plenty of other stuff I could do






          share|improve this answer


























          • Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

            – Kusalananda
            Apr 12 '18 at 8:52











          • Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

            – Stephen Kitt
            Apr 12 '18 at 8:52











          • Likewise for mine. I have added this restriction at the end of my answer.

            – Kusalananda
            Apr 12 '18 at 9:01











          • @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

            – Kiwy
            Apr 12 '18 at 9:01






          • 1





            I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

            – Kusalananda
            Apr 12 '18 at 9:03











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "106"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437052%2fscript-optimisation-to-find-duplicates-filename-in-huge-csv%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3














          The following AWK script should do the trick, without using too much memory:



          #!/usr/bin/awk -f

          BEGIN {
          FS = ";"
          }

          {
          idx = match($2, "/[^/]+$")
          if (idx > 0) {
          path = substr($2, 1, idx)
          name = substr($2, idx + 1)
          if (paths[name] && paths[name] != path && !output[name]) {
          print name
          output[name] = 1
          }
          paths[name] = path
          }
          }


          It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.






          share|improve this answer


























          • Thank you for the help, I really need to learn awk looks awesomely effective

            – Kiwy
            Apr 12 '18 at 8:44











          • You made the most effective script.

            – Kiwy
            Apr 12 '18 at 12:18
















          3














          The following AWK script should do the trick, without using too much memory:



          #!/usr/bin/awk -f

          BEGIN {
          FS = ";"
          }

          {
          idx = match($2, "/[^/]+$")
          if (idx > 0) {
          path = substr($2, 1, idx)
          name = substr($2, idx + 1)
          if (paths[name] && paths[name] != path && !output[name]) {
          print name
          output[name] = 1
          }
          paths[name] = path
          }
          }


          It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.






          share|improve this answer


























          • Thank you for the help, I really need to learn awk looks awesomely effective

            – Kiwy
            Apr 12 '18 at 8:44











          • You made the most effective script.

            – Kiwy
            Apr 12 '18 at 12:18














          3












          3








          3







          The following AWK script should do the trick, without using too much memory:



          #!/usr/bin/awk -f

          BEGIN {
          FS = ";"
          }

          {
          idx = match($2, "/[^/]+$")
          if (idx > 0) {
          path = substr($2, 1, idx)
          name = substr($2, idx + 1)
          if (paths[name] && paths[name] != path && !output[name]) {
          print name
          output[name] = 1
          }
          paths[name] = path
          }
          }


          It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.






          share|improve this answer















          The following AWK script should do the trick, without using too much memory:



          #!/usr/bin/awk -f

          BEGIN {
          FS = ";"
          }

          {
          idx = match($2, "/[^/]+$")
          if (idx > 0) {
          path = substr($2, 1, idx)
          name = substr($2, idx + 1)
          if (paths[name] && paths[name] != path && !output[name]) {
          print name
          output[name] = 1
          }
          paths[name] = path
          }
          }


          It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 12 '18 at 8:54

























          answered Apr 11 '18 at 15:39









          Stephen KittStephen Kitt

          176k24401479




          176k24401479













          • Thank you for the help, I really need to learn awk looks awesomely effective

            – Kiwy
            Apr 12 '18 at 8:44











          • You made the most effective script.

            – Kiwy
            Apr 12 '18 at 12:18



















          • Thank you for the help, I really need to learn awk looks awesomely effective

            – Kiwy
            Apr 12 '18 at 8:44











          • You made the most effective script.

            – Kiwy
            Apr 12 '18 at 12:18

















          Thank you for the help, I really need to learn awk looks awesomely effective

          – Kiwy
          Apr 12 '18 at 8:44





          Thank you for the help, I really need to learn awk looks awesomely effective

          – Kiwy
          Apr 12 '18 at 8:44













          You made the most effective script.

          – Kiwy
          Apr 12 '18 at 12:18





          You made the most effective script.

          – Kiwy
          Apr 12 '18 at 12:18













          4














          The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.



          The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.



          Suggestion:



          $ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
          quad_list_14.json


          The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.



          The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.



          The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.



          The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.





          Alternative for only looking at IN_OPEN entries:



          sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d





          share|improve this answer


























          • Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

            – Isaac
            Apr 12 '18 at 7:08











          • @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

            – Kusalananda
            Apr 12 '18 at 7:10













          • @Kusalananda see my answer after some testing ;-) thank you for your help

            – Kiwy
            Apr 12 '18 at 8:43
















          4














          The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.



          The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.



          Suggestion:



          $ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
          quad_list_14.json


          The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.



          The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.



          The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.



          The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.





          Alternative for only looking at IN_OPEN entries:



          sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d





          share|improve this answer


























          • Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

            – Isaac
            Apr 12 '18 at 7:08











          • @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

            – Kusalananda
            Apr 12 '18 at 7:10













          • @Kusalananda see my answer after some testing ;-) thank you for your help

            – Kiwy
            Apr 12 '18 at 8:43














          4












          4








          4







          The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.



          The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.



          Suggestion:



          $ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
          quad_list_14.json


          The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.



          The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.



          The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.



          The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.





          Alternative for only looking at IN_OPEN entries:



          sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d





          share|improve this answer















          The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.



          The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.



          Suggestion:



          $ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
          quad_list_14.json


          The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.



          The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.



          The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.



          The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.





          Alternative for only looking at IN_OPEN entries:



          sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 12 '18 at 9:00

























          answered Apr 11 '18 at 15:51









          KusalanandaKusalananda

          136k17256424




          136k17256424













          • Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

            – Isaac
            Apr 12 '18 at 7:08











          • @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

            – Kusalananda
            Apr 12 '18 at 7:10













          • @Kusalananda see my answer after some testing ;-) thank you for your help

            – Kiwy
            Apr 12 '18 at 8:43



















          • Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

            – Isaac
            Apr 12 '18 at 7:08











          • @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

            – Kusalananda
            Apr 12 '18 at 7:10













          • @Kusalananda see my answer after some testing ;-) thank you for your help

            – Kiwy
            Apr 12 '18 at 8:43

















          Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

          – Isaac
          Apr 12 '18 at 7:08





          Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

          – Isaac
          Apr 12 '18 at 7:08













          @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

          – Kusalananda
          Apr 12 '18 at 7:10







          @isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

          – Kusalananda
          Apr 12 '18 at 7:10















          @Kusalananda see my answer after some testing ;-) thank you for your help

          – Kiwy
          Apr 12 '18 at 8:43





          @Kusalananda see my answer after some testing ;-) thank you for your help

          – Kiwy
          Apr 12 '18 at 8:43











          1














          Thank you both of you for your answers and thanks Isaac for the comments.

          I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:



          for i in $(ls *.csv)
          do
          script.sh $1
          done


          With the command time I compare them and here's the results:



          stephen.awk



          real    2m35,049s
          user 2m26,278s
          sys 0m8,495s


          stephen.awk: updated with /IN_OPEN/ before the second block



          real    0m35,749s
          user 0m15,711s
          sys 0m4,915s


          kusa.sh



          real    8m55,754s
          user 8m48,924s
          sys 0m21,307s


          Update with filter on IN_OPEN:



          real    0m37,463s
          user 0m9,340s
          sys 0m4,778s


          Side note:

          Though correct I had a lot of blank line outputed with sed, your script were the only one like that.



          isaac.sh



          grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
          real 7m2,715s
          user 6m56,009s
          sys 0m18,385s


          With filter on IN_OPEN:



          real    0m32,785s
          user 0m8,775s
          sys 0m4,202s


          my script



          real    6m27,645s
          user 6m13,742s
          sys 0m20,570s


          @Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.



          Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:



          #see I add grep "IN_OPEN" to reduce complexity
          PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
          FILENAMELIST=""
          for path in ${PATHLIST}
          do
          FILENAMELIST="$(basename "${path}")
          ${FILENAMELIST}"
          done
          echo "${FILENAMELIST}" | sort | uniq -d


          With this only modification which gave me the same result I end up with this time value:



          real    0m56,412s
          user 0m27,439s
          sys 0m9,928s


          And I'm pretty sure there's plenty of other stuff I could do






          share|improve this answer


























          • Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

            – Kusalananda
            Apr 12 '18 at 8:52











          • Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

            – Stephen Kitt
            Apr 12 '18 at 8:52











          • Likewise for mine. I have added this restriction at the end of my answer.

            – Kusalananda
            Apr 12 '18 at 9:01











          • @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

            – Kiwy
            Apr 12 '18 at 9:01






          • 1





            I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

            – Kusalananda
            Apr 12 '18 at 9:03
















          1














          Thank you both of you for your answers and thanks Isaac for the comments.

          I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:



          for i in $(ls *.csv)
          do
          script.sh $1
          done


          With the command time I compare them and here's the results:



          stephen.awk



          real    2m35,049s
          user 2m26,278s
          sys 0m8,495s


          stephen.awk: updated with /IN_OPEN/ before the second block



          real    0m35,749s
          user 0m15,711s
          sys 0m4,915s


          kusa.sh



          real    8m55,754s
          user 8m48,924s
          sys 0m21,307s


          Update with filter on IN_OPEN:



          real    0m37,463s
          user 0m9,340s
          sys 0m4,778s


          Side note:

          Though correct I had a lot of blank line outputed with sed, your script were the only one like that.



          isaac.sh



          grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
          real 7m2,715s
          user 6m56,009s
          sys 0m18,385s


          With filter on IN_OPEN:



          real    0m32,785s
          user 0m8,775s
          sys 0m4,202s


          my script



          real    6m27,645s
          user 6m13,742s
          sys 0m20,570s


          @Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.



          Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:



          #see I add grep "IN_OPEN" to reduce complexity
          PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
          FILENAMELIST=""
          for path in ${PATHLIST}
          do
          FILENAMELIST="$(basename "${path}")
          ${FILENAMELIST}"
          done
          echo "${FILENAMELIST}" | sort | uniq -d


          With this only modification which gave me the same result I end up with this time value:



          real    0m56,412s
          user 0m27,439s
          sys 0m9,928s


          And I'm pretty sure there's plenty of other stuff I could do






          share|improve this answer


























          • Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

            – Kusalananda
            Apr 12 '18 at 8:52











          • Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

            – Stephen Kitt
            Apr 12 '18 at 8:52











          • Likewise for mine. I have added this restriction at the end of my answer.

            – Kusalananda
            Apr 12 '18 at 9:01











          • @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

            – Kiwy
            Apr 12 '18 at 9:01






          • 1





            I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

            – Kusalananda
            Apr 12 '18 at 9:03














          1












          1








          1







          Thank you both of you for your answers and thanks Isaac for the comments.

          I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:



          for i in $(ls *.csv)
          do
          script.sh $1
          done


          With the command time I compare them and here's the results:



          stephen.awk



          real    2m35,049s
          user 2m26,278s
          sys 0m8,495s


          stephen.awk: updated with /IN_OPEN/ before the second block



          real    0m35,749s
          user 0m15,711s
          sys 0m4,915s


          kusa.sh



          real    8m55,754s
          user 8m48,924s
          sys 0m21,307s


          Update with filter on IN_OPEN:



          real    0m37,463s
          user 0m9,340s
          sys 0m4,778s


          Side note:

          Though correct I had a lot of blank line outputed with sed, your script were the only one like that.



          isaac.sh



          grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
          real 7m2,715s
          user 6m56,009s
          sys 0m18,385s


          With filter on IN_OPEN:



          real    0m32,785s
          user 0m8,775s
          sys 0m4,202s


          my script



          real    6m27,645s
          user 6m13,742s
          sys 0m20,570s


          @Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.



          Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:



          #see I add grep "IN_OPEN" to reduce complexity
          PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
          FILENAMELIST=""
          for path in ${PATHLIST}
          do
          FILENAMELIST="$(basename "${path}")
          ${FILENAMELIST}"
          done
          echo "${FILENAMELIST}" | sort | uniq -d


          With this only modification which gave me the same result I end up with this time value:



          real    0m56,412s
          user 0m27,439s
          sys 0m9,928s


          And I'm pretty sure there's plenty of other stuff I could do






          share|improve this answer















          Thank you both of you for your answers and thanks Isaac for the comments.

          I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:



          for i in $(ls *.csv)
          do
          script.sh $1
          done


          With the command time I compare them and here's the results:



          stephen.awk



          real    2m35,049s
          user 2m26,278s
          sys 0m8,495s


          stephen.awk: updated with /IN_OPEN/ before the second block



          real    0m35,749s
          user 0m15,711s
          sys 0m4,915s


          kusa.sh



          real    8m55,754s
          user 8m48,924s
          sys 0m21,307s


          Update with filter on IN_OPEN:



          real    0m37,463s
          user 0m9,340s
          sys 0m4,778s


          Side note:

          Though correct I had a lot of blank line outputed with sed, your script were the only one like that.



          isaac.sh



          grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
          real 7m2,715s
          user 6m56,009s
          sys 0m18,385s


          With filter on IN_OPEN:



          real    0m32,785s
          user 0m8,775s
          sys 0m4,202s


          my script



          real    6m27,645s
          user 6m13,742s
          sys 0m20,570s


          @Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.



          Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:



          #see I add grep "IN_OPEN" to reduce complexity
          PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
          FILENAMELIST=""
          for path in ${PATHLIST}
          do
          FILENAMELIST="$(basename "${path}")
          ${FILENAMELIST}"
          done
          echo "${FILENAMELIST}" | sort | uniq -d


          With this only modification which gave me the same result I end up with this time value:



          real    0m56,412s
          user 0m27,439s
          sys 0m9,928s


          And I'm pretty sure there's plenty of other stuff I could do







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 12 '18 at 12:17

























          answered Apr 12 '18 at 8:42









          KiwyKiwy

          6,08253560




          6,08253560













          • Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

            – Kusalananda
            Apr 12 '18 at 8:52











          • Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

            – Stephen Kitt
            Apr 12 '18 at 8:52











          • Likewise for mine. I have added this restriction at the end of my answer.

            – Kusalananda
            Apr 12 '18 at 9:01











          • @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

            – Kiwy
            Apr 12 '18 at 9:01






          • 1





            I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

            – Kusalananda
            Apr 12 '18 at 9:03



















          • Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

            – Kusalananda
            Apr 12 '18 at 8:52











          • Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

            – Stephen Kitt
            Apr 12 '18 at 8:52











          • Likewise for mine. I have added this restriction at the end of my answer.

            – Kusalananda
            Apr 12 '18 at 9:01











          • @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

            – Kiwy
            Apr 12 '18 at 9:01






          • 1





            I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

            – Kusalananda
            Apr 12 '18 at 9:03

















          Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

          – Kusalananda
          Apr 12 '18 at 8:52





          Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

          – Kusalananda
          Apr 12 '18 at 8:52













          Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

          – Stephen Kitt
          Apr 12 '18 at 8:52





          Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

          – Stephen Kitt
          Apr 12 '18 at 8:52













          Likewise for mine. I have added this restriction at the end of my answer.

          – Kusalananda
          Apr 12 '18 at 9:01





          Likewise for mine. I have added this restriction at the end of my answer.

          – Kusalananda
          Apr 12 '18 at 9:01













          @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

          – Kiwy
          Apr 12 '18 at 9:01





          @Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

          – Kiwy
          Apr 12 '18 at 9:01




          1




          1





          I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

          – Kusalananda
          Apr 12 '18 at 9:03





          I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

          – Kusalananda
          Apr 12 '18 at 9:03


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Unix & Linux Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437052%2fscript-optimisation-to-find-duplicates-filename-in-huge-csv%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          サソリ

          広島県道265号伴広島線

          Accessing regular linux commands in Huawei's Dopra Linux