Script optimisation to find duplicates filename in huge CSV

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324

1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.

In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath

PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)

FILENAMELIST=""



#this loop build a list of basename from the list of filepath

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done



#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted

echo "${FILENAMELIST}" | sort | uniq -d

Don't use this code at home it's terrible, I should have replace this script with a onliner like this :

#this get all file path, sort them and only keep unique entry then

#remove the path to get the basename of the file 

#and finally sort and output duplicates entry.

cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.

I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

add a comment |

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324

1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.

In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath

PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)

FILENAMELIST=""



#this loop build a list of basename from the list of filepath

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done



#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted

echo "${FILENAMELIST}" | sort | uniq -d

Don't use this code at home it's terrible, I should have replace this script with a onliner like this :

#this get all file path, sort them and only keep unique entry then

#remove the path to get the basename of the file 

#and finally sort and output duplicates entry.

cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

add a comment |

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324

1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.

In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath

PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)

FILENAMELIST=""



#this loop build a list of basename from the list of filepath

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done



#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted

echo "${FILENAMELIST}" | sort | uniq -d

Don't use this code at home it's terrible, I should have replace this script with a onliner like this :

#this get all file path, sort them and only keep unique entry then

#remove the path to get the basename of the file 

#and finally sort and output duplicates entry.

cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

I have several CSV file from 1MB to 6GB generated by inotify script with a list of events formatted as is:
timestamp;fullpath;event;size.

Those files are formatted like this :

timestamp;fullpath;event;size

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324

1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324

1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160

1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80

1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160

1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160

My goal is to identify the file with the same name that appears in different folders.

In this example, the file quad_list_14.json appears in both /home/workdir/otherfolder and /home/workdir/.

My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:

quad_list_14.json

To do this I've written this small piece of code:

#this line cut the file to only get unique filepath

PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)

FILENAMELIST=""



#this loop build a list of basename from the list of filepath

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done



#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted

echo "${FILENAMELIST}" | sort | uniq -d

Don't use this code at home it's terrible, I should have replace this script with a onliner like this :

#this get all file path, sort them and only keep unique entry then

#remove the path to get the basename of the file 

#and finally sort and output duplicates entry.

cut -d';' -f 2 ${1} | sort -u |  grep -o '[^/]*$' | sort | uniq -d

I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.

bash awk grep uniq optimization

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

edited 2 hours ago

Rui F Ribeiro

41.5k1483140

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

asked Apr 11 '18 at 15:28

Kiwy

6,08253560

add a comment |

3 Answers
3

active

oldest

votes

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f



BEGIN {

    FS = ";"

}



{

    idx = match($2, "/[^/]+$")

    if (idx > 0) {

        path = substr($2, 1, idx)

        name = substr($2, idx + 1)

        if (paths[name] && paths[name] != path && !output[name]) {

            print name

            output[name] = 1

        }

        paths[name] = path

    }

}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

add a comment |

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

add a comment |

Thank you both of you for your answers and thanks Isaac for the comments.

I've taken all your code and put them in a script stephen.awk kusa.sh and isaac.sh after that I've run a small benchmark like this:

for i in $(ls *.csv)

do

    script.sh $1

done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s

user    2m26,278s

sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s

user    0m15,711s

sys     0m4,915s

kusa.sh

real    8m55,754s

user    8m48,924s

sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s

user    0m9,340s

sys     0m4,778s

Side note:

Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

real    7m2,715s

user    6m56,009s

sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s

user    0m8,775s

sys     0m4,202s

my script

real    6m27,645s

user    6m13,742s

sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:

#see I add grep "IN_OPEN" to reduce complexity

PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)

FILENAMELIST=""

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done

echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s

user    0m27,439s

sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

1

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

|
show 1 more comment

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437052%2fscript-optimisation-to-find-duplicates-filename-in-huge-csv%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f



BEGIN {

    FS = ";"

}



{

    idx = match($2, "/[^/]+$")

    if (idx > 0) {

        path = substr($2, 1, idx)

        name = substr($2, idx + 1)

        if (paths[name] && paths[name] != path && !output[name]) {

            print name

            output[name] = 1

        }

        paths[name] = path

    }

}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

add a comment |

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f



BEGIN {

    FS = ";"

}



{

    idx = match($2, "/[^/]+$")

    if (idx > 0) {

        path = substr($2, 1, idx)

        name = substr($2, idx + 1)

        if (paths[name] && paths[name] != path && !output[name]) {

            print name

            output[name] = 1

        }

        paths[name] = path

    }

}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

add a comment |

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f



BEGIN {

    FS = ";"

}



{

    idx = match($2, "/[^/]+$")

    if (idx > 0) {

        path = substr($2, 1, idx)

        name = substr($2, idx + 1)

        if (paths[name] && paths[name] != path && !output[name]) {

            print name

            output[name] = 1

        }

        paths[name] = path

    }

}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f



BEGIN {

    FS = ";"

}



{

    idx = match($2, "/[^/]+$")

    if (idx > 0) {

        path = substr($2, 1, idx)

        name = substr($2, idx + 1)

        if (paths[name] && paths[name] != path && !output[name]) {

            print name

            output[name] = 1

        }

        paths[name] = path

    }

}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

edited Apr 12 '18 at 8:54

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

answered Apr 11 '18 at 15:39

Stephen Kitt

176k24401479

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

add a comment |

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

Thank you for the help, I really need to learn awk looks awesomely effective

– Kiwy
Apr 12 '18 at 8:44

You made the most effective script.

– Kiwy
Apr 12 '18 at 12:18

add a comment |

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

add a comment |

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

add a comment |

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.

Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

edited Apr 12 '18 at 9:00

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

answered Apr 11 '18 at 15:51

Kusalananda

136k17256424

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

add a comment |

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

Grep seems to be a 20% faster: grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

– Isaac
Apr 12 '18 at 7:08

@isaac Possibly. It would depend on what implementations of grep and sed one used. BSD sed is generally faster than GNU sed, and GNU grep may be faster than GNU sed too... I'm on a BSD system, so using GNU grep isn't something I'm "automatically" doing.

– Kusalananda
Apr 12 '18 at 7:10

@Kusalananda see my answer after some testing ;-) thank you for your help

– Kiwy
Apr 12 '18 at 8:43

add a comment |

for i in $(ls *.csv)

do

    script.sh $1

done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s

user    2m26,278s

sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s

user    0m15,711s

sys     0m4,915s

kusa.sh

real    8m55,754s

user    8m48,924s

sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s

user    0m9,340s

sys     0m4,778s

Side note:

Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

real    7m2,715s

user    6m56,009s

sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s

user    0m8,775s

sys     0m4,202s

my script

real    6m27,645s

user    6m13,742s

sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

#see I add grep "IN_OPEN" to reduce complexity

PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)

FILENAMELIST=""

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done

echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s

user    0m27,439s

sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

1

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

|
show 1 more comment

for i in $(ls *.csv)

do

    script.sh $1

done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s

user    2m26,278s

sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s

user    0m15,711s

sys     0m4,915s

kusa.sh

real    8m55,754s

user    8m48,924s

sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s

user    0m9,340s

sys     0m4,778s

Side note:

Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

real    7m2,715s

user    6m56,009s

sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s

user    0m8,775s

sys     0m4,202s

my script

real    6m27,645s

user    6m13,742s

sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

#see I add grep "IN_OPEN" to reduce complexity

PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)

FILENAMELIST=""

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done

echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s

user    0m27,439s

sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

1

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

|
show 1 more comment

for i in $(ls *.csv)

do

    script.sh $1

done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s

user    2m26,278s

sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s

user    0m15,711s

sys     0m4,915s

kusa.sh

real    8m55,754s

user    8m48,924s

sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s

user    0m9,340s

sys     0m4,778s

Side note:

Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

real    7m2,715s

user    6m56,009s

sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s

user    0m8,775s

sys     0m4,202s

my script

real    6m27,645s

user    6m13,742s

sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

#see I add grep "IN_OPEN" to reduce complexity

PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)

FILENAMELIST=""

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done

echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s

user    0m27,439s

sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

for i in $(ls *.csv)

do

    script.sh $1

done

With the command time I compare them and here's the results:

stephen.awk

real    2m35,049s

user    2m26,278s

sys     0m8,495s

stephen.awk: updated with /IN_OPEN/ before the second block

real    0m35,749s

user    0m15,711s

sys     0m4,915s

kusa.sh

real    8m55,754s

user    8m48,924s

sys     0m21,307s

Update with filter on IN_OPEN:

real    0m37,463s

user    0m9,340s

sys     0m4,778s

Side note:

Though correct I had a lot of blank line outputed with sed, your script were the only one like that.

isaac.sh

grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d

real    7m2,715s

user    6m56,009s

sys     0m18,385s

With filter on IN_OPEN:

real    0m32,785s

user    0m8,775s

sys     0m4,202s

my script

real    6m27,645s

user    6m13,742s

sys     0m20,570s

@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.

#see I add grep "IN_OPEN" to reduce complexity

PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)

FILENAMELIST=""

for path in ${PATHLIST}

do

    FILENAMELIST="$(basename "${path}")

${FILENAMELIST}"

done

echo "${FILENAMELIST}" | sort | uniq -d

With this only modification which gave me the same result I end up with this time value:

real    0m56,412s

user    0m27,439s

sys     0m9,928s

And I'm pretty sure there's plenty of other stuff I could do

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

edited Apr 12 '18 at 12:17

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

answered Apr 12 '18 at 8:42

Kiwy

6,08253560

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

1

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

|
show 1 more comment

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

1

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with basename!

– Kusalananda
Apr 12 '18 at 8:52

Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with /IN_OPEN/ {...

– Stephen Kitt
Apr 12 '18 at 8:52

Likewise for mine. I have added this restriction at the end of my answer.

– Kusalananda
Apr 12 '18 at 9:01

@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering. awk is definitely quicker.

– Kiwy
Apr 12 '18 at 9:01

I simply fail to see how looping over basename can even get close to being as fast as a single sed invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).

– Kusalananda
Apr 12 '18 at 9:03

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj