Script optimisation to find duplicates filename in huge CSV
I have several CSV file from 1MB to 6GB generated by inotify
script with a list of events formatted as is:timestamp;fullpath;event;size
.
Those files are formatted like this :
timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160
My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json
appears in both /home/workdir/otherfolder
and /home/workdir/
.
My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:
quad_list_14.json
To do this I've written this small piece of code:
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""
#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d
My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.
I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.
bash awk grep uniq optimization
add a comment |
I have several CSV file from 1MB to 6GB generated by inotify
script with a list of events formatted as is:timestamp;fullpath;event;size
.
Those files are formatted like this :
timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160
My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json
appears in both /home/workdir/otherfolder
and /home/workdir/
.
My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:
quad_list_14.json
To do this I've written this small piece of code:
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""
#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d
My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.
I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.
bash awk grep uniq optimization
add a comment |
I have several CSV file from 1MB to 6GB generated by inotify
script with a list of events formatted as is:timestamp;fullpath;event;size
.
Those files are formatted like this :
timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160
My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json
appears in both /home/workdir/otherfolder
and /home/workdir/
.
My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:
quad_list_14.json
To do this I've written this small piece of code:
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""
#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d
My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.
I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.
bash awk grep uniq optimization
I have several CSV file from 1MB to 6GB generated by inotify
script with a list of events formatted as is:timestamp;fullpath;event;size
.
Those files are formatted like this :
timestamp;fullpath;event;size
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_OPEN;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_ACCESS;2324
1521540649.02;/home/workdir/ScienceXMLIn/config.cfg;IN_CLOSE_NOWRITE;2324
1521540649.02;/home/workdir/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_ACCESS;2160
1521540649.03;/home/workdir/quad_list_14.json;IN_CLOSE_NOWRITE;2160
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc;IN_OPEN;70
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.1;IN_OPEN;80
1521540649.03;/home/workdir/ScienceXMLIn/masterbias_list.asc.2;IN_OPEN;70
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_OPEN;2160
1521540649.03;/home/workdir/otherfolder/quad_list_14.json;IN_CLOSE_NOWRITE;2160
My goal is to identify the file with the same name that appears in different folders.
In this example, the file quad_list_14.json
appears in both /home/workdir/otherfolder
and /home/workdir/
.
My desired output is simple just the list of file that appears in more than one folder in this case it would look to this:
quad_list_14.json
To do this I've written this small piece of code:
#this line cut the file to only get unique filepath
PATHLIST=$(cut -d';' -f 2 ${1} | sort -u)
FILENAMELIST=""
#this loop build a list of basename from the list of filepath
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
#once the list is build, I simply find the duplicates with uniq -d as the list is already sorted
echo "${FILENAMELIST}" | sort | uniq -d
Don't use this code at home it's terrible, I should have replace this script with a onliner like this :
#this get all file path, sort them and only keep unique entry then
#remove the path to get the basename of the file
#and finally sort and output duplicates entry.
cut -d';' -f 2 ${1} | sort -u | grep -o '[^/]*$' | sort | uniq -d
My problem though remains and lot of file and the shortest takes 0.5 second but the longest takes 45 seconds on a SSD(and my production disk won't be as fast) to find duplicate filename in different folder.
I need to improve this code to make it more efficient.
My only limitation is that I can't fully load the files in RAM.
bash awk grep uniq optimization
bash awk grep uniq optimization
edited 2 hours ago
Rui F Ribeiro
41.5k1483140
41.5k1483140
asked Apr 11 '18 at 15:28
KiwyKiwy
6,08253560
6,08253560
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
The following AWK script should do the trick, without using too much memory:
#!/usr/bin/awk -f
BEGIN {
FS = ";"
}
{
idx = match($2, "/[^/]+$")
if (idx > 0) {
path = substr($2, 1, idx)
name = substr($2, idx + 1)
if (paths[name] && paths[name] != path && !output[name]) {
print name
output[name] = 1
}
paths[name] = path
}
}
It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
add a comment |
The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename
. This makes it slow.
The loop also runs over the unquoted variable expansion ${PATHLIST}
, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash
(or other shells that supports it), one would have used an array instead.
Suggestion:
$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json
The first sed
picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv
, or as tail -n +2 file.csv | cut -d ';' -f 2
.
The sort -u
gives us unique pathnames, and the following sed
gives us the basenames. The final sort
with uniq -d
at the end tells us which basenames are duplicated.
The last sed 's#.*/##'
which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/}
which is equivalent to $( basename "$pathname" )
. It just deletes everything up to and including the last /
in the string.
The main difference from your code is that instead of the loop that calls basename
multiple times, a single sed
is used to produce the basenames from a list of pathnames.
Alternative for only looking at IN_OPEN
entries:
sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
Grep seems to be a 20% faster:grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations ofgrep
andsed
one used. BSDsed
is generally faster than GNUsed
, and GNUgrep
may be faster than GNUsed
too... I'm on a BSD system, so using GNUgrep
isn't something I'm "automatically" doing.
– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
add a comment |
Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk
kusa.sh
and isaac.sh
after that I've run a small benchmark like this:
for i in $(ls *.csv)
do
script.sh $1
done
With the command time
I compare them and here's the results:
stephen.awk
real 2m35,049s
user 2m26,278s
sys 0m8,495s
stephen.awk: updated with /IN_OPEN/ before the second block
real 0m35,749s
user 0m15,711s
sys 0m4,915s
kusa.sh
real 8m55,754s
user 8m48,924s
sys 0m21,307s
Update with filter on IN_OPEN
:
real 0m37,463s
user 0m9,340s
sys 0m4,778s
Side note:
Though correct I had a lot of blank line outputed with sed
, your script were the only one like that.
isaac.sh
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
real 7m2,715s
user 6m56,009s
sys 0m18,385s
With filter on IN_OPEN
:
real 0m32,785s
user 0m8,775s
sys 0m4,202s
my script
real 6m27,645s
user 6m13,742s
sys 0m20,570s
@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.
Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:
#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d
With this only modification which gave me the same result I end up with this time
value:
real 0m56,412s
user 0m27,439s
sys 0m9,928s
And I'm pretty sure there's plenty of other stuff I could do
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop withbasename
!
– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with/IN_OPEN/ {
...
– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.awk
is definitely quicker.
– Kiwy
Apr 12 '18 at 9:01
1
I simply fail to see how looping overbasename
can even get close to being as fast as a singlesed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).
– Kusalananda
Apr 12 '18 at 9:03
|
show 1 more comment
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437052%2fscript-optimisation-to-find-duplicates-filename-in-huge-csv%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
The following AWK script should do the trick, without using too much memory:
#!/usr/bin/awk -f
BEGIN {
FS = ";"
}
{
idx = match($2, "/[^/]+$")
if (idx > 0) {
path = substr($2, 1, idx)
name = substr($2, idx + 1)
if (paths[name] && paths[name] != path && !output[name]) {
print name
output[name] = 1
}
paths[name] = path
}
}
It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
add a comment |
The following AWK script should do the trick, without using too much memory:
#!/usr/bin/awk -f
BEGIN {
FS = ";"
}
{
idx = match($2, "/[^/]+$")
if (idx > 0) {
path = substr($2, 1, idx)
name = substr($2, idx + 1)
if (paths[name] && paths[name] != path && !output[name]) {
print name
output[name] = 1
}
paths[name] = path
}
}
It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
add a comment |
The following AWK script should do the trick, without using too much memory:
#!/usr/bin/awk -f
BEGIN {
FS = ";"
}
{
idx = match($2, "/[^/]+$")
if (idx > 0) {
path = substr($2, 1, idx)
name = substr($2, idx + 1)
if (paths[name] && paths[name] != path && !output[name]) {
print name
output[name] = 1
}
paths[name] = path
}
}
It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.
The following AWK script should do the trick, without using too much memory:
#!/usr/bin/awk -f
BEGIN {
FS = ";"
}
{
idx = match($2, "/[^/]+$")
if (idx > 0) {
path = substr($2, 1, idx)
name = substr($2, idx + 1)
if (paths[name] && paths[name] != path && !output[name]) {
print name
output[name] = 1
}
paths[name] = path
}
}
It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.
edited Apr 12 '18 at 8:54
answered Apr 11 '18 at 15:39
Stephen KittStephen Kitt
176k24401479
176k24401479
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
add a comment |
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
Thank you for the help, I really need to learn awk looks awesomely effective
– Kiwy
Apr 12 '18 at 8:44
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
You made the most effective script.
– Kiwy
Apr 12 '18 at 12:18
add a comment |
The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename
. This makes it slow.
The loop also runs over the unquoted variable expansion ${PATHLIST}
, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash
(or other shells that supports it), one would have used an array instead.
Suggestion:
$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json
The first sed
picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv
, or as tail -n +2 file.csv | cut -d ';' -f 2
.
The sort -u
gives us unique pathnames, and the following sed
gives us the basenames. The final sort
with uniq -d
at the end tells us which basenames are duplicated.
The last sed 's#.*/##'
which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/}
which is equivalent to $( basename "$pathname" )
. It just deletes everything up to and including the last /
in the string.
The main difference from your code is that instead of the loop that calls basename
multiple times, a single sed
is used to produce the basenames from a list of pathnames.
Alternative for only looking at IN_OPEN
entries:
sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
Grep seems to be a 20% faster:grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations ofgrep
andsed
one used. BSDsed
is generally faster than GNUsed
, and GNUgrep
may be faster than GNUsed
too... I'm on a BSD system, so using GNUgrep
isn't something I'm "automatically" doing.
– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
add a comment |
The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename
. This makes it slow.
The loop also runs over the unquoted variable expansion ${PATHLIST}
, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash
(or other shells that supports it), one would have used an array instead.
Suggestion:
$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json
The first sed
picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv
, or as tail -n +2 file.csv | cut -d ';' -f 2
.
The sort -u
gives us unique pathnames, and the following sed
gives us the basenames. The final sort
with uniq -d
at the end tells us which basenames are duplicated.
The last sed 's#.*/##'
which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/}
which is equivalent to $( basename "$pathname" )
. It just deletes everything up to and including the last /
in the string.
The main difference from your code is that instead of the loop that calls basename
multiple times, a single sed
is used to produce the basenames from a list of pathnames.
Alternative for only looking at IN_OPEN
entries:
sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
Grep seems to be a 20% faster:grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations ofgrep
andsed
one used. BSDsed
is generally faster than GNUsed
, and GNUgrep
may be faster than GNUsed
too... I'm on a BSD system, so using GNUgrep
isn't something I'm "automatically" doing.
– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
add a comment |
The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename
. This makes it slow.
The loop also runs over the unquoted variable expansion ${PATHLIST}
, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash
(or other shells that supports it), one would have used an array instead.
Suggestion:
$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json
The first sed
picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv
, or as tail -n +2 file.csv | cut -d ';' -f 2
.
The sort -u
gives us unique pathnames, and the following sed
gives us the basenames. The final sort
with uniq -d
at the end tells us which basenames are duplicated.
The last sed 's#.*/##'
which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/}
which is equivalent to $( basename "$pathname" )
. It just deletes everything up to and including the last /
in the string.
The main difference from your code is that instead of the loop that calls basename
multiple times, a single sed
is used to produce the basenames from a list of pathnames.
Alternative for only looking at IN_OPEN
entries:
sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename
. This makes it slow.
The loop also runs over the unquoted variable expansion ${PATHLIST}
, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash
(or other shells that supports it), one would have used an array instead.
Suggestion:
$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json
The first sed
picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv
, or as tail -n +2 file.csv | cut -d ';' -f 2
.
The sort -u
gives us unique pathnames, and the following sed
gives us the basenames. The final sort
with uniq -d
at the end tells us which basenames are duplicated.
The last sed 's#.*/##'
which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/}
which is equivalent to $( basename "$pathname" )
. It just deletes everything up to and including the last /
in the string.
The main difference from your code is that instead of the loop that calls basename
multiple times, a single sed
is used to produce the basenames from a list of pathnames.
Alternative for only looking at IN_OPEN
entries:
sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
edited Apr 12 '18 at 9:00
answered Apr 11 '18 at 15:51
KusalanandaKusalananda
136k17256424
136k17256424
Grep seems to be a 20% faster:grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations ofgrep
andsed
one used. BSDsed
is generally faster than GNUsed
, and GNUgrep
may be faster than GNUsed
too... I'm on a BSD system, so using GNUgrep
isn't something I'm "automatically" doing.
– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
add a comment |
Grep seems to be a 20% faster:grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations ofgrep
andsed
one used. BSDsed
is generally faster than GNUsed
, and GNUgrep
may be faster than GNUsed
too... I'm on a BSD system, so using GNUgrep
isn't something I'm "automatically" doing.
– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
Grep seems to be a 20% faster:
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
Grep seems to be a 20% faster:
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
– Isaac
Apr 12 '18 at 7:08
@isaac Possibly. It would depend on what implementations of
grep
and sed
one used. BSD sed
is generally faster than GNU sed
, and GNU grep
may be faster than GNU sed
too... I'm on a BSD system, so using GNU grep
isn't something I'm "automatically" doing.– Kusalananda
Apr 12 '18 at 7:10
@isaac Possibly. It would depend on what implementations of
grep
and sed
one used. BSD sed
is generally faster than GNU sed
, and GNU grep
may be faster than GNU sed
too... I'm on a BSD system, so using GNU grep
isn't something I'm "automatically" doing.– Kusalananda
Apr 12 '18 at 7:10
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
@Kusalananda see my answer after some testing ;-) thank you for your help
– Kiwy
Apr 12 '18 at 8:43
add a comment |
Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk
kusa.sh
and isaac.sh
after that I've run a small benchmark like this:
for i in $(ls *.csv)
do
script.sh $1
done
With the command time
I compare them and here's the results:
stephen.awk
real 2m35,049s
user 2m26,278s
sys 0m8,495s
stephen.awk: updated with /IN_OPEN/ before the second block
real 0m35,749s
user 0m15,711s
sys 0m4,915s
kusa.sh
real 8m55,754s
user 8m48,924s
sys 0m21,307s
Update with filter on IN_OPEN
:
real 0m37,463s
user 0m9,340s
sys 0m4,778s
Side note:
Though correct I had a lot of blank line outputed with sed
, your script were the only one like that.
isaac.sh
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
real 7m2,715s
user 6m56,009s
sys 0m18,385s
With filter on IN_OPEN
:
real 0m32,785s
user 0m8,775s
sys 0m4,202s
my script
real 6m27,645s
user 6m13,742s
sys 0m20,570s
@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.
Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:
#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d
With this only modification which gave me the same result I end up with this time
value:
real 0m56,412s
user 0m27,439s
sys 0m9,928s
And I'm pretty sure there's plenty of other stuff I could do
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop withbasename
!
– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with/IN_OPEN/ {
...
– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.awk
is definitely quicker.
– Kiwy
Apr 12 '18 at 9:01
1
I simply fail to see how looping overbasename
can even get close to being as fast as a singlesed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).
– Kusalananda
Apr 12 '18 at 9:03
|
show 1 more comment
Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk
kusa.sh
and isaac.sh
after that I've run a small benchmark like this:
for i in $(ls *.csv)
do
script.sh $1
done
With the command time
I compare them and here's the results:
stephen.awk
real 2m35,049s
user 2m26,278s
sys 0m8,495s
stephen.awk: updated with /IN_OPEN/ before the second block
real 0m35,749s
user 0m15,711s
sys 0m4,915s
kusa.sh
real 8m55,754s
user 8m48,924s
sys 0m21,307s
Update with filter on IN_OPEN
:
real 0m37,463s
user 0m9,340s
sys 0m4,778s
Side note:
Though correct I had a lot of blank line outputed with sed
, your script were the only one like that.
isaac.sh
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
real 7m2,715s
user 6m56,009s
sys 0m18,385s
With filter on IN_OPEN
:
real 0m32,785s
user 0m8,775s
sys 0m4,202s
my script
real 6m27,645s
user 6m13,742s
sys 0m20,570s
@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.
Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:
#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d
With this only modification which gave me the same result I end up with this time
value:
real 0m56,412s
user 0m27,439s
sys 0m9,928s
And I'm pretty sure there's plenty of other stuff I could do
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop withbasename
!
– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with/IN_OPEN/ {
...
– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.awk
is definitely quicker.
– Kiwy
Apr 12 '18 at 9:01
1
I simply fail to see how looping overbasename
can even get close to being as fast as a singlesed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).
– Kusalananda
Apr 12 '18 at 9:03
|
show 1 more comment
Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk
kusa.sh
and isaac.sh
after that I've run a small benchmark like this:
for i in $(ls *.csv)
do
script.sh $1
done
With the command time
I compare them and here's the results:
stephen.awk
real 2m35,049s
user 2m26,278s
sys 0m8,495s
stephen.awk: updated with /IN_OPEN/ before the second block
real 0m35,749s
user 0m15,711s
sys 0m4,915s
kusa.sh
real 8m55,754s
user 8m48,924s
sys 0m21,307s
Update with filter on IN_OPEN
:
real 0m37,463s
user 0m9,340s
sys 0m4,778s
Side note:
Though correct I had a lot of blank line outputed with sed
, your script were the only one like that.
isaac.sh
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
real 7m2,715s
user 6m56,009s
sys 0m18,385s
With filter on IN_OPEN
:
real 0m32,785s
user 0m8,775s
sys 0m4,202s
my script
real 6m27,645s
user 6m13,742s
sys 0m20,570s
@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.
Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:
#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d
With this only modification which gave me the same result I end up with this time
value:
real 0m56,412s
user 0m27,439s
sys 0m9,928s
And I'm pretty sure there's plenty of other stuff I could do
Thank you both of you for your answers and thanks Isaac for the comments.
I've taken all your code and put them in a script stephen.awk
kusa.sh
and isaac.sh
after that I've run a small benchmark like this:
for i in $(ls *.csv)
do
script.sh $1
done
With the command time
I compare them and here's the results:
stephen.awk
real 2m35,049s
user 2m26,278s
sys 0m8,495s
stephen.awk: updated with /IN_OPEN/ before the second block
real 0m35,749s
user 0m15,711s
sys 0m4,915s
kusa.sh
real 8m55,754s
user 8m48,924s
sys 0m21,307s
Update with filter on IN_OPEN
:
real 0m37,463s
user 0m9,340s
sys 0m4,778s
Side note:
Though correct I had a lot of blank line outputed with sed
, your script were the only one like that.
isaac.sh
grep -oP '^[^;]*;K[^;]*' file.csv | sort -u | grep -oP '.*/K.*' | sort | uniq -d
real 7m2,715s
user 6m56,009s
sys 0m18,385s
With filter on IN_OPEN
:
real 0m32,785s
user 0m8,775s
sys 0m4,202s
my script
real 6m27,645s
user 6m13,742s
sys 0m20,570s
@Stephen you clearly win this one, with a very impressive time decrease by a 2.5 factor.
Though after thinking a bit more I came with another idea, what if I only look at OPEN file event it would reduce the complexity and you're not suppose to access a file or write it without opening it first, so I did that:
#see I add grep "IN_OPEN" to reduce complexity
PATHLIST=$(grep "IN_OPEN" "${1}" | cut -d';' -f 2 | sort -u)
FILENAMELIST=""
for path in ${PATHLIST}
do
FILENAMELIST="$(basename "${path}")
${FILENAMELIST}"
done
echo "${FILENAMELIST}" | sort | uniq -d
With this only modification which gave me the same result I end up with this time
value:
real 0m56,412s
user 0m27,439s
sys 0m9,928s
And I'm pretty sure there's plenty of other stuff I could do
edited Apr 12 '18 at 12:17
answered Apr 12 '18 at 8:42
KiwyKiwy
6,08253560
6,08253560
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop withbasename
!
– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with/IN_OPEN/ {
...
– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.awk
is definitely quicker.
– Kiwy
Apr 12 '18 at 9:01
1
I simply fail to see how looping overbasename
can even get close to being as fast as a singlesed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).
– Kusalananda
Apr 12 '18 at 9:03
|
show 1 more comment
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop withbasename
!
– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with/IN_OPEN/ {
...
– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.awk
is definitely quicker.
– Kiwy
Apr 12 '18 at 9:01
1
I simply fail to see how looping overbasename
can even get close to being as fast as a singlesed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).
– Kusalananda
Apr 12 '18 at 9:03
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with
basename
!– Kusalananda
Apr 12 '18 at 8:52
Plug that grep into Stephen's and my solution too and you'll probably see an equivalent speed increase. Just don't do that loop with
basename
!– Kusalananda
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with
/IN_OPEN/ {
...– Stephen Kitt
Apr 12 '18 at 8:52
Thanks for the benchmarking! I’m curious how the AWK version improves if you change the second block to start with
/IN_OPEN/ {
...– Stephen Kitt
Apr 12 '18 at 8:52
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
Likewise for mine. I have added this restriction at the end of my answer.
– Kusalananda
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.
awk
is definitely quicker.– Kiwy
Apr 12 '18 at 9:01
@Kusalananda what if I use an array instead ? @StephenKitt See updated result with the filtering.
awk
is definitely quicker.– Kiwy
Apr 12 '18 at 9:01
1
1
I simply fail to see how looping over
basename
can even get close to being as fast as a single sed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).– Kusalananda
Apr 12 '18 at 9:03
I simply fail to see how looping over
basename
can even get close to being as fast as a single sed
invocation. Our approaches are identical in all other aspects (depending on what tool you use for the first step of the pipeline, but I gave a few alternatives).– Kusalananda
Apr 12 '18 at 9:03
|
show 1 more comment
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437052%2fscript-optimisation-to-find-duplicates-filename-in-huge-csv%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown