How to merge pre-sorted files into a single BIG file, without excessive memory or temporary disk use
up vote
3
down vote
favorite
I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.
Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.
But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.
I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?
=== Outcome ===
I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.
freebsd sort merge
New contributor
add a comment |
up vote
3
down vote
favorite
I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.
Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.
But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.
I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?
=== Outcome ===
I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.
freebsd sort merge
New contributor
How large are your chunks? Are they smaller than whatsort
would have made? It sounds like you are basically mimicking what plainsort
would have done anyway...
– Kusalananda
Dec 3 at 11:27
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.
Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.
But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.
I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?
=== Outcome ===
I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.
freebsd sort merge
New contributor
I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.
Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.
But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.
I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?
=== Outcome ===
I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.
freebsd sort merge
freebsd sort merge
New contributor
New contributor
edited yesterday
New contributor
asked Dec 3 at 8:42
rowan194
163
163
New contributor
New contributor
How large are your chunks? Are they smaller than whatsort
would have made? It sounds like you are basically mimicking what plainsort
would have done anyway...
– Kusalananda
Dec 3 at 11:27
add a comment |
How large are your chunks? Are they smaller than whatsort
would have made? It sounds like you are basically mimicking what plainsort
would have done anyway...
– Kusalananda
Dec 3 at 11:27
How large are your chunks? Are they smaller than what
sort
would have made? It sounds like you are basically mimicking what plain sort
would have done anyway...– Kusalananda
Dec 3 at 11:27
How large are your chunks? Are they smaller than what
sort
would have made? It sounds like you are basically mimicking what plain sort
would have done anyway...– Kusalananda
Dec 3 at 11:27
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?
Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.
New contributor
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
add a comment |
up vote
1
down vote
Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use
https://en.m.wikipedia.org/wiki/External_sorting
A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy
https://en.m.wikipedia.org/wiki/Zram
https://en.m.wikipedia.org/wiki/Category:Compression_file_systems
If there is space for output without removing input and input is pre-sorted then it's trivial.
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?
Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.
New contributor
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
add a comment |
up vote
1
down vote
It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?
Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.
New contributor
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
add a comment |
up vote
1
down vote
up vote
1
down vote
It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?
Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.
New contributor
It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?
Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.
New contributor
New contributor
answered Dec 3 at 13:53
Gumnos
112
112
New contributor
New contributor
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
add a comment |
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13
add a comment |
up vote
1
down vote
Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use
https://en.m.wikipedia.org/wiki/External_sorting
A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy
https://en.m.wikipedia.org/wiki/Zram
https://en.m.wikipedia.org/wiki/Category:Compression_file_systems
If there is space for output without removing input and input is pre-sorted then it's trivial.
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
add a comment |
up vote
1
down vote
Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use
https://en.m.wikipedia.org/wiki/External_sorting
A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy
https://en.m.wikipedia.org/wiki/Zram
https://en.m.wikipedia.org/wiki/Category:Compression_file_systems
If there is space for output without removing input and input is pre-sorted then it's trivial.
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
add a comment |
up vote
1
down vote
up vote
1
down vote
Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use
https://en.m.wikipedia.org/wiki/External_sorting
A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy
https://en.m.wikipedia.org/wiki/Zram
https://en.m.wikipedia.org/wiki/Category:Compression_file_systems
If there is space for output without removing input and input is pre-sorted then it's trivial.
Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use
https://en.m.wikipedia.org/wiki/External_sorting
A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy
https://en.m.wikipedia.org/wiki/Zram
https://en.m.wikipedia.org/wiki/Category:Compression_file_systems
If there is space for output without removing input and input is pre-sorted then it's trivial.
edited Dec 3 at 17:43
answered Dec 3 at 10:48
user1133275
2,693415
2,693415
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
add a comment |
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05
add a comment |
rowan194 is a new contributor. Be nice, and check out our Code of Conduct.
rowan194 is a new contributor. Be nice, and check out our Code of Conduct.
rowan194 is a new contributor. Be nice, and check out our Code of Conduct.
rowan194 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f485638%2fhow-to-merge-pre-sorted-files-into-a-single-big-file-without-excessive-memory-o%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How large are your chunks? Are they smaller than what
sort
would have made? It sounds like you are basically mimicking what plainsort
would have done anyway...– Kusalananda
Dec 3 at 11:27