How to merge pre-sorted files into a single BIG file, without excessive memory or temporary disk use











up vote
3
down vote

favorite
1












I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.



Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.



But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.



I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?



=== Outcome ===



I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.










share|improve this question









New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
    – Kusalananda
    Dec 3 at 11:27















up vote
3
down vote

favorite
1












I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.



Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.



But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.



I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?



=== Outcome ===



I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.










share|improve this question









New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
    – Kusalananda
    Dec 3 at 11:27













up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.



Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.



But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.



I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?



=== Outcome ===



I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.










share|improve this question









New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.



Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.



But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.



I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?



=== Outcome ===



I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.







freebsd sort merge






share|improve this question









New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited yesterday





















New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Dec 3 at 8:42









rowan194

163




163




New contributor




rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
    – Kusalananda
    Dec 3 at 11:27


















  • How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
    – Kusalananda
    Dec 3 at 11:27
















How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27




How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27










2 Answers
2






active

oldest

votes

















up vote
1
down vote













It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?



Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.






share|improve this answer








New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
    – rowan194
    Dec 3 at 14:13


















up vote
1
down vote













Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use



https://en.m.wikipedia.org/wiki/External_sorting



A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy



https://en.m.wikipedia.org/wiki/Zram



https://en.m.wikipedia.org/wiki/Category:Compression_file_systems



If there is space for output without removing input and input is pre-sorted then it's trivial.






share|improve this answer























  • No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
    – rowan194
    Dec 3 at 14:05











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






rowan194 is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f485638%2fhow-to-merge-pre-sorted-files-into-a-single-big-file-without-excessive-memory-o%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?



Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.






share|improve this answer








New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
    – rowan194
    Dec 3 at 14:13















up vote
1
down vote













It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?



Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.






share|improve this answer








New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
    – rowan194
    Dec 3 at 14:13













up vote
1
down vote










up vote
1
down vote









It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?



Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.






share|improve this answer








New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?



Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.







share|improve this answer








New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer






New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered Dec 3 at 13:53









Gumnos

112




112




New contributor




Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
    – rowan194
    Dec 3 at 14:13


















  • I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
    – rowan194
    Dec 3 at 14:13
















I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13




I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13












up vote
1
down vote













Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use



https://en.m.wikipedia.org/wiki/External_sorting



A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy



https://en.m.wikipedia.org/wiki/Zram



https://en.m.wikipedia.org/wiki/Category:Compression_file_systems



If there is space for output without removing input and input is pre-sorted then it's trivial.






share|improve this answer























  • No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
    – rowan194
    Dec 3 at 14:05















up vote
1
down vote













Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use



https://en.m.wikipedia.org/wiki/External_sorting



A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy



https://en.m.wikipedia.org/wiki/Zram



https://en.m.wikipedia.org/wiki/Category:Compression_file_systems



If there is space for output without removing input and input is pre-sorted then it's trivial.






share|improve this answer























  • No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
    – rowan194
    Dec 3 at 14:05













up vote
1
down vote










up vote
1
down vote









Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use



https://en.m.wikipedia.org/wiki/External_sorting



A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy



https://en.m.wikipedia.org/wiki/Zram



https://en.m.wikipedia.org/wiki/Category:Compression_file_systems



If there is space for output without removing input and input is pre-sorted then it's trivial.






share|improve this answer














Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use



https://en.m.wikipedia.org/wiki/External_sorting



A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy



https://en.m.wikipedia.org/wiki/Zram



https://en.m.wikipedia.org/wiki/Category:Compression_file_systems



If there is space for output without removing input and input is pre-sorted then it's trivial.







share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 3 at 17:43

























answered Dec 3 at 10:48









user1133275

2,693415




2,693415












  • No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
    – rowan194
    Dec 3 at 14:05


















  • No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
    – rowan194
    Dec 3 at 14:05
















No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05




No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05










rowan194 is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















rowan194 is a new contributor. Be nice, and check out our Code of Conduct.













rowan194 is a new contributor. Be nice, and check out our Code of Conduct.












rowan194 is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f485638%2fhow-to-merge-pre-sorted-files-into-a-single-big-file-without-excessive-memory-o%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

サソリ

広島県道265号伴広島線

Accessing regular linux commands in Huawei's Dopra Linux