How to merge pre-sorted files into a single BIG file, without excessive memory or temporary disk use

up vote
3
down vote

favorite

I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.

Splitting and sorting the individual chunks is not an issue, but reassembling them is turning out to be a challenge. I expected from the man page that 'sort -m' (Under FreeBSD 11) would do a simple merge, creating an aggregate perfectly sorted output, optionally suppressing duplicates with the -u option.

But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.

I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?

=== Outcome ===

I ended up using a "standard" sort. It took around 50 hours of high CPU and disk load to process, including the generation of several hundred temporary files. This was despite the input already being perfectly sorted. I'm still interested in learning if there is a simple utility to neatly merge pre-sorted files.

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27

add a comment |

up vote
3
down vote

favorite

I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.

But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.

I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?

=== Outcome ===

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27

add a comment |

up vote
3
down vote

favorite

I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.

But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.

I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?

=== Outcome ===

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

I'm trying to sort (and ideally remove duplicate lines) from a 1.4TB file.

But after leaving it to run for a while, I discovered that sort had (so far) generated several hundred gigs worth of temporary files, just as if it was sorting the input like normal.

I don't have enough disk space to be able to store the same data 3 times. Are there any utilities that can do a simple merge of already sorted files, without requiring temporary disk space?

=== Outcome ===

freebsd sort merge

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

edited yesterday

asked Dec 3 at 8:42

rowan194

163

New contributor

asked Dec 3 at 8:42

rowan194

163

asked Dec 3 at 8:42

rowan194

163

New contributor

rowan194 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27

add a comment |

How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27

How large are your chunks? Are they smaller than what sort would have made? It sounds like you are basically mimicking what plain sort would have done anyway...
– Kusalananda
Dec 3 at 11:27

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

It would help to know some characteristics of the data. If there's lots of duplication (say, you expect the output to only be 1GB of the original 1.4TB of data) there are some tricks that can be used. Alternatively, are there certain common duplicates (that you could special-case) interspersed among the other non-duplicated data?

Also, are each of the individual files-to-be-merged already deduplicated? Is the input file stored on a ZFS dataset with a high level of compression enabled? This might also squeeze extra disk-space out for you. Especially if you're able to split the original and have all the pieces on the disk.

answered Dec 3 at 13:53

Gumnos

112

New contributor

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

add a comment |

up vote
1
down vote

Your requirements (so spare ram/storrage/cloud) is going to make this really slow but it is possible by writing your own file system driver. However if you have the time/skill to do that it would be faster/cheaper to rent/buy/sell/return a $37 2TB drive and use

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

rowan194 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f485638%2fhow-to-merge-pre-sorted-files-into-a-single-big-file-without-excessive-memory-o%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

answered Dec 3 at 13:53

Gumnos

112

New contributor

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

add a comment |

up vote
1
down vote

answered Dec 3 at 13:53

Gumnos

112

New contributor

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

add a comment |

up vote
1
down vote

answered Dec 3 at 13:53

Gumnos

112

New contributor

answered Dec 3 at 13:53

Gumnos

112

New contributor

answered Dec 3 at 13:53

Gumnos

112

New contributor

answered Dec 3 at 13:53

Gumnos

112

answered Dec 3 at 13:53

Gumnos

112

New contributor

Gumnos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

add a comment |

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

I'm using ZFS with compression so I'm already pushing the limits of my storage. I expect the duplicate count is relatively small (a fraction of a percent) and not a small set of lines massively duplicated. In addition, a given line may be duplicated over more than one sorted chunk. So final de-duplication really needs to be done as part of the merge.
– rowan194
Dec 3 at 14:13

add a comment |

up vote
1
down vote

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

add a comment |

up vote
1
down vote

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

add a comment |

up vote
1
down vote

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

https://en.m.wikipedia.org/wiki/External_sorting

A workaround might be zram and/or 7z/fs-compression if the file is compressable, you could make room for a 2nd copy

https://en.m.wikipedia.org/wiki/Zram

https://en.m.wikipedia.org/wiki/Category:Compression_file_systems

If there is space for output without removing input and input is pre-sorted then it's trivial.

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

edited Dec 3 at 17:43

answered Dec 3 at 10:48

user1133275

2,693415

answered Dec 3 at 10:48

user1133275

2,693415

answered Dec 3 at 10:48

user1133275

2,693415

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

add a comment |

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

No sorting needs to be done at all as the chunks are fully sorted. All that is needed is a merge: basically, open all files, read a line from each, and whichever is "lowest" compared to the others is output. Repeat until all input is exhausted. I could probably code this myself, but I want to check first that there's nothing already available.
– rowan194
Dec 3 at 14:05

add a comment |

rowan194 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

rowan194 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj