Bulk data generation











up vote
3
down vote

favorite












I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -



 awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'


But it is not generating more than 599160237 records and process got killed automatically










share|improve this question
























  • You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
    – zagrimsan
    Dec 26 '15 at 14:46










  • awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
    – anurag
    Dec 26 '15 at 14:47












  • Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
    – zagrimsan
    Dec 26 '15 at 14:48










  • python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
    – Ijaz Ahmad Khan
    Dec 26 '15 at 15:07










  • Unfortunately I don't know Python :(
    – anurag
    Dec 26 '15 at 15:10















up vote
3
down vote

favorite












I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -



 awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'


But it is not generating more than 599160237 records and process got killed automatically










share|improve this question
























  • You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
    – zagrimsan
    Dec 26 '15 at 14:46










  • awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
    – anurag
    Dec 26 '15 at 14:47












  • Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
    – zagrimsan
    Dec 26 '15 at 14:48










  • python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
    – Ijaz Ahmad Khan
    Dec 26 '15 at 15:07










  • Unfortunately I don't know Python :(
    – anurag
    Dec 26 '15 at 15:10













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -



 awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'


But it is not generating more than 599160237 records and process got killed automatically










share|improve this question















I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -



 awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'


But it is not generating more than 599160237 records and process got killed automatically







awk regular-expression






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 at 21:34









Rui F Ribeiro

38.2k1475125




38.2k1475125










asked Dec 26 '15 at 14:43









anurag

162211




162211












  • You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
    – zagrimsan
    Dec 26 '15 at 14:46










  • awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
    – anurag
    Dec 26 '15 at 14:47












  • Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
    – zagrimsan
    Dec 26 '15 at 14:48










  • python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
    – Ijaz Ahmad Khan
    Dec 26 '15 at 15:07










  • Unfortunately I don't know Python :(
    – anurag
    Dec 26 '15 at 15:10


















  • You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
    – zagrimsan
    Dec 26 '15 at 14:46










  • awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
    – anurag
    Dec 26 '15 at 14:47












  • Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
    – zagrimsan
    Dec 26 '15 at 14:48










  • python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
    – Ijaz Ahmad Khan
    Dec 26 '15 at 15:07










  • Unfortunately I don't know Python :(
    – anurag
    Dec 26 '15 at 15:10
















You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46




You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46












awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47






awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47














Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48




Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48












python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07




python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07












Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10




Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10










4 Answers
4






active

oldest

votes

















up vote
5
down vote













You could use GNU seq + sort to first generate a list of unique 1B integers (in sequential order), then sort -R to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.



This will takes several minutes (depending on your machine's CPU/Ram/disk):



$ seq 1000000000 > 1B.txt

$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt

$ sort -R 1B.txt > 1B.random.txt


If you have access to a machine with enough RAM you can use GNU shuf:



$ shuf -i 1-1000000000 > 1B.random.txt


Empirically, shuf needed ~8GB of free ram and ~6 minutes of runtime on my machine.






share|improve this answer





















  • I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
    – zagrimsan
    Dec 27 '15 at 7:26










  • Worked for me....Thank a ton buddy :)
    – anurag
    Dec 27 '15 at 7:44


















up vote
1
down vote













It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.



I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:



./mkrnd 0 200 100


You probably will want a redirect to file, so do



./mkrnd 0 200 100 >randomints.txt


The compiling is simple, just do gcc mkrnd.c -o mkrnd (or I can compile it for you).



Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:



% time null ./mkrnd 0 1000000000 10000000                                                          

real 0m33.471s
user 0m0.000s
sys 0m0.000s


Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).



And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.



EDIT: Replacing gettimeofday with clock_gettime gives double speed.






share|improve this answer






























    up vote
    0
    down vote













    in python3.4 you can generate and play with huge numbers like this:



        #!/bin/python3.4
    import random
    print(random.sample(range(1, 1000000000000),1000000000))


    this will print one billion unique numbers



    if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:



        x=range(1, 1000000000000)
    for i in x:
    print (i) #or process i , whatever the operation is.





    share|improve this answer























    • I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
      – zagrimsan
      Dec 26 '15 at 16:13










    • range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
      – Ijaz Ahmad Khan
      Dec 26 '15 at 17:08










    • with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
      – iruvar
      Dec 28 '15 at 4:35










    • yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
      – Ijaz Ahmad Khan
      Dec 28 '15 at 11:01


















    up vote
    0
    down vote













    The reason for the process getting killed might be that awk has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.



    I mean, you're trying to build an array with maximum index of 10 billion (based on the range) with 1 billion defined values. So, awk needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.



    To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.



    How to escape from the huge memory requirements, then?



    A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk and then pass it to sort -R. Also see my comment on the answer for how to make the range and the count of produced numbers be different.



    One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.



    If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.






    share|improve this answer























    • Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
      – anurag
      Dec 26 '15 at 15:36










    • I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
      – zagrimsan
      Dec 27 '15 at 7:20











    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f251644%2fbulk-data-generation%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote













    You could use GNU seq + sort to first generate a list of unique 1B integers (in sequential order), then sort -R to shuffle them randomly).
    While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.



    This will takes several minutes (depending on your machine's CPU/Ram/disk):



    $ seq 1000000000 > 1B.txt

    $ ls -lhog 1B.txt
    -rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt

    $ sort -R 1B.txt > 1B.random.txt


    If you have access to a machine with enough RAM you can use GNU shuf:



    $ shuf -i 1-1000000000 > 1B.random.txt


    Empirically, shuf needed ~8GB of free ram and ~6 minutes of runtime on my machine.






    share|improve this answer





















    • I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
      – zagrimsan
      Dec 27 '15 at 7:26










    • Worked for me....Thank a ton buddy :)
      – anurag
      Dec 27 '15 at 7:44















    up vote
    5
    down vote













    You could use GNU seq + sort to first generate a list of unique 1B integers (in sequential order), then sort -R to shuffle them randomly).
    While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.



    This will takes several minutes (depending on your machine's CPU/Ram/disk):



    $ seq 1000000000 > 1B.txt

    $ ls -lhog 1B.txt
    -rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt

    $ sort -R 1B.txt > 1B.random.txt


    If you have access to a machine with enough RAM you can use GNU shuf:



    $ shuf -i 1-1000000000 > 1B.random.txt


    Empirically, shuf needed ~8GB of free ram and ~6 minutes of runtime on my machine.






    share|improve this answer





















    • I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
      – zagrimsan
      Dec 27 '15 at 7:26










    • Worked for me....Thank a ton buddy :)
      – anurag
      Dec 27 '15 at 7:44













    up vote
    5
    down vote










    up vote
    5
    down vote









    You could use GNU seq + sort to first generate a list of unique 1B integers (in sequential order), then sort -R to shuffle them randomly).
    While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.



    This will takes several minutes (depending on your machine's CPU/Ram/disk):



    $ seq 1000000000 > 1B.txt

    $ ls -lhog 1B.txt
    -rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt

    $ sort -R 1B.txt > 1B.random.txt


    If you have access to a machine with enough RAM you can use GNU shuf:



    $ shuf -i 1-1000000000 > 1B.random.txt


    Empirically, shuf needed ~8GB of free ram and ~6 minutes of runtime on my machine.






    share|improve this answer












    You could use GNU seq + sort to first generate a list of unique 1B integers (in sequential order), then sort -R to shuffle them randomly).
    While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.



    This will takes several minutes (depending on your machine's CPU/Ram/disk):



    $ seq 1000000000 > 1B.txt

    $ ls -lhog 1B.txt
    -rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt

    $ sort -R 1B.txt > 1B.random.txt


    If you have access to a machine with enough RAM you can use GNU shuf:



    $ shuf -i 1-1000000000 > 1B.random.txt


    Empirically, shuf needed ~8GB of free ram and ~6 minutes of runtime on my machine.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 26 '15 at 22:46









    A. Gordon

    42924




    42924












    • I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
      – zagrimsan
      Dec 27 '15 at 7:26










    • Worked for me....Thank a ton buddy :)
      – anurag
      Dec 27 '15 at 7:44


















    • I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
      – zagrimsan
      Dec 27 '15 at 7:26










    • Worked for me....Thank a ton buddy :)
      – anurag
      Dec 27 '15 at 7:44
















    I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
    – zagrimsan
    Dec 27 '15 at 7:26




    I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for seq the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
    – zagrimsan
    Dec 27 '15 at 7:26












    Worked for me....Thank a ton buddy :)
    – anurag
    Dec 27 '15 at 7:44




    Worked for me....Thank a ton buddy :)
    – anurag
    Dec 27 '15 at 7:44












    up vote
    1
    down vote













    It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.



    I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:



    ./mkrnd 0 200 100


    You probably will want a redirect to file, so do



    ./mkrnd 0 200 100 >randomints.txt


    The compiling is simple, just do gcc mkrnd.c -o mkrnd (or I can compile it for you).



    Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:



    % time null ./mkrnd 0 1000000000 10000000                                                          

    real 0m33.471s
    user 0m0.000s
    sys 0m0.000s


    Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).



    And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.



    EDIT: Replacing gettimeofday with clock_gettime gives double speed.






    share|improve this answer



























      up vote
      1
      down vote













      It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.



      I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:



      ./mkrnd 0 200 100


      You probably will want a redirect to file, so do



      ./mkrnd 0 200 100 >randomints.txt


      The compiling is simple, just do gcc mkrnd.c -o mkrnd (or I can compile it for you).



      Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:



      % time null ./mkrnd 0 1000000000 10000000                                                          

      real 0m33.471s
      user 0m0.000s
      sys 0m0.000s


      Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).



      And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.



      EDIT: Replacing gettimeofday with clock_gettime gives double speed.






      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.



        I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:



        ./mkrnd 0 200 100


        You probably will want a redirect to file, so do



        ./mkrnd 0 200 100 >randomints.txt


        The compiling is simple, just do gcc mkrnd.c -o mkrnd (or I can compile it for you).



        Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:



        % time null ./mkrnd 0 1000000000 10000000                                                          

        real 0m33.471s
        user 0m0.000s
        sys 0m0.000s


        Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).



        And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.



        EDIT: Replacing gettimeofday with clock_gettime gives double speed.






        share|improve this answer














        It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.



        I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:



        ./mkrnd 0 200 100


        You probably will want a redirect to file, so do



        ./mkrnd 0 200 100 >randomints.txt


        The compiling is simple, just do gcc mkrnd.c -o mkrnd (or I can compile it for you).



        Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:



        % time null ./mkrnd 0 1000000000 10000000                                                          

        real 0m33.471s
        user 0m0.000s
        sys 0m0.000s


        Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).



        And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.



        EDIT: Replacing gettimeofday with clock_gettime gives double speed.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 27 '15 at 6:18

























        answered Dec 27 '15 at 1:03







        user140866





























            up vote
            0
            down vote













            in python3.4 you can generate and play with huge numbers like this:



                #!/bin/python3.4
            import random
            print(random.sample(range(1, 1000000000000),1000000000))


            this will print one billion unique numbers



            if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:



                x=range(1, 1000000000000)
            for i in x:
            print (i) #or process i , whatever the operation is.





            share|improve this answer























            • I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
              – zagrimsan
              Dec 26 '15 at 16:13










            • range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
              – Ijaz Ahmad Khan
              Dec 26 '15 at 17:08










            • with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
              – iruvar
              Dec 28 '15 at 4:35










            • yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
              – Ijaz Ahmad Khan
              Dec 28 '15 at 11:01















            up vote
            0
            down vote













            in python3.4 you can generate and play with huge numbers like this:



                #!/bin/python3.4
            import random
            print(random.sample(range(1, 1000000000000),1000000000))


            this will print one billion unique numbers



            if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:



                x=range(1, 1000000000000)
            for i in x:
            print (i) #or process i , whatever the operation is.





            share|improve this answer























            • I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
              – zagrimsan
              Dec 26 '15 at 16:13










            • range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
              – Ijaz Ahmad Khan
              Dec 26 '15 at 17:08










            • with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
              – iruvar
              Dec 28 '15 at 4:35










            • yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
              – Ijaz Ahmad Khan
              Dec 28 '15 at 11:01













            up vote
            0
            down vote










            up vote
            0
            down vote









            in python3.4 you can generate and play with huge numbers like this:



                #!/bin/python3.4
            import random
            print(random.sample(range(1, 1000000000000),1000000000))


            this will print one billion unique numbers



            if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:



                x=range(1, 1000000000000)
            for i in x:
            print (i) #or process i , whatever the operation is.





            share|improve this answer














            in python3.4 you can generate and play with huge numbers like this:



                #!/bin/python3.4
            import random
            print(random.sample(range(1, 1000000000000),1000000000))


            this will print one billion unique numbers



            if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:



                x=range(1, 1000000000000)
            for i in x:
            print (i) #or process i , whatever the operation is.






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 26 '15 at 17:12

























            answered Dec 26 '15 at 15:57









            Ijaz Ahmad Khan

            3,29931334




            3,29931334












            • I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
              – zagrimsan
              Dec 26 '15 at 16:13










            • range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
              – Ijaz Ahmad Khan
              Dec 26 '15 at 17:08










            • with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
              – iruvar
              Dec 28 '15 at 4:35










            • yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
              – Ijaz Ahmad Khan
              Dec 28 '15 at 11:01


















            • I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
              – zagrimsan
              Dec 26 '15 at 16:13










            • range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
              – Ijaz Ahmad Khan
              Dec 26 '15 at 17:08










            • with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
              – iruvar
              Dec 28 '15 at 4:35










            • yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
              – Ijaz Ahmad Khan
              Dec 28 '15 at 11:01
















            I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
            – zagrimsan
            Dec 26 '15 at 16:13




            I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
            – zagrimsan
            Dec 26 '15 at 16:13












            range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
            – Ijaz Ahmad Khan
            Dec 26 '15 at 17:08




            range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
            – Ijaz Ahmad Khan
            Dec 26 '15 at 17:08












            with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
            – iruvar
            Dec 28 '15 at 4:35




            with the second approach, you're just iterating over x, that would just print numbers between 1 and 1000000000000 in order, nothing random to it
            – iruvar
            Dec 28 '15 at 4:35












            yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
            – Ijaz Ahmad Khan
            Dec 28 '15 at 11:01




            yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
            – Ijaz Ahmad Khan
            Dec 28 '15 at 11:01










            up vote
            0
            down vote













            The reason for the process getting killed might be that awk has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.



            I mean, you're trying to build an array with maximum index of 10 billion (based on the range) with 1 billion defined values. So, awk needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.



            To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.



            How to escape from the huge memory requirements, then?



            A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk and then pass it to sort -R. Also see my comment on the answer for how to make the range and the count of produced numbers be different.



            One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.



            If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.






            share|improve this answer























            • Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
              – anurag
              Dec 26 '15 at 15:36










            • I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
              – zagrimsan
              Dec 27 '15 at 7:20















            up vote
            0
            down vote













            The reason for the process getting killed might be that awk has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.



            I mean, you're trying to build an array with maximum index of 10 billion (based on the range) with 1 billion defined values. So, awk needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.



            To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.



            How to escape from the huge memory requirements, then?



            A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk and then pass it to sort -R. Also see my comment on the answer for how to make the range and the count of produced numbers be different.



            One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.



            If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.






            share|improve this answer























            • Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
              – anurag
              Dec 26 '15 at 15:36










            • I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
              – zagrimsan
              Dec 27 '15 at 7:20













            up vote
            0
            down vote










            up vote
            0
            down vote









            The reason for the process getting killed might be that awk has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.



            I mean, you're trying to build an array with maximum index of 10 billion (based on the range) with 1 billion defined values. So, awk needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.



            To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.



            How to escape from the huge memory requirements, then?



            A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk and then pass it to sort -R. Also see my comment on the answer for how to make the range and the count of produced numbers be different.



            One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.



            If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.






            share|improve this answer














            The reason for the process getting killed might be that awk has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.



            I mean, you're trying to build an array with maximum index of 10 billion (based on the range) with 1 billion defined values. So, awk needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.



            To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.



            How to escape from the huge memory requirements, then?



            A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk and then pass it to sort -R. Also see my comment on the answer for how to make the range and the count of produced numbers be different.



            One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.



            If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited May 23 '17 at 11:33









            Community

            1




            1










            answered Dec 26 '15 at 15:32









            zagrimsan

            692418




            692418












            • Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
              – anurag
              Dec 26 '15 at 15:36










            • I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
              – zagrimsan
              Dec 27 '15 at 7:20


















            • Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
              – anurag
              Dec 26 '15 at 15:36










            • I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
              – zagrimsan
              Dec 27 '15 at 7:20
















            Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
            – anurag
            Dec 26 '15 at 15:36




            Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
            – anurag
            Dec 26 '15 at 15:36












            I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
            – zagrimsan
            Dec 27 '15 at 7:20




            I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
            – zagrimsan
            Dec 27 '15 at 7:20


















             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f251644%2fbulk-data-generation%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Accessing regular linux commands in Huawei's Dopra Linux

            Can't connect RFCOMM socket: Host is down

            Kernel panic - not syncing: Fatal Exception in Interrupt