Bulk data generation
up vote
3
down vote
favorite
I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -
awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'
But it is not generating more than 599160237 records and process got killed automatically
awk regular-expression
add a comment |
up vote
3
down vote
favorite
I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -
awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'
But it is not generating more than 599160237 records and process got killed automatically
awk regular-expression
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -
awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'
But it is not generating more than 599160237 records and process got killed automatically
awk regular-expression
I need to generate nearly 1 Billion records of unique integers.
I tried with awk but it is not generating more than 5million records.
Below is what I had tried so far -
awk -v loop=10000000000 -v range=10000000000 'BEGIN{
srand()
do {
numb = 1 + int(rand() * range)
if (!(numb in prev)) {
print numb
prev[numb] = 1
count++
}
} while (count<loop)
}'
But it is not generating more than 599160237 records and process got killed automatically
awk regular-expression
awk regular-expression
edited Nov 21 at 21:34
Rui F Ribeiro
38.2k1475125
38.2k1475125
asked Dec 26 '15 at 14:43
anurag
162211
162211
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10
add a comment |
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10
add a comment |
4 Answers
4
active
oldest
votes
up vote
5
down vote
You could use GNU seq
+ sort
to first generate a list of unique 1B integers (in sequential order), then sort -R
to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.
This will takes several minutes (depending on your machine's CPU/Ram/disk):
$ seq 1000000000 > 1B.txt
$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt
$ sort -R 1B.txt > 1B.random.txt
If you have access to a machine with enough RAM you can use GNU shuf
:
$ shuf -i 1-1000000000 > 1B.random.txt
Empirically, shuf
needed ~8GB of free ram and ~6 minutes of runtime on my machine.
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument forseq
the range and after shuffling of the numbers, us e.g.head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
add a comment |
up vote
1
down vote
It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.
I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:
./mkrnd 0 200 100
You probably will want a redirect to file, so do
./mkrnd 0 200 100 >randomints.txt
The compiling is simple, just do gcc mkrnd.c -o mkrnd
(or I can compile it for you).
Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:
% time null ./mkrnd 0 1000000000 10000000
real 0m33.471s
user 0m0.000s
sys 0m0.000s
Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).
And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.
EDIT: Replacing gettimeofday with clock_gettime gives double speed.
add a comment |
up vote
0
down vote
in python3.4 you can generate and play with huge numbers like this:
#!/bin/python3.4
import random
print(random.sample(range(1, 1000000000000),1000000000))
this will print one billion unique numbers
if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:
x=range(1, 1000000000000)
for i in x:
print (i) #or process i , whatever the operation is.
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating overx
, that would just print numbers between1
and1000000000000
in order, nothing random to it
– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
add a comment |
up vote
0
down vote
The reason for the process getting killed might be that awk
has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.
I mean, you're trying to build an array with maximum index of 10 billion (based on the range
) with 1 billion defined values. So, awk
needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk
is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.
To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.
How to escape from the huge memory requirements, then?
A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk
and then pass it to sort -R
. Also see my comment on the answer for how to make the range and the count of produced numbers be different.
One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.
If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
You could use GNU seq
+ sort
to first generate a list of unique 1B integers (in sequential order), then sort -R
to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.
This will takes several minutes (depending on your machine's CPU/Ram/disk):
$ seq 1000000000 > 1B.txt
$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt
$ sort -R 1B.txt > 1B.random.txt
If you have access to a machine with enough RAM you can use GNU shuf
:
$ shuf -i 1-1000000000 > 1B.random.txt
Empirically, shuf
needed ~8GB of free ram and ~6 minutes of runtime on my machine.
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument forseq
the range and after shuffling of the numbers, us e.g.head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
add a comment |
up vote
5
down vote
You could use GNU seq
+ sort
to first generate a list of unique 1B integers (in sequential order), then sort -R
to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.
This will takes several minutes (depending on your machine's CPU/Ram/disk):
$ seq 1000000000 > 1B.txt
$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt
$ sort -R 1B.txt > 1B.random.txt
If you have access to a machine with enough RAM you can use GNU shuf
:
$ shuf -i 1-1000000000 > 1B.random.txt
Empirically, shuf
needed ~8GB of free ram and ~6 minutes of runtime on my machine.
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument forseq
the range and after shuffling of the numbers, us e.g.head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
add a comment |
up vote
5
down vote
up vote
5
down vote
You could use GNU seq
+ sort
to first generate a list of unique 1B integers (in sequential order), then sort -R
to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.
This will takes several minutes (depending on your machine's CPU/Ram/disk):
$ seq 1000000000 > 1B.txt
$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt
$ sort -R 1B.txt > 1B.random.txt
If you have access to a machine with enough RAM you can use GNU shuf
:
$ shuf -i 1-1000000000 > 1B.random.txt
Empirically, shuf
needed ~8GB of free ram and ~6 minutes of runtime on my machine.
You could use GNU seq
+ sort
to first generate a list of unique 1B integers (in sequential order), then sort -R
to shuffle them randomly).
While this is not CPU-efficient, it is memory agnostic as sort will use as much memory as available, then revert to temporary files.
This will takes several minutes (depending on your machine's CPU/Ram/disk):
$ seq 1000000000 > 1B.txt
$ ls -lhog 1B.txt
-rw-rw-r-- 1 9.3G Dec 26 17:31 1B.txt
$ sort -R 1B.txt > 1B.random.txt
If you have access to a machine with enough RAM you can use GNU shuf
:
$ shuf -i 1-1000000000 > 1B.random.txt
Empirically, shuf
needed ~8GB of free ram and ~6 minutes of runtime on my machine.
answered Dec 26 '15 at 22:46
A. Gordon
42924
42924
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument forseq
the range and after shuffling of the numbers, us e.g.head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
add a comment |
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument forseq
the range and after shuffling of the numbers, us e.g.head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).
– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for
seq
the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).– zagrimsan
Dec 27 '15 at 7:26
I think this is very good solution, and it can easily support also a requirement of the range of the numbers being different (larger) than the count of generated numbers. Just make the argument for
seq
the range and after shuffling of the numbers, us e.g. head 10000000 1B.random.txt > 10M.random.txt
would take the first 10 million numbers from set (the question has range=10 billion and count=1 billion).– zagrimsan
Dec 27 '15 at 7:26
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
Worked for me....Thank a ton buddy :)
– anurag
Dec 27 '15 at 7:44
add a comment |
up vote
1
down vote
It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.
I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:
./mkrnd 0 200 100
You probably will want a redirect to file, so do
./mkrnd 0 200 100 >randomints.txt
The compiling is simple, just do gcc mkrnd.c -o mkrnd
(or I can compile it for you).
Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:
% time null ./mkrnd 0 1000000000 10000000
real 0m33.471s
user 0m0.000s
sys 0m0.000s
Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).
And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.
EDIT: Replacing gettimeofday with clock_gettime gives double speed.
add a comment |
up vote
1
down vote
It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.
I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:
./mkrnd 0 200 100
You probably will want a redirect to file, so do
./mkrnd 0 200 100 >randomints.txt
The compiling is simple, just do gcc mkrnd.c -o mkrnd
(or I can compile it for you).
Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:
% time null ./mkrnd 0 1000000000 10000000
real 0m33.471s
user 0m0.000s
sys 0m0.000s
Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).
And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.
EDIT: Replacing gettimeofday with clock_gettime gives double speed.
add a comment |
up vote
1
down vote
up vote
1
down vote
It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.
I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:
./mkrnd 0 200 100
You probably will want a redirect to file, so do
./mkrnd 0 200 100 >randomints.txt
The compiling is simple, just do gcc mkrnd.c -o mkrnd
(or I can compile it for you).
Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:
% time null ./mkrnd 0 1000000000 10000000
real 0m33.471s
user 0m0.000s
sys 0m0.000s
Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).
And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.
EDIT: Replacing gettimeofday with clock_gettime gives double speed.
It will be better to use a program that will not allocate much of memory to complete the task. However, there is a problem with random number generation: if you need completely random numbers, then you need to use "good" random number source like /dev/urandom.
I think this C program can help you with this task. It generates numbers on the run, and with three arguments you specify: start int, end int and number of them to generate. So to generate a 100 ints in range in (0..200), you do:
./mkrnd 0 200 100
You probably will want a redirect to file, so do
./mkrnd 0 200 100 >randomints.txt
The compiling is simple, just do gcc mkrnd.c -o mkrnd
(or I can compile it for you).
Believed to be fast enough, but still will require hours to work I think. For me on Athlon64 5000+:
% time null ./mkrnd 0 1000000000 10000000
real 0m33.471s
user 0m0.000s
sys 0m0.000s
Remove #if 0 ... #endif to make it grab random integers from /dev/urandom (maybe slower).
And about memory requirements: it takes only 4K RSS on musl system during all it's runtime.
EDIT: Replacing gettimeofday with clock_gettime gives double speed.
edited Dec 27 '15 at 6:18
answered Dec 27 '15 at 1:03
user140866
add a comment |
add a comment |
up vote
0
down vote
in python3.4 you can generate and play with huge numbers like this:
#!/bin/python3.4
import random
print(random.sample(range(1, 1000000000000),1000000000))
this will print one billion unique numbers
if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:
x=range(1, 1000000000000)
for i in x:
print (i) #or process i , whatever the operation is.
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating overx
, that would just print numbers between1
and1000000000000
in order, nothing random to it
– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
add a comment |
up vote
0
down vote
in python3.4 you can generate and play with huge numbers like this:
#!/bin/python3.4
import random
print(random.sample(range(1, 1000000000000),1000000000))
this will print one billion unique numbers
if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:
x=range(1, 1000000000000)
for i in x:
print (i) #or process i , whatever the operation is.
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating overx
, that would just print numbers between1
and1000000000000
in order, nothing random to it
– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
add a comment |
up vote
0
down vote
up vote
0
down vote
in python3.4 you can generate and play with huge numbers like this:
#!/bin/python3.4
import random
print(random.sample(range(1, 1000000000000),1000000000))
this will print one billion unique numbers
if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:
x=range(1, 1000000000000)
for i in x:
print (i) #or process i , whatever the operation is.
in python3.4 you can generate and play with huge numbers like this:
#!/bin/python3.4
import random
print(random.sample(range(1, 1000000000000),1000000000))
this will print one billion unique numbers
if there is memory problem of allocating huge sample , then one can use the range and print the numbers in a loop , but that will be in a sequence , not random:
x=range(1, 1000000000000)
for i in x:
print (i) #or process i , whatever the operation is.
edited Dec 26 '15 at 17:12
answered Dec 26 '15 at 15:57
Ijaz Ahmad Khan
3,29931334
3,29931334
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating overx
, that would just print numbers between1
and1000000000000
in order, nothing random to it
– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
add a comment |
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating overx
, that would just print numbers between1
and1000000000000
in order, nothing random to it
– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
I think you got the scale wrong, max 10 billion, not 1000 billion... Produces MemoryError here, but that's what I'd expect to have from that approach unless given huge resources.
– zagrimsan
Dec 26 '15 at 16:13
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
range(1, 1000000000000) will not result in memory error because it doesn’t allocate memory at once , the number after the comma is the sample size , that will result in memory error if its too huge. but the other approach woould be to use range(1, 1000000000000) and get the numbers one by one in a loop.
– Ijaz Ahmad Khan
Dec 26 '15 at 17:08
with the second approach, you're just iterating over
x
, that would just print numbers between 1
and 1000000000000
in order, nothing random to it– iruvar
Dec 28 '15 at 4:35
with the second approach, you're just iterating over
x
, that would just print numbers between 1
and 1000000000000
in order, nothing random to it– iruvar
Dec 28 '15 at 4:35
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
yes , it will not be random , but instead of printing you may add randomness to it along the way and use it the way you want it.
– Ijaz Ahmad Khan
Dec 28 '15 at 11:01
add a comment |
up vote
0
down vote
The reason for the process getting killed might be that awk
has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.
I mean, you're trying to build an array with maximum index of 10 billion (based on the range
) with 1 billion defined values. So, awk
needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk
is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.
To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.
How to escape from the huge memory requirements, then?
A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk
and then pass it to sort -R
. Also see my comment on the answer for how to make the range and the count of produced numbers be different.
One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.
If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
add a comment |
up vote
0
down vote
The reason for the process getting killed might be that awk
has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.
I mean, you're trying to build an array with maximum index of 10 billion (based on the range
) with 1 billion defined values. So, awk
needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk
is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.
To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.
How to escape from the huge memory requirements, then?
A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk
and then pass it to sort -R
. Also see my comment on the answer for how to make the range and the count of produced numbers be different.
One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.
If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
add a comment |
up vote
0
down vote
up vote
0
down vote
The reason for the process getting killed might be that awk
has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.
I mean, you're trying to build an array with maximum index of 10 billion (based on the range
) with 1 billion defined values. So, awk
needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk
is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.
To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.
How to escape from the huge memory requirements, then?
A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk
and then pass it to sort -R
. Also see my comment on the answer for how to make the range and the count of produced numbers be different.
One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.
If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.
The reason for the process getting killed might be that awk
has a bug/limitation in arrays that your code is hitting, or your code is just so space-consuming that it hits some process based limit.
I mean, you're trying to build an array with maximum index of 10 billion (based on the range
) with 1 billion defined values. So, awk
needs potentially to reserve space for 10 billion variables. I'm not familiar enough to tell how much space that would mean, but 10 billion 16 bit integers would mean 18.5 GB, and even if awk
is clever in building such a sparse array, it would require over 1.8 GB just so store the numbers you're generating.
To be able to keep the results unique, you will need to have all the previous values somewhere, so it will necessarily be heavy on space requirements, but it might be that some other language would allow the algorithm to finish.
How to escape from the huge memory requirements, then?
A.Gordon presents one option, by relying on a sequence and just shuffling it for randomness. That works well when there is a requirement that the result should truly be numbers and you want them to be from a given range. If the range is more complex than from one to N, you could generate the sequence with awk
and then pass it to sort -R
. Also see my comment on the answer for how to make the range and the count of produced numbers be different.
One option could be to use a cryptographic (hash) function for producing the random numbers, but in that case you can't define the range to be 1 to N since those functions usually produce N bit output and you can't mangle the results without risking producing a collision (a duplicate number in the set).Such functions, however, would be guaranteed to easily produce 1 billion unique outputs (as those hash functions are designed to not produce the same output twice even with a extremely large number of repeated calls). Depending on the implementation, their output might not be numbers but strings, and one could possibly convert the string output to numbers, but since their output size is typically quite large, then range of the numbers resulting from the conversion would be really huge. You could start from this Stackoverflow question if you're interested in exploring this option.
If you can risk the chance of having a collision, even if that is rather unlikely, you could try using a good source of randomness (/dev/urandom is one option) to generate the 1 billion numbers. I don't know how likely it is that you could get 1 billion unique numbers from that, but trying it out would surely be worth the try. There is no memory-efficient way of telling if there is a duplicate in the result set, though, since that would require having all the numbers in memory for comparison.
edited May 23 '17 at 11:33
Community♦
1
1
answered Dec 26 '15 at 15:32
zagrimsan
692418
692418
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
add a comment |
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
Thanks for the explanation...can you please provide me piece of code to do the same task in some other language...may be python....i dont know python unfortunately :(
– anurag
Dec 26 '15 at 15:36
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
I think A. Gordon's approach is very good as it get's rid of the memory requirement altogether.
– zagrimsan
Dec 27 '15 at 7:20
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f251644%2fbulk-data-generation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You need to provide us with what you have so far done, so that we might spot why its not working. Post the code in your question, at least the relevant parts.
– zagrimsan
Dec 26 '15 at 14:46
awk -v loop=10000000000 -v range=10000000000 'BEGIN{ srand() do { numb = 1 + int(rand() * range) if (!(numb in prev)) { print numb prev[numb] = 1 count++ } } while (count<loop) }' tried with the above, but only 599160237 records were generated and after that process got killed. :(
– anurag
Dec 26 '15 at 14:47
Please edit your question and put the relevant information there so that people see it right away without having to wade through comments (it's also easier to format text in question).
– zagrimsan
Dec 26 '15 at 14:48
python may be more suitable choice for this task: related link. stackoverflow.com/questions/2076838/…
– Ijaz Ahmad Khan
Dec 26 '15 at 15:07
Unfortunately I don't know Python :(
– anurag
Dec 26 '15 at 15:10