How can I restructure a table of points with lengthy text ids?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty{ margin-bottom:0;
}
up vote
3
down vote
favorite
I have a dataset that looks like this:
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.226215,9.23175,244,1531840519
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,86,49.22584,9.229898,260,1531840525
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.225726,9.22834,263,1531840530
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,84,49.456019,11.026769,123,1531840479
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,82,49.455454,11.02813,122,1531840484
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,75,49.454935,11.029419,121,1531840489
Multiple tracks of cars of a car-sharing service. The first entry is a random ID that marks one rental. (I guess something like hash(user, car, RNG) ). Every row consists of
ID
,speed
,lat
,lng
,orientation
(deg),epoch
...but the ID
column takes up at least half of the storage. (Don't ask me why this format was chosen).
I want to keep the IDs
, but I'm not sure how to do it. (In PostgreSQL with Django)
My idea is to use one table that only contains the ID
strings and the second table that contains all the other information and a foreign key on the first table.
I just wonder if there is already a feature for this problem directly within PostgreSQL as I can imagine that multiple repetitions of a string in a column is a common problem.
postgresql database-design
New contributor
|
show 6 more comments
up vote
3
down vote
favorite
I have a dataset that looks like this:
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.226215,9.23175,244,1531840519
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,86,49.22584,9.229898,260,1531840525
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.225726,9.22834,263,1531840530
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,84,49.456019,11.026769,123,1531840479
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,82,49.455454,11.02813,122,1531840484
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,75,49.454935,11.029419,121,1531840489
Multiple tracks of cars of a car-sharing service. The first entry is a random ID that marks one rental. (I guess something like hash(user, car, RNG) ). Every row consists of
ID
,speed
,lat
,lng
,orientation
(deg),epoch
...but the ID
column takes up at least half of the storage. (Don't ask me why this format was chosen).
I want to keep the IDs
, but I'm not sure how to do it. (In PostgreSQL with Django)
My idea is to use one table that only contains the ID
strings and the second table that contains all the other information and a foreign key on the first table.
I just wonder if there is already a feature for this problem directly within PostgreSQL as I can imagine that multiple repetitions of a string in a column is a common problem.
postgresql database-design
New contributor
2
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
2
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
1
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
2
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
1
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago
|
show 6 more comments
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a dataset that looks like this:
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.226215,9.23175,244,1531840519
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,86,49.22584,9.229898,260,1531840525
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.225726,9.22834,263,1531840530
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,84,49.456019,11.026769,123,1531840479
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,82,49.455454,11.02813,122,1531840484
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,75,49.454935,11.029419,121,1531840489
Multiple tracks of cars of a car-sharing service. The first entry is a random ID that marks one rental. (I guess something like hash(user, car, RNG) ). Every row consists of
ID
,speed
,lat
,lng
,orientation
(deg),epoch
...but the ID
column takes up at least half of the storage. (Don't ask me why this format was chosen).
I want to keep the IDs
, but I'm not sure how to do it. (In PostgreSQL with Django)
My idea is to use one table that only contains the ID
strings and the second table that contains all the other information and a foreign key on the first table.
I just wonder if there is already a feature for this problem directly within PostgreSQL as I can imagine that multiple repetitions of a string in a column is a common problem.
postgresql database-design
New contributor
I have a dataset that looks like this:
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.226215,9.23175,244,1531840519
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,86,49.22584,9.229898,260,1531840525
4cd1c79bf6a87692@@@b6aa5eeb-8a83-433f-a0ea-6c72708abc1d,82,49.225726,9.22834,263,1531840530
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,84,49.456019,11.026769,123,1531840479
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,82,49.455454,11.02813,122,1531840484
ff9d6b955f214ff0@@@599e1d04-7fc4-4ce4-8c92-22e9d7ac6596,75,49.454935,11.029419,121,1531840489
Multiple tracks of cars of a car-sharing service. The first entry is a random ID that marks one rental. (I guess something like hash(user, car, RNG) ). Every row consists of
ID
,speed
,lat
,lng
,orientation
(deg),epoch
...but the ID
column takes up at least half of the storage. (Don't ask me why this format was chosen).
I want to keep the IDs
, but I'm not sure how to do it. (In PostgreSQL with Django)
My idea is to use one table that only contains the ID
strings and the second table that contains all the other information and a foreign key on the first table.
I just wonder if there is already a feature for this problem directly within PostgreSQL as I can imagine that multiple repetitions of a string in a column is a common problem.
postgresql database-design
postgresql database-design
New contributor
New contributor
edited 6 hours ago
Evan Carroll
30.3k864198
30.3k864198
New contributor
asked 11 hours ago
FooTheBar
1162
1162
New contributor
New contributor
2
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
2
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
1
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
2
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
1
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago
|
show 6 more comments
2
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
2
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
1
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
2
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
1
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago
2
2
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
2
2
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
1
1
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
2
2
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
1
1
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago
|
show 6 more comments
2 Answers
2
active
oldest
votes
up vote
4
down vote
First, you shouldn't use those IDs in the database. To start off they're clearly a compound ID, with
- bigint (hex encoded)
- uuid
The @@@
is a text-token. That means right off the bat you can make your table substantially smaller very easy.
ALTER TABLE foo
ADD COLUMN id_uuid uuid,
ADD COLUMN id_num bigint;
Now you need to populate it. (essentially we make sure you have 16 characters, prepend an x
and then cast it to a bigint
by proxy of ::bit(64)
)
UPDATE foo
SET id_num =
(
'x'
|| lpad(
(regexp_match(id, '(.*?)@'))[0]
, 16
, '0'
)
)::bit(64)::bigint,
id_uuid = (regexp_match(id, '.*@(.*)$'))[1]::uuid;
After this you could drop the original text id column. That's what you should do when you commit to a direction.
Now you have two id columns, but one of them is 16 bytes, the other is 8 bytes. compared to one text column that is 55 bytes (plus varlena overhead which is like 6 byts for you).
What you really want to do though is rewrite the table with these two columns at the start because that'll impact physical layout and make a big difference. Google for PostgreSQL Tetris for the reasons there..
At the point that you have two ids, and you can start to understand the data, the questions emerge.
- Are both ids necessary, can I drop one and will it make it make difference
- If they're both neccessary, can I store both of them as external vendor ids in a table and assign my own simplier int4 ids?
Depending on the answers to those questions you may want to structure some triggers so on insertion something (like a function you create) takes a text argument, decodes it into a composite type of (uuid,int) and assigns them to the new columns in one pass.
Moreover, you have another massive question here. This is a spatial problem. You have GPS coordinates. If you're talking about billion of rows, you may consider using a table like this until the routes are complete, and then migrating to a GIS solution with LINESTRINGS
and PostGIS. Then you can map the routes in the database, and find what roads were traveled, how many miles the routes were, and place them on a map (or do other fancy goodness). The GIS solution makes the points columnar and gives you the ability to index them too.
You may want a consultant.
add a comment |
up vote
1
down vote
I have around around 1.4Billion (1449000000) datapoints. So in uncompressed csv it's around 120GB.
At this scale, and given that an average track has 50 or more points, it's probably worth a shot. While storage itself is cheap these days, you're also improving physical I/O and buffer pool efficiency, squeezing more records into every page.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
First, you shouldn't use those IDs in the database. To start off they're clearly a compound ID, with
- bigint (hex encoded)
- uuid
The @@@
is a text-token. That means right off the bat you can make your table substantially smaller very easy.
ALTER TABLE foo
ADD COLUMN id_uuid uuid,
ADD COLUMN id_num bigint;
Now you need to populate it. (essentially we make sure you have 16 characters, prepend an x
and then cast it to a bigint
by proxy of ::bit(64)
)
UPDATE foo
SET id_num =
(
'x'
|| lpad(
(regexp_match(id, '(.*?)@'))[0]
, 16
, '0'
)
)::bit(64)::bigint,
id_uuid = (regexp_match(id, '.*@(.*)$'))[1]::uuid;
After this you could drop the original text id column. That's what you should do when you commit to a direction.
Now you have two id columns, but one of them is 16 bytes, the other is 8 bytes. compared to one text column that is 55 bytes (plus varlena overhead which is like 6 byts for you).
What you really want to do though is rewrite the table with these two columns at the start because that'll impact physical layout and make a big difference. Google for PostgreSQL Tetris for the reasons there..
At the point that you have two ids, and you can start to understand the data, the questions emerge.
- Are both ids necessary, can I drop one and will it make it make difference
- If they're both neccessary, can I store both of them as external vendor ids in a table and assign my own simplier int4 ids?
Depending on the answers to those questions you may want to structure some triggers so on insertion something (like a function you create) takes a text argument, decodes it into a composite type of (uuid,int) and assigns them to the new columns in one pass.
Moreover, you have another massive question here. This is a spatial problem. You have GPS coordinates. If you're talking about billion of rows, you may consider using a table like this until the routes are complete, and then migrating to a GIS solution with LINESTRINGS
and PostGIS. Then you can map the routes in the database, and find what roads were traveled, how many miles the routes were, and place them on a map (or do other fancy goodness). The GIS solution makes the points columnar and gives you the ability to index them too.
You may want a consultant.
add a comment |
up vote
4
down vote
First, you shouldn't use those IDs in the database. To start off they're clearly a compound ID, with
- bigint (hex encoded)
- uuid
The @@@
is a text-token. That means right off the bat you can make your table substantially smaller very easy.
ALTER TABLE foo
ADD COLUMN id_uuid uuid,
ADD COLUMN id_num bigint;
Now you need to populate it. (essentially we make sure you have 16 characters, prepend an x
and then cast it to a bigint
by proxy of ::bit(64)
)
UPDATE foo
SET id_num =
(
'x'
|| lpad(
(regexp_match(id, '(.*?)@'))[0]
, 16
, '0'
)
)::bit(64)::bigint,
id_uuid = (regexp_match(id, '.*@(.*)$'))[1]::uuid;
After this you could drop the original text id column. That's what you should do when you commit to a direction.
Now you have two id columns, but one of them is 16 bytes, the other is 8 bytes. compared to one text column that is 55 bytes (plus varlena overhead which is like 6 byts for you).
What you really want to do though is rewrite the table with these two columns at the start because that'll impact physical layout and make a big difference. Google for PostgreSQL Tetris for the reasons there..
At the point that you have two ids, and you can start to understand the data, the questions emerge.
- Are both ids necessary, can I drop one and will it make it make difference
- If they're both neccessary, can I store both of them as external vendor ids in a table and assign my own simplier int4 ids?
Depending on the answers to those questions you may want to structure some triggers so on insertion something (like a function you create) takes a text argument, decodes it into a composite type of (uuid,int) and assigns them to the new columns in one pass.
Moreover, you have another massive question here. This is a spatial problem. You have GPS coordinates. If you're talking about billion of rows, you may consider using a table like this until the routes are complete, and then migrating to a GIS solution with LINESTRINGS
and PostGIS. Then you can map the routes in the database, and find what roads were traveled, how many miles the routes were, and place them on a map (or do other fancy goodness). The GIS solution makes the points columnar and gives you the ability to index them too.
You may want a consultant.
add a comment |
up vote
4
down vote
up vote
4
down vote
First, you shouldn't use those IDs in the database. To start off they're clearly a compound ID, with
- bigint (hex encoded)
- uuid
The @@@
is a text-token. That means right off the bat you can make your table substantially smaller very easy.
ALTER TABLE foo
ADD COLUMN id_uuid uuid,
ADD COLUMN id_num bigint;
Now you need to populate it. (essentially we make sure you have 16 characters, prepend an x
and then cast it to a bigint
by proxy of ::bit(64)
)
UPDATE foo
SET id_num =
(
'x'
|| lpad(
(regexp_match(id, '(.*?)@'))[0]
, 16
, '0'
)
)::bit(64)::bigint,
id_uuid = (regexp_match(id, '.*@(.*)$'))[1]::uuid;
After this you could drop the original text id column. That's what you should do when you commit to a direction.
Now you have two id columns, but one of them is 16 bytes, the other is 8 bytes. compared to one text column that is 55 bytes (plus varlena overhead which is like 6 byts for you).
What you really want to do though is rewrite the table with these two columns at the start because that'll impact physical layout and make a big difference. Google for PostgreSQL Tetris for the reasons there..
At the point that you have two ids, and you can start to understand the data, the questions emerge.
- Are both ids necessary, can I drop one and will it make it make difference
- If they're both neccessary, can I store both of them as external vendor ids in a table and assign my own simplier int4 ids?
Depending on the answers to those questions you may want to structure some triggers so on insertion something (like a function you create) takes a text argument, decodes it into a composite type of (uuid,int) and assigns them to the new columns in one pass.
Moreover, you have another massive question here. This is a spatial problem. You have GPS coordinates. If you're talking about billion of rows, you may consider using a table like this until the routes are complete, and then migrating to a GIS solution with LINESTRINGS
and PostGIS. Then you can map the routes in the database, and find what roads were traveled, how many miles the routes were, and place them on a map (or do other fancy goodness). The GIS solution makes the points columnar and gives you the ability to index them too.
You may want a consultant.
First, you shouldn't use those IDs in the database. To start off they're clearly a compound ID, with
- bigint (hex encoded)
- uuid
The @@@
is a text-token. That means right off the bat you can make your table substantially smaller very easy.
ALTER TABLE foo
ADD COLUMN id_uuid uuid,
ADD COLUMN id_num bigint;
Now you need to populate it. (essentially we make sure you have 16 characters, prepend an x
and then cast it to a bigint
by proxy of ::bit(64)
)
UPDATE foo
SET id_num =
(
'x'
|| lpad(
(regexp_match(id, '(.*?)@'))[0]
, 16
, '0'
)
)::bit(64)::bigint,
id_uuid = (regexp_match(id, '.*@(.*)$'))[1]::uuid;
After this you could drop the original text id column. That's what you should do when you commit to a direction.
Now you have two id columns, but one of them is 16 bytes, the other is 8 bytes. compared to one text column that is 55 bytes (plus varlena overhead which is like 6 byts for you).
What you really want to do though is rewrite the table with these two columns at the start because that'll impact physical layout and make a big difference. Google for PostgreSQL Tetris for the reasons there..
At the point that you have two ids, and you can start to understand the data, the questions emerge.
- Are both ids necessary, can I drop one and will it make it make difference
- If they're both neccessary, can I store both of them as external vendor ids in a table and assign my own simplier int4 ids?
Depending on the answers to those questions you may want to structure some triggers so on insertion something (like a function you create) takes a text argument, decodes it into a composite type of (uuid,int) and assigns them to the new columns in one pass.
Moreover, you have another massive question here. This is a spatial problem. You have GPS coordinates. If you're talking about billion of rows, you may consider using a table like this until the routes are complete, and then migrating to a GIS solution with LINESTRINGS
and PostGIS. Then you can map the routes in the database, and find what roads were traveled, how many miles the routes were, and place them on a map (or do other fancy goodness). The GIS solution makes the points columnar and gives you the ability to index them too.
You may want a consultant.
edited 7 hours ago
answered 7 hours ago
Evan Carroll
30.3k864198
30.3k864198
add a comment |
add a comment |
up vote
1
down vote
I have around around 1.4Billion (1449000000) datapoints. So in uncompressed csv it's around 120GB.
At this scale, and given that an average track has 50 or more points, it's probably worth a shot. While storage itself is cheap these days, you're also improving physical I/O and buffer pool efficiency, squeezing more records into every page.
add a comment |
up vote
1
down vote
I have around around 1.4Billion (1449000000) datapoints. So in uncompressed csv it's around 120GB.
At this scale, and given that an average track has 50 or more points, it's probably worth a shot. While storage itself is cheap these days, you're also improving physical I/O and buffer pool efficiency, squeezing more records into every page.
add a comment |
up vote
1
down vote
up vote
1
down vote
I have around around 1.4Billion (1449000000) datapoints. So in uncompressed csv it's around 120GB.
At this scale, and given that an average track has 50 or more points, it's probably worth a shot. While storage itself is cheap these days, you're also improving physical I/O and buffer pool efficiency, squeezing more records into every page.
I have around around 1.4Billion (1449000000) datapoints. So in uncompressed csv it's around 120GB.
At this scale, and given that an average track has 50 or more points, it's probably worth a shot. While storage itself is cheap these days, you're also improving physical I/O and buffer pool efficiency, squeezing more records into every page.
answered 9 hours ago
mustaccio
8,87762135
8,87762135
add a comment |
add a comment |
FooTheBar is a new contributor. Be nice, and check out our Code of Conduct.
FooTheBar is a new contributor. Be nice, and check out our Code of Conduct.
FooTheBar is a new contributor. Be nice, and check out our Code of Conduct.
FooTheBar is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f223681%2fhow-can-i-restructure-a-table-of-points-with-lengthy-text-ids%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
How much storage are we talking about that it requires action?
– mustaccio
10 hours ago
2
What's the point? You'll need more space and more work to maintain both tables.
– McNets
10 hours ago
1
One table for Id's+Surrogate Key + Index, Current table + New PK. Is there a chance to change current Id's by some serial field?
– McNets
10 hours ago
2
@McNets but wouldn't storing that ID only once, and the ID being something like an integer, reduce storage? Then you are only storing that long ID 1 time vs 3 times, like in the example above.
– Anthony Genovese
10 hours ago
1
But you still have these ID's in one table >>> same disk space, plus additional space for the new key. And keep in mind you'll need 2 reads or 2 writes to select/insert that table. You would save space if you had more than one table
– McNets
10 hours ago