Parallelise rsync using GNU Parallel
up vote
10
down vote
favorite
I have been using a rsync
script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.
In order to sync those files, I have been using rsync
command as follows:
rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/
The contents of proj.lst are as follows:
+ proj1
+ proj1/*
+ proj1/*/*
+ proj1/*/*/*.tar
+ proj1/*/*/*.pdf
+ proj2
+ proj2/*
+ proj2/*/*
+ proj2/*/*/*.tar
+ proj2/*/*/*.pdf
...
...
...
- *
As a test, I picked up two of those projects (8.5GB of data) and I executed the command above. Being a sequential process, it tool 14 minutes 58 seconds to complete. So, for 1.2TB of data it would take several hours.
If I would could multiple rsync
processes in parallel (using &
, xargs
or parallel
), it would save my time.
I tried with below command with parallel
(after cd
ing to source directory) and it took 12 minutes 37 seconds to execute:
parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .
This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere.
How can I run multiple rsync
processes in order to reduce the execution time?
linux rhel rsync gnu-parallel
add a comment |
up vote
10
down vote
favorite
I have been using a rsync
script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.
In order to sync those files, I have been using rsync
command as follows:
rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/
The contents of proj.lst are as follows:
+ proj1
+ proj1/*
+ proj1/*/*
+ proj1/*/*/*.tar
+ proj1/*/*/*.pdf
+ proj2
+ proj2/*
+ proj2/*/*
+ proj2/*/*/*.tar
+ proj2/*/*/*.pdf
...
...
...
- *
As a test, I picked up two of those projects (8.5GB of data) and I executed the command above. Being a sequential process, it tool 14 minutes 58 seconds to complete. So, for 1.2TB of data it would take several hours.
If I would could multiple rsync
processes in parallel (using &
, xargs
or parallel
), it would save my time.
I tried with below command with parallel
(after cd
ing to source directory) and it took 12 minutes 37 seconds to execute:
parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .
This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere.
How can I run multiple rsync
processes in order to reduce the execution time?
linux rhel rsync gnu-parallel
1
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiplersync
s is our first priority.
– Mandar Shinde
Mar 13 '15 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiplersync
s in parallel is the primary focus now.
– Mandar Shinde
Mar 13 '15 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55
add a comment |
up vote
10
down vote
favorite
up vote
10
down vote
favorite
I have been using a rsync
script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.
In order to sync those files, I have been using rsync
command as follows:
rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/
The contents of proj.lst are as follows:
+ proj1
+ proj1/*
+ proj1/*/*
+ proj1/*/*/*.tar
+ proj1/*/*/*.pdf
+ proj2
+ proj2/*
+ proj2/*/*
+ proj2/*/*/*.tar
+ proj2/*/*/*.pdf
...
...
...
- *
As a test, I picked up two of those projects (8.5GB of data) and I executed the command above. Being a sequential process, it tool 14 minutes 58 seconds to complete. So, for 1.2TB of data it would take several hours.
If I would could multiple rsync
processes in parallel (using &
, xargs
or parallel
), it would save my time.
I tried with below command with parallel
(after cd
ing to source directory) and it took 12 minutes 37 seconds to execute:
parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .
This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere.
How can I run multiple rsync
processes in order to reduce the execution time?
linux rhel rsync gnu-parallel
I have been using a rsync
script to synchronize data at one host with the data at another host. The data has numerous small-sized files that contribute to almost 1.2TB.
In order to sync those files, I have been using rsync
command as follows:
rsync -avzm --stats --human-readable --include-from proj.lst /data/projects REMOTEHOST:/data/
The contents of proj.lst are as follows:
+ proj1
+ proj1/*
+ proj1/*/*
+ proj1/*/*/*.tar
+ proj1/*/*/*.pdf
+ proj2
+ proj2/*
+ proj2/*/*
+ proj2/*/*/*.tar
+ proj2/*/*/*.pdf
...
...
...
- *
As a test, I picked up two of those projects (8.5GB of data) and I executed the command above. Being a sequential process, it tool 14 minutes 58 seconds to complete. So, for 1.2TB of data it would take several hours.
If I would could multiple rsync
processes in parallel (using &
, xargs
or parallel
), it would save my time.
I tried with below command with parallel
(after cd
ing to source directory) and it took 12 minutes 37 seconds to execute:
parallel --will-cite -j 5 rsync -avzm --stats --human-readable {} REMOTEHOST:/data/ ::: .
This should have taken 5 times less time, but it didn't. I think, I'm going wrong somewhere.
How can I run multiple rsync
processes in order to reduce the execution time?
linux rhel rsync gnu-parallel
linux rhel rsync gnu-parallel
asked Mar 13 '15 at 6:51
Mandar Shinde
1,40782747
1,40782747
1
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiplersync
s is our first priority.
– Mandar Shinde
Mar 13 '15 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiplersync
s in parallel is the primary focus now.
– Mandar Shinde
Mar 13 '15 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55
add a comment |
1
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiplersync
s is our first priority.
– Mandar Shinde
Mar 13 '15 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiplersync
s in parallel is the primary focus now.
– Mandar Shinde
Mar 13 '15 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55
1
1
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiple
rsync
s is our first priority.– Mandar Shinde
Mar 13 '15 at 7:32
If possible, we would want to use 50% of total bandwidth. But, parallelising multiple
rsync
s is our first priority.– Mandar Shinde
Mar 13 '15 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiple
rsync
s in parallel is the primary focus now.– Mandar Shinde
Mar 13 '15 at 7:47
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiple
rsync
s in parallel is the primary focus now.– Mandar Shinde
Mar 13 '15 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55
add a comment |
6 Answers
6
active
oldest
votes
up vote
11
down vote
accepted
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list usingsplit
and feed those filenames to parallel. Then use rsync's--files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/
– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.receiving file list ... done
created directory /data/
.
– Mike D
Sep 19 '17 at 16:42
1
On newer versions of rsync (3.1.0+), you can use--info=name
in place of-v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.
– Cheetah
Oct 12 '17 at 5:31
add a comment |
up vote
7
down vote
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.
add a comment |
up vote
6
down vote
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is usefull when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
add a comment |
up vote
4
down vote
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:
cd src-dir; find . -type f -size +100000 |
parallel -v ssh fooserver mkdir -p /dest-dir/{//};
rsync -s -Havessh {} fooserver:/dest-dir/{}
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/
If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
Any other alternative in order to avoidfind
?
– Mandar Shinde
Mar 13 '15 at 7:34
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use--dry-run
option inrsync
, I would have a list of files that would be transferred. Can I provide that file list toparallel
in order to parallelise the process?
– Mandar Shinde
Apr 10 '15 at 3:47
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain themkdir -p /dest-dir/{//};
part? Especially the{//}
thing is a bit confusing.
– Mandar Shinde
Apr 10 '15 at 9:49
|
show 3 more comments
up vote
0
down vote
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
add a comment |
up vote
0
down vote
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel
. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5
is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).
--bwlimit
to avoid using all bandwidth.
-I %
argument provided by find (directory found in dir/
)
$(echo dir/%/ host:/dir/%/)
- prints source and destination directories which are read by rsync as arguments. % is replaced by xargs
with directory name found by find
.
Let's assume I have two directories in /home
: dir1
and dir2
. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'
. So rsync command will run as two processes (two processes because /home
has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
New contributor
OK, can you explain$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.
– Scott
Nov 22 at 16:16
add a comment |
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
11
down vote
accepted
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list usingsplit
and feed those filenames to parallel. Then use rsync's--files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/
– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.receiving file list ... done
created directory /data/
.
– Mike D
Sep 19 '17 at 16:42
1
On newer versions of rsync (3.1.0+), you can use--info=name
in place of-v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.
– Cheetah
Oct 12 '17 at 5:31
add a comment |
up vote
11
down vote
accepted
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list usingsplit
and feed those filenames to parallel. Then use rsync's--files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/
– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.receiving file list ... done
created directory /data/
.
– Mike D
Sep 19 '17 at 16:42
1
On newer versions of rsync (3.1.0+), you can use--info=name
in place of-v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.
– Cheetah
Oct 12 '17 at 5:31
add a comment |
up vote
11
down vote
accepted
up vote
11
down vote
accepted
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm --relative --stats --safe-links --ignore-existing --human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
edited Apr 13 '17 at 12:36
Community♦
1
1
answered Apr 11 '15 at 13:53
Mandar Shinde
1,40782747
1,40782747
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list usingsplit
and feed those filenames to parallel. Then use rsync's--files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/
– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.receiving file list ... done
created directory /data/
.
– Mike D
Sep 19 '17 at 16:42
1
On newer versions of rsync (3.1.0+), you can use--info=name
in place of-v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.
– Cheetah
Oct 12 '17 at 5:31
add a comment |
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list usingsplit
and feed those filenames to parallel. Then use rsync's--files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/
– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.receiving file list ... done
created directory /data/
.
– Mike D
Sep 19 '17 at 16:42
1
On newer versions of rsync (3.1.0+), you can use--info=name
in place of-v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.
– Cheetah
Oct 12 '17 at 5:31
4
4
That would do an rsync per file. It would probably be more efficient to split up the whole file list using
split
and feed those filenames to parallel. Then use rsync's --files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/– Sandip Bhattacharya
Nov 17 '16 at 21:22
That would do an rsync per file. It would probably be more efficient to split up the whole file list using
split
and feed those filenames to parallel. Then use rsync's --files-from
to get the filenames out of each file and sync them. rm backups.* split -l 3000 backup.list backups. ls backups.* | parallel --line-buffer --verbose -j 5 rsync --progress -av --files-from {} /LOCAL/PARENT/PATH/ REMOTE_HOST:REMOTE_PATH/– Sandip Bhattacharya
Nov 17 '16 at 21:22
How does the second rsync command handle the lines in result.log that are not files? i.e.
receiving file list ... done
created directory /data/
.– Mike D
Sep 19 '17 at 16:42
How does the second rsync command handle the lines in result.log that are not files? i.e.
receiving file list ... done
created directory /data/
.– Mike D
Sep 19 '17 at 16:42
1
1
On newer versions of rsync (3.1.0+), you can use
--info=name
in place of -v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.– Cheetah
Oct 12 '17 at 5:31
On newer versions of rsync (3.1.0+), you can use
--info=name
in place of -v
, and you'll get just the names of the files and directories. You may want to use --protect-args to the 'inner' transferring rsync too if any files might have spaces or shell metacharacters in them.– Cheetah
Oct 12 '17 at 5:31
add a comment |
up vote
7
down vote
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.
add a comment |
up vote
7
down vote
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.
add a comment |
up vote
7
down vote
up vote
7
down vote
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.
answered Apr 10 '17 at 3:28
Mikhail
17013
17013
add a comment |
add a comment |
up vote
6
down vote
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is usefull when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
add a comment |
up vote
6
down vote
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is usefull when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
add a comment |
up vote
6
down vote
up vote
6
down vote
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is usefull when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
I personally use this simple one:
ls -1 | parallel rsync -a {} /destination/directory/
Which only is usefull when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
answered May 25 '16 at 14:15
Julien Palard
26635
26635
add a comment |
add a comment |
up vote
4
down vote
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:
cd src-dir; find . -type f -size +100000 |
parallel -v ssh fooserver mkdir -p /dest-dir/{//};
rsync -s -Havessh {} fooserver:/dest-dir/{}
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/
If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
Any other alternative in order to avoidfind
?
– Mandar Shinde
Mar 13 '15 at 7:34
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use--dry-run
option inrsync
, I would have a list of files that would be transferred. Can I provide that file list toparallel
in order to parallelise the process?
– Mandar Shinde
Apr 10 '15 at 3:47
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain themkdir -p /dest-dir/{//};
part? Especially the{//}
thing is a bit confusing.
– Mandar Shinde
Apr 10 '15 at 9:49
|
show 3 more comments
up vote
4
down vote
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:
cd src-dir; find . -type f -size +100000 |
parallel -v ssh fooserver mkdir -p /dest-dir/{//};
rsync -s -Havessh {} fooserver:/dest-dir/{}
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/
If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
Any other alternative in order to avoidfind
?
– Mandar Shinde
Mar 13 '15 at 7:34
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use--dry-run
option inrsync
, I would have a list of files that would be transferred. Can I provide that file list toparallel
in order to parallelise the process?
– Mandar Shinde
Apr 10 '15 at 3:47
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain themkdir -p /dest-dir/{//};
part? Especially the{//}
thing is a bit confusing.
– Mandar Shinde
Apr 10 '15 at 9:49
|
show 3 more comments
up vote
4
down vote
up vote
4
down vote
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:
cd src-dir; find . -type f -size +100000 |
parallel -v ssh fooserver mkdir -p /dest-dir/{//};
rsync -s -Havessh {} fooserver:/dest-dir/{}
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/
If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
A tested way to do the parallelized rsync is: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallelizing-rsync
rsync is a great tool, but sometimes it will not fill up the available bandwidth. This is often a problem when copying several big files over high speed connections.
The following will start one rsync per big file in src-dir to dest-dir
on the server fooserver:
cd src-dir; find . -type f -size +100000 |
parallel -v ssh fooserver mkdir -p /dest-dir/{//};
rsync -s -Havessh {} fooserver:/dest-dir/{}
The directories created may end up with wrong permissions and smaller files are not being transferred. To fix those run rsync a final time:
rsync -Havessh src-dir/ fooserver:/dest-dir/
If you are unable to
push data, but need to pull them and the files are called digits.png
(e.g. 000000.png) you might be able to do:
seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
edited Dec 11 '17 at 7:04
Ryan Long
1032
1032
answered Mar 13 '15 at 7:25
Ole Tange
11.8k1448105
11.8k1448105
Any other alternative in order to avoidfind
?
– Mandar Shinde
Mar 13 '15 at 7:34
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use--dry-run
option inrsync
, I would have a list of files that would be transferred. Can I provide that file list toparallel
in order to parallelise the process?
– Mandar Shinde
Apr 10 '15 at 3:47
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain themkdir -p /dest-dir/{//};
part? Especially the{//}
thing is a bit confusing.
– Mandar Shinde
Apr 10 '15 at 9:49
|
show 3 more comments
Any other alternative in order to avoidfind
?
– Mandar Shinde
Mar 13 '15 at 7:34
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use--dry-run
option inrsync
, I would have a list of files that would be transferred. Can I provide that file list toparallel
in order to parallelise the process?
– Mandar Shinde
Apr 10 '15 at 3:47
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain themkdir -p /dest-dir/{//};
part? Especially the{//}
thing is a bit confusing.
– Mandar Shinde
Apr 10 '15 at 9:49
Any other alternative in order to avoid
find
?– Mandar Shinde
Mar 13 '15 at 7:34
Any other alternative in order to avoid
find
?– Mandar Shinde
Mar 13 '15 at 7:34
1
1
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
Limit the -maxdepth of find.
– Ole Tange
Mar 17 '15 at 9:20
If I use
--dry-run
option in rsync
, I would have a list of files that would be transferred. Can I provide that file list to parallel
in order to parallelise the process?– Mandar Shinde
Apr 10 '15 at 3:47
If I use
--dry-run
option in rsync
, I would have a list of files that would be transferred. Can I provide that file list to parallel
in order to parallelise the process?– Mandar Shinde
Apr 10 '15 at 3:47
1
1
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
cat files | parallel -v ssh fooserver mkdir -p /dest-dir/{//}; rsync -s -Havessh {} fooserver:/dest-dir/{}
– Ole Tange
Apr 10 '15 at 5:51
Can you please explain the
mkdir -p /dest-dir/{//};
part? Especially the {//}
thing is a bit confusing.– Mandar Shinde
Apr 10 '15 at 9:49
Can you please explain the
mkdir -p /dest-dir/{//};
part? Especially the {//}
thing is a bit confusing.– Mandar Shinde
Apr 10 '15 at 9:49
|
show 3 more comments
up vote
0
down vote
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
add a comment |
up vote
0
down vote
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
add a comment |
up vote
0
down vote
up vote
0
down vote
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
For multi destination syncs, I am using
parallel rsync -avi /path/to/source ::: host1: host2: host3:
Hint: All ssh connections are established with public keys in ~/.ssh/authorized_keys
answered Apr 10 '17 at 6:37
ingopingo
61944
61944
add a comment |
add a comment |
up vote
0
down vote
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel
. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5
is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).
--bwlimit
to avoid using all bandwidth.
-I %
argument provided by find (directory found in dir/
)
$(echo dir/%/ host:/dir/%/)
- prints source and destination directories which are read by rsync as arguments. % is replaced by xargs
with directory name found by find
.
Let's assume I have two directories in /home
: dir1
and dir2
. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'
. So rsync command will run as two processes (two processes because /home
has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
New contributor
OK, can you explain$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.
– Scott
Nov 22 at 16:16
add a comment |
up vote
0
down vote
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel
. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5
is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).
--bwlimit
to avoid using all bandwidth.
-I %
argument provided by find (directory found in dir/
)
$(echo dir/%/ host:/dir/%/)
- prints source and destination directories which are read by rsync as arguments. % is replaced by xargs
with directory name found by find
.
Let's assume I have two directories in /home
: dir1
and dir2
. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'
. So rsync command will run as two processes (two processes because /home
has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
New contributor
OK, can you explain$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.
– Scott
Nov 22 at 16:16
add a comment |
up vote
0
down vote
up vote
0
down vote
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel
. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5
is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).
--bwlimit
to avoid using all bandwidth.
-I %
argument provided by find (directory found in dir/
)
$(echo dir/%/ host:/dir/%/)
- prints source and destination directories which are read by rsync as arguments. % is replaced by xargs
with directory name found by find
.
Let's assume I have two directories in /home
: dir1
and dir2
. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'
. So rsync command will run as two processes (two processes because /home
has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
New contributor
I always google for parallel rsync as I always forget the full command, but no solution worked for me as I wanted - either it includes multiple steps or needs to install parallel
. I ended up using this one-liner to sync multiple folders:
find dir/ -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo dir/%/ host:/dir/%/)'
-P 5
is the amount of processes you want to spawn - use 0 for unlimited (obviously not recommended).
--bwlimit
to avoid using all bandwidth.
-I %
argument provided by find (directory found in dir/
)
$(echo dir/%/ host:/dir/%/)
- prints source and destination directories which are read by rsync as arguments. % is replaced by xargs
with directory name found by find
.
Let's assume I have two directories in /home
: dir1
and dir2
. I run find /home -type d|xargs -P 5 -I % sh -c 'rsync -a --delete --bwlimit=50000 $(echo /home/%/ host:/home/%/)'
. So rsync command will run as two processes (two processes because /home
has two directories) with following arguments:
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
rsync -a --delete --bwlimit=50000 /home/dir1/ host:/home/dir1/
New contributor
edited 2 days ago
New contributor
answered Nov 22 at 15:43
Sebastjanas
11
11
New contributor
New contributor
OK, can you explain$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.
– Scott
Nov 22 at 16:16
add a comment |
OK, can you explain$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.
– Scott
Nov 22 at 16:16
OK, can you explain
$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.– Scott
Nov 22 at 16:16
OK, can you explain
$(echo dir/%/ host:/dir/%/)
now? Please do not respond in comments; edit your answer to make it clearer and more complete.– Scott
Nov 22 at 16:16
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f189878%2fparallelise-rsync-using-gnu-parallel%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Are you limited by network bandwidth? Disk iops? Disk bandwidth?
– Ole Tange
Mar 13 '15 at 7:25
If possible, we would want to use 50% of total bandwidth. But, parallelising multiple
rsync
s is our first priority.– Mandar Shinde
Mar 13 '15 at 7:32
Can you let us know your: Network bandwidth, disk iops, disk bandwidth, and the bandwidth actually used?
– Ole Tange
Mar 13 '15 at 7:41
In fact, I do not know about above parameters. For the time being, we can neglect the optimization part. Multiple
rsync
s in parallel is the primary focus now.– Mandar Shinde
Mar 13 '15 at 7:47
No point in going parallel if the limitation isn't the CPU. It can/will even make matters worse (conflicting disk arm movements on source or target disk).
– xenoid
Nov 22 at 15:55