Using rclone with amazon cloud storage

About rclone

rclone is a command line tool allowing data transfer from and to cloud storage providers, described as 'rsync for the cloud'. This guide will show you how to use it to transfer data to Amazon S3 Glacier.

Requirements

You need an AWS account to utilize cloud storage. At this time, these accounts are not provided by the SCU. 

Be advised that using AWS will incur costs separate from any existing agreement with the SCU, or other WCM groups. Use the Amazon Cost Explorer to estimate the costs for this service.

Once you have an AWS account, you need to create an IAM user with the permissions to use S3 and Glacier. Please refer to the amazon documentation on how to do that.

Using other cloud providers

See the list of rclone-supported cloud providers for notes on how to set up providers other than AWS. In general, this is similar to the AWS setup shown below - just follow the instructions of the configuration tool.

Configure rclone to work with AWS

Rclone is available on the SCU nodes. The following example is on pascal.med.cornell.edu.

The easiest way to configure Rclone is via rclone config, which opens an interactive tool. The following is an abbreviated log of the settings for AWS:

$ rclone config
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config

n/s/q> n

name> amazon_store

Type of storage to configure.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 [...]
 4 / Amazon S3 Compliant Storage Provider (AWS, Alibaba, Ceph, Digital Ocean, Dreamhost, IBM COS, Minio, etc)
   \ "s3"
 [...]

Storage> 4

** See help for s3 backend at: https://rclone.org/s3/ **

Choose your S3 provider.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / Amazon Web Services (AWS) S3
   \ "AWS"
 [...]

provider> 1

Get AWS credentials from runtime (environment variables or EC2/ECS meta data if no env vars).
Only applies if access_key_id and secret_access_key is blank.
Enter a boolean value (true or false). Press Enter for the default ("false").
Choose a number from below, or type in your own value
 1 / Enter AWS credentials in the next step
   \ "false"
 [...]

env_auth> false

AWS Access Key ID.
Leave blank for anonymous access or runtime credentials.
Enter a string value. Press Enter for the default ("").

access_key_id> [YOUR ACCESS KEY ID]

AWS Secret Access Key (password)
Leave blank for anonymous access or runtime credentials.
Enter a string value. Press Enter for the default ("").

secret_access_key> [YOUR SECRET ACCESS KEY]

Region to connect to.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
   / The default endpoint - a good choice if you are unsure.
 1 | US Region, Northern Virginia or Pacific Northwest.
   | Leave location constraint empty.
   \ "us-east-1"
 [...]

region> 1

Endpoint for S3 API.
Leave blank if using AWS to use the default endpoint for the region.
Enter a string value. Press Enter for the default ("").

endpoint> [LEAVE BLANK]

Location constraint - must be set to match the Region.
Used when creating buckets only.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / Empty for US Region, Northern Virginia or Pacific Northwest.
   \ ""
 [...]

location_constraint> 1

Canned ACL used when creating buckets and storing or copying objects.
[...]
Choose a number from below, or type in your own value
 1 / Owner gets FULL_CONTROL. No one else has access rights (default).
   \ "private"
 [...]

acl> 1

The server-side encryption algorithm used when storing this object in S3.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / None
   \ ""
 [...]

server_side_encryption> 1

If using KMS ID you must provide the ARN of Key.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / None
   \ ""
 [...]

sse_kms_key_id> 1

The storage class to use when storing new objects in S3.
Enter a string value. Press Enter for the default ("").
Choose a number from below, or type in your own value
 1 / Default
   \ ""

storage_class> 1

Edit advanced config? (y/n)
y) Yes
n) No

y/n> n

Remote config
--------------------
[amazon_store]
type = s3
provider = AWS
env_auth = false
access_key_id = [REDACTED]
secret_access_key = [REDACTED]
region = us-east-1
acl = private
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote

y/e/d> y

Current remotes:

Name                 Type
====                 ====
amazon_store         s3

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config

e/n/d/r/c/s/q> q

$


Once this configuration is saved, you are ready to use rclone.

Note that this example uses standard S3 storage, which is relatively expensive. Within the configuration, you can select a less expensive storage class such as Infrequent Access or Glacier. Learn more about Object Storage Classes.

To transfer data to the new remote (named amazon_store in the example above), use this syntax:

$ rclone copy -P ~/test_data amazon_store:rclone-tutorial
Transferred:   	      192M / 192 MBytes, 100%, 24.736 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            2 / 2, 100%
Elapsed time:        7.7s
$

Note that the -P flag gives you the progress indicator.

To download data from the cloud, use this:

$ rclone copy -P amazon_store:rclone-tutorial ~/test-download
Transferred:   	      192M / 192 MBytes, 100%, 77.531 MBytes/s, ETA 0s
Errors:                 0
Checks:                 0 / 0, -
Transferred:            2 / 2, 100%
Elapsed time:        2.4s
$ ls ~/test-download/
test1.file  test2.file


Tuning rclone performance

By default, rclone is not optimized for our infrastructure. Increasing the maximum number of parallel transfers and the chunk size can increase transfer speed. This will however take more bandwidth and RAM, so depending on which node this is run on, the results will vary. The following flags should be used:

--bwlimit=0          # Do not limit bandwidth
--buffer-size=128M   # Buffer for each transfer
--checkers=32        # Run 32 checksum checkers in parallel
--transfers=32       # Run 32 transfers in parallel

Please be advised that the actual performance gain depends on both the source and destination system, as well as the current usage of those systems. Also, depending on the type of data transferred (many small files, or few large files?), results will vary. Use these parameters as a starting point for your individual fine tuning.

Use these parameters as follows:

$ rclone --bwlimit=0 --buffer-size=128M --checkers=32 --transfers=32 copy -P ~/local/source amazon_store:bucket-name

rclone browser - Graphical user interface

If you want to run rclone on a desktop, you can use rclone browser for a graphical user interface to your remotes. For this to work, you need to have the rclone binaries installed on your local machine as well. Refer to the respective sites for documentation on how to install and configure this.

Please note: This is not suitable to move large amounts of data. Only use this for smaller amounts of data, or for managing inventory.