AWS

I have been using cloud based data backup for years (mainly Crashplan) to provide an extra level of protection to my on-prem raid1 NAS drive. These solutions are great for an easy to use set and forget product (which is worth paying for), but I got to thinking could I write something in Python and S3 that did a similar job ? The answer is of course YES ! In this post I’ll go through the steps in the code that I use to provide the working solution. For reference this is a Windows based solution, but could easily be modified to support Mac or Linux via the file path design.

This application does the following:

  1. Reads in the current files on the selected drive as well as whats stored in S3
  2. Compares whats not present, uploads the missing files, and adds a new copy for those ones that have been updated (which requires S3 versioning to be enabled)

The python source code can be downloaded HERE

As input, the code simply takes the local drive you want to backup, the folder to store it in on S3, and the file storage type

if __name__ == '__main__':   
with open(log_file, 'w') as log_file_out:       
sync = S3Sync("s3_bucket, log_file_out)       
sync.sync("I:\\", "i_drive", 'STANDARD_IA')       

Firstly you need an S3 account with API key access and the awscli application credentials setup on your PC. This will give you API access to your S3 bucket to write the files. Have a read of this POST I put together on AWS creds and boto3 if you need to set this up. Note the S3 bucket you setup should have versioning enabled, to allow us to keep previous versions of files we update (and not just one copy).

The code is broken into a few functions. What we need to do is:

  1. Read all the current files in the S3 buckets (including write time), so we can work out which files locally are newer than the ones we have stored
  2. Read the local file system, ignoring various files and folders that we dont want to backup
  3. Compare the files locally with the S3 stored ones, to create a list of files that we want to backup/update
  4. Back them up to S3

Lets have a look at these 3 code blocks.

Step1 – Read in files from S3

To do this we use the aws s3 list_objects_v2 api call, inside a loop to deal with the limit of 1000 objects per call:

response = self._s3.list_objects_v2(Bucket=bucket)
contents_list = response.get("Contents", [])

while "NextContinuationToken" in response:
response = self._s3.list_objects_v2(Bucket=bucket, ContinuationToken=response['NextContinuationToken'])
contents_list.extend(response.get("Contents", []))

This now gives us a full list of all object stored in s3 in this bucket

Step2 – Read the local filesystem

For this we use the os.scandir function. This reads the current (and sub via loop) folders for files and returns both filenames and meta data like path, last write and file size (used to show amount remaining).

for entry in scandir(source_folder):               
if entry.is_dir(follow_symlinks=False):
     if entry.name.lower() not in self.ignore_folders and not entry.path.lower().startswith(tuple(self.system_folder_ignore_list)) \ 

         and not entry.name.lower().startswith(tuple(self.ignore_folder_start)) and not entry.name.startswith(tuple(ignore_starts_with)) \
        and not [ele for ele in self.ignore_folder_contains if (ele in entry.path.lower())]:
          self.scantree(entry.path)   
else:                   
if not entry.path.lower().endswith(tuple(self.ignore_filetypes)) and not entry.name.startswith(tuple(ignore_starts_with)):
          self.path_list.append(entry)

This block does a recursive dive into the folder and subfolders, only returning filenames/paths based on pattern matching with our ignore file/folder arguments.

Step3 – Compare the files locally with the S3 stored ones, to create a list of files that we want to backup/update

Now that we have all the S3 existing objects and a full list of the local filesystem objects, we can do a compare on which files are missing or updated. These files can then be added to a dictionary which we can use to then upload the files.

Step3a – create the dictionary file_to_send_data
(note this is based on Windows OS and Sydney TZ based – adjust here for Mac/Linux and your local TZ):

for file in files:           
# replace '\' with '\\'           
p = Path((re.sub("\\\\", "\\\\", file.path)))           
# remove drive letter
  s3_converted_file_path = str(p.parent)[3:]           
# print("Path:", s3_converted_file_path)           
if s3_converted_file_path:               
# change "\"" to "/"" 
s3_converted_full_path = f"{bucket_folder}/" + s3_converted_file_path.replace("\\", "/") + "/" + p.name           
else:               
# root dir file - just add to folder               
s3_converted_full_path = f"{bucket_folder}/" + p.name
  file_time = datetime.datetime.fromtimestamp(file.stat().st_mtime, tz=pytz.timezone("Australia/Sydney"))           

s3_file_time = self.object_keys.get(s3_converted_full_path, datetime.datetime(1970, 1, 1, 0, 0, 0, 0, tzinfo=tzutc()) )
  # upload new file or replace older file - enable versioning on bucket           

if file_time > s3_file_time:               
file_to_send_data[file] = s3_converted_full_path               

 Step 3b – upload the files from the created dictionary

We will use the s3 API call upload_file to then send the required files to s3:

if file_to_send_data:
file_count = len(file_to_send_data)
  current_file = 1
  print(f"Files to upload: {file_count}")
self.log_file.write(f"Files to upload: {file_count}\n")
  print(f"Total size of upload: {self.file_size(total_size)}")
self.log_file.write(f"Total size of upload: {self.file_size(total_size)}\n")
for file, key in file_to_send_data.items():
self._s3.upload_file(file.path, Bucket=self.bucket, Key=key, ExtraArgs={'StorageClass': storage_class})
current_file += 1

And that’s it ! The core functions are straight forward and allows you to have a great working file upload solution for you home system. I have been using this for about 3 months now in place of the previous commercial solution and it works exactly like I need it to. A few features are added in to the attached code:

  1. try – except blocks to deal with exceptions due to file locks and failed reads
  2. logging of the activity to a file as well as stdout (on the windows desktop)
  3. file and folder exclusion patterns to skip unwanted data

There are also some future plans for code updates to be done:

  1. add ability to remove deleted files from S3 that are x days/month/years old
  2. stop/start uploading at certain time
  3. Add reporting in to show how much uploaded per drive / how many files at end

These can all be added in with some additional code – so feel free to adjust as needed.

Hope you enjoy your new S3 backup application ! I generally run it weekly, but run it to the schedule that works best for you.

Python S3 Data Backup Application
Tagged on:             

Leave a Reply

Your email address will not be published.