Synchronising Two Machines Via Internet

I am building a small Ruby script to teach myself a few things about the language and solve a current issue I have. Here are the notes about the application I am building. All comments are appreciated.

My Use Case
My company has provided me with a very cool laptop and let me use Ubuntu Linux. However, it is really heavy (or I am not strong enough) and I don’t want to carry it back home and forward (or I would prefer to use my brain instead of developing my muscular masses).

I do not want both computers to be always on (waste of energy). I would like instead when I shut down one of them computer to send all my updated data on Internet and when I boot one to fetch all the new data from Internet.

The obvious solution is use SSH+Rsync but since I work in corporate trenches I cannot count on anything except my faithful 80 port.

in sync

My data daily update rate is pretty small. So S3 and its cheap hosting rate for low data ($0.15 per Go per month + $0.20 per Go transferred), might do the trick if and only S3 is used as a pivot storage medium (a bus actually). Only updated will be put on S3 and are deleted after consumption (yes it is a bus).

The initial sync might be expensive but it is a one time expense.

How to build it?
I don’t want to put in S3 all my data (mainly for cost reasons) but only the updated data. They will be inside two different computers. Besides, since it contains all my personal data, I would rather not overwrite some new data with the old one so extra-care must be taken for those. My 2 PCs will not be used at the same time. If two different updates set are sent at the same time, the application will quit.

We cannot count on clock since both computers can be in different time zone and could be out of sync. We could use S3’s time or a time’s server but using a logical clock is simpler and more efficient.

The grain of this application is file. Some optimization could be applied later there to send only the updated part(s) of the file. All updated data will be archived in one zip file.

S3syncer has two main actions: put and get

  • Get will get all new updated data and set them up on the local computer.
  • Put will take all data updated after the last get operations and will put them on S3.

Manifest
The manifest contains all the metadata of the zip archive. Since storage space is not an issue, it will be written in XML and put on S3 with the archive.

<manifest>
<version></version>
<logical_clock></logical_clock>
<emitter>MAC address or any UUID</emitter>
<new>
<file name=”…” path=”…”/>
</new>
<updated>

</updated>
<deleted>
<file name=”…” path=”…” archivedName=”can be different if file same name”/>
</deleted>
</manifest>

NB 3 classes of actions => space economy

File naming in S3
manifest<logical_clock>.xml
archive<logical_clock>.zip

Every field is self explanatory. The logical clock is a way to know for a machine if some updates have been missed. Basically it is updated with each put. We have n -> n+1 with n and n+1 logical timestamp and -> a relation for happened before.

The logical clock is present on each local computer and updated by a call to get. (It allows also to retrieve several different updates.)

It goes from 0 to 8 and then to 0 (mod 8).

Configuration file
Since we are in Ruby, we could use YAML serialization layer, but a simple Ruby class might be smarter although less user friendly than a XML file.

For the first version, a Ruby class will be fine.

Get in more depth
Get works that way:

  1. Get the manifest file self.logical_clock + 1 (if do not exist: quit).
  2. Check emitter != self (else quit since no update needs to be done)
  3. Get archives
  4. store it in a temp directory
  5. apply updates on the local FS.
  6. delete local archives. Keep manifest + store local times (to handle change of time)
  7. delete files on S3
  8. self.logical_clock ++
  9. check for next manifest file

NB If logical clock do not exist => 0

Put in more depth
Put works that way:

  1. Find all data updated using the timestamp stored. How? Using time… But error prone
  2. check for manifest existence on s3. If yes: quit b/C error.
  3. Create manifest
  4. Send it to S3

NB Issues can exist if you change the timestamp after the get. In this case some files can be forgotten. This will be detected by checking no file has been updated 2 hours before the last get. If so, we will resend all the files.

Solve conflicts
The conflicts will be detected through the logical clock. In this case, the application would quit and the manifest file should be deleted.

The language used will be Ruby mainly because I like it and I am still learning it) with s3.rb. If I am courageous enough I will provide a Debian and an Ubuntu package.

Other uses

We can easily extend the system to support multiple computers and use it to sync a common directory (a little bit as Coda does).

References
http://townx.org/blog/elliot/thoughts_on_rsync_and_s3
http://www.amazon.com/gp/browse.html?node=16427261
http://search.cpan.org/dist/File-RsyncP/FileList/FileList.pm
http://samba.anu.edu.au/rsync/documentation.html

3 Responses to “Synchronising Two Machines Via Internet”

  1. FD Says:

    it seems…. a very interesting project.
    I have absolutely no comment to make on your tech choice as I’m nothing but a bullshitter/marketing/sales guy

    my only question would be : Does your company security policy allow to have datas transfered ?

  2. Nico Says:

    Of course, they do and for several reasons:

    1/ Nobody has forbidden this to me.

    2/ It is to allow me to work more so it is for the best.

    3/ Data loss is a bigger risk than data theft.

  3. Dave Says:

    Actually, your use case is….mobile user!