Archive for the 'distributed systems' Category

Impressive Achievement of S3

Saturday, May 17th, 2008

In three years, Amazon S3 changed deeply the web game.

Link

Peer-to-Peer File Sharing Client

Thursday, August 30th, 2007

Is a Joost-like with embedded aggregation features (they “parse” YouTube).

Seas

I do not really know what to think about this it, but it looks real cool… Although I do not really understand the research nor the consumer value. I will give it a try and let you know. I do not think we would not hear about it if it was not from Harvard.

Link

The Rise Of Open Source Storage Software

Thursday, August 2nd, 2007

Commodity based open source storage software à la Red Hat GFS is it seems the next big wave. One of the reason is web companies use a lot of storage and cannot afford expensive NetApp or EMC hardware. They are also too small to create their own Distributed/clustered filesystems.

Solutions: open source with support as MySQL did it.

Socket

This is a quick win, since it is not particularly hard compared to the guaranteed lucrative market position, a company would gain. I guess VC are not interested in this market, as they were not in MySQL at their start.

In the meantime, Amazon S3 is gaining a lot of advance since its competitor are non existant yet.

IFPI: Ten “inconvenient truths” about file-swapping

Wednesday, June 6th, 2007

Funny and concise article. Of course behind the rhetoric, nobody seems to address the real issue. This issue is simple though:

A lot of Internet users share copyrighted material. We cannot send so many to jail. But creators need to pay their rents. What do we do?

Copyright?

Link

From Web Applications To Personal Virtual Machine

Thursday, May 24th, 2007

I have recently relocated from a big city to a small town in the countryside. This lifestyle change pointed out several limits of the web application paradigm since I have now only a very limited Internet connectivity.

Kufstein

At first I tried to use web applications extensively for my private life: agenda, text processor, todo and so on…. But it turns out this is a far from ideal situation.

For instance:

  • No confidentiality. I don’t want to expose all my data to outsiders especially on a wifi network. I send and receive sensitive data like my credit card number.
  • Only online access. OK this one is obvious but my Internet access is limited and irregular. This is is the first time in more than ten years. It is destabilizing. How do I access my calendar even when there is no Internet around? Web application do not provide this.
  • Silo effect. I usually process subset of my data with custom made scripts. Those scripts are not web aware. There are as far as I know no find or grep for the Web. Not even a universal and easy way to parse any HTML page from any computer language.
  • Lack of customization. I have tailored my desktop to my needs and this is a huge value add for me. For instance, Evolution, Skype and Gaim are automatically launched at startup times, my SVN repositories are updated… Web application does not offer any way to configure them this easily. What I need is kind of NetVibes on steroids
  • Integration with current application. Desktop applications are heavily integrated. For instance, I can load with my desktop computer my Ipod. I can not from web applications.
  • Interactivity. I need speed and efficiency. This is the basis of interactive applications. Web applications (even Gmail) are slow and clumsy for obvious latency reasons. This is OK for email not for code editing.

One silly example that happened yesterday. I need to fill online my tax declaration. For this French’s government is generating a certificate. But no web applications allow me to store this certificate for me (and for good reasons).

clouds

You might object that I could have built a web application to take care of those needs, this Netvibes on steroid. Actually I started but I found in between a much more elegant solution: the use of a remote desktop system. One issue remains though. Its cost: a dedicated server is expensive.

(USB systems have also their own limit mainly “no background mode”.)

A virtual machine paid by the hour such as EC2 is perfect and nearly free (it could even be financed by advertisement if a company wanted to operate such a service). It is my Personal Virtual Machine (PVM). Some companies have started offering them for free (ie Desktop On Demand) but their offer is unreliable, slow and you cannot run all the applications you want/need.

In the end I installed KDE and NXE on my dedicated server (NXE is a great WAN remote desktop tool. Truly impressive). It solved all my problems really fast although it is costly (more than 30 euros per month). Marketing hype is on web applications but now we should start to explore alternatives especially if they empower users and are a cheaper alternatives (I can demonstrate it if needed). I can access it from my corporate PC, my cellphone or a cybercafe.

As a final note I am not saying that WebApp are bad. Just that they are not the universal panacea. Especially for lonely, interactive and heavily used applications. I will discuss this in more depth later. The PVM vision is not either the perfect solution but for heavy computer users such as myself it offers real advantages: no need for backup ever,r power consumption alway the lowest possible, you access your machine at your will without leaving it on.

The next steps is to be able to tie a virtual machine to a physical computer and then sends it back in the cloud. VMware system allows such trick. I will discuss this later in more depth too and I will tell you how it was to use this prototype for a month.

What do you think of this idea? Would you like me to explore those ideas more in depth?

Rain

The Long Tail In Practice

Sunday, May 20th, 2007

This artist tells us all about how to make it in this post-Internet world and it resonates well with Chris Anderson’s Long Tail ideas: this is the long tail at work.

River

Link

Future Of Computing

Monday, January 8th, 2007

A really interesting article from a consumer point of view. It is good to see people still being to step back on AJAX and web apps.

Link (via the wonderful JoeyCoco)

The End Of The PC Era?

Sunday, December 10th, 2006

The PC era is coming to an end according to Greg Papadopoulos, CTO of Sun. He claims only Amazon, Yahoo!, Google and so on would survive as a computer operator and revives infamous Thomas J. Watson misquote.

PC era is still there

What can we think of his assertion, especially when you read this mostly from your PC?

  • A lot of processes whether computer based or human based are sourced somewhere else from the base company. But a company still needs to create value and cannot do it simply by aggregating different providers. It needs to add something, somewhere, although this can be done from an outsourced data-centers, companies would still need some infrastructure.
  • Most companies have already outsourced most of their IT, stored in remote data-center, so this is no news.
  • Actually, the number of CPU sold is increasing so stating there will be only 5 computers in the world seems contradictory with the facts.

This annoucements look a big PR event. What is behind is commoditization of those hardware and the emerging of a whole new class of applications to manage and use those CPU. Peer to Peer seems the most relevant abstraction for this next revolution.

Link

Distributed Computing And Computer Languages

Sunday, November 19th, 2006

Larry in this excellent post blogs about the issue between a language and its uses. He quickly describes the issue between a language and its domain specific attributes and point out how much the effort has shifted from language development to framework. He gives the example of Prolog between C to show how those languages work on different issue. (By the way, have you seen Prolog code in production?)

I could not agree more on this and his idea of the shift from single core to multi-core CPU. I would even add that OpenMP (or others framework) are clearly impractical: too complex to learn, to use and to debug. And not addressing all issues raised by distributed computing (ie: which consistency is needed?).

A language is a trade-off between specific use cases (ie: embedded system in Java) and a broad abstractions (ie: synchronized in Java). For instance, Erlang has not been embraced as a general language but solve the threading issue. Erlang is not seen as a good general language (for a lot of reasons).

No language offers yet powerful high and low level abstraction to manage multi-processor. You could use a framework (openMP) for those, but since it is a central feature of a modern language this needs to be in the language construct. It allows you to get more information on the context and build cleverer code.

Currently, I know only of Erlang, Java and Ada to offer some sort of high level concurrency management. But a developer is not an expert in distributed system. Most patterns of distributed codes is known (where to add a mutex, a guard condition, …) and could be added automatically by the compiler with the right language construct.

More to follow on this…

Link

Synchronising Two Machines Via Internet

Thursday, November 2nd, 2006

I am building a small Ruby script to teach myself a few things about the language and solve a current issue I have. Here are the notes about the application I am building. All comments are appreciated.

My Use Case
My company has provided me with a very cool laptop and let me use Ubuntu Linux. However, it is really heavy (or I am not strong enough) and I don’t want to carry it back home and forward (or I would prefer to use my brain instead of developing my muscular masses).

I do not want both computers to be always on (waste of energy). I would like instead when I shut down one of them computer to send all my updated data on Internet and when I boot one to fetch all the new data from Internet.

The obvious solution is use SSH+Rsync but since I work in corporate trenches I cannot count on anything except my faithful 80 port.

in sync

My data daily update rate is pretty small. So S3 and its cheap hosting rate for low data ($0.15 per Go per month + $0.20 per Go transferred), might do the trick if and only S3 is used as a pivot storage medium (a bus actually). Only updated will be put on S3 and are deleted after consumption (yes it is a bus).

The initial sync might be expensive but it is a one time expense.

How to build it?
I don’t want to put in S3 all my data (mainly for cost reasons) but only the updated data. They will be inside two different computers. Besides, since it contains all my personal data, I would rather not overwrite some new data with the old one so extra-care must be taken for those. My 2 PCs will not be used at the same time. If two different updates set are sent at the same time, the application will quit.

We cannot count on clock since both computers can be in different time zone and could be out of sync. We could use S3’s time or a time’s server but using a logical clock is simpler and more efficient.

The grain of this application is file. Some optimization could be applied later there to send only the updated part(s) of the file. All updated data will be archived in one zip file.

S3syncer has two main actions: put and get

  • Get will get all new updated data and set them up on the local computer.
  • Put will take all data updated after the last get operations and will put them on S3.

Manifest
The manifest contains all the metadata of the zip archive. Since storage space is not an issue, it will be written in XML and put on S3 with the archive.

<manifest>
<version></version>
<logical_clock></logical_clock>
<emitter>MAC address or any UUID</emitter>
<new>
<file name=”…” path=”…”/>
</new>
<updated>

</updated>
<deleted>
<file name=”…” path=”…” archivedName=”can be different if file same name”/>
</deleted>
</manifest>

NB 3 classes of actions => space economy

File naming in S3
manifest<logical_clock>.xml
archive<logical_clock>.zip

Every field is self explanatory. The logical clock is a way to know for a machine if some updates have been missed. Basically it is updated with each put. We have n -> n+1 with n and n+1 logical timestamp and -> a relation for happened before.

The logical clock is present on each local computer and updated by a call to get. (It allows also to retrieve several different updates.)

It goes from 0 to 8 and then to 0 (mod 8).

Configuration file
Since we are in Ruby, we could use YAML serialization layer, but a simple Ruby class might be smarter although less user friendly than a XML file.

For the first version, a Ruby class will be fine.

Get in more depth
Get works that way:

  1. Get the manifest file self.logical_clock + 1 (if do not exist: quit).
  2. Check emitter != self (else quit since no update needs to be done)
  3. Get archives
  4. store it in a temp directory
  5. apply updates on the local FS.
  6. delete local archives. Keep manifest + store local times (to handle change of time)
  7. delete files on S3
  8. self.logical_clock ++
  9. check for next manifest file

NB If logical clock do not exist => 0

Put in more depth
Put works that way:

  1. Find all data updated using the timestamp stored. How? Using time… But error prone
  2. check for manifest existence on s3. If yes: quit b/C error.
  3. Create manifest
  4. Send it to S3

NB Issues can exist if you change the timestamp after the get. In this case some files can be forgotten. This will be detected by checking no file has been updated 2 hours before the last get. If so, we will resend all the files.

Solve conflicts
The conflicts will be detected through the logical clock. In this case, the application would quit and the manifest file should be deleted.

The language used will be Ruby mainly because I like it and I am still learning it) with s3.rb. If I am courageous enough I will provide a Debian and an Ubuntu package.

Other uses

We can easily extend the system to support multiple computers and use it to sync a common directory (a little bit as Coda does).

References
http://townx.org/blog/elliot/thoughts_on_rsync_and_s3
http://www.amazon.com/gp/browse.html?node=16427261
http://search.cpan.org/dist/File-RsyncP/FileList/FileList.pm
http://samba.anu.edu.au/rsync/documentation.html