Irfan’s Corner on the Web On Mac, Linux, Grid, Virtualization and Software Technology

2Jan/100

Is Data Science emerging as a New Domain in Computer Science?

I've just completed reading Chapter 5 of Beautiful Data. I planned to write a data-pic.jpg blog post about this book, however this chapter contained some new insights for me which I thought were valuable to share. This book has some excellent chapters covering significant developments in the domain of data storage, retrieval and analysis. Chapter 5 is titled "Information Platforms and the Rise of the Data Scientist" written by Jeff Hammerbacher.

The chapter explores the challenges Facebook faced in analysing the data it is collecting and how existing RDMS solutions (MySQL and Oracle) were not up to the task of collecting and enabling analysis of highly fluid data such as clickstreams from millions of users (Currently 2.5 Petabytes is stored and new data is collected at 14 TB/day). The author goes on to discuss the solution they developed internally at Facebook (based on Cloud technologies such as Hadoop and unstructured data).

Analysis of large scale data is becoming a common problem in a large number of domains. Web companies such as Facebook, Google are not the only ones in the World that analyse huge amounts of data. Several scientific experiments such as the CERN LHC produce gigantic amounts of data that needs to be analysed (The recent book Fourth Paradigm by Microsoft Research explores data intensive scientific initiatives).

So many new skills are required to manage this data: designing storage architectures, high speed retrieval architectures, authoring data analysis workflows and finally communicating the results of the analysis. All these tasks are multi-disciplinary. Some tasks are related to Computer Science (design of data storage and retrieval systems), some to Business Analysis (authoring data analysis), some tasks belong to statisticians (the actual algorithms performing the analysis) and some to engineers (the underlying infrastructure for storing and processing the data).

Can this multi-discplinary approach to data management be termed as "Data Science". This is a term which I believe is increasingly gaining traction.

30Dec/090

Making the move to Cloud backup

My Time Machine Hard disk failed 2 days ago. I lost all my backups! Unfortunately I had reinstalled my system just last week and had not yet fully restored from the latest time machine backup. Fortunately I have recovered everything other than my pictures.

I don't want to experience such loss again, so I'm moving towards Cloud based backup. I gain a few things, but loose some as well. The service I selected is Mozy. They provide unlimited storage at an economical rate. However their backup/restore tool for the Mac does not support Proxies (their windows one apparently does). Moreover, Mozy's tool also does not allow me to browse my backups in a fine-grained fashion as Time Machine does. In Time Machine I can restore individual folders and files and browse my backup history over weeks and months. The Mozy tool does not provide such fine-grained history browsing.

Finally, uploading is such a hassle! It took me more than a day to upload a limited subset of data from my laptop (~60GB). Downloading fortunately is faster.

What do I gain from a cloud based backup solution? Hopefully I will not loose my data again.

However because there are certain advantages to local backups as well, I plan to do daily Time machine backups, on a new 1TB HD and weekly cloud backups. As for my pictures, I have a MobileMe subscription and those albums I shared there with friends and family I still have them. So in future I plan to upload all my new pictures to MobileMe.

Tagged as: , No Comments