To provide you with a little background of what types of systems we are working on we felt it would be beneficial to share some high level stats about our infrastructure.
Incoming data volumes exceed 50TB per day, with more than 10^11 new items/lines/records being added per day. Our analytical processing infrastructure exceeds 12PB of physical storage with over 4.5PB in our largest cluster.
We leverage compression technologies wherever possible and are achieving compression ratios as high as 96% on our highest volume data feeds.
On any given day our massive parallel systems process more than 50PB of data, not factoring in various levels of caches that serve similar activities or processes and reduce the amount of physical IOs significantly.
We execute millions of requests on a daily basis, spanning from near realtime highly localized access to enormous jobs that span 100s of TB in a single or series of models.
Early on we have adopted certain enterprise level philosophies like: Answer any question - any time and All Data - at the most atomic level
We know that we could never predict the questions and demands of tomorrow, so ultimate flexibility is one of our key design principles.
While data reduction and aggregation is part of our daily work, we primarily deal with much more complex tasks that leverage combinations of data structures out of our more than 10^4 entities and more than 10^5 data elements. By combining various technologies and designs we achieve ultimate throughput and high availability per dollar invested. Total Cost of Ownership is paramount - we combine server grade with ultra low cost equipment to achieve never before seen scaleability.
Our analytical systems serve thousands of analysts and business users in a true 24x7x365 operation.
A typical cluster in our environment will process 10^10 records of about 1TB of data in 5 sec or less on a few dozen processing nodes.
At any point in time our systems receive massive amounts of new incoming data while processing hundreds and thousands of requests.
We are currently working on next generation architecture to process in excess of 100s PB of data per day.