Home The Blog Science - DB Research Meeting

Mar 22
2008

Science - DB Research Meeting

Posted by Oliver Ratzesberger in xldbsuper computingmpp

Next week I will be attending the next iteration of the xldb group events organized around eXtreme Large Database Applications. xldb workshop

With 100s of Peta Bytes of information waiting to be captured and analyzed, new concepts are required to scale today's platforms by 1-3 orders of magnitudes.

Today we 'limit' ourselves to 'only' capture 40TB/day of incremental incoming data volumes, next generation requirements demand a much more detailed collection of event detail data. 100TB/day are already on the horizon giving us just 10 days of history per Peta Byte. With deep historical requirements of 3+ years of information, data volume growth will outpace Moor's Law. And I would not be surprised if next year this time we will be thinking about how to deal with 250TB/day — the writing is on the wall.

Improvements in Processing Power per CPU, advances in Memory and Storage are not going to the able to make up for the exponential growth of data processing requirements.

Advanced compression techniques are a 'must-have' for anybody who wants to continue in this environment. Forget gzip — get real, but that's the easy part. Thanks to the Science community new advance algorithms are becoming reality that achieve never before seen compression ratios.

We also have to transparently blur the lines between relational and non-relational systems. Structured as well as Unstructured data — or highly volatile data streams — need to be able to exist next to each other, without the need to switch platforms or move information between technologies. Forget the 100 Byte average wide record, it's going to be 50 Bytes, 10^4 Bytes, 523 Bytes, 10^3 Bytes - any variation - a highly volatile data stream with constantly changing information 'payload' and very few defined patterns.

We have to be able to manage very chaotic and ever changing information feeds. You can't always design structures for it beforehand, and even if you would, they would become obsolete the very second you turn them live.

Trillions of records, lines, objects, arrays, pointers, vectors, ... need to be managed, aggregated, reduced, joined, sampled, redistributed, scanned, ...

It is not going to work to have a few Peta Bytes here in one architecture and a couple more over there. Bringing them together would become virtually impossible.

With 100s to 1000s of individual nodes, Parallel Efficiency (PE) will have to come in at above 99% or system overhead will prevent the required scalability. And if you have not started measuring it on existing implementations, you will be in for a big surprise.

It's going to be an interesting couple of days in Asilomar with the xldb team. Some of the smartest minds in Research, Science and Industry will brainstorm about these and other challenges.


Hits: 440
Trackback(0)
Comments (0)add comment

Write comment

busy

Latest Comments

Tags

We have 26 guests online