|
Apr 29
2013
|
Back in ActionPosted by Oliver Ratzesberger in myblog, general, bigdata |
|
Apr 29
2013
|
Back in ActionPosted by Oliver Ratzesberger in myblog, general, bigdata |
|
Sep 03
2011
|
Analytics as a Service - Social SQLPosted by Oliver Ratzesberger in social, bigdata, agile |
The past 12 months had us move Analytics as a Service (A3S) to new maturity levels. For the very first time we have a single point interface for all of our A3S services: the DataHub. I recently presented an overview and demo to a group of industry analysts and the feedback has been overwhelmingly positive.
The combination of Social, private Cloud, Analytics as a Service based on an Open Source built (joomla + kunena) social portal is turning into a killer application for the global enterprise. Never before have we seen agile and community, BI and Analytics brought together through a fully social experience, that allows users, analysts, scientists, executives, PMs - pretty much anybody in the organization to follow each other, link up, like, create groups, publish Analytics and discover new data, new analytics, new insights on the fly.
Search and metad-data are great, as long as you know what to look for. In todays world of BigData this is becoming increasingly complex,
|
Feb 12
2011
|
Project SingularityPosted by Oliver Ratzesberger in xldb, super computing, mpp, bigdata |
It has been a while since I actively blogged on this personal site of ours. It has been a busy couple of years and our teams have pushed the boundaries of pretty much any technology out there that deals with Data and Analytics.
Some 4-5 years ago we started an internal project and based on Ray Kurzweil's - The SIngularity is Near - we dubbed it Singularity.
We are only weeks away from launching V3 of our Singularity platform and its nothing short of amazing. We set out to scale big, economical, make complex easy, do the impossible in the hands of all our analysts, without special training or knowledge of complex programming languages. Putting hundreds of trillions of behavioral patterns to use, structuring complex data just enough to make it simple to use, yet keep loosely structured patterns they way they are, storing unstructured data as is and project logic and structure at runtime.
|
Nov 15
2008
|
Today I ran across some discussion about Agile Development in Data Warehousing, and note that we talk about this in the context of the DW development, but not in relation to the Business. I believe there is a need to discriminate some of these processes quite differently. Most simply put - One is applying Agile to DW development; the other is applying Agile to Business Analysis.
Core DW foundations involve modeling root components of business data needs and implementing a data model which allows for flexibility to answer questions of the data - a concept I call "Designing for the Unknown". The more renormalization and change from the source system, typically the more transformation logic and less flexibility, and ergo higher cost and less organizational agility.
Effective Agile development of the DW infrastructure itself involves delineating the methodologies which can be used for what types of development functions. For example, creating a core or "root key" entity in the data model
|
Oct 23
2008
|
Just got back from Chicago, where over the past 2 days a small group of scientists, academia and industry discussed various aspects of cloud computing and related topics.
One of the topics was about comparing extreme large scale analytical problems and the systems leverage to solve them. In order to compare classes of super computers, Alex Szalay (John Hopkins University) explained a simple yet interesting figure: The AMDAHL number (Amdahl's Law Bell, Gray and Szalay 2006)
Alex explained the Amdahl number (BW) as One bit of IO/sec per instruction/sec.
Why is this figure interesting? To compare the analytical capabilities of various extreme large clusters, it is important to categorize them into different groups based on their processing capabilities. Typical commercial applications of large scale analytics require large amounts of IO per available CPU while various scientific applications require less IO per available CPU.
For a Blue Gene the BW=0.013, the JHU cluster BW = 0.664. So off
|
Oct 23
2008
|
New software licensing is neededPosted by Michael McIntire in xldb, general, cost |
Over the last year I have spent a good amount of time thinking about the cost of analytics, and a few things worry me about our industry and how vendors price in this industry. In the Data Warehousing and BI industry, we're starting to see pricing models based on data volume or size. I know of one vendor which prices by specint, so - get a bigger system or virtualize your systems - get a big bill.
The problem with these licensing schemes is that they actually make the cost of doing analytics more expensive over time. If I am a good technology user, over time I'm driving down the costs per unit of data volume to create and maintain the data flowing into my DW. In the simple case you could say that the cost to generate data is declining at the inverse rate of Moore's law, particularly for a web business where most of the cost is in hardware.
Back to the point, using Moores Law as the example, the cost of data generation declines per unit of volume at a rate of ~50% every 18 months.
|
Apr 21
2008
|
Analytics as a ServicePosted by Oliver Ratzesberger in xldb, mpp, efficiency, agile |
Analytics as a Service
What Do you think about Agile Analytics? Every heard about it? Well, here are a couple thoughts from the guys who deal with it on a daily basis.
Looking forward to seeing your comments on this
|
Mar 22
2008
|
Science - DB Research MeetingPosted by Oliver Ratzesberger in xldb, super computing, mpp |
Next week I will be attending the next iteration of the xldb group events organized around eXtreme Large Database Applications. xldb workshop
With 100s of Peta Bytes of information waiting to be captured and analyzed, new concepts are required to scale today's platforms by 1-3 orders of magnitudes.
Today we 'limit' ourselves to 'only' capture 40TB/day of incremental incoming data volumes, next generation requirements demand a much more detailed collection of event detail data. 100TB/day are already on the horizon giving us just 10 days of history per Peta Byte. With deep historical requirements of 3+ years of information, data volume growth will outpace Moor's Law. And I would not be surprised if next year this time we will be thinking about how to deal with 250TB/day — the writing is on the wall.
Improvements in Processing Power per CPU, advances in Memory and Storage are not going to the able to make up for the exponential growth of data processing requirements.
|
Mar 21
2008
|
TACC Ranger goes livePosted by Oliver Ratzesberger in super computing, mpp |
On February 22nd 2008 TACC formally introduced the go-live of RANGER - a massive scale supercomputer. While not a traditional relational processing system, the design shared many components and basic principles of large scale processing platform.
Of particular interest is the multi terabit infiniband interconnect that allows the system to (re)distribute massive amounts of data.
One of the early learnings from the system is that loading massive amounts of data can at times be a larger challenge, than processing that very same data once loaded into the system. It points out a very common issue with large scale data processing:
|
Mar 21
2008
|
A Systems overviewPosted by Oliver Ratzesberger in mpp, general |
Finally I got to complete a high level systems overview. I realize it does not contain too much detail, but as you can imagine, we are bound by pretty strict NDAs.
Nevertheless, it should give you a good feel for how much data we process any given day. The stats are pretty much going into 2008 figures and are growing rapidly.
Here is a link to the article: Our Systems
Enjoy the reading and post your comments!
Oliver