Over the last year I have spent a good amount of time thinking about the cost of analytics, and a few things worry me about our industry and how vendors price in this industry. In the Data Warehousing and BI industry, we're starting to see pricing models based on data volume or size. I know of one vendor which prices by specint, so - get a bigger system or virtualize your systems - get a big bill.
The problem with these licensing schemes is that they actually make the cost of doing analytics more expensive over time. If I am a good technology user, over time I'm driving down the costs per unit of data volume to create and maintain the data flowing into my DW. In the simple case you could say that the cost to generate data is declining at the inverse rate of Moore's law, particularly for a web business where most of the cost is in hardware.
Back to the point, using Moores Law as the example, the cost of data generation declines per unit of volume at a rate of ~50% every 18 months. Unfortunately for most web businesses, the volume of data increases at a faster rate, but for simplicity, let's assume data volume grows at the same rate.
In the case of the vendor who uses specint to fix the cost per unit of resource consumed - it now takes 2x as much resource to do the same amount of analytics, and therefor 2x as much licensing cost from that particular vendor.
This is certainly a worst case, but I'll bet that if you look at your cost to generate data over time, and map it over the cost to process that data, you will find that per unit, data generation costs are declining much faster than the rate of costs for analytics.
Most importantly, because analytics licensing costs are not keeping up, the gap between these rates is accelerating over time, not narrowing. This is not a recipie for a successful relationship, nor for the future health of my suppliers.
The cost of the licensing models may explain why companies like Google, Yahoo, and others pursue map/reduce solutions. They simply cannot afford to implement a vendor solution given the volumes of data... There are other reasons for this which I'll cover in another blog, but I think you get the point.
There is another factor to consider, which is that the analytics you know are cheap. The ones you don't know, and the process to find them is quite expensive. And as you mine for yet more information, the relative cost to extract that information from the same amount of data rises. The anlytics simply become more and more complex and soon, the cost of the analytics is going to outweigh the value in the data - not for a lack of good business value in the data, but because the analytics costs are too high.
To get to the end point, I think we need some new licensing schemes. Vendors have to make money - and I want them to. But I also think we need licensing schemes where are valued more to the functional benefit to the business rather than the volume of resource consumption.
written by Jeff , October 23, 2008
Glad to see you guys blogging again! When I was at Facebook, these licensing schemes definitely played a role in our choice of Hadoop as an attractive option for analytics. There were other reasons, of course, but the pricing model played a role.
You mentioned at the XLDB meetup that you meter by the CPU second at eBay. Would you recommend this metric for billing by data warehousing customers? If not, what pricing schemes would you like to see?
written by Oliver Ratzesberger , October 24, 2008
It was good seeing you in Chicago!
As for the CPU seconds portion of your question:
In fact we calculate fractions of CPU seconds. For our large clusters (>2000 virtual instances) we capture this and other metrics on a per request level and store the details for any request lasting longer than 5 sec. For the rest we only keep a rolling summary by account and group who submitted the request(s). On demand we can drill down to step level detail that allows us to look at individual sorts, redistributions, aggregations and so on.
This information is then leveraged by the system priority manager and queuing engine to manage millions of requests per day according to predefined processing budgets. And for efficiency sakes we calculate Amdahl-like (see my latest blog post from CCA08 ) numbers as well as Parallel Efficiency on a per request level as well as group and system level.
Have a great day!