Big-data generates big traffic, it seems, and that calls for some serious planning on the part of enterprise IT managers.
Big-data. It's popping up in the news more and more frequently, portrayed as the next, ahem, big thing in IT. Ploughing through mountains of data in real-time has been helping companies such as Yahoo, eBay, and Google serve up just the right data for visitors to their sites and now, increasingly, other non-Web-centric companies are embracing the technology, too. Among the bigger names to have described their big-data activities: JPMorgan, Disney, Nokia, and CBS.
How come? Obviously, there's beaucoup data available, thanks to a world chock-full of computers and things controlled and monitored by computers. And, equally important, the cost of collecting and sifting through great amounts of data has been slashed with help from open-source software like Hadoop, which is designed specifically to run superbly on even low-cost commodity processors and disk drives.
What many enterprises are finding out, according to Internet Research Group (IRG), is that big-data leads to big traffic, and that some special planning is required to get big-data's big datasets copied and moved to where they need to be, before and after they're mined for valuable insights.
Granted, IRG's research into big-data's communications needs was funded in part by a company called Infineta Systems, which specializes in WAN optimization (using cacheing and other techniques to squeeze maximum effective bandwidth from long-distance links, that is.) But the reasoning makes sense: In many large enterprises, large collections of mineable data may be dispersed across several different storage systems and even disparate datacenters. Yet to be fully exploited, these datasets need to be conglomerated to more or less central analytics facilities.
In the case of early big-data users, such as Yahoo, data storage and analytics co-exist by design. But in many enterprises, for reasons of history, analytics facilities and big pools of data may end up in quite different locations. And that, IRG says, makes big-data a matter of "both throughput capacity [typically involving large clusters of Hadoop-based processing nodes] and intelligent data movement." The factors that call for consideration include:
- Which datasets are required by jobs queuing up for execution.
- The policies for moving and securing data in transit.
- What resources may be required as jobs execute.
- The allocation of the completed datasets to execution servers.
IRG asserts:
To get the most from their Hadoop investment, most organizations will eventually want to run hundreds of Hadoop jobs daily, some running for only a few seconds while others may run for hours. The pressure to run more jobs leads to shrinking data movement windows for all jobs. If the data arrival rate slows due to the impact of big traffic (saturated WAN links and increased latency), then job execution slows as well. Fewer jobs can be run, scalability suffers, the utilization rate of the cluster drops, and with it the return on investment.
The solution: Solid and early planning. First, the appropriate people in the enterprise IT department must familiarize themselves with the workings and operating characteristics of software like Hadoop and MapReduce. Building and playing around with a prototype is a particularly good approach. Early on, though, IRG urges IT to consider the movement of data among Hadoop clusters, which may well be isolated from each other by large distances.
What's more, many enterprises are likely to forego investing in their own big-data facilities and will instead turn to one of the various rent-a-Hadoop services that are springing up. But here, too, consideration of WAN transit times, optimization techniques, and strictly limited processing windows is a must.
What's your take on big-data? Has it raised any communications issues in your setup?