Specialized hardware may help analyze social networks and other big-data graphs.
Sometimes, I think the Web is just a giant version of Ripley's Believe it or Not, an endless string of extreme headlines. Leica's selling a digital camera for $30,000 -- all-white and sporting a see-in-the-dark f/0.95 lens. A Frenchman with no arms or legs has set out to swim between five continents. A 13-employee photo-sharing company with no discernible revenue has sold itself for $1 billion in cash and stock.
And now, this: Cray Inc. has come out with a computer that can take as much as 1 terabyte of main memory. That's 1,400 bytes for every person on the planet.
Cray, of course, is an old name -- nay, the name -- in supercomputers, its machines having been used since the late 1970s to design hydrogen bombs, break secret codes, and simulate everything from truck axles to supersonic aircraft.
But that's not what its uRiKa computer is designed for. This machine, sold by Cray's YarcData unit (that's Cray spelled backwards), is designed specifically for use in what's called "relationship analytics." That's the field of big-data dealing with large sets of graph data, a kind of data structure made up of webs, or networks, of "nodes" with multitudes of connections, known as edges, drawn between them to capture their many relationships.
Such graphs might describe all the members of Facebook, each one's "Likes," and who's "Friends" with whom. Or, details of all the telephone calls a telco has handled in the previous six months. Intelligence agencies build and analyze graphs to make sense of the piles of disparate data they collect from phone taps and field observations. Bio-medical researchers analyze graphs to understand how different genes relate to each other. Ultimately, the semantic Web, that almost-mythical next chapter in the tagging and organization of online data, would depend heavily on analyzing vast graphs of "triples" -- Obama resides in the White House, for instance -- coded in a scheme called RDF.
In short, relationship analytics is largely a matter of connecting the dots. And since there are typically so many dots in a data graph and so many links among those dots, analyzing a graph to find "influencers" who may have special sway among friends and family -- useful to telcos battling customer churn -- or to identify all terrorism suspects whose travels intersected anywhere on the globe during March, 2009 -- such jobs can consume huge numbers of compute cycles.
How large is large? Curt Monash, my favorite database guru, writes that YarcData (a client of his) has told him of intelligence agencies envisioning graphs that contain billions of nodes, each one with thousands of edges. Similarly, the company talks of telco graphs comprising around 100 million nodes and hundreds of billions of edges. One of the "smaller" bio-informatics graphs out there comprises 22 billion nodes.
Sounds like just the kind of thing for which the cloud was built, right? Or Hadoop? Alas, traversing the myriad pathways within a graph to see what's connected to what (or whom) doesn't necessarily lend itself to divvying up across gangs of servers. It's often difficult to partition a large graph into self-contained pieces that can be analyzed in parallel.
Which is why, of course, YarcData's "graph appliance" is ready to harness so much RAM. By storing a large graph entirely in high-speed memory, it can be analyzed much faster than could be done even with swarms of standard servers linked together over 10Gig Ethernet. YarcData describes its machine as a "massively multithreaded graph processor" that in its largest configuration can run more than 1 million threads at once and move data in and out of its memory at rates of as much as 350TB per hour.
Exactly how this computer differs from Cray's standard supercomputers, I'm not sure (mainly in the software stacks, I imagine). And ditto as to whether or not Corporate America will gain much from analyzing mountains of tweets and other social chit-chat to target ads and such, a widely heralded big-data app about which Monash, for one, has serious doubts.
Me, I'll go for that Leica. In classic black, it's only $8,000.
Your thoughts on big machines for big-data?